* Re: Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")
From: Alexei Starovoitov @ 2018-04-13 15:07 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Peter Zijlstra, Daniel Borkmann, netdev@vger.kernel.org, LKML,
Björn Töpel
In-Reply-To: <20180413152237.3727911f@redhat.com>
On Fri, Apr 13, 2018 at 03:22:37PM +0200, Jesper Dangaard Brouer wrote:
> Hi Peter,
>
> Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build
> for several samples/bpf programs. I'm unsure what the best way forward
> is to unbreak these...
>
> The issue is that these samples are build with LLVM/clang (which
> doesn't like 'asm goto' constructs). And they end up including
> arch/x86/include/asm/cpufeature.h via a long include path, see build
> examples below (through different path to include/linux/thread_info.h).
>
> Maybe Alexei or Daniel have an idea how to work around this?
> As tools/testing/selftests/bpf/ does not seem to fail!?
Right. All of bpf tracing and samples/bpf/ broke.
Here is the proposed fix that we're asking Peter to apply and send to Linus asap.
https://lkml.org/lkml/2018/4/10/825
> Build error#1:
> --------------
> clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -Isamples/bpf \
> -I./tools/testing/selftests/bpf/ \
> -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
> -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
> -Wno-gnu-variable-sized-type-not-at-end \
> -Wno-address-of-packed-member -Wno-tautological-compare \
> -Wno-unknown-warning-option \
> -O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/sockex2_kern.o
> In file included from samples/bpf/sockex2_kern.c:3:
> In file included from ./include/uapi/linux/in.h:24:
> In file included from ./include/linux/socket.h:8:
> In file included from ./include/linux/uio.h:13:
> In file included from ./include/linux/thread_info.h:38:
> In file included from ./arch/x86/include/asm/thread_info.h:53:
> ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
> asm_volatile_goto("1: jmp 6f\n"
> ^
> ./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
> #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
^ permalink raw reply
* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: David Miller @ 2018-04-13 15:03 UTC (permalink / raw)
To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <c7a994ee-763c-1f94-2c1e-4348d7b4cc62@solarflare.com>
From: Edward Cree <ecree@solarflare.com>
Date: Fri, 13 Apr 2018 15:52:25 +0100
> On 13/04/18 15:45, David Miller wrote:
>> I understand the constraints you are working under, but do realize
>> that the real root of the problems is that you are implementing what
>> is defined clearly as a synchronous operation as asynchronous.
> Yes, it is unfortunate that we are unable to perform synchronous filter
> insertions, but you go to war with the hardware you have :(
Whilst you may not be able to program the filter into the hardware
synchronously, you should be able to allocate the ID and get all of
the software state setup.
At least then you can return the proper return value from the netdev
op.
People really aren't going to be happy if their performance goes into
the tank because their filter update rate, for whatever reason, hits
this magic backlog limit.
^ permalink raw reply
* Re: tcp hang when socket fills up ?
From: Eric Dumazet @ 2018-04-13 15:01 UTC (permalink / raw)
To: Dominique Martinet, netdev
In-Reply-To: <20180406090720.GA31845@nautica>
On 04/06/2018 02:07 AM, Dominique Martinet wrote:
> (current kernel: vanilla 4.14.29)
>
> I've been running into troubles recently where a sockets "fills up" (as
> in, select() will no longer return it in its outfd / attempting to write
> to it after setting it to NONBLOCK will return -EWOULDBLOCK) and it
> doesn't seem to ever "unfill" until the tcp connexion timeout.
>
> The previous time I pushed it down to the application for not trying to
> read the socket either as I assume the buffers could be shared and
> just waiting won't take data out, but this time I'm a bit more
> skeptical as socat waits for the fd in both read and write...
>
> Let me take a minute to describe my setup, I don't think that how the
> socket was created matters but it might be interesting:
> - I have two computers behind NATs, no port forwarding on either side
> - One (call it C for client) runs ssh with a proxycommand ncat/socat to
> control the source port, e.g.
> $ ssh -o ProxyCommand="socat stdio tcp:<server public ip>:<port1>,sourceport=<port2>" server
> - The server runs another socat to connect to that and forwards to ssh
> locally, e.g.
> $ socat tcp:<client public ip>:<port2>,sourceport=<port1> tcp:127.0.0.1:22
>
> (yes, both are connect() calls, and that just works™ thanks to initial
> syn replay and conntrack on routers)
>
> When things stall, the first socat is in a select with both fd in
> reading, so it's waiting data.
> The second socat has data coming from ssh and is in a select with the
> client-facing socket in both read and write - that select never returns
> so the socket is not available for reading or writing and does not free
> up until the connection eventually times out a few minutes later.
>
> At this point, I only see tcp replays in tcpdump/wireshark. I've
> compared dumps from both sides and there are no lost packets, only
> reordering - there always is a batch of acks that were sent regularily
> coming in shortly before the hang. Here's the trace on the server:
>
> 16:49:26.735042 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735046 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735334 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736005 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736402 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.736408 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.738561 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739539 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739882 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.740255 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.746756 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], length 0
> 16:49:26.747923 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], length 0
> 16:49:26.749083 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], length 0
> 16:49:26.750171 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], length 0
> 16:49:26.750808 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], length 0
> 16:49:26.754648 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], length 0
> 16:49:26.755985 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], length 0
> 16:49:26.758513 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], length 0
> 16:49:26.759096 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759421 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759755 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], length 0
> 16:49:26.760653 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], length 0
> 16:49:26.761453 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], length 0
> 16:49:26.762199 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 0
> 16:49:26.763547 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 36
> 16:49:26.763553 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], length 0
> 16:49:26.764298 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], length 0
> 16:49:26.764676 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], length 0
> 16:49:26.807754 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 1617129473], length 1374
> 16:49:26.876467 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], length 0
There is no way a regular TCP stack (including linux) could send the following frame.
> 16:49:27.048760 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 1617129473], length 1374
So something is mangling the packet, maybe NAT or something.
> 16:49:27.051791 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 ecr 1313937714], length 36
> 16:49:27.076444 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.371182 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 ecr 1313937714], length 36
> 16:49:27.519862 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 1617129473], length 1374
> 16:49:27.547662 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.883372 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 ecr 1313937714], length 36
> 16:49:28.511861 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 1617129473], length 1374
> 16:49:28.538891 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:28.907197 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 ecr 1313937714], length 36
> 16:49:30.431864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 1617129473], length 1374
> 16:49:30.459127 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:30.955388 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 ecr 1313937714], length 36
> 16:49:34.207879 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 1617129473], length 1374
> 16:49:34.235726 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:35.256285 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 ecr 1313937714], length 36
> 16:49:42.143864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 1617129473], length 1374
> 16:49:42.171531 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:43.448262 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 ecr 1313937714], length 36
>
> (I still have the pcap files if someone wants to see them, but I'd
> rather not give my work IP publicly so will send it privately on
> request)
>
>
> At this point, only the same 3 packets keep being replayed over and
> over... According to 'ss' the Send-Q isn't empty on either side, the
> client has some ~1k to send but the server has much more (87k)
> After increasing the window size through net.*wmem sysctl it got stuck
> with over 1MB in Send-Q, which makes sense because the socket is full...
>
>
> So, what I don't get is, why are these acks continuously replayed? Given
> the timing it looks like the server doesn't take the client acks into
> account, despite having received that precise 33378 ack earlier and I
> believe it should accept a higher ack value anyway ?
>
> The ultimate question being, how can I go about debugging that?
> I'm working on getting perf probe/crash to work on the server so I can
> look at the kernel side now, I can reproduce this semi-easily so I'm
> sure I'll get down to it eventually, but if anyone has an idea I'm all
> ears.
>
>
> Thanks!
>
^ permalink raw reply
* Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps
From: David Miller @ 2018-04-13 14:57 UTC (permalink / raw)
To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <be5251775c626534633ce7658f6d7da93e7aecd5.1523558015.git.g.nault@alphalink.fr>
From: Guillaume Nault <g.nault@alphalink.fr>
Date: Thu, 12 Apr 2018 20:50:33 +0200
> l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
> tunnel, therefore it can be freed whenever the caller uses it.
> This patch defines l2tp_tunnel_get_nth() which works similarly, but
> also takes a reference on the returned tunnel. The caller then has to
> drop it after it stops using the tunnel.
>
> Convert netlink dumps to make them safe against concurrent tunnel
> deletion.
>
> Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
mutex is held.
Therefore no tunnel configuration changes may occur and the tunnel
object will persist and is safe to access.
The netlink dump should be safe as-is.
Were you actually able to trigger a crash or KASAN warning or is
this purely from code inspection?
^ permalink raw reply
* Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
From: Arnd Bergmann @ 2018-04-13 14:52 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
netfilter-devel, coreteam, Networking, Linux Kernel Mailing List
In-Reply-To: <20180413131558.3jw5dhub5gcyotyt@salvia>
On Fri, Apr 13, 2018 at 3:15 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote:
>> On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > Hi Arnd,
>> >
>> > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
>> >> We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m
>> >
>> > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
>> > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
>> > and CONFIG_NF_REJECT_IPV6=m.
>> >
>> > I mean, just like we do with NFT_FIB_INET.
>>
>> That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
>> again, so that code gets built as a loadable module if
>> CONFIG_NF_REJECT_IPV6=m.
>>
>> > BTW, I think this problem has been is not related to the recent patch,
>> > but something older that kbuild robot has triggered more easily for
>> > some reason?
>>
>> 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
>> symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
>> restricted to a loadable module with IPV6=m, but can now be
>> built-in, which causes that link error.
>
> Still one more spin on this, I would like to see if we have a way to
> fix this by simplifing things a bit.
>
> Would this one I'm attaching would work?
One disadvantage is that it makes the vmlinux bigger since
NF_REJECT_IPV{4,6} can no longer be a module at all now.
I suspect you also stil get a link error with IPV6=m, this time because
the nf_reject_ipv6.o file fails to link against the ipv6 code, e.g.
ipv6_skip_exthdr() and icmpv6_send() appear to be unreachable here.
I haven't tried that though, so I might be missing something.
Arnd
^ permalink raw reply
* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 14:52 UTC (permalink / raw)
To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <97aad935-d5fb-713e-fd0f-d84bbd733a8f@solarflare.com>
On 13/04/18 13:36, Edward Cree wrote:
> It turns out this may all be moot anyway: I figured out why I was seeing
> ARFS storms and it wasn't the configuration issue I originally blamed.
Hmm, correction, while the fix I mentioned in my previous email is needed,
it doesn't prevent the ARFS storms (although seems to lessen their
severity, given that the machine didn't actually fall over this time),
so we do also need some kind of limiting.
On 12/04/18 16:33, David Miller wrote:
> Then simply make the work process a queue, and add entries to the queue
> here if the work is already scheduled.
>
> Is there a reason why that wouldn't work?
That has the same problem as the existing code, that the length of the queue
can grow without bound, potentially causing a very long lag between the
request and its execution. This then can quickly become exponential as,
while waiting for the filter to be inserted, further packets from the same
flow are received (still unsteered) and trigger duplicate ARFS requests
(though I suppose it would be possible to scan the queue for matching flow
IDs; but the concurrency / locking problems with that are 'interesting').
I'm not sure why you object to the dropping of requests - it seems reasonable
to me to treat ARFS as a 'best-effort' thing. The packets will still be
received (just not necessarily on the core nearest the application), and if
the flow continues it will generate more steering requests after the ones
currently in flight have been processed.
And in practice we only get into this situation in the first place when we
have interrupt affinities configured in such a way as to make ARFS
practically useless anyway, so our failure to insert the filters is not of
great significance.
On 13/04/18 15:45, David Miller wrote:
> I understand the constraints you are working under, but do realize
> that the real root of the problems is that you are implementing what
> is defined clearly as a synchronous operation as asynchronous.
Yes, it is unfortunate that we are unable to perform synchronous filter
insertions, but you go to war with the hardware you have :(
-Ed
^ permalink raw reply
* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: David Miller @ 2018-04-13 14:45 UTC (permalink / raw)
To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <97aad935-d5fb-713e-fd0f-d84bbd733a8f@solarflare.com>
From: Edward Cree <ecree@solarflare.com>
Date: Fri, 13 Apr 2018 13:36:28 +0100
> It turns out this may all be moot anyway: I figured out why I was seeing
> ARFS storms and it wasn't the configuration issue I originally blamed.
> My current ndo_rx_flow_steer() implementation, efx_filter_rfs(), returns
> 0 for success, but the caller expects a filter ID to be returned (which
> we can't give it because we don't know what the filter ID will be until
> we start mucking around in the software state that's now protected by a
> sleepable lock).
> As a result, when we call rps_may_expire_flow(), and pass it the _actual_
> filter ID, this doesn't match the one set_rps_cpu() recorded, so the
> function returns true and we immediately expire the filter. Then the
> next packet to come along isn't steered, so ARFS asks us to insert a
> steering filter again.
> As a quick fix I've simply tried making the rps_may_expire_flow() calls
> also pass a filter ID of 0, which prevents the ARFS storms. This is
> safe; it may cause us to delay expiring a filter when flow_ids collide,
> but that can happen anyway with other drivers' implementations (e.g.
> mlx4 and mlx5 can potentially reuse filter IDs) so I presume it is OK.
> I'll post a v2 with that fix in place of this Patch #2 shortly, then try
> to follow up with a counter-generated ID (similar to what mlx have).
I understand the constraints you are working under, but do realize
that the real root of the problems is that you are implementing what
is defined clearly as a synchronous operation as asynchronous.
^ permalink raw reply
* Re: Atlantic driver 4.16 stable request
From: David Miller @ 2018-04-13 14:43 UTC (permalink / raw)
To: igor.russkikh; +Cc: netdev, Nadezhda.Krupnina, simon.edelhaus
In-Reply-To: <9af8632f-6f5a-686d-0da1-bdf5b5b6bb07@aquantia.com>
From: Igor Russkikh <igor.russkikh@aquantia.com>
Date: Fri, 13 Apr 2018 15:03:29 +0300
> Could you please consider queuing to v4.16:
>
> 9a11aff net: aquantia: oops when shutdown on already stopped device
> cce96d1 net: aquantia: Regression on reset with 1.x firmware
>
> These are both critical and well tested by our team.
Queued up, thank you.
^ permalink raw reply
* Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
From: Bjorn Helgaas @ 2018-04-13 14:06 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tal Gilboa, Tariq Toukan, Jacob Keller, Ariel Elior,
Ganesh Goudar, Jeff Kirsher, everest-linux-l2, intel-wired-lan,
netdev, linux-kernel, linux-pci
In-Reply-To: <20180412213249.06661048@cakuba.netronome.com>
On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > + if (bw_avail >= bw_cap)
> > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > + else
> > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d link at %s (capable of %d Mb/s with %s x%d link)\n",
> > + bw_avail, PCIE_SPEED2STR(speed), width,
> > + limiting_dev ? pci_name(limiting_dev) : "<unknown>",
> > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
>
> I was just looking at using this new function to print PCIe BW for a
> NIC, but I'm slightly worried that there is nothing in the message that
> says PCIe... For a NIC some people may interpret the bandwidth as NIC
> bandwidth:
>
> [ 39.839989] nfp 0000:04:00.0: Netronome Flow Processor NFP4000/NFP6000 PCIe Card Probe
> [ 39.848943] nfp 0000:04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> [ 39.857146] nfp 0000:04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1: PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
>
> It's not a 63Gbps NIC... I'm sorry if this was discussed before and I
> didn't find it. Would it make sense to add the "PCIe: " prefix to the
> message like bnx2x used to do? Like:
>
> nfp 0000:04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
I agree, that does look potentially confusing. How about this:
nfp 0000:04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
I did have to look twice at this before I remembered that we're
printing Gb/s (not GB/s). Most of the references I found on the web
use GB/s when talking about total PCIe bandwidth.
But either way I think it's definitely worth mentioning PCIe
explicitly.
^ permalink raw reply
* Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")
From: Jesper Dangaard Brouer @ 2018-04-13 13:22 UTC (permalink / raw)
To: Peter Zijlstra, Daniel Borkmann, Alexei Starovoitov
Cc: brouer, netdev@vger.kernel.org, LKML, Björn Töpel
Hi Peter,
Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build
for several samples/bpf programs. I'm unsure what the best way forward
is to unbreak these...
The issue is that these samples are build with LLVM/clang (which
doesn't like 'asm goto' constructs). And they end up including
arch/x86/include/asm/cpufeature.h via a long include path, see build
examples below (through different path to include/linux/thread_info.h).
Maybe Alexei or Daniel have an idea how to work around this?
As tools/testing/selftests/bpf/ does not seem to fail!?
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Build error#1:
--------------
clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -Isamples/bpf \
-I./tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-Wno-unknown-warning-option \
-O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/sockex2_kern.o
In file included from samples/bpf/sockex2_kern.c:3:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:13:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
asm_volatile_goto("1: jmp 6f\n"
^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
^
Build error#2:
--------------
clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -Isamples/bpf \
-I./tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-Wno-unknown-warning-option \
-O2 -emit-llvm -c samples/bpf/tracex1_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/tracex1_kern.o
In file included from samples/bpf/tracex1_kern.c:7:
In file included from ./include/linux/skbuff.h:19:
In file included from ./include/linux/time.h:6:
In file included from ./include/linux/seqlock.h:36:
In file included from ./include/linux/spinlock.h:51:
In file included from ./include/linux/preempt.h:81:
In file included from ./arch/x86/include/asm/preempt.h:7:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
asm_volatile_goto("1: jmp 6f\n"
^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
^
Build error#3:
--------------
clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86
/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -Isamples/bpf \
-I./tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-Wno-unknown-warning-option \
-O2 -emit-llvm -c samples/bpf/xdp1_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/xdp1_kern.o
In file included from samples/bpf/xdp1_kern.c:9:
In file included from ./include/linux/in.h:23:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:13:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
asm_volatile_goto("1: jmp 6f\n"
^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
^ permalink raw reply
* Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
From: Pablo Neira Ayuso @ 2018-04-13 13:15 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
netfilter-devel, coreteam, Networking, Linux Kernel Mailing List
In-Reply-To: <CAK8P3a1GH5OSu62TBh46S0OdiXTJODsvV3To5=Jp2nL+fz1Puw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]
On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote:
> On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Hi Arnd,
> >
> > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
> >> We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m
> >
> > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
> > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
> > and CONFIG_NF_REJECT_IPV6=m.
> >
> > I mean, just like we do with NFT_FIB_INET.
>
> That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
> again, so that code gets built as a loadable module if
> CONFIG_NF_REJECT_IPV6=m.
>
> > BTW, I think this problem has been is not related to the recent patch,
> > but something older that kbuild robot has triggered more easily for
> > some reason?
>
> 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
> symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
> restricted to a loadable module with IPV6=m, but can now be
> built-in, which causes that link error.
Still one more spin on this, I would like to see if we have a way to
fix this by simplifing things a bit.
Would this one I'm attaching would work?
Thanks for you patience.
[-- Attachment #2: 0001-netfilter-CONFIG_NF_REJECT_IPV-4-6-becomes-bool-togg.patch --]
[-- Type: text/x-diff, Size: 2586 bytes --]
>From af07bc7ff5d34ce54e7913233912c058e6699e3c Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Fri, 13 Apr 2018 10:48:40 +0200
Subject: [PATCH] netfilter: CONFIG_NF_REJECT_IPV{4,6} becomes bool toggle
Arnd reports that we get a new link error with CONFIG_NFT_REJECT_INET=y
and CONFIG_NF_REJECT_IPV6=m after larger parts of the nftables modules
are linked together:
net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval':
nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6'
nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6'
The problem is that with NF_TABLES_INET set, we implicitly try to use
the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to
a loadable module, it's impossible to reach that.
This patch fixes this problem by building-in nf_reject_ipv{4,6}.c, IPv6
symbol dependencies for the IPv6 reject infrastructure are located in
exthdrs_core.c, ip6_checksum.c and ip6_icmp.c which are also built-in,
so let's do the same to simplify this.
Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/ipv4/netfilter/Kconfig | 3 +--
net/ipv6/netfilter/Kconfig | 3 +--
net/netfilter/Kconfig | 2 ++
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index 280048e1e395..3e4e0ae2a9a1 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -104,8 +104,7 @@ config NF_LOG_IPV4
select NF_LOG_COMMON
config NF_REJECT_IPV4
- tristate "IPv4 packet rejection"
- default m if NETFILTER_ADVANCED=n
+ bool "IPv4 packet rejection"
config NF_NAT_IPV4
tristate "IPv4 NAT"
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index ccbfa83e4bb0..1e5d040a60b8 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -87,8 +87,7 @@ config NF_DUP_IPV6
packet to be rerouted to another destination.
config NF_REJECT_IPV6
- tristate "IPv6 packet rejection"
- default m if NETFILTER_ADVANCED=n
+ bool "IPv6 packet rejection"
config NF_LOG_IPV6
tristate "IPv6 packet logging"
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 4189f574f5ec..d7b3272fe821 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -609,6 +609,8 @@ config NFT_REJECT
config NFT_REJECT_INET
depends on NF_TABLES_INET
+ select NF_REJECT_IPV4
+ select NF_REJECT_IPV6
default NFT_REJECT
tristate
--
2.11.0
^ permalink raw reply related
* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Dan Streetman @ 2018-04-13 12:43 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: Tommi Rantala, Neil Horman, Xin Long, David Ahern,
Daniel Borkmann, Cong Wang, David Miller, Eric Dumazet,
Willem de Bruijn, Jakub Kicinski, Rasmus Villemoes, netdev, LKML,
Alexey Kuznetsov, Hideaki YOSHIFUJI, syzkaller, Dan Streetman,
Eric W. Biederman, Alexey Kodanev
In-Reply-To: <CACT4Y+Z2kdjsYUs-H9aSTcNW-FSKqm1__fg3eOq16H-E=X8oZg@mail.gmail.com>
On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
> <tommi.t.rantala@nokia.com> wrote:
>> On 20.02.2018 18:26, Neil Horman wrote:
>>>
>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>>
>>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>
>>>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>>>>
>>>>>> Is this meant to be fixed already? I am still seeing this on the
>>>>>> latest upstream tree.
>>>>>>
>>>>>
>>>>> These two commits are in v4.16-rc1:
>>>>>
>>>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
>>>>> Date: Mon Feb 5 21:48:14 2018 +0200
>>>>>
>>>>> sctp: fix dst refcnt leak in sctp_v4_get_dst
>>>>> ...
>>>>> Fixes: 410f03831 ("sctp: add routing output fallback")
>>>>> Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>>>> secondary
>>>>> addresses")
>>>>>
>>>>>
>>>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
>>>>> Date: Mon Feb 5 15:10:35 2018 +0300
>>>>>
>>>>> sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>>>> ...
>>>>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>>>> secondary
>>>>> addresses for ipv6")
>>>>>
>>>>>
>>>>> I guess we missed something if it's still reproducible.
>>>>>
>>>>> I can check it later this week, unless someone else beat me to it.
>>>>
>>>>
>>>> Hi Tommi,
>>>>
>>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>>> another one then. But I am still seeing these:
>>>>
>>>> [ 58.799130] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>> [ 60.847138] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>> [ 62.895093] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>> [ 64.943103] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>>
>>>> on upstream tree pulled ~12 hours ago.
>>>>
>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>> out a
>>> backtrace if the device name matches "lo". That should tell us
>>> definitively if
>>> the problem is in the same location or not
>>
>>
>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>> sent in the first email in this thread:
>>
>> With 4.16-rc2 unable to reproduce.
>>
>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>> lo to become free. Usage count = 3"
>>
>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>> cherry-picked on top, unable to reproduce.
>>
>>
>> Is syzkaller doing something else now to trigger the bug...?
>> Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?
I'm working on this currently:
https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
I added a summary of what I've found to be the cause (or at least, one
possible cause) of this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
I'm working on a patch to work around the main side-effect of this,
which is hanging while holding the global net mutex. Hangs will still
happen (e.g. if a dst leaks) but should not affect anything else,
other than a leak of the dst and its net namespace.
Fixing the dst leaks is important too, of course, but a dst leak (or
other cause) shouldn't break the entire system.
^ permalink raw reply
* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 12:36 UTC (permalink / raw)
To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <20180412.113300.475644883265871487.davem@davemloft.net>
It turns out this may all be moot anyway: I figured out why I was seeing
ARFS storms and it wasn't the configuration issue I originally blamed.
My current ndo_rx_flow_steer() implementation, efx_filter_rfs(), returns
0 for success, but the caller expects a filter ID to be returned (which
we can't give it because we don't know what the filter ID will be until
we start mucking around in the software state that's now protected by a
sleepable lock).
As a result, when we call rps_may_expire_flow(), and pass it the _actual_
filter ID, this doesn't match the one set_rps_cpu() recorded, so the
function returns true and we immediately expire the filter. Then the
next packet to come along isn't steered, so ARFS asks us to insert a
steering filter again.
As a quick fix I've simply tried making the rps_may_expire_flow() calls
also pass a filter ID of 0, which prevents the ARFS storms. This is
safe; it may cause us to delay expiring a filter when flow_ids collide,
but that can happen anyway with other drivers' implementations (e.g.
mlx4 and mlx5 can potentially reuse filter IDs) so I presume it is OK.
I'll post a v2 with that fix in place of this Patch #2 shortly, then try
to follow up with a counter-generated ID (similar to what mlx have).
-Ed
^ permalink raw reply
* Atlantic driver 4.16 stable request
From: Igor Russkikh @ 2018-04-13 12:03 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Nadezhda Krupnina, Simon Edelhaus
Hi David,
Could you please consider queuing to v4.16:
9a11aff net: aquantia: oops when shutdown on already stopped device
cce96d1 net: aquantia: Regression on reset with 1.x firmware
These are both critical and well tested by our team.
Thanks in advance!
^ permalink raw reply
* [PATCH net] team: avoid adding twice the same option to the event list
From: Paolo Abeni @ 2018-04-13 11:59 UTC (permalink / raw)
To: netdev; +Cc: Jiri Pirko, David S. Miller
When parsing the options provided by the user space,
team_nl_cmd_options_set() insert them in a temporary list to send
multiple events with a single message.
While each option's attribute is correctly validated, the code does
not check for duplicate entries before inserting into the event
list.
Exploiting the above, the syzbot was able to trigger the following
splat:
kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4466 Comm: syzkaller556835 Not tainted 4.16.0+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__list_add_valid+0xaa/0xb0 lib/list_debug.c:29
RSP: 0018:ffff8801b04bf248 EFLAGS: 00010286
RAX: 0000000000000058 RBX: ffff8801c8fc7a90 RCX: 0000000000000000
RDX: 0000000000000058 RSI: ffffffff815fbf41 RDI: ffffed0036097e3f
RBP: ffff8801b04bf260 R08: ffff8801b0b2a700 R09: ffffed003b604f90
R10: ffffed003b604f90 R11: ffff8801db027c87 R12: ffff8801c8fc7a90
R13: ffff8801c8fc7a90 R14: dffffc0000000000 R15: 0000000000000000
FS: 0000000000b98880(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000043fc30 CR3: 00000001afe8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__list_add include/linux/list.h:60 [inline]
list_add include/linux/list.h:79 [inline]
team_nl_cmd_options_set+0x9ff/0x12b0 drivers/net/team/team.c:2571
genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:639
___sys_sendmsg+0x805/0x940 net/socket.c:2117
__sys_sendmsg+0x115/0x270 net/socket.c:2155
SYSC_sendmsg net/socket.c:2164 [inline]
SyS_sendmsg+0x29/0x30 net/socket.c:2162
do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4458b9
RSP: 002b:00007ffd1d4a7278 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00000000004458b9
RDX: 0000000000000010 RSI: 0000000020000d00 RDI: 0000000000000004
RBP: 00000000004a74ed R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000213 R12: 00007ffd1d4a7348
R13: 0000000000402a60 R14: 0000000000000000 R15: 0000000000000000
Code: 75 e8 eb a9 48 89 f7 48 89 75 e8 e8 d1 85 7b fe 48 8b 75 e8 eb bb 48
89 f2 48 89 d9 4c 89 e6 48 c7 c7 a0 84 d8 87 e8 ea 67 28 fe <0f> 0b 0f 1f
40 00 48 b8 00 00 00 00 00 fc ff df 55 48 89 e5 41
RIP: __list_add_valid+0xaa/0xb0 lib/list_debug.c:29 RSP: ffff8801b04bf248
This changeset addresses the avoiding list_add() if the current
option is already present in the event list.
Reported-and-tested-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one message")
---
drivers/net/team/team.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index a6c6ce19eeee..acbe84967834 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -261,6 +261,17 @@ static void __team_option_inst_mark_removed_port(struct team *team,
}
}
+static bool __team_option_inst_tmp_find(const struct list_head *opts,
+ const struct team_option_inst *needle)
+{
+ struct team_option_inst *opt_inst;
+
+ list_for_each_entry(opt_inst, opts, tmp_list)
+ if (opt_inst == needle)
+ return true;
+ return false;
+}
+
static int __team_options_register(struct team *team,
const struct team_option *option,
size_t option_count)
@@ -2568,6 +2579,14 @@ static int team_nl_cmd_options_set(struct sk_buff *skb, struct genl_info *info)
if (err)
goto team_put;
opt_inst->changed = true;
+
+ /* dumb/evil user-space can send us duplicate opt,
+ * keep only the last one
+ */
+ if (__team_option_inst_tmp_find(&opt_inst_list,
+ opt_inst))
+ continue;
+
list_add(&opt_inst->tmp_list, &opt_inst_list);
}
if (!opt_found) {
--
2.14.3
^ permalink raw reply related
* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Neil Horman @ 2018-04-13 11:37 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: Tommi Rantala, Xin Long, David Ahern, Daniel Borkmann, Cong Wang,
David Miller, Eric Dumazet, Willem de Bruijn, Jakub Kicinski,
Rasmus Villemoes, netdev, LKML, Alexey Kuznetsov,
Hideaki YOSHIFUJI, syzkaller, Dan Streetman, Eric W. Biederman,
Alexey Kodanev, Marcelo Ricardo Leitner <marcelo.leit
In-Reply-To: <CACT4Y+Z2kdjsYUs-H9aSTcNW-FSKqm1__fg3eOq16H-E=X8oZg@mail.gmail.com>
On Thu, Apr 12, 2018 at 02:15:30PM +0200, Dmitry Vyukov wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
> <tommi.t.rantala@nokia.com> wrote:
> > On 20.02.2018 18:26, Neil Horman wrote:
> >>
> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
> >>>
> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
> >>> <tommi.t.rantala@nokia.com> wrote:
> >>>>
> >>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
> >>>>>
> >>>>> Is this meant to be fixed already? I am still seeing this on the
> >>>>> latest upstream tree.
> >>>>>
> >>>>
> >>>> These two commits are in v4.16-rc1:
> >>>>
> >>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> >>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
> >>>> Date: Mon Feb 5 21:48:14 2018 +0200
> >>>>
> >>>> sctp: fix dst refcnt leak in sctp_v4_get_dst
> >>>> ...
> >>>> Fixes: 410f03831 ("sctp: add routing output fallback")
> >>>> Fixes: 0ca50d12f ("sctp: fix src address selection if using
> >>>> secondary
> >>>> addresses")
> >>>>
> >>>>
> >>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> >>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
> >>>> Date: Mon Feb 5 15:10:35 2018 +0300
> >>>>
> >>>> sctp: fix dst refcnt leak in sctp_v6_get_dst()
> >>>> ...
> >>>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
> >>>> secondary
> >>>> addresses for ipv6")
> >>>>
> >>>>
> >>>> I guess we missed something if it's still reproducible.
> >>>>
> >>>> I can check it later this week, unless someone else beat me to it.
> >>>
> >>>
> >>> Hi Tommi,
> >>>
> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> >>> another one then. But I am still seeing these:
> >>>
> >>> [ 58.799130] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [ 60.847138] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [ 62.895093] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [ 64.943103] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>>
> >>> on upstream tree pulled ~12 hours ago.
> >>>
> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing
> >> out a
> >> backtrace if the device name matches "lo". That should tell us
> >> definitively if
> >> the problem is in the same location or not
> >
> >
> > Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> > sent in the first email in this thread:
> >
> > With 4.16-rc2 unable to reproduce.
> >
> > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> > lo to become free. Usage count = 3"
> >
> > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> > cherry-picked on top, unable to reproduce.
> >
> >
> > Is syzkaller doing something else now to trigger the bug...?
> > Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
As the name implies, its supposed to wait for the reference count to be zero
indefinately, but yes, under normal operation, its intended to not have to wait
very long at all. The issuance of the NETDEV_UNREGISTER_FINAL notification is
meant to be a subscribable signal to any code path holding a reference that it
needs to be dropped so that the progress can be made.
Note that the "waiting for %s to become free" message is triggered after 10
seconds of waiting, and is likely the trigger you want, Its just an emergency
level log message rather a WARN. I don't think we want to change that
permanently, but you could certainly alter it in the code to cause syzbot to
catch it (i.e. WARN_ON(time_after(jiffies, warning_time + 10 * HZ)) )
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?
Well, it drops everything and sleeps periodically, so its safe in and of itself.
The problem is its waiting for the reference count of a device to drop to zero,
and some other execution context isn't taking the appropriate action to do that.
Neil
>
^ permalink raw reply
* Re: [net] xfrm: cover crypto status in xfrm_input
From: Steffen Klassert @ 2018-04-13 11:09 UTC (permalink / raw)
To: Jacek Kalwas; +Cc: davem, netdev, intel-wired-lan
In-Reply-To: <20180412190315.3102-3-jacek.kalwas@intel.com>
On Thu, Apr 12, 2018 at 12:03:15PM -0700, Jacek Kalwas wrote:
> Status checking in xfrm_input doesn't cover CRYPTO_GENERIC_ERROR and
> CRYPTO_INVALID_PACKET_SYNTAX.
>
> Given patch adds additional check for CRYPTO_INVALID_PACKET_SYNTAX and
> treats CRYPTO_GENERIC_ERROR as status matching LINUX_MIB_XFRMINERROR.
>
> Signed-off-by: Jacek Kalwas <jacek.kalwas@intel.com>
> ---
> net/xfrm/xfrm_input.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
> index 352abca2605f..08d70ea774f9 100644
> --- a/net/xfrm/xfrm_input.c
> +++ b/net/xfrm/xfrm_input.c
> @@ -285,7 +285,12 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
> goto drop;
> }
>
> - XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
> + if (xo->status & CRYPTO_INVALID_PACKET_SYNTAX) {
> + XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
> + goto drop;
> + }
Please consider adding separate statistic counters for offloading.
Reusing some other counter does not make it more usfull as it is now.
Some time ago, each statistic counter was bumped at a unique place,
so it was easy to identify where the packet was dropped. Unfortunately
this changed over the years. This was one of the concerns the userspace
IPsec developers had during the IPsec workshop we held recently. So I
think it is better to add new counters insted of reusing old ones here.
^ permalink raw reply
* Re: tcp hang when socket fills up ?
From: Dominique Martinet @ 2018-04-13 9:42 UTC (permalink / raw)
To: netdev
In-Reply-To: <20180406090720.GA31845@nautica>
Note this is mostly me talking to myself, but in case anyone else hits
the same issues I might as well post my meagre progress.
Dominique Martinet wrote on Fri, Apr 06, 2018:
> (current kernel: vanilla 4.14.29)
reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64
kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen
any patch about that kind of stuff recently so probably still valid)
Other main difference is the qdisc, VM is using fq_codel, host is
directly on wireless so mq with 4 pfifo_fast queues - it is harder to
reproduce on the VM (even on another VM with the same kernel) so I'd
put the difference down to the qdisc, but I was able to reproduce with
both ultimately.
(update: it actually was still fairly easy to reproduce until it got
later (coworkers left?), going from ~5-15s to reproduce to multiple
minutes, so this likely depends on net quality a lot. I couldn't
reproduce by fiddling with netem on a local network though...)
> [please find previous email for setup/tcpdump output]
So I have a crash dump with a socket / inet_sock that are blocked, since
this is a VM I even get gdb in bonus..
With the crash dump, I can confirm that the socket is not available for
writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1),
but that doesn't help much if I can't tell why we're not taking acks in
With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I
think that's the function that should handle my ack, and that gets the
replay.
First, abusing next, the flow seems to be something like
(I folded if/else not taken)
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_sacktag_state sack_state;
struct rate_sample rs = { .prior_delivered = 0 };
u32 prior_snd_una = tp->snd_una;
bool is_sack_reneg = tp->is_sack_reneg;
u32 ack_seq = TCP_SKB_CB(skb)->seq;
u32 ack = TCP_SKB_CB(skb)->ack_seq;
bool is_dupack = false;
int prior_packets = tp->packets_out;
u32 delivered = tp->delivered;
u32 lost = tp->lost;
int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover
losses */
u32 prior_fack;
sack_state.first_sackt = 0;
sack_state.rate = &rs;
/* We very likely will need to access rtx queue. */
prefetch(sk->tcp_rtx_queue.rb_node);
/* If the ack is older than previous acks
* then we can probably ignore it.
*/
if (before(ack, prior_snd_una)) {
}
/* If the ack includes data we haven't sent yet, discard
* this segment (RFC793 Section 3.9).
*/
if (after(ack, tp->snd_nxt))
if (after(ack, prior_snd_una)) {
}
prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) :
tp->snd_una;
rs.prior_in_flight = tcp_packets_in_flight(tp);
/* ts_recent update must be made after we are sure that the
packet
* is in window.
*/
if (flag & FLAG_UPDATE_TS_RECENT)
if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
} else {
u32 ack_ev_flags = CA_ACK_SLOWPATH;
if (ack_seq != TCP_SKB_CB(skb)->end_seq)
else
NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPPUREACKS);
flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);
if (TCP_SKB_CB(skb)->sacked)
if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
}
if (flag & FLAG_WIN_UPDATE)
tcp_in_ack_event(sk, ack_ev_flags);
}
/* We passed data and got it acked, remove any soft error
* log. Something worked...
*/
sk->sk_err_soft = 0;
icsk->icsk_probes_out = 0;
tp->rcv_tstamp = tcp_jiffies32;
if (!prior_packets)
goto no_queue;
no_queue:
/* If data was DSACKed, see if we can undo a cwnd reduction. */
if (flag & FLAG_DSACKING_ACK)
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack,
&flag,
&rexmit);
/* If this ack opens up a zero window, clear backoff. It was
* being used to time the probes, and is probably far higher
than
* it needs to be for normal retransmission.
*/
tcp_ack_probe(sk);
if (tp->tlp_high_seq)
tcp_process_tlp_ack(sk, ack, flag);
return 1;
And here is 'info local' towards the end of the function:
icsk = 0xffff88001740b800
tp = 0xffff88001740b800
sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 0xffff88003fc83ae0, flag = 0, mss_now = 0}
rs = {prior_mstamp = 0, prior_delivered = 0, delivered = 0, interval_us = 0, rtt_us = 0, losses = 0, acked_sacked = 0, prior_in_flight = 0, is_app_limited = false, is_retrans = false, is_ack_delayed = false}
prior_snd_una = 2896339291
is_sack_reneg = false
ack_seq = 2651638615
ack = 2896339291
is_dupack = false
prior_packets = 0
delivered = 172
lost = 0
rexmit = 0
prior_fack = 2896339291
__vpp_verify = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
So as I was seeing, we have ack = prior_fack = prior_snd_una (which is > ack_seq)
We're in the !prior_packets goto to no_queue though so we don't pass
tcp_clean_rtx_queue - so that explains why the replays don't help. I'm
ok with that. But all these traces actually don't tell me why the first
time we saw that packed didn't take the ack and decreate sk_wmem_queued
properly.
This isn't easy enough to reproduce that I feel confident grabbing the
first call to that function that fails though, so I guess I'll actually
have to read the code I'm debugging instead of trying to just plow
through it...
Hints still welcome, if anyone has an idea.
--
Dominique Martinet | Asmadeus
^ permalink raw reply
* [PATCH v2] selftests: add headers_install to lib.mk
From: Anders Roxell @ 2018-04-13 9:03 UTC (permalink / raw)
To: shuah, yamada.masahiro, michal.lkml
Cc: linux-kselftest, linux-kbuild, linux-kernel, netdev, linux-gpio,
fathi.boudra, Anders Roxell
In-Reply-To: <20180412102302.4532-1-anders.roxell@linaro.org>
If the kernel headers aren't installed we can't build all the tests.
Add a new make target rule 'khdr' in the file lib.mk to generate the
kernel headers and that gets include for every test-dir Makefile that
includes lib.mk If the testdir in turn have its own sub-dirs the
top_srcdir needs to be set to the linux-rootdir to be able to generate
the kernel headers.
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: Fathi Boudra <fathi.boudra@linaro.org>
---
Makefile | 14 +-------------
scripts/subarch.include | 13 +++++++++++++
tools/testing/selftests/android/Makefile | 2 +-
tools/testing/selftests/android/ion/Makefile | 1 +
tools/testing/selftests/bpf/Makefile | 5 ++---
tools/testing/selftests/futex/functional/Makefile | 1 +
tools/testing/selftests/gpio/Makefile | 4 ----
tools/testing/selftests/kvm/Makefile | 6 +-----
tools/testing/selftests/lib.mk | 10 ++++++++++
tools/testing/selftests/vm/Makefile | 3 ---
10 files changed, 30 insertions(+), 29 deletions(-)
create mode 100644 scripts/subarch.include
diff --git a/Makefile b/Makefile
index 74567b0ec2f0..0c47935d48a2 100644
--- a/Makefile
+++ b/Makefile
@@ -286,19 +286,7 @@ KERNELRELEASE = $(shell cat include/config/kernel.release 2> /dev/null)
KERNELVERSION = $(VERSION)$(if $(PATCHLEVEL),.$(PATCHLEVEL)$(if $(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION)
export VERSION PATCHLEVEL SUBLEVEL KERNELRELEASE KERNELVERSION
-# SUBARCH tells the usermode build what the underlying arch is. That is set
-# first, and if a usermode build is happening, the "ARCH=um" on the command
-# line overrides the setting of ARCH below. If a native build is happening,
-# then ARCH is assigned, getting whatever value it gets normally, and
-# SUBARCH is subsequently ignored.
-
-SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
- -e s/sun4u/sparc64/ \
- -e s/arm.*/arm/ -e s/sa110/arm/ \
- -e s/s390x/s390/ -e s/parisc64/parisc/ \
- -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
- -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ \
- -e s/riscv.*/riscv/)
+include scripts/subarch.include
# Cross compiling and selecting different set of gcc/bin-utils
# ---------------------------------------------------------------------------
diff --git a/scripts/subarch.include b/scripts/subarch.include
new file mode 100644
index 000000000000..650682821126
--- /dev/null
+++ b/scripts/subarch.include
@@ -0,0 +1,13 @@
+# SUBARCH tells the usermode build what the underlying arch is. That is set
+# first, and if a usermode build is happening, the "ARCH=um" on the command
+# line overrides the setting of ARCH below. If a native build is happening,
+# then ARCH is assigned, getting whatever value it gets normally, and
+# SUBARCH is subsequently ignored.
+
+SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
+ -e s/sun4u/sparc64/ \
+ -e s/arm.*/arm/ -e s/sa110/arm/ \
+ -e s/s390x/s390/ -e s/parisc64/parisc/ \
+ -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
+ -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ \
+ -e s/riscv.*/riscv/)
diff --git a/tools/testing/selftests/android/Makefile b/tools/testing/selftests/android/Makefile
index f6304d2be90c..087390bbad68 100644
--- a/tools/testing/selftests/android/Makefile
+++ b/tools/testing/selftests/android/Makefile
@@ -6,7 +6,7 @@ TEST_PROGS := run.sh
include ../lib.mk
-all:
+all: khdr
@for DIR in $(SUBDIRS); do \
BUILD_TARGET=$(OUTPUT)/$$DIR; \
mkdir $$BUILD_TARGET -p; \
diff --git a/tools/testing/selftests/android/ion/Makefile b/tools/testing/selftests/android/ion/Makefile
index e03695287f76..14ecd9805748 100644
--- a/tools/testing/selftests/android/ion/Makefile
+++ b/tools/testing/selftests/android/ion/Makefile
@@ -11,6 +11,7 @@ $(TEST_GEN_FILES): ipcsocket.c ionutils.c
TEST_PROGS := ion_test.sh
include ../../lib.mk
+top_srcdir = ../../../../../
$(OUTPUT)/ionapp_export: ionapp_export.c ipcsocket.c ionutils.c
$(OUTPUT)/ionapp_import: ionapp_import.c ipcsocket.c ionutils.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 0a315ddabbf4..cc611a284087 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -16,9 +16,8 @@ LDLIBS += -lcap -lelf -lrt -lpthread
TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
all: $(TEST_CUSTOM_PROGS)
-$(TEST_CUSTOM_PROGS): urandom_read
-
-urandom_read: urandom_read.c
+$(TEST_CUSTOM_PROGS):| khdr
+$(TEST_CUSTOM_PROGS): urandom_read.c
$(CC) -o $(TEST_CUSTOM_PROGS) -static $<
# Order correspond to 'make run_tests' order
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index ff8feca49746..9f602fb40241 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -19,5 +19,6 @@ TEST_GEN_FILES := \
TEST_PROGS := run.sh
include ../../lib.mk
+top_srcdir = ../../../../../
$(TEST_GEN_FILES): $(HEADERS)
diff --git a/tools/testing/selftests/gpio/Makefile b/tools/testing/selftests/gpio/Makefile
index 1bbb47565c55..768b2be010db 100644
--- a/tools/testing/selftests/gpio/Makefile
+++ b/tools/testing/selftests/gpio/Makefile
@@ -25,7 +25,3 @@ $(BINARIES): ../../../gpio/gpio-utils.o ../../../../usr/include/linux/gpio.h
../../../gpio/gpio-utils.o:
make ARCH=$(ARCH) CROSS_COMPILE=$(CROSS_COMPILE) -C ../../../gpio
-
-../../../../usr/include/linux/gpio.h:
- make -C ../../../.. headers_install INSTALL_HDR_PATH=$(shell pwd)/../../../../usr/
-
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index dc44de904797..ba03ce334212 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -31,9 +31,5 @@ $(LIBKVM_OBJ): $(OUTPUT)/%.o: %.c
$(OUTPUT)/libkvm.a: $(LIBKVM_OBJ)
$(AR) crs $@ $^
-$(LINUX_HDR_PATH):
- make -C $(top_srcdir) headers_install
-
-all: $(STATIC_LIBS) $(LINUX_HDR_PATH)
+all: $(STATIC_LIBS)
$(TEST_GEN_PROGS): $(STATIC_LIBS)
-$(TEST_GEN_PROGS) $(LIBKVM_OBJ): | $(LINUX_HDR_PATH)
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 195e9d4739a9..1b1e8135715e 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -16,8 +16,18 @@ TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
+top_srcdir ?= ../../../../
+include $(top_srcdir)/scripts/subarch.include
+ARCH ?= $(SUBARCH)
+
all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
+.PHONY: khdr
+khdr:
+ make ARCH=$(ARCH) -C $(top_srcdir) headers_install
+
+$(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES):| khdr
+
.ONESHELL:
define RUN_TESTS
@export KSFT_TAP_LEVEL=`echo 1`;
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index fdefa2295ddc..1e34a40745ef 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -29,6 +29,3 @@ $(OUTPUT)/userfaultfd: ../../../../usr/include/linux/kernel.h
$(OUTPUT)/userfaultfd: LDLIBS += -lpthread
$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
-
-../../../../usr/include/linux/kernel.h:
- make -C ../../../.. headers_install
--
2.16.3
^ permalink raw reply related
* Re: SRIOV switchdev mode BoF minutes
From: Or Gerlitz @ 2018-04-13 8:57 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan,
Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed,
Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <CAJ3xEMj824Y-Erod3e27VO0GmntyWE8u1PQ5-Ff-iZrCn=2X9A@mail.gmail.com>
On Fri, Apr 13, 2018 at 11:56 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>> On 4/12/2018 1:20 PM, Or Gerlitz wrote:
>>>
>>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
>>> <sridhar.samudrala@intel.com> wrote:
>>>>
>>>> On 11/12/2017 11:49 AM, Or Gerlitz wrote:
>>>>>
>>>>> Hi Dave and all,
>>>>>
>>>>> During and after the BoF on SRIOV switchdev mode, we came into a
>>>>> consensus among the developers from four different HW vendors (CC
>>>>> audience) that a correct thing to do would be to disallow any new
>>>>> extensions to the legacy mode.
>>>>>
>>>>> The idea is to put focus on the new mode and not add new UAPIs and
>>>>> kernel code which was turned to be a wrong design which does not allow
>>>>> for properly offloading a kernel switching SW model to e-switch HW.
>>>>>
>>>>> We also had a good session the day after regarding alignment for the
>>>>> representation model of the uplink (physical port) and PF/s.
>>>>>
>>>>> The VF representor netdevs exist for all drivers that support the new
>>>>> mode but the representation for the uplink and PF wasn't the same for
>>>>> all. The decision was to represent the uplink and PFs vports in the
>>>>> same manner done for VFs, using rep netdevs. This alignment would
>>>>> provide a more strict and clear view of the kernel model for e-switch
>>>>> to users and upper layer control plane SW.
>>>>>
>>>> I don't see any changes in the Mellanox/other drivers to move to this new
>>>> model to enable the uplink and PF port representors, any updates?
>>>
>>> Yeah, I am worked on that but didn't get to finalize the upstreaming
>>> so far. I have resumed
>>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep
>>> for 4.18
>>>
>>>> It would be really nice to highlight the pros and cons of the old versus
>>>> the
>>>> new model.
>>>>
>>>> We are looking into adding switchdev support for our new 100Gb ice driver
>>>> and could use some feedback on the direction we should be taking.
>>>
>>> good news.
>>>
>>> The uplink rep is clear cut that needs to be a rep device representing
>>> the uplink just like vf
>>> rep represents the vport toward the vf - please just do it correct
>>> from the begining
>>>
>> Having an uplink rep will definitely help implement the slow path with
>> flat/vlan network
>> scenarios by not having to add PF to the bridge.
>>
>> But how do they help with a vxlan overlay scenario? In case of overlays, the
>> slow path has to go via vxlan -> ip stack -> pf?
>
> in overlay networks scheme, the uplink has the VTEP ip and is not connected
the uplink rep has the vtep ip
> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs
> and the ip stack routes through the uplink rep
>
>>
>> What about pf-rep?
>>
^ permalink raw reply
* Re: SRIOV switchdev mode BoF minutes
From: Or Gerlitz @ 2018-04-13 8:56 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan,
Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed,
Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <d1e4cfb4-513e-14d4-9007-25d1b02e8dcd@intel.com>
On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 4/12/2018 1:20 PM, Or Gerlitz wrote:
>>
>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
>> <sridhar.samudrala@intel.com> wrote:
>>>
>>> On 11/12/2017 11:49 AM, Or Gerlitz wrote:
>>>>
>>>> Hi Dave and all,
>>>>
>>>> During and after the BoF on SRIOV switchdev mode, we came into a
>>>> consensus among the developers from four different HW vendors (CC
>>>> audience) that a correct thing to do would be to disallow any new
>>>> extensions to the legacy mode.
>>>>
>>>> The idea is to put focus on the new mode and not add new UAPIs and
>>>> kernel code which was turned to be a wrong design which does not allow
>>>> for properly offloading a kernel switching SW model to e-switch HW.
>>>>
>>>> We also had a good session the day after regarding alignment for the
>>>> representation model of the uplink (physical port) and PF/s.
>>>>
>>>> The VF representor netdevs exist for all drivers that support the new
>>>> mode but the representation for the uplink and PF wasn't the same for
>>>> all. The decision was to represent the uplink and PFs vports in the
>>>> same manner done for VFs, using rep netdevs. This alignment would
>>>> provide a more strict and clear view of the kernel model for e-switch
>>>> to users and upper layer control plane SW.
>>>>
>>> I don't see any changes in the Mellanox/other drivers to move to this new
>>> model to enable the uplink and PF port representors, any updates?
>>
>> Yeah, I am worked on that but didn't get to finalize the upstreaming
>> so far. I have resumed
>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep
>> for 4.18
>>
>>> It would be really nice to highlight the pros and cons of the old versus
>>> the
>>> new model.
>>>
>>> We are looking into adding switchdev support for our new 100Gb ice driver
>>> and could use some feedback on the direction we should be taking.
>>
>> good news.
>>
>> The uplink rep is clear cut that needs to be a rep device representing
>> the uplink just like vf
>> rep represents the vport toward the vf - please just do it correct
>> from the begining
>>
> Having an uplink rep will definitely help implement the slow path with
> flat/vlan network
> scenarios by not having to add PF to the bridge.
>
> But how do they help with a vxlan overlay scenario? In case of overlays, the
> slow path has to go via vxlan -> ip stack -> pf?
in overlay networks scheme, the uplink has the VTEP ip and is not connected
to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs
and the ip stack routes through the uplink rep
>
> What about pf-rep?
>
^ permalink raw reply
* Re: KMSAN: uninit-value in __netif_receive_skb_core
From: Dmitry Vyukov @ 2018-04-13 8:31 UTC (permalink / raw)
To: Toshiaki Makita
Cc: syzbot, bpoirier, David Miller, Eric Dumazet, Reshetova, Elena,
Hans Liljestrand, Kees Cook, LKML, Mike Maloney, netdev,
rami.rosen, syzkaller-bugs, Willem de Bruijn
In-Reply-To: <be9e24ac-c9b8-6972-f560-c9daea817d28@lab.ntt.co.jp>
On Fri, Apr 13, 2018 at 10:20 AM, Toshiaki Makita
<makita.toshiaki@lab.ntt.co.jp> wrote:
> On 2018/04/12 17:03, Dmitry Vyukov wrote:
>> On Thu, Apr 12, 2018 at 10:01 AM, syzbot
>> <syzbot+b202b7208664142954fa@syzkaller.appspotmail.com> wrote:
>>> Hello,
>>>
>>> syzbot hit the following crash on https://github.com/google/kmsan.git/master
>>> commit
>>> e2ab7e8abba47a2f2698216258e5d8727ae58717 (Fri Apr 6 16:24:31 2018 +0000)
>>> kmsan: temporarily disable visitAsmInstruction() to help syzbot
>>> syzbot dashboard link:
>>> https://syzkaller.appspot.com/bug?extid=b202b7208664142954fa
>>>
>>> Unfortunately, I don't have any reproducer for this crash yet.
>>> Raw console output:
>>> https://syzkaller.appspot.com/x/log.txt?id=5356516437655552
>>> Kernel config:
>>> https://syzkaller.appspot.com/x/.config?id=6627248707860932248
>>> compiler: clang version 7.0.0 (trunk 329391)
>>
>> +Toshiaki as this seems to be related to the recent vlan tagging changes.
>
> seems not...
> "Uninit was stored to memory at:" shows uninitialized memory was stored
> before where I modified the code (skb_reorder_vlan_header).
>
> I'm not sure what this uninit memory means.
> To me it looks like the memory is initialized by user provided data.
>
> (iov in packet sock -> skb->data -> skb->protocol)
>
> The reproducer provides 4 bytes after ethernet header, so it should be
> sufficient for a vlan tag. This will set skb->len to 4 and fill the
> 4-byte contents in packet_snd(). skb_vlan_untag() is reading the user
> provided 4-byte skb->data. It is ensured that skb_vlan_untag() does not
> read beyond skb->len since it calls pskb_may_pull(). At this point I am
> failing to find what I am missing.
Eric,
You mentioned something about assumption that the
__vlan_insert_inner_tag() helper would only be called from
__netif_receive_skb_core(). Can you elaborate?
>> This also seems to be related to
>> https://groups.google.com/d/msg/syzkaller-bugs/VRH9NnUi2k0/90GYsAeRBgAJ
>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+b202b7208664142954fa@syzkaller.appspotmail.com
>>> It will help syzbot understand when the bug is fixed. See footer for
>>> details.
>>> If you forward the report, please keep this part and the footer.
>>>
>>> ==================================================================
>>> BUG: KMSAN: uninit-value in __read_once_size include/linux/compiler.h:197
>>> [inline]
>>> BUG: KMSAN: uninit-value in deliver_ptype_list_skb net/core/dev.c:1908
>>> [inline]
>>> BUG: KMSAN: uninit-value in __netif_receive_skb_core+0x4630/0x4a80
>>> net/core/dev.c:4545
>>> CPU: 0 PID: 5999 Comm: syz-executor3 Not tainted 4.16.0+ #82
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>> Google 01/01/2011
>>> Call Trace:
>>> <IRQ>
>>> __dump_stack lib/dump_stack.c:17 [inline]
>>> dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>>> kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>>> __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>>> __read_once_size include/linux/compiler.h:197 [inline]
>>> deliver_ptype_list_skb net/core/dev.c:1908 [inline]
>>> __netif_receive_skb_core+0x4630/0x4a80 net/core/dev.c:4545
>>> __netif_receive_skb net/core/dev.c:4627 [inline]
>>> process_backlog+0x62d/0xe20 net/core/dev.c:5307
>>> napi_poll net/core/dev.c:5705 [inline]
>>> net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>>> __do_softirq+0x56d/0x93d kernel/softirq.c:285
>>> do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1040
>>> </IRQ>
>>> do_softirq kernel/softirq.c:329 [inline]
>>> __local_bh_enable_ip+0x114/0x140 kernel/softirq.c:182
>>> local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
>>> rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
>>> __dev_queue_xmit+0x2a31/0x2b60 net/core/dev.c:3584
>>> dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
>>> packet_snd net/packet/af_packet.c:2944 [inline]
>>> packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
>>> sock_sendmsg_nosec net/socket.c:630 [inline]
>>> sock_sendmsg net/socket.c:640 [inline]
>>> sock_write_iter+0x3b9/0x470 net/socket.c:909
>>> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>>> do_iter_write+0x30d/0xd40 fs/read_write.c:932
>>> vfs_writev fs/read_write.c:977 [inline]
>>> do_writev+0x3c9/0x830 fs/read_write.c:1012
>>> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>>> SyS_writev+0x56/0x80 fs/read_write.c:1082
>>> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>> RIP: 0033:0x455259
>>> RSP: 002b:00007fb53ede8c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
>>> RAX: ffffffffffffffda RBX: 00007fb53ede96d4 RCX: 0000000000455259
>>> RDX: 0000000000000001 RSI: 00000000200010c0 RDI: 0000000000000013
>>> RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
>>> R13: 00000000000006cd R14: 00000000006fd3d8 R15: 0000000000000000
>>>
>>> Uninit was stored to memory at:
>>> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>>> kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
>>> kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
>>> __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
>>> skb_vlan_untag+0x950/0xee0 include/linux/if_vlan.h:597
>>> __netif_receive_skb_core+0x70a/0x4a80 net/core/dev.c:4460
>>> __netif_receive_skb net/core/dev.c:4627 [inline]
>>> process_backlog+0x62d/0xe20 net/core/dev.c:5307
>>> napi_poll net/core/dev.c:5705 [inline]
>>> net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>>> __do_softirq+0x56d/0x93d kernel/softirq.c:285
>>> Uninit was created at:
>>> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>>> kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
>>> kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
>>> kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
>>> slab_post_alloc_hook mm/slab.h:445 [inline]
>>> slab_alloc_node mm/slub.c:2737 [inline]
>>> __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
>>> __kmalloc_reserve net/core/skbuff.c:138 [inline]
>>> __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
>>> alloc_skb include/linux/skbuff.h:984 [inline]
>>> alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
>>> sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
>>> packet_alloc_skb net/packet/af_packet.c:2803 [inline]
>>> packet_snd net/packet/af_packet.c:2894 [inline]
>>> packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
>>> sock_sendmsg_nosec net/socket.c:630 [inline]
>>> sock_sendmsg net/socket.c:640 [inline]
>>> sock_write_iter+0x3b9/0x470 net/socket.c:909
>>> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>>> do_iter_write+0x30d/0xd40 fs/read_write.c:932
>>> vfs_writev fs/read_write.c:977 [inline]
>>> do_writev+0x3c9/0x830 fs/read_write.c:1012
>>> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>>> SyS_writev+0x56/0x80 fs/read_write.c:1082
>>> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>> ==================================================================
>>>
>>>
>>> ---
>>> This bug is generated by a dumb bot. It may contain errors.
>>> See https://goo.gl/tpsmEJ for details.
>>> Direct all questions to syzkaller@googlegroups.com.
>>>
>>> syzbot will keep track of this bug report.
>>> If you forgot to add the Reported-by tag, once the fix for this bug is
>>> merged
>>> into any tree, please reply to this email with:
>>> #syz fix: exact-commit-title
>>> To mark this as a duplicate of another syzbot report, please reply with:
>>> #syz dup: exact-subject-of-another-report
>>> If it's a one-off invalid bug report, please reply with:
>>> #syz invalid
>>> Note: if the crash happens again, it will cause creation of a new bug
>>> report.
>>> Note: all commands must start from beginning of the line in the email body.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "syzkaller-bugs" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to syzkaller-bugs+unsubscribe@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c059ce01f643c0569a228ee%40google.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>
> --
> Toshiaki Makita
>
^ permalink raw reply
* Re: KMSAN: uninit-value in netif_skb_features
From: Toshiaki Makita @ 2018-04-13 8:28 UTC (permalink / raw)
To: Dmitry Vyukov, syzbot
Cc: bpoirier, David Miller, Eric Dumazet, Reshetova, Elena, Kees Cook,
LKML, Mike Maloney, netdev, rami.rosen, syzkaller-bugs,
Willem de Bruijn
In-Reply-To: <CACT4Y+YfWyEx_amzVYkyaNzB2st6zcuDoVmJsE+=gAq2iJVLjg@mail.gmail.com>
On 2018/04/12 17:03, Dmitry Vyukov wrote:
> On Thu, Apr 12, 2018 at 10:01 AM, syzbot
> <syzbot+0bbe42c764feafa82c5a@syzkaller.appspotmail.com> wrote:
>> Hello,
>>
>> syzbot hit the following crash on https://github.com/google/kmsan.git/master
>> commit
>> e2ab7e8abba47a2f2698216258e5d8727ae58717 (Fri Apr 6 16:24:31 2018 +0000)
>> kmsan: temporarily disable visitAsmInstruction() to help syzbot
>> syzbot dashboard link:
>> https://syzkaller.appspot.com/bug?extid=0bbe42c764feafa82c5a
>>
>> So far this crash happened 30 times on
>> https://github.com/google/kmsan.git/master.
>> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4850744041668608
>> syzkaller reproducer:
>> https://syzkaller.appspot.com/x/repro.syz?id=6289386287136768
>> Raw console output:
>> https://syzkaller.appspot.com/x/log.txt?id=4577411249209344
>> Kernel config:
>> https://syzkaller.appspot.com/x/.config?id=6627248707860932248
>> compiler: clang version 7.0.0 (trunk 329391)
>
> +Toshiaki as this seems to be related to the recent vlan tagging changes.
Seems not.
Probably skb_vlan_tagged_multi() needs to call pskb_may_pull() before
accessing skb->data? I'll confirm it later.
> This also seems to be related to
> https://groups.google.com/d/msg/syzkaller-bugs/FNEavkB4QaM/efXl2AeRBgAJ
>
>
>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+0bbe42c764feafa82c5a@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> ==================================================================
>> BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283
>> [inline]
>> BUG: KMSAN: uninit-value in skb_vlan_tagged_multi
>> include/linux/if_vlan.h:656 [inline]
>> BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672
>> [inline]
>> BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline]
>> BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0
>> net/core/dev.c:3009
>> CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>> __dump_stack lib/dump_stack.c:17 [inline]
>> dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>> kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>> __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>> eth_type_vlan include/linux/if_vlan.h:283 [inline]
>> skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
>> vlan_features_check include/linux/if_vlan.h:672 [inline]
>> dflt_features_check net/core/dev.c:2949 [inline]
>> netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
>> validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084
>> __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549
>> dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
>> packet_snd net/packet/af_packet.c:2944 [inline]
>> packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
>> sock_sendmsg_nosec net/socket.c:630 [inline]
>> sock_sendmsg net/socket.c:640 [inline]
>> sock_write_iter+0x3b9/0x470 net/socket.c:909
>> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>> do_iter_write+0x30d/0xd40 fs/read_write.c:932
>> vfs_writev fs/read_write.c:977 [inline]
>> do_writev+0x3c9/0x830 fs/read_write.c:1012
>> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>> SyS_writev+0x56/0x80 fs/read_write.c:1082
>> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> RIP: 0033:0x43ffa9
>> RSP: 002b:00007fff2cff3948 EFLAGS: 00000217 ORIG_RAX: 0000000000000014
>> RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043ffa9
>> RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000003
>> RBP: 00000000006cb018 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004018d0
>> R13: 0000000000401960 R14: 0000000000000000 R15: 0000000000000000
>>
>> Uninit was created at:
>> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>> kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
>> kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
>> kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
>> slab_post_alloc_hook mm/slab.h:445 [inline]
>> slab_alloc_node mm/slub.c:2737 [inline]
>> __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
>> __kmalloc_reserve net/core/skbuff.c:138 [inline]
>> __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
>> alloc_skb include/linux/skbuff.h:984 [inline]
>> alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
>> sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
>> packet_alloc_skb net/packet/af_packet.c:2803 [inline]
>> packet_snd net/packet/af_packet.c:2894 [inline]
>> packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
>> sock_sendmsg_nosec net/socket.c:630 [inline]
>> sock_sendmsg net/socket.c:640 [inline]
>> sock_write_iter+0x3b9/0x470 net/socket.c:909
>> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>> do_iter_write+0x30d/0xd40 fs/read_write.c:932
>> vfs_writev fs/read_write.c:977 [inline]
>> do_writev+0x3c9/0x830 fs/read_write.c:1012
>> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>> SyS_writev+0x56/0x80 fs/read_write.c:1082
>> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> ==================================================================
>>
>>
>> ---
>> This bug is generated by a dumb bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for details.
>> Direct all questions to syzkaller@googlegroups.com.
>>
>> syzbot will keep track of this bug report.
>> If you forgot to add the Reported-by tag, once the fix for this bug is
>> merged
>> into any tree, please reply to this email with:
>> #syz fix: exact-commit-title
>> If you want to test a patch for this bug, please reply with:
>> #syz test: git://repo/address.git branch
>> and provide the patch inline or as an attachment.
>> To mark this as a duplicate of another syzbot report, please reply with:
>> #syz dup: exact-subject-of-another-report
>> If it's a one-off invalid bug report, please reply with:
>> #syz invalid
>> Note: if the crash happens again, it will cause creation of a new bug
>> report.
>> Note: all commands must start from beginning of the line in the email body.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "syzkaller-bugs" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to syzkaller-bugs+unsubscribe@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/syzkaller-bugs/089e082d0cb81b67d10569a2283f%40google.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
--
Toshiaki Makita
^ permalink raw reply
* Re: KMSAN: uninit-value in __netif_receive_skb_core
From: Toshiaki Makita @ 2018-04-13 8:20 UTC (permalink / raw)
To: Dmitry Vyukov, syzbot
Cc: bpoirier, David Miller, Eric Dumazet, Reshetova, Elena,
Hans Liljestrand, Kees Cook, LKML, Mike Maloney, netdev,
rami.rosen, syzkaller-bugs, Willem de Bruijn
In-Reply-To: <CACT4Y+Y+fco9rusF0ZYStBKkbyhZDOvkx50kp1-Lynu3UmgsLA@mail.gmail.com>
On 2018/04/12 17:03, Dmitry Vyukov wrote:
> On Thu, Apr 12, 2018 at 10:01 AM, syzbot
> <syzbot+b202b7208664142954fa@syzkaller.appspotmail.com> wrote:
>> Hello,
>>
>> syzbot hit the following crash on https://github.com/google/kmsan.git/master
>> commit
>> e2ab7e8abba47a2f2698216258e5d8727ae58717 (Fri Apr 6 16:24:31 2018 +0000)
>> kmsan: temporarily disable visitAsmInstruction() to help syzbot
>> syzbot dashboard link:
>> https://syzkaller.appspot.com/bug?extid=b202b7208664142954fa
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output:
>> https://syzkaller.appspot.com/x/log.txt?id=5356516437655552
>> Kernel config:
>> https://syzkaller.appspot.com/x/.config?id=6627248707860932248
>> compiler: clang version 7.0.0 (trunk 329391)
>
> +Toshiaki as this seems to be related to the recent vlan tagging changes.
seems not...
"Uninit was stored to memory at:" shows uninitialized memory was stored
before where I modified the code (skb_reorder_vlan_header).
I'm not sure what this uninit memory means.
To me it looks like the memory is initialized by user provided data.
(iov in packet sock -> skb->data -> skb->protocol)
The reproducer provides 4 bytes after ethernet header, so it should be
sufficient for a vlan tag. This will set skb->len to 4 and fill the
4-byte contents in packet_snd(). skb_vlan_untag() is reading the user
provided 4-byte skb->data. It is ensured that skb_vlan_untag() does not
read beyond skb->len since it calls pskb_may_pull(). At this point I am
failing to find what I am missing.
> This also seems to be related to
> https://groups.google.com/d/msg/syzkaller-bugs/VRH9NnUi2k0/90GYsAeRBgAJ
>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+b202b7208664142954fa@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> ==================================================================
>> BUG: KMSAN: uninit-value in __read_once_size include/linux/compiler.h:197
>> [inline]
>> BUG: KMSAN: uninit-value in deliver_ptype_list_skb net/core/dev.c:1908
>> [inline]
>> BUG: KMSAN: uninit-value in __netif_receive_skb_core+0x4630/0x4a80
>> net/core/dev.c:4545
>> CPU: 0 PID: 5999 Comm: syz-executor3 Not tainted 4.16.0+ #82
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>> <IRQ>
>> __dump_stack lib/dump_stack.c:17 [inline]
>> dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>> kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>> __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>> __read_once_size include/linux/compiler.h:197 [inline]
>> deliver_ptype_list_skb net/core/dev.c:1908 [inline]
>> __netif_receive_skb_core+0x4630/0x4a80 net/core/dev.c:4545
>> __netif_receive_skb net/core/dev.c:4627 [inline]
>> process_backlog+0x62d/0xe20 net/core/dev.c:5307
>> napi_poll net/core/dev.c:5705 [inline]
>> net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>> __do_softirq+0x56d/0x93d kernel/softirq.c:285
>> do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1040
>> </IRQ>
>> do_softirq kernel/softirq.c:329 [inline]
>> __local_bh_enable_ip+0x114/0x140 kernel/softirq.c:182
>> local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
>> rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
>> __dev_queue_xmit+0x2a31/0x2b60 net/core/dev.c:3584
>> dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
>> packet_snd net/packet/af_packet.c:2944 [inline]
>> packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
>> sock_sendmsg_nosec net/socket.c:630 [inline]
>> sock_sendmsg net/socket.c:640 [inline]
>> sock_write_iter+0x3b9/0x470 net/socket.c:909
>> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>> do_iter_write+0x30d/0xd40 fs/read_write.c:932
>> vfs_writev fs/read_write.c:977 [inline]
>> do_writev+0x3c9/0x830 fs/read_write.c:1012
>> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>> SyS_writev+0x56/0x80 fs/read_write.c:1082
>> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> RIP: 0033:0x455259
>> RSP: 002b:00007fb53ede8c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
>> RAX: ffffffffffffffda RBX: 00007fb53ede96d4 RCX: 0000000000455259
>> RDX: 0000000000000001 RSI: 00000000200010c0 RDI: 0000000000000013
>> RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
>> R13: 00000000000006cd R14: 00000000006fd3d8 R15: 0000000000000000
>>
>> Uninit was stored to memory at:
>> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>> kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
>> kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
>> __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
>> skb_vlan_untag+0x950/0xee0 include/linux/if_vlan.h:597
>> __netif_receive_skb_core+0x70a/0x4a80 net/core/dev.c:4460
>> __netif_receive_skb net/core/dev.c:4627 [inline]
>> process_backlog+0x62d/0xe20 net/core/dev.c:5307
>> napi_poll net/core/dev.c:5705 [inline]
>> net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>> __do_softirq+0x56d/0x93d kernel/softirq.c:285
>> Uninit was created at:
>> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>> kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
>> kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
>> kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
>> slab_post_alloc_hook mm/slab.h:445 [inline]
>> slab_alloc_node mm/slub.c:2737 [inline]
>> __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
>> __kmalloc_reserve net/core/skbuff.c:138 [inline]
>> __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
>> alloc_skb include/linux/skbuff.h:984 [inline]
>> alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
>> sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
>> packet_alloc_skb net/packet/af_packet.c:2803 [inline]
>> packet_snd net/packet/af_packet.c:2894 [inline]
>> packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
>> sock_sendmsg_nosec net/socket.c:630 [inline]
>> sock_sendmsg net/socket.c:640 [inline]
>> sock_write_iter+0x3b9/0x470 net/socket.c:909
>> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>> do_iter_write+0x30d/0xd40 fs/read_write.c:932
>> vfs_writev fs/read_write.c:977 [inline]
>> do_writev+0x3c9/0x830 fs/read_write.c:1012
>> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>> SyS_writev+0x56/0x80 fs/read_write.c:1082
>> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> ==================================================================
>>
>>
>> ---
>> This bug is generated by a dumb bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for details.
>> Direct all questions to syzkaller@googlegroups.com.
>>
>> syzbot will keep track of this bug report.
>> If you forgot to add the Reported-by tag, once the fix for this bug is
>> merged
>> into any tree, please reply to this email with:
>> #syz fix: exact-commit-title
>> To mark this as a duplicate of another syzbot report, please reply with:
>> #syz dup: exact-subject-of-another-report
>> If it's a one-off invalid bug report, please reply with:
>> #syz invalid
>> Note: if the crash happens again, it will cause creation of a new bug
>> report.
>> Note: all commands must start from beginning of the line in the email body.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "syzkaller-bugs" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to syzkaller-bugs+unsubscribe@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c059ce01f643c0569a228ee%40google.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
--
Toshiaki Makita
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-13 7:15 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <ebfbfc3e-9e52-9331-61b6-258e131f822d@redhat.com>
On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> On 2018年04月01日 22:12, Tiwei Bie wrote:
> > Hello everyone,
> >
> > This RFC implements packed ring support for virtio driver.
> >
> > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > Minor changes are needed for the vhost code, e.g. to kick the guest.
> >
> > TODO:
> > - Refinements and bug fixes;
> > - Split into small patches;
> > - Test indirect descriptor support;
> > - Test/fix event suppression support;
> > - Test devices other than net;
> >
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> >
> > Thanks!
> >
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
> > include/linux/virtio_ring.h | 8 +-
> > include/uapi/linux/virtio_config.h | 12 +-
> > include/uapi/linux/virtio_ring.h | 61 ++
> > 4 files changed, 980 insertions(+), 195 deletions(-)
[...]
> > +static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
> > + unsigned int total_sg,
> > + gfp_t gfp)
> > +{
> > + struct vring_packed_desc *desc;
> > +
> > + /*
> > + * We require lowmem mappings for the descriptors because
> > + * otherwise virt_to_phys will give us bogus addresses in the
> > + * virtqueue.
> > + */
> > + gfp &= ~__GFP_HIGHMEM;
> > +
> > + desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
>
> Can we simply check vq->packed here to avoid duplicating helpers?
Then it would be something like this:
static void *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
void *data;
/*
* We require lowmem mappings for the descriptors because
* otherwise virt_to_phys will give us bogus addresses in the
* virtqueue.
*/
gfp &= ~__GFP_HIGHMEM;
if (vq->packed) {
data = kmalloc(total_sg * sizeof(struct vring_packed_desc),
gfp);
if (!data)
return NULL;
} else {
struct vring_desc *desc;
unsigned int i;
desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
if (!desc)
return NULL;
for (i = 0; i < total_sg; i++)
desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
data = desc;
}
return data;
}
I would prefer to have two simpler helpers (and to the callers,
it's already very clear about which one they should call), i.e.
the current implementation:
static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
unsigned int total_sg,
gfp_t gfp)
{
struct vring_packed_desc *desc;
/*
* We require lowmem mappings for the descriptors because
* otherwise virt_to_phys will give us bogus addresses in the
* virtqueue.
*/
gfp &= ~__GFP_HIGHMEM;
desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
return desc;
}
static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
unsigned int total_sg,
gfp_t gfp)
{
struct vring_desc *desc;
unsigned int i;
/*
* We require lowmem mappings for the descriptors because
* otherwise virt_to_phys will give us bogus addresses in the
* virtqueue.
*/
gfp &= ~__GFP_HIGHMEM;
desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
if (!desc)
return NULL;
for (i = 0; i < total_sg; i++)
desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
return desc;
}
>
> > +
> > + return desc;
> > +}
[...]
> > +static inline int virtqueue_add_packed(struct virtqueue *_vq,
> > + struct scatterlist *sgs[],
> > + unsigned int total_sg,
> > + unsigned int out_sgs,
> > + unsigned int in_sgs,
> > + void *data,
> > + void *ctx,
> > + gfp_t gfp)
> > +{
> > + struct vring_virtqueue *vq = to_vvq(_vq);
> > + struct vring_packed_desc *desc;
> > + struct scatterlist *sg;
> > + unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
> > + __virtio16 uninitialized_var(head_flags), flags;
> > + int head, wrap_counter;
> > + bool indirect;
> > +
> > + START_USE(vq);
> > +
> > + BUG_ON(data == NULL);
> > + BUG_ON(ctx && vq->indirect);
> > +
> > + if (unlikely(vq->broken)) {
> > + END_USE(vq);
> > + return -EIO;
> > + }
> > +
> > +#ifdef DEBUG
> > + {
> > + ktime_t now = ktime_get();
> > +
> > + /* No kick or get, with .1 second between? Warn. */
> > + if (vq->last_add_time_valid)
> > + WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
> > + > 100);
> > + vq->last_add_time = now;
> > + vq->last_add_time_valid = true;
> > + }
> > +#endif
> > +
> > + BUG_ON(total_sg == 0);
> > +
> > + head = vq->next_avail_idx;
> > + wrap_counter = vq->wrap_counter;
> > +
> > + /* If the host supports indirect descriptor tables, and we have multiple
> > + * buffers, then go indirect. FIXME: tune this threshold */
> > + if (vq->indirect && total_sg > 1 && vq->vq.num_free)
>
> Let's introduce a helper like virtqueue_need_indirect() to avoid duplicating
> codes and FIXME.
Okay.
>
> > + desc = alloc_indirect_packed(_vq, total_sg, gfp);
> > + else {
> > + desc = NULL;
> > + WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
> > + }
> > +
> > + if (desc) {
> > + /* Use a single buffer which doesn't continue */
> > + indirect = true;
> > + /* Set up rest to use this indirect table. */
> > + i = 0;
> > + descs_used = 1;
> > + } else {
> > + indirect = false;
> > + desc = vq->vring_packed.desc;
> > + i = head;
> > + descs_used = total_sg;
> > + }
> > +
> > + if (vq->vq.num_free < descs_used) {
> > + pr_debug("Can't add buf len %i - avail = %i\n",
> > + descs_used, vq->vq.num_free);
> > + /* FIXME: for historical reasons, we force a notify here if
> > + * there are outgoing parts to the buffer. Presumably the
> > + * host should service the ring ASAP. */
> > + if (out_sgs)
> > + vq->notify(&vq->vq);
> > + if (indirect)
> > + kfree(desc);
> > + END_USE(vq);
> > + return -ENOSPC;
> > + }
> > +
> > + for (n = 0; n < out_sgs + in_sgs; n++) {
> > + for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> > + dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
> > + DMA_TO_DEVICE : DMA_FROM_DEVICE);
> > + if (vring_mapping_error(vq, addr))
> > + goto unmap_release;
> > +
> > + flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
> > + (n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
> > + VRING_DESC_F_AVAIL(vq->wrap_counter) |
> > + VRING_DESC_F_USED(!vq->wrap_counter));
> > + if (!indirect && i == head)
> > + head_flags = flags;
> > + else
> > + desc[i].flags = flags;
> > +
> > + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> > + desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
> > + desc[i].id = cpu_to_virtio32(_vq->vdev, head);
>
> Similar to V1, we only need this for the last descriptor.
Okay, will just set it for the last desc.
>
> > + prev = i;
>
> It looks to me there's no need to track prev inside the loop here.
>
> > + i++;
> > + if (!indirect && i >= vq->vring_packed.num) {
> > + i = 0;
> > + vq->wrap_counter ^= 1;
> > + }
> > + }
> > + }
> > + /* Last one doesn't continue. */
> > + if (total_sg == 1)
> > + head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> > + else
> > + desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
>
> The only case when prev != i - 1 is i == 0, we can add a if here.
It's just a mirror of the existing implementation in split ring.
It seems that split ring implementation needs this just because
it's much harder for it to find the prev, which is not true for
packed ring. So I'll take your suggestion. Thanks!
>
[...]
> > +static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > +{
> > + struct vring_virtqueue *vq = to_vvq(_vq);
> > + u16 new, old, off_wrap;
> > + bool needs_kick;
> > +
> > + START_USE(vq);
> > + /* We need to expose the new flags value before checking notification
> > + * suppressions. */
> > + virtio_mb(vq->weak_barriers);
> > +
> > + old = vq->next_avail_idx - vq->num_added;
> > + new = vq->next_avail_idx;
> > + vq->num_added = 0;
> > +
> > +#ifdef DEBUG
> > + if (vq->last_add_time_valid) {
> > + WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> > + vq->last_add_time)) > 100);
> > + }
> > + vq->last_add_time_valid = false;
> > +#endif
> > +
> > + off_wrap = virtio16_to_cpu(_vq->vdev, vq->vring_packed.device->off_wrap);
> > +
> > + if (vq->event) {
>
> It looks to me we should examine RING_EVENT_FLAGS_DESC in desc_event_flags
> instead of vq->event here. Spec does not forces to use evenf_off and
> event_wrap if event index is enabled.
>
> > + // FIXME: fix this!
> > + needs_kick = ((off_wrap >> 15) == vq->wrap_counter) &&
> > + vring_need_event(off_wrap & ~(1<<15), new, old);
>
> Why need a & here?
Because wrap_counter (the most significant bit in off_wrap)
isn't part of the index.
>
> > + } else {
>
> Need a smp_rmb() to make sure desc_event_flags was checked before flags.
I don't get your point, if my understanding is correct,
desc_event_flags is vq->vring_packed.device->flags. So
what's the "flags"?
>
> > + needs_kick = (vq->vring_packed.device->flags !=
> > + cpu_to_virtio16(_vq->vdev, VRING_EVENT_F_DISABLE));
> > + }
> > + END_USE(vq);
> > + return needs_kick;
> > +}
[...]
> > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > + void **ctx)
> > +{
> > + struct vring_packed_desc *desc;
> > + unsigned int i, j;
> > +
> > + /* Clear data ptr. */
> > + vq->desc_state[head].data = NULL;
> > +
> > + i = head;
> > +
> > + for (j = 0; j < vq->desc_state[head].num; j++) {
> > + desc = &vq->vring_packed.desc[i];
> > + vring_unmap_one_packed(vq, desc);
> > + desc->flags = 0x0;
>
> Looks like this is unnecessary.
It's safer to zero it. If we don't zero it, after we
call virtqueue_detach_unused_buf_packed() which calls
this function, the desc is still available to the
device.
>
> > + i++;
> > + if (i >= vq->vring_packed.num)
> > + i = 0;
> > + }
[...]
> > +static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> > +{
> > + struct vring_virtqueue *vq = to_vvq(_vq);
> > + u16 last_used_idx, wrap_counter, off_wrap;
> > +
> > + START_USE(vq);
> > +
> > + last_used_idx = vq->last_used_idx;
> > + wrap_counter = vq->wrap_counter;
> > +
> > + if (last_used_idx > vq->next_avail_idx)
> > + wrap_counter ^= 1;
> > +
> > + off_wrap = last_used_idx | (wrap_counter << 15);
> > +
> > + /* We optimistically turn back on interrupts, then check if there was
> > + * more to do. */
> > + /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> > + * either clear the flags bit or point the event index at the next
> > + * entry. Always do both to keep code simple. */
> > + if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > + vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC:
> > + VRING_EVENT_F_ENABLE;
> > + vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > + vq->event_flags_shadow);
> > + }
>
> A smp_wmb() is missed here?
>
> > + vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev, off_wrap);
>
> And according to the spec, it looks to me write a VRING_EVENT_F_ENABLE is
> sufficient here.
I didn't think much when implementing the event suppression
for packed ring previously. After I saw your comments, I found
something new. Indeed, unlike the split ring, for the packed
ring, spec doesn't say we must use VRING_EVENT_F_DESC when
EVENT_IDX is negotiated. So do you think below thought is
right or makes sense?
- For virtqueue_enable_cb_prepare(), we just need to enable
the ring by setting flags to VRING_EVENT_F_ENABLE in any
case.
- We will try to use VRING_EVENT_F_DESC (if EVENT_IDX is
negotiated) only when we want to delay the interrupts
virtqueue_enable_cb_delayed().
>
> > + END_USE(vq);
> > + return last_used_idx;
> > +}
> > +
[...]
> > @@ -1157,14 +1852,18 @@ void vring_transport_features(struct virtio_device *vdev)
> > for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
> > switch (i) {
> > - case VIRTIO_RING_F_INDIRECT_DESC:
> > +#if 0
> > + case VIRTIO_RING_F_INDIRECT_DESC: // FIXME not tested yet.
> > break;
> > - case VIRTIO_RING_F_EVENT_IDX:
> > + case VIRTIO_RING_F_EVENT_IDX: // FIXME probably not work.
> > break;
> > +#endif
>
> It would be better if you can split EVENT_IDX and INDIRECT_DESC into
> separate patches too.
Sure. Will do it in the next version.
Thanks for the review!
>
> Thanks
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox