Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 10/18] rhashtable: remove rhashtable_walk_peek()
From: Tom Herbert @ 2018-06-04 21:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Herbert Xu, Thomas Graf, Linux Kernel Network Developers, LKML,
	Tom Herbert
In-Reply-To: <87sh63pakb.fsf@notabene.neil.brown.name>

On Sun, Jun 3, 2018 at 7:09 PM, NeilBrown <neilb@suse.com> wrote:
> On Sun, Jun 03 2018, Tom Herbert wrote:
>
>> On Sun, Jun 3, 2018 at 5:30 PM, NeilBrown <neilb@suse.com> wrote:
>>> On Sat, Jun 02 2018, Herbert Xu wrote:
>>>
>>>> On Fri, Jun 01, 2018 at 02:44:09PM +1000, NeilBrown wrote:
>>>>> This function has a somewhat confused behavior that is not properly
>>>>> described by the documentation.
>>>>> Sometimes is returns the previous object, sometimes it returns the
>>>>> next one.
>>>>> Sometimes it changes the iterator, sometimes it doesn't.
>>>>>
>>>>> This function is not currently used and is not worth keeping, so
>>>>> remove it.
>>>>>
>>>>> A future patch will introduce a new function with a
>>>>> simpler interface which can meet the same need that
>>>>> this was added for.
>>>>>
>>>>> Signed-off-by: NeilBrown <neilb@suse.com>
>>>>
>>>> Please keep Tom Herbert in the loop.  IIRC he had an issue with
>>>> this patch.
>>>
>>> Yes you are right - sorry for forgetting to add Tom.
>>>
>>> My understanding of where this issue stands is that Tom raised issue and
>>> asked for clarification, I replied, nothing further happened.
>>>
>>> It summary, my position is that:
>>> - most users of my new rhashtable_walk_prev() will use it like
>>>    rhasthable_talk_prev() ?: rhashtable_walk_next()
>>>   which is close to what rhashtable_walk_peek() does
>>> - I know of no use-case that could not be solved if we only had
>>>   the combined operation
>>> - BUT it is hard to document the combined operation, as it really
>>>   does two things.  If it is hard to document, then it might be
>>>   hard to understand.
>>>
>>> So provide the most understandable/maintainable solution, I think
>>> we should provide rhashtable_walk_prev() as a separate interface.
>>>
>> I'm still missing why requiring two API operations instead of one is
>> simpler or easier to document. Also, I disagree that
>> rhashtable_walk_peek does two things-- it just does one which is to
>> return the current element in the walk without advancing to the next
>> one. The fact that the iterator may or may not move is immaterial in
>> the API, that is an implementation detail. In fact, it's conceivable
>> that we might completely reimplement this someday such that the
>> iterator works completely differently implementation semantics but the
>> API doesn't change. Also the naming in your proposal is confusing,
>> we'd have operations to get the previous, and the next next object--
>> so the user may ask where's the API to get the current object in the
>> walk? The idea that we get it by first trying to get the previous
>> object, and then if that fails getting the next object seems
>> counterintuitive.
>
> To respond to your points out of order:
>
> - I accept that "rhashtable_walk_prev" is not a perfect name.  It
>   suggests a stronger symmetry with rhasthable_walk_next than actually
>   exist.  I cannot think of a better name, but I think the
>   description "Return the previously returned object if it is
>   still in the table" is clear and simple and explains the name.
>   I'm certainly open to suggestions for a better name.
>
> - I don't think it is meaningful to talk about a "current" element in a
>   table where asynchronous insert/remove is to be expected.
>   The best we can hope for is a "current location" is the sequence of
>   objects in the table - a location which is after some objects and
>   before all others.  rhashtable_walk_next() returns the next object
>   after the current location, and advances the location pointer past
>   that object.
>   rhashtable_walk_prev() *doesn't* return the previous object in the
>   table.  It returns the previously returned object. ("previous" in
>   time, but not in space, if you like).
>
> - rhashtable_walk_peek() currently does one of two different things.
>   It either returns the previously returned object (iter->p) if that
>   is still in the table, or it find the next object, steps over it, and
>   returns it.
>
> - I would like to suggest that when an API acts on a iterator object,
>   the question of whether or not the iterator is advanced *must* be a
>   fundamental question, not one that might change from time to time.
>
> Maybe a useful way forward would be for you to write documentation for
> the rhashtable_walk_peek() interface which correctly describes what it
> does and how it is used.  Given that, I can implement that interface
> with the stability improvements that I'm working on.
>

Here's how it's documented currently:

"rhashtable_walk_peek - Return the next object but don't advance the iterator"

I don't see what is incorrect about that. Peek returns the next object
in the walk, however does not move the iterator past that object, so
sucessive calls to peek return the same object. In other words it's a
way to inspect the next object but not "consume" it. This is what is
needed when netlink returns in the middle of a walk. The last object
retrieved from the table may not have been processed completely, so it
needs to be the first one processed on the next invocation to netlink.

This is also easily distinguishable from

"rhashtable_walk_next - Return the next object and advance the iterator"

Where the only difference is that peek and walk is that, walk advances
the iterator and peek does not. Hence why "peek" is a descriptive name
for what is happening.

Tom

> Thanks,
> NeilBrown

^ permalink raw reply

* Re: [PATCH 1/2 v2 net-next] net_failover: Use netdev_features_t instead of u32
From: David Miller @ 2018-06-04 21:30 UTC (permalink / raw)
  To: dan.carpenter; +Cc: sridhar.samudrala, netdev, kernel-janitors
In-Reply-To: <20180604144321.kjemxdfhi4qqb5ji@kili.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Mon, 4 Jun 2018 17:43:21 +0300

> The features mask needs to be a netdev_features_t (u64) because a u32
> is not big enough.
> 
> Fixes: cfc80d9a1163 ("net: Introduce net_failover driver")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> ---
> v2: In the original patch, I thought that the & should be | and I
>     introduced a bug.

Applied.

^ permalink raw reply

* Re: [PATCH net-next] qed: use dma_zalloc_coherent instead of allocator/memset
From: David Miller @ 2018-06-04 21:28 UTC (permalink / raw)
  To: yuehaibing; +Cc: Ariel.Elior, netdev, linux-kernel, everest-linux-l2
In-Reply-To: <20180604131031.24476-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Mon, 4 Jun 2018 21:10:31 +0800

> Use dma_zalloc_coherent instead of dma_alloc_coherent
> followed by memset 0.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] wan/fsl_ucc_hdlc: use dma_zalloc_coherent instead of allocator/memset
From: David Miller @ 2018-06-04 21:28 UTC (permalink / raw)
  To: yuehaibing; +Cc: qiang.zhao, netdev, linux-kernel, linuxppc-dev
In-Reply-To: <20180604130759.25024-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Mon, 4 Jun 2018 21:07:59 +0800

> Use dma_zalloc_coherent instead of dma_alloc_coherent
> followed by memset 0.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied.

^ permalink raw reply

* Re: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: David Miller @ 2018-06-04 21:27 UTC (permalink / raw)
  To: gerlitz.or; +Cc: jeffrey.t.kirsher, netdev
In-Reply-To: <CAJ3xEMg1cSodYJwXLyoUkdwDJicKoeG67Y_vnfFfPQU====vZA@mail.gmail.com>

From: Or Gerlitz <gerlitz.or@gmail.com>
Date: Tue, 5 Jun 2018 00:11:35 +0300

> Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be
> merged for this window -- AFAIU from [1], it's still under
> examination/development/research for non Intel HWs, am I correct or
> this is going to get in now?

All of the pending AF_XDP changes will be merged this merge window.

I think Intel folks need to review things as fast as possible because
I pretty much refuse to revert the series or disable it in Kconfig at
this point.

Thank you.

^ permalink raw reply

* Re: [PATCH net] ipmr: fix error path when mr_table_alloc fails
From: David Miller @ 2018-06-04 21:25 UTC (permalink / raw)
  To: sd; +Cc: netdev, edumazet, nikolay, yuvalm, ivecera
In-Reply-To: <ffc3366697c8d8789cbe314b6944b910eceb38e7.1528112758.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Mon,  4 Jun 2018 13:55:54 +0200

> commit 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
> refactored ipmr_new_table, so that it now returns NULL when
> mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table
> expect an ERR_PTR. commit 66fb33254f45 ("ipmr: properly check
> rhltable_init() return value") followed suit.
> 
> This can result in NULL deref, when ipmr_rules_exit calls
> ipmr_free_table with NULL net->ipv4.mrt in the
> !CONFIG_IP_MROUTE_MULTIPLE_TABLES version.
> 
> This patch makes mr_table_alloc return errors, and changes
> ip6mr_new_table and its callers to return/expect error pointers as
> well. It also removes the version of mr_table_alloc defined under
> !CONFIG_IP_MROUTE_COMMON, since it is never used.
> 
> Fixes: 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
> Fixes: 66fb33254f45 ("ipmr: properly check rhltable_init() return value")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

This adds a new warning with gcc-8.1.1 on Fedora 28

  CC [M]  net/ipv6/ip6mr.o
In file included from ./arch/x86/include/asm/current.h:5,
                 from ./include/linux/sched.h:12,
                 from ./include/linux/uaccess.h:5,
                 from net/ipv6/ip6mr.c:19:
net/ipv6/ip6mr.c: In function ‘ip6_mroute_setsockopt’:
./include/linux/compiler.h:177:26: warning: ‘mrt’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  case 8: *(__u64 *)res = *(volatile __u64 *)p; break;  \
                          ^
net/ipv6/ip6mr.c:1752:20: note: ‘mrt’ was declared here
   struct mr_table *mrt;
                    ^~~

^ permalink raw reply

* Re: pull request: bluetooth-next 2018-06-04
From: David Miller @ 2018-06-04 21:22 UTC (permalink / raw)
  To: johan.hedberg; +Cc: linux-bluetooth, netdev
In-Reply-To: <20180604101532.GA8919@x1c>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Mon, 4 Jun 2018 12:15:32 +0200

> Here's one last bluetooth-next pull request for the 4.18 kernel:
> 
>  - New USB device IDs for Realtek 8822BE and 8723DE
>  - reset/resume fix for Dell Inspiron 5565
>  - Fix HCI_UART_INIT_PENDING flag behavior
>  - Fix patching behavior for some ATH3012 models
>  - A few other minor cleanups & fixes
> 
> Please let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.

^ permalink raw reply

* Re: [PATCH] docs: networking: fix minor typos in various documentation files
From: David Miller @ 2018-06-04 21:21 UTC (permalink / raw)
  To: olivier.gayot; +Cc: netdev
In-Reply-To: <20180604100737.4228-1-olivier.gayot@sigexec.com>

From: Olivier Gayot <olivier.gayot@sigexec.com>
Date: Mon,  4 Jun 2018 12:07:37 +0200

> This patch fixes some typos/misspelling errors in the
> Documentation/networking files.
> 
> Signed-off-by: Olivier Gayot <olivier.gayot@sigexec.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] net: qualcomm: rmnet: Fix use after free while sending command ack
From: David Miller @ 2018-06-04 21:16 UTC (permalink / raw)
  To: subashab; +Cc: netdev
In-Reply-To: <1528064275-3205-1-git-send-email-subashab@codeaurora.org>

From: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Date: Sun,  3 Jun 2018 16:17:55 -0600

> When sending an ack to a command packet, the skb is still referenced
> after it is sent to the real device. Since the real device could
> free the skb, the device pointer would be invalid.
> 
> Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

This doesn't apply cleanly to the current net-next tree, please respin.

^ permalink raw reply

* Re: [PATCH iproute2 1/2] ip: display netns name instead of nsid
From: Stephen Hemminger @ 2018-06-04 21:12 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <20180604121253.2140-2-nicolas.dichtel@6wind.com>

On Mon,  4 Jun 2018 14:12:52 +0200
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index c7c7e7df4e81..aee09c7ff6df 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -819,6 +819,9 @@ int print_linkinfo(const struct sockaddr_nl *who,
>  	unsigned int m_flag = 0;
>  	SPRINT_BUF(b1);
>  
> +	netns_nsid_socket_init();
> +	netns_map_init();
> +

The idea of printing network namespace is good but I am concerned that
setting up yet another netlink socket and scanning the netns directory on
each ip link command will impact some users.

Can this setup be deferred until the first net ns lookup happens?

^ permalink raw reply

* Re: [PATCH net-next] net: ipv6: Generate random IID for addresses on RAWIP devices
From: David Miller @ 2018-06-04 21:15 UTC (permalink / raw)
  To: subashab; +Cc: netdev
In-Reply-To: <1528062874-19250-1-git-send-email-subashab@codeaurora.org>

From: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Date: Sun,  3 Jun 2018 15:54:34 -0600

> RAWIP devices such as rmnet do not have a hardware address and
> instead require the kernel to generate a random IID for the
> temporary addresses. For permanent addresses, the device IID is
> used along with prefix received.
> 
> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

Please address yoshfuji's feedback, thank you.

^ permalink raw reply

* Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: David Miller @ 2018-06-04 21:14 UTC (permalink / raw)
  To: zenczykowski; +Cc: maze, edumazet, netdev
In-Reply-To: <20180603174705.51802-1-zenczykowski@gmail.com>

From: "Maciej Żenczykowski" <zenczykowski@gmail.com>
Date: Sun,  3 Jun 2018 10:47:05 -0700

> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH] net-tcp: extend tcp_tw_reuse sysctl to enable loopback only optimization
From: David Miller @ 2018-06-04 21:13 UTC (permalink / raw)
  To: zenczykowski; +Cc: maze, edumazet, netdev, ncardwell, ycheng, weiwan
In-Reply-To: <20180603174117.48539-1-zenczykowski@gmail.com>

From: "Maciej Żenczykowski" <zenczykowski@gmail.com>
Date: Sun,  3 Jun 2018 10:41:17 -0700

> From: Maciej Żenczykowski <maze@google.com>
> 
> This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean
> to an integer.
> 
> It now takes the values 0, 1 and 2, where 0 and 1 behave as before,
> while 2 enables timewait socket reuse only for sockets that we can
> prove are loopback connections:
>   ie. bound to 'lo' interface or where one of source or destination
>   IPs is 127.0.0.0/8, ::ffff:127.0.0.0/104 or ::1.
> 
> This enables quicker reuse of ephemeral ports for loopback connections
> - where tcp_tw_reuse is 100% safe from a protocol perspective
> (this assumes no artificially induced packet loss on 'lo').
> 
> This also makes estblishing many loopback connections *much* faster
> (allocating ports out of the first half of the ephemeral port range
> is significantly faster, then allocating from the second half)
> 
> Without this change in a 32K ephemeral port space my sample program
> (it just establishes and closes [::1]:ephemeral -> [::1]:server_port
> connections in a tight loop) fails after 32765 connections in 24 seconds.
> With it enabled 50000 connections only take 4.7 seconds.
> 
> This is particularly problematic for IPv6 where we only have one local
> address and cannot play tricks with varying source IP from 127.0.0.0/8
> pool.
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next v2] qed: Add srq core support for RoCE and iWARP
From: David Miller @ 2018-06-04 21:12 UTC (permalink / raw)
  To: yuval.bason
  Cc: netdev, jgg, dledford, linux-rdma, michal.kalderon, ariel.elior
In-Reply-To: <20180603161307.29832-1-yuval.bason@cavium.com>

From: Yuval Bason <yuval.bason@cavium.com>
Date: Sun, 3 Jun 2018 19:13:07 +0300

> This patch adds support for configuring SRQ and provides the necessary
> APIs for rdma upper layer driver (qedr) to enable the SRQ feature.
> 
> Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com>
> Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
> Signed-off-by: Yuval Bason <yuval.bason@cavium.com>
> ---
> Changes from v1:
> 	- sparse warnings
> 	- replace memset with ={}

Applied, thanks.

^ permalink raw reply

* Re: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: Or Gerlitz @ 2018-06-04 21:11 UTC (permalink / raw)
  To: David Miller; +Cc: Jeff Kirsher, Linux Netdev List
In-Reply-To: <20180604.163017.1841938428912069611.davem@davemloft.net>

On Mon, Jun 4, 2018 at 11:30 PM, David Miller <davem@davemloft.net> wrote:
> It's open a day or two more to deal with the AF_XDP issues...

Dave,

Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be merged for
this window -- AFAIU from [1], it's still under
examination/development/research for
non Intel HWs, am I correct or this is going to get in now?

Or

[1] https://marc.info/?l=linux-netdev&m=152810546108060&w=2

^ permalink raw reply

* Re: [PATCH 0/2] net: bnx2: Fix checkpatch and clang warnings
From: David Miller @ 2018-06-04 21:07 UTC (permalink / raw)
  To: rvarsha016
  Cc: rasesh.mody, harish.patil, Dept-GELinuxNICDev, netdev,
	linux-kernel, der.herr, lukas.bulwahn
In-Reply-To: <cover.1528025568.git.rvarsha016@gmail.com>

From: Varsha Rao <rvarsha016@gmail.com>
Date: Sun,  3 Jun 2018 17:18:04 +0530

> This patchset fixes NULL comparison and extra parentheses, checkpatch
> and clang warnings.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: gemini: fix spelling mistake: "it" -> "is"
From: David Miller @ 2018-06-04 21:05 UTC (permalink / raw)
  To: yuehaibing; +Cc: linus.walleij, netdev, linux-kernel
In-Reply-To: <20180603081001.23396-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Sun, 3 Jun 2018 16:10:01 +0800

> Trivial fix to spelling mistake in gemini dev_warn message
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: chelsio: Use zeroing memory allocator instead of allocator/memset
From: David Miller @ 2018-06-04 21:04 UTC (permalink / raw)
  To: yuehaibing; +Cc: santosh, netdev, linux-kernel, ganeshgr, leedom
In-Reply-To: <20180603024015.12744-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Sun, 3 Jun 2018 10:40:15 +0800

> Use dma_zalloc_coherent for allocating zeroed
> memory and remove unnecessary memset function.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next V2 2/2] cls_flower: Fix comparing of old filter mask with new filter
From: David Miller @ 2018-06-04 21:04 UTC (permalink / raw)
  To: paulb
  Cc: jiri, xiyou.wangcong, jhs, netdev, kliteyn, roid, shahark, markb,
	ogerlitz
In-Reply-To: <1528009574-63306-2-git-send-email-paulb@mellanox.com>

From: Paul Blakey <paulb@mellanox.com>
Date: Sun,  3 Jun 2018 10:06:14 +0300

> We incorrectly compare the mask and the result is that we can't modify
> an already existing rule.
> 
> Fix that by comparing correctly.
> 
> Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
> Reported-by: Vlad Buslov <vladbu@mellanox.com>
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next V2 1/2] cls_flower: Fix missing free of rhashtable
From: David Miller @ 2018-06-04 21:04 UTC (permalink / raw)
  To: paulb
  Cc: jiri, xiyou.wangcong, jhs, netdev, kliteyn, roid, shahark, markb,
	ogerlitz
In-Reply-To: <1528009574-63306-1-git-send-email-paulb@mellanox.com>

From: Paul Blakey <paulb@mellanox.com>
Date: Sun,  3 Jun 2018 10:06:13 +0300

> When destroying the instance, destroy the head rhashtable.
> 
> Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
> Reported-by: Vlad Buslov <vladbu@mellanox.com>
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: skbuff.h: drop unneeded <linux/slab.h>
From: David Miller @ 2018-06-04 21:02 UTC (permalink / raw)
  To: rdunlap; +Cc: netdev, linux-kernel, akpm
In-Reply-To: <86ee9a53-3136-7a2c-e59a-57d4d078188f@infradead.org>

From: Randy Dunlap <rdunlap@infradead.org>
Date: Sat, 2 Jun 2018 21:40:19 -0700

> From: Randy Dunlap <rdunlap@infradead.org>
> 
> <linux/skbuff.h> does not use nor need <linux/slab.h>, so drop this
> header file from skbuff.h.
> 
> <linux/skbuff.h> is currently #included in around 1200 C source and
> header files, making it the 31st most-used header file.
> 
> Build tested [allmodconfig] on 20 arch-es.
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>

Applied, thanks Randy.

^ permalink raw reply

* Re: [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx
From: Alexander Duyck @ 2018-06-04 20:53 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Netdev, mykyta.iziumtsev, John Fastabend, Willem de Bruijn,
	Michael S. Tsirkin, michael.lundkvist, Brandeburg, Jesse,
	Anjali Singhai Jain, qi.z.zhang, francois.ozog, ilias.apalodimas,
	brian.brooks, Andy Gospodarek, Michael Chan
In-Reply-To: <20180604120601.18123-11-bjorn.topel@gmail.com>

On Mon, Jun 4, 2018 at 5:06 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This will result in other devices
> doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
> dropped.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |  93 +++++++++++-------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  23 +++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 140 ++++++++++++++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   2 +
>  include/net/xdp_sock.h                      |  14 +++
>  6 files changed, 242 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 8c602424d339..98c18c41809d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3073,8 +3073,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>         i40e_status err = 0;
>         u32 qtx_ctl = 0;
>
> -       if (ring_is_xdp(ring))
> +       ring->clean_tx_irq = i40e_clean_tx_irq;
> +       if (ring_is_xdp(ring)) {
>                 ring->xsk_umem = i40e_xsk_umem(ring);
> +               if (ring->xsk_umem)
> +                       ring->clean_tx_irq = i40e_clean_tx_irq_zc;

Again, I am worried what the performance penalty on this will be given
the retpoline penalty for function pointers.

> +       }
>
>         /* some ATR related tx ring init */
>         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
> @@ -12162,6 +12166,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>         .ndo_bpf                = i40e_xdp,
>         .ndo_xdp_xmit           = i40e_xdp_xmit,
>         .ndo_xdp_flush          = i40e_xdp_flush,
> +       .ndo_xsk_async_xmit     = i40e_xsk_async_xmit,
>  };
>
>  /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 6b1142fbc697..923bb84a93ab 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -10,16 +10,6 @@
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
>
> -static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
> -                               u32 td_tag)
> -{
> -       return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
> -                          ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
> -                          ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
> -                          ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
> -                          ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
> -}
> -
>  #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
>  /**
>   * i40e_fdir - Generate a Flow Director descriptor based on fdata
> @@ -649,9 +639,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
>         if (!tx_ring->tx_bi)
>                 return;
>
> -       /* Free all the Tx ring sk_buffs */
> -       for (i = 0; i < tx_ring->count; i++)
> -               i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
> +       /* Cleanup only needed for non XSK TX ZC rings */
> +       if (!tx_ring->xsk_umem) {
> +               /* Free all the Tx ring sk_buffs */
> +               for (i = 0; i < tx_ring->count; i++)
> +                       i40e_unmap_and_free_tx_resource(tx_ring,
> +                                                       &tx_ring->tx_bi[i]);
> +       }
>
>         bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
>         memset(tx_ring->tx_bi, 0, bi_size);
> @@ -768,8 +762,40 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
>         }
>  }
>
> +void i40e_update_tx_stats(struct i40e_ring *tx_ring,
> +                         unsigned int total_packets,
> +                         unsigned int total_bytes)
> +{
> +       u64_stats_update_begin(&tx_ring->syncp);
> +       tx_ring->stats.bytes += total_bytes;
> +       tx_ring->stats.packets += total_packets;
> +       u64_stats_update_end(&tx_ring->syncp);
> +       tx_ring->q_vector->tx.total_bytes += total_bytes;
> +       tx_ring->q_vector->tx.total_packets += total_packets;
> +}
> +
>  #define WB_STRIDE 4
>
> +void i40e_arm_wb(struct i40e_ring *tx_ring,
> +                struct i40e_vsi *vsi,
> +                int budget)
> +{
> +       if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
> +               /* check to see if there are < 4 descriptors
> +                * waiting to be written back, then kick the hardware to force
> +                * them to be written back in case we stay in NAPI.
> +                * In this mode on X722 we do not enable Interrupt.
> +                */
> +               unsigned int j = i40e_get_tx_pending(tx_ring, false);
> +
> +               if (budget &&
> +                   ((j / WB_STRIDE) == 0) && (j > 0) &&
> +                   !test_bit(__I40E_VSI_DOWN, vsi->state) &&
> +                   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
> +                       tx_ring->arm_wb = true;
> +       }
> +}
> +
>  /**
>   * i40e_clean_tx_irq - Reclaim resources after transmit completes
>   * @vsi: the VSI we care about
> @@ -778,8 +804,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
>   *
>   * Returns true if there's any budget left (e.g. the clean is finished)
>   **/
> -static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> -                             struct i40e_ring *tx_ring, int napi_budget)
> +bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> +                      struct i40e_ring *tx_ring, int napi_budget)
>  {
>         u16 i = tx_ring->next_to_clean;
>         struct i40e_tx_buffer *tx_buf;
> @@ -874,27 +900,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>
>         i += tx_ring->count;
>         tx_ring->next_to_clean = i;
> -       u64_stats_update_begin(&tx_ring->syncp);
> -       tx_ring->stats.bytes += total_bytes;
> -       tx_ring->stats.packets += total_packets;
> -       u64_stats_update_end(&tx_ring->syncp);
> -       tx_ring->q_vector->tx.total_bytes += total_bytes;
> -       tx_ring->q_vector->tx.total_packets += total_packets;
> -
> -       if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
> -               /* check to see if there are < 4 descriptors
> -                * waiting to be written back, then kick the hardware to force
> -                * them to be written back in case we stay in NAPI.
> -                * In this mode on X722 we do not enable Interrupt.
> -                */
> -               unsigned int j = i40e_get_tx_pending(tx_ring, false);
>
> -               if (budget &&
> -                   ((j / WB_STRIDE) == 0) && (j > 0) &&
> -                   !test_bit(__I40E_VSI_DOWN, vsi->state) &&
> -                   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
> -                       tx_ring->arm_wb = true;
> -       }
> +       i40e_update_tx_stats(tx_ring, total_packets, total_bytes);
> +       i40e_arm_wb(tx_ring, vsi, budget);
>
>         if (ring_is_xdp(tx_ring))
>                 return !!budget;
> @@ -2467,10 +2475,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
>          * budget and be more aggressive about cleaning up the Tx descriptors.
>          */
>         i40e_for_each_ring(ring, q_vector->tx) {
> -               if (!i40e_clean_tx_irq(vsi, ring, budget)) {
> +               if (!ring->clean_tx_irq(vsi, ring, budget)) {
>                         clean_complete = false;
>                         continue;
>                 }
> +
>                 arm_wb |= ring->arm_wb;
>                 ring->arm_wb = false;
>         }
> @@ -3595,6 +3604,12 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return -ENXIO;
>
> +       /* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
> +        * will drop incoming packets redirected by other devices!
> +        */
> +       if (vsi->xdp_rings[queue_index]->xsk_umem)
> +               return -ENXIO;
> +
>         if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>                 return -EINVAL;
>
> @@ -3633,5 +3648,11 @@ void i40e_xdp_flush(struct net_device *dev)
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return;
>
> +       /* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
> +        * will drop incoming packets redirected by other devices!
> +        */
> +       if (vsi->xdp_rings[queue_index]->xsk_umem)
> +               return;
> +
>         i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>  }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index cddb185cd2f8..b9c42c352a8d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -426,6 +426,8 @@ struct i40e_ring {
>
>         int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
>         bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
> +       bool (*clean_tx_irq)(struct i40e_vsi *vsi, struct i40e_ring *ring,
> +                            int budget);
>         struct xdp_umem *xsk_umem;
>
>         struct zero_copy_allocator zca; /* ZC allocator anchor */
> @@ -506,6 +508,9 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>                   u32 flags);
>  void i40e_xdp_flush(struct net_device *dev);
>  int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
> +bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> +                      struct i40e_ring *tx_ring, int napi_budget);
> +int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
>
>  /**
>   * i40e_get_head - Retrieve head from head writeback
> @@ -687,6 +692,16 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>         writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
>  }
>
> +static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
> +                               u32 td_tag)
> +{
> +       return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
> +                          ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
> +                          ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
> +                          ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
> +                          ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
> +}
> +
>  void i40e_fd_handle_status(struct i40e_ring *rx_ring,
>                            union i40e_rx_desc *rx_desc, u8 prog_id);
>  int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> @@ -696,4 +711,12 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>                              u8 rx_ptype);
>  void i40e_receive_skb(struct i40e_ring *rx_ring,
>                       struct sk_buff *skb, u16 vlan_tag);
> +
> +void i40e_update_tx_stats(struct i40e_ring *tx_ring,
> +                         unsigned int total_packets,
> +                         unsigned int total_bytes);
> +void i40e_arm_wb(struct i40e_ring *tx_ring,
> +                struct i40e_vsi *vsi,
> +                int budget);
> +
>  #endif /* _I40E_TXRX_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9d16924415b9..021fec5b5799 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -535,3 +535,143 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>         return failure ? budget : (int)total_rx_packets;
>  }
>
> +/* Returns true if the work is finished */
> +static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> +{
> +       unsigned int total_packets = 0, total_bytes = 0;
> +       struct i40e_tx_buffer *tx_bi;
> +       struct i40e_tx_desc *tx_desc;
> +       bool work_done = true;
> +       dma_addr_t dma;
> +       u32 len;
> +
> +       while (budget-- > 0) {
> +               if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
> +                       xdp_ring->tx_stats.tx_busy++;
> +                       work_done = false;
> +                       break;
> +               }
> +
> +               if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
> +                       break;
> +
> +               dma_sync_single_for_device(xdp_ring->dev, dma, len,
> +                                          DMA_BIDIRECTIONAL);
> +
> +               tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
> +               tx_bi->bytecount = len;
> +               tx_bi->gso_segs = 1;
> +
> +               tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> +               tx_desc->buffer_addr = cpu_to_le64(dma);
> +               tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
> +                                                       | I40E_TX_DESC_CMD_EOP,
> +                                                         0, len, 0);
> +
> +               total_packets++;
> +               total_bytes += len;
> +
> +               xdp_ring->next_to_use++;
> +               if (xdp_ring->next_to_use == xdp_ring->count)
> +                       xdp_ring->next_to_use = 0;
> +       }
> +
> +       if (total_packets > 0) {
> +               /* Request an interrupt for the last frame and bump tail ptr. */
> +               tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
> +                                                I40E_TXD_QW1_CMD_SHIFT);
> +               i40e_xdp_ring_update_tail(xdp_ring);
> +
> +               xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
> +               i40e_update_tx_stats(xdp_ring, total_packets, total_bytes);
> +       }
> +

So this code is likely vulnerable to an issue we were seeing where the
Tx was stalling and surging when xmit_more was in use. We found the
issue was that we were only setting the RS once per ring fill. As a
result the ring was either full or empty from the drivers perspective.
This ends up with poor Tx performance when it occurs. As such you may
want to set the RS bit at least twice per fill so you may want to look
at going through the lesser of either half the ring size or the
budget, then set the RS, and repeat with whatever budget you have
remaining. That way the ring should on average be 50% utilized instead
of either 100% or 0%.

> +       return !!budget && work_done;
> +}
> +
> +bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
> +                         struct i40e_ring *tx_ring, int napi_budget)
> +{
> +       struct xdp_umem *umem = tx_ring->xsk_umem;
> +       u32 head_idx = i40e_get_head(tx_ring);
> +       unsigned int budget = vsi->work_limit;
> +       bool work_done = true, xmit_done;
> +       u32 completed_frames;
> +       u32 frames_ready;
> +
> +       if (head_idx < tx_ring->next_to_clean)
> +               head_idx += tx_ring->count;
> +       frames_ready = head_idx - tx_ring->next_to_clean;
> +
> +       if (frames_ready == 0) {
> +               goto out_xmit;
> +       } else if (frames_ready > budget) {
> +               completed_frames = budget;
> +               work_done = false;
> +       } else {
> +               completed_frames = frames_ready;
> +       }
> +
> +       tx_ring->next_to_clean += completed_frames;
> +       if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
> +               tx_ring->next_to_clean -= tx_ring->count;
> +
> +       xsk_umem_complete_tx(umem, completed_frames);
> +
> +       i40e_arm_wb(tx_ring, vsi, budget);
> +
> +out_xmit:
> +       xmit_done = i40e_xmit_zc(tx_ring, budget);
> +
> +       return work_done && xmit_done;
> +}

I am not a fan of using head write-back. This code just seems shaky at
best to me. Am I understanding correctly that you are using the Tx
cleanup to transmit frames?

> +
> +/**
> + * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
> + * @n: napi context
> + *
> + * Returns true if NAPI is scheduled.
> + **/
> +static bool i40e_napi_is_scheduled(struct napi_struct *n)
> +{
> +       unsigned long val, new;
> +
> +       do {
> +               val = READ_ONCE(n->state);
> +               if (val & NAPIF_STATE_DISABLE)
> +                       return true;
> +
> +               if (!(val & NAPIF_STATE_SCHED))
> +                       return false;
> +
> +               new = val | NAPIF_STATE_MISSED;
> +       } while (cmpxchg(&n->state, val, new) != val);
> +

This code does not belong here. This is core kernel code, not anything
driver specific. Probably not needed if you drop the call below.

> +       return true;
> +}
> +
> +int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
> +{
> +       struct i40e_netdev_priv *np = netdev_priv(dev);
> +       struct i40e_vsi *vsi = np->vsi;
> +       struct i40e_ring *ring;
> +
> +       if (test_bit(__I40E_VSI_DOWN, vsi->state))
> +               return -ENETDOWN;
> +
> +       if (!i40e_enabled_xdp_vsi(vsi))
> +               return -ENXIO;
> +
> +       if (queue_id >= vsi->num_queue_pairs)
> +               return -ENXIO;
> +
> +       if (!vsi->xdp_rings[queue_id]->xsk_umem)
> +               return -ENXIO;
> +
> +       ring = vsi->xdp_rings[queue_id];
> +
> +       if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
> +               i40e_force_wb(vsi, ring->q_vector);

We really shouldn't have napi being scheduled by the Tx path.

> +
> +       return 0;
> +}

Again, more comments might be helpful here.

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> index 757ac5ca8511..bd006f1a4397 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> @@ -13,5 +13,7 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
>  void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
>  bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
>  int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> +bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
> +                         struct i40e_ring *tx_ring, int napi_budget);
>
>  #endif /* _I40E_XSK_H_ */
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index ec8fd3314097..63aa05abf11d 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -103,6 +103,20 @@ static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>  static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
>  {
>  }
> +
> +static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
> +{
> +}
> +
> +static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
> +                                      u32 *len)
> +{
> +       return false;
> +}
> +
> +static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
> +{
> +}
>  #endif /* CONFIG_XDP_SOCKETS */
>
>  static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> --
> 2.14.1
>

^ permalink raw reply

* Re: [PATCH net] net/ipv6: prevent use after free in ip6_route_mpath_notify
From: Eric Dumazet @ 2018-06-04 20:45 UTC (permalink / raw)
  To: dsahern, netdev; +Cc: David Ahern, Eric Dumazet
In-Reply-To: <20180604204142.8941-1-dsahern@kernel.org>



On 06/04/2018 01:41 PM, dsahern@kernel.org wrote:
> From: David Ahern <dsahern@gmail.com>
> 
> syzbot reported a use-after-free:
> 
> BUG: KASAN: use-after-free in ip6_route_mpath_notify+0xe9/0x100 net/ipv6/route.c:4180
> Read of size 4 at addr ffff8801bf789cf0 by task syz-executor756/4555
> 
> Fix by not setting rt_last until the it is verified the insert succeeded.
> 
> Fixes: 3b1137fe7482 ("net: ipv6: Change notifications for multipath add to RTA_MULTIPATH")
> Cc: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks David !

^ permalink raw reply

* [PATCH net] net/ipv6: prevent use after free in ip6_route_mpath_notify
From: dsahern @ 2018-06-04 20:41 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern, Eric Dumazet

From: David Ahern <dsahern@gmail.com>

syzbot reported a use-after-free:

BUG: KASAN: use-after-free in ip6_route_mpath_notify+0xe9/0x100 net/ipv6/route.c:4180
Read of size 4 at addr ffff8801bf789cf0 by task syz-executor756/4555

CPU: 1 PID: 4555 Comm: syz-executor756 Not tainted 4.17.0-rc7+ #78
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
 ip6_route_mpath_notify+0xe9/0x100 net/ipv6/route.c:4180
 ip6_route_multipath_add+0x615/0x1910 net/ipv6/route.c:4303
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 ...

Allocated by task 4555:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
 dst_alloc+0xbb/0x1d0 net/core/dst.c:104
 __ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:361
 ip6_dst_alloc+0x29/0xb0 net/ipv6/route.c:376
 ip6_route_info_create+0x4d4/0x3a30 net/ipv6/route.c:2834
 ip6_route_multipath_add+0xc7e/0x1910 net/ipv6/route.c:4240
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 ...

Freed by task 4555:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
 dst_destroy+0x267/0x3c0 net/core/dst.c:140
 dst_release_immediate+0x71/0x9e net/core/dst.c:205
 fib6_add+0xa40/0x1650 net/ipv6/ip6_fib.c:1305
 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1011
 ip6_route_multipath_add+0x513/0x1910 net/ipv6/route.c:4267
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 ...

The problem is that rt_last can point to a deleted route if the insert
fails.

One reproducer is to insert a route and then add a multipath route that
has a duplicate nexthop.e.g,:
    $ ip -6 ro add vrf red 2001:db8:101::/64 nexthop via 2001:db8:1::2
    $ ip -6 ro append vrf red 2001:db8:101::/64 nexthop via 2001:db8:1::4 nexthop via 2001:db8:1::2

Fix by not setting rt_last until the it is verified the insert succeeded.

Fixes: 3b1137fe7482 ("net: ipv6: Change notifications for multipath add to RTA_MULTIPATH")
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4d61736c41a..c516f8556dbe 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4263,11 +4263,15 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,

 	err_nh = NULL;
 	list_for_each_entry(nh, &rt6_nh_list, next) {
-		rt_last = nh->rt6_info;
 		err = __ip6_ins_rt(nh->rt6_info, info, &nh->mxc, extack);
-		/* save reference to first route for notification */
-		if (!rt_notif && !err)
-			rt_notif = nh->rt6_info;
+		if (!err) {
+			/* save reference to last route successfully inserted */
+			rt_last = nh->rt6_info;
+
+			/* save reference to first route for notification */
+			if (!rt_notif)
+				rt_notif = nh->rt6_info;
+		}

 		/* nh->rt6_info is used or freed at this point, reset to NULL*/
 		nh->rt6_info = NULL;
-- 
2.11.0

^ permalink raw reply related

* Re: [bug] cxgb4: vrf stopped working with cxgb4 card
From: David Ahern @ 2018-06-04 20:35 UTC (permalink / raw)
  To: AMG Zollner Robert, ganeshgr; +Cc: netdev
In-Reply-To: <93057e04-d4ff-e16f-02ac-132501a8e08f@cloudmedia.eu>

On 6/4/18 1:14 PM, AMG Zollner Robert wrote:
> Yes, I was enslaving while the interface was up.
> 
> Just tested some of the builds that where not working earlier and they
> are working if I keep the interface down when enslaving as you suggested.
> 
> Is this the expected behavior?

Not expected from my perspective.

The VRF device cycles interfaces when they are enslaved or unenslaved to
clean up route and neighbor tables. This is a day 1 property of VRF.

I guessed that was the problem based on the commit you bisected the
problem to. If nothing else, it gives you a workaround until it is fixed.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox