* [GIT PULL nf] IPVS Fixes for v3.18
From: Simon Horman @ 2014-10-28 1:05 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
Julian Anastasov, Simon Horman
Hi Pablo,
please consider this fix for v3.18.
It fixes a null-pointer dereference that may occur when logging
errors.
This problem was introduced by 4a4739d56b0 ("ipvs: Pull out
crosses_local_route_boundary logic") in v3.17-rc5. As such I would
also like it considered for 3.17-stable.
The following changes since commit 7965ee93719921ea5978f331da653dfa2d7b99f5:
netfilter: nft_compat: fix wrong target lookup in nft_target_select_ops() (2014-10-27 22:17:46 +0100)
are available in the git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs.git tags/ipvs-fixes-for-v3.18
for you to fetch changes up to 3d53666b40007b55204ee8890618da79a20c9940:
ipvs: Avoid null-pointer deref in debug code (2014-10-28 09:48:31 +0900)
----------------------------------------------------------------
Alex Gartrell (1):
ipvs: Avoid null-pointer deref in debug code
net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
^ permalink raw reply
* Re: [PATCH net-next 2/2] udp: Reset flow table for flows over unconnected sockets
From: Tom Herbert @ 2014-10-28 1:09 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, Linux Netdev List
In-Reply-To: <1414451970.2922.27.camel@edumazet-glaptop2.roam.corp.google.com>
On Mon, Oct 27, 2014 at 4:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2014-10-27 at 12:36 -0700, Tom Herbert wrote:
>
>> Please try this patch and provide real data to support your points.
>>
>
> Yep. This is not good, I confirm my fear.
>
> Google servers are shifting to serve both TCP & UDP traffic (QUIC
> protocol), with an increasing UDP load.
>
> Millions of packets per second per host, from millions of different
> sources...
>
This indicates nothing about the merits of this patch. Nevertheless,
in order to avoid further rat-holing and since this patch does change
a long standing behavior I'll will respin to make it enabled only by
sysctl.
Tom
> And your patch voids the RFS table, adds another cache miss in fast path
> for UDP rx path which is already too expensive.
>
>
>> If a TCP connection is hot it will continually refresh the table for
>> that connection, if connection becomes idle it only takes one received
>> packet to restore the CPU. The only time there could be a persistent
>> problem is if collision rate is high (which probably means table is
>> too small).
>
>
> RFS already has a low hit/miss rate, this patch does not help neither
> UDP or TCP.
>
> Ideally, RFS should be enabled on a protocol base, not an agnostic u32
> flow hash.
>
> Whatever strategy you implement, as long as different protocols share a
> common hash table, it wont be perfect for mixed workloads.
>
> Fundamental problem is that when an UDP packet comes, its not possible
> to know if its a 'flow' or 'not', unless we perform an expensive lookup,
> and then RPS/RFS cost becomes prohibitive.
>
> While for TCP, the current RFS cache miss is good enough, because about
> all packets are for connected flows. We eventually have bad steering for
> <not yet established> flows where the stack performs poorly anyway.
>
>
>
^ permalink raw reply
* Re: [PATCH] bridge: Add support for IEEE 802.11 Proxy ARP
From: Stephen Hemminger @ 2014-10-28 1:20 UTC (permalink / raw)
To: Kyeyoon Park; +Cc: davem, jouni, netdev
In-Reply-To: <1414100957-8288-1-git-send-email-kyeyoonp@qca.qualcomm.com>
On Thu, 23 Oct 2014 14:49:17 -0700
Kyeyoon Park <kyeyoonp@qca.qualcomm.com> wrote:
> From: Kyeyoon Park <kyeyoonp@codeaurora.org>
>
> This feature is defined in IEEE Std 802.11-2012, 10.23.13. It allows
> the AP devices to keep track of the hardware-address-to-IP-address
> mapping of the mobile devices within the WLAN network.
>
> The AP will learn this mapping via observing DHCP, ARP, and NS/NA
> frames. When a request for such information is made (i.e. ARP request,
> Neighbor Solicitation), the AP will respond on behalf of the
> associated mobile device. In the process of doing so, the AP will drop
> the multicast request frame that was intended to go out to the wireless
> medium.
>
> It was recommended at the LKS workshop to do this implementation in
> the bridge layer. vxlan.c is already doing something very similar.
> The DHCP snooping code will be added to the userspace application
> (hostapd) per the recommendation.
>
> This RFC commit is only for IPv4. A similar approach in the bridge
> layer will be taken for IPv6 as well.
>
> Signed-off-by: Kyeyoon Park <kyeyoonp@codeaurora.org>
Looks good. Maybe at some point VXLAN and bridge should share
more code or at least the same options.
I a little worried that this could be DoS'd.
^ permalink raw reply
* [PATCH] mac80211_hwsim: release driver when ieee80211_register_hw fails
From: Junjie Mao @ 2014-10-28 1:31 UTC (permalink / raw)
To: Martin Pitt
Cc: Junjie Mao, Fengguang Wu, linux-wireless, netdev, linux-kernel
The driver is not released when ieee80211_register_hw fails in
mac80211_hwsim_create_radio, leading to the access to the unregistered (and
possibly freed) device in platform_driver_unregister:
[ 0.447547] mac80211_hwsim: ieee80211_register_hw failed (-2)
[ 0.448292] ------------[ cut here ]------------
[ 0.448854] WARNING: CPU: 0 PID: 1 at ../include/linux/kref.h:47 kobject_get+0x33/0x50()
[ 0.449839] CPU: 0 PID: 1 Comm: swapper Not tainted 3.17.0-00001-gdd46990-dirty #2
[ 0.450813] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 0.451512] 00000000 00000000 78025e38 7967c6c6 78025e68 7905e09b 7988b480 00000000
[ 0.452579] 00000001 79887d62 0000002f 79170bb3 79170bb3 78397008 79ac9d74 00000001
[ 0.453614] 78025e78 7905e15d 00000009 00000000 78025e84 79170bb3 78397000 78025e8c
[ 0.454632] Call Trace:
[ 0.454921] [<7967c6c6>] dump_stack+0x16/0x18
[ 0.455453] [<7905e09b>] warn_slowpath_common+0x6b/0x90
[ 0.456067] [<79170bb3>] ? kobject_get+0x33/0x50
[ 0.456612] [<79170bb3>] ? kobject_get+0x33/0x50
[ 0.457155] [<7905e15d>] warn_slowpath_null+0x1d/0x20
[ 0.457748] [<79170bb3>] kobject_get+0x33/0x50
[ 0.458274] [<7925824f>] get_device+0xf/0x20
[ 0.458779] [<7925b5cd>] driver_detach+0x3d/0xa0
[ 0.459331] [<7925a3ff>] bus_remove_driver+0x8f/0xb0
[ 0.459927] [<7925bf80>] ? class_unregister+0x40/0x80
[ 0.460660] [<7925bad7>] driver_unregister+0x47/0x50
[ 0.461248] [<7925c033>] ? class_destroy+0x13/0x20
[ 0.461824] [<7925d07b>] platform_driver_unregister+0xb/0x10
[ 0.462507] [<79b51ba0>] init_mac80211_hwsim+0x3e8/0x3f9
[ 0.463161] [<79b30c58>] do_one_initcall+0x106/0x1a9
[ 0.463758] [<79b517b8>] ? if_spi_init_module+0xac/0xac
[ 0.464393] [<79b517b8>] ? if_spi_init_module+0xac/0xac
[ 0.465001] [<79071935>] ? parse_args+0x2f5/0x480
[ 0.465569] [<7906b41e>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[ 0.466345] [<79b30dd9>] kernel_init_freeable+0xde/0x17d
[ 0.466972] [<79b304d6>] ? do_early_param+0x7a/0x7a
[ 0.467546] [<79677b1b>] kernel_init+0xb/0xe0
[ 0.468072] [<79075f42>] ? schedule_tail+0x12/0x40
[ 0.468658] [<79686580>] ret_from_kernel_thread+0x20/0x30
[ 0.469303] [<79677b10>] ? rest_init+0xc0/0xc0
[ 0.469829] ---[ end trace ad8ac403ff8aef5c ]---
[ 0.470509] ------------[ cut here ]------------
[ 0.471047] WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3161 __lock_acquire.isra.22+0x7aa/0xb00()
[ 0.472163] DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)
[ 0.472774] CPU: 0 PID: 1 Comm: swapper Tainted: G W 3.17.0-00001-gdd46990-dirty #2
[ 0.473815] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 0.474492] 78025de0 78025de0 78025da0 7967c6c6 78025dd0 7905e09b 79888931 78025dfc
[ 0.475515] 00000001 79888a93 00000c59 7907f33a 7907f33a 78028000 fffe9d09 00000000
[ 0.476519] 78025de8 7905e10e 00000009 78025de0 79888931 78025dfc 78025e24 7907f33a
[ 0.477523] Call Trace:
[ 0.477821] [<7967c6c6>] dump_stack+0x16/0x18
[ 0.478352] [<7905e09b>] warn_slowpath_common+0x6b/0x90
[ 0.478976] [<7907f33a>] ? __lock_acquire.isra.22+0x7aa/0xb00
[ 0.479658] [<7907f33a>] ? __lock_acquire.isra.22+0x7aa/0xb00
[ 0.480417] [<7905e10e>] warn_slowpath_fmt+0x2e/0x30
[ 0.480479] [<7907f33a>] __lock_acquire.isra.22+0x7aa/0xb00
[ 0.480479] [<79078aa5>] ? sched_clock_cpu+0xb5/0xf0
[ 0.480479] [<7907fd06>] lock_acquire+0x56/0x70
[ 0.480479] [<7925b5e8>] ? driver_detach+0x58/0xa0
[ 0.480479] [<79682d11>] mutex_lock_nested+0x61/0x2a0
[ 0.480479] [<7925b5e8>] ? driver_detach+0x58/0xa0
[ 0.480479] [<7925b5e8>] ? driver_detach+0x58/0xa0
[ 0.480479] [<7925b5e8>] driver_detach+0x58/0xa0
[ 0.480479] [<7925a3ff>] bus_remove_driver+0x8f/0xb0
[ 0.480479] [<7925bf80>] ? class_unregister+0x40/0x80
[ 0.480479] [<7925bad7>] driver_unregister+0x47/0x50
[ 0.480479] [<7925c033>] ? class_destroy+0x13/0x20
[ 0.480479] [<7925d07b>] platform_driver_unregister+0xb/0x10
[ 0.480479] [<79b51ba0>] init_mac80211_hwsim+0x3e8/0x3f9
[ 0.480479] [<79b30c58>] do_one_initcall+0x106/0x1a9
[ 0.480479] [<79b517b8>] ? if_spi_init_module+0xac/0xac
[ 0.480479] [<79b517b8>] ? if_spi_init_module+0xac/0xac
[ 0.480479] [<79071935>] ? parse_args+0x2f5/0x480
[ 0.480479] [<7906b41e>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[ 0.480479] [<79b30dd9>] kernel_init_freeable+0xde/0x17d
[ 0.480479] [<79b304d6>] ? do_early_param+0x7a/0x7a
[ 0.480479] [<79677b1b>] kernel_init+0xb/0xe0
[ 0.480479] [<79075f42>] ? schedule_tail+0x12/0x40
[ 0.480479] [<79686580>] ret_from_kernel_thread+0x20/0x30
[ 0.480479] [<79677b10>] ? rest_init+0xc0/0xc0
[ 0.480479] ---[ end trace ad8ac403ff8aef5d ]---
[ 0.495478] BUG: unable to handle kernel paging request at 00200200
[ 0.496257] IP: [<79682de5>] mutex_lock_nested+0x135/0x2a0
[ 0.496923] *pde = 00000000
[ 0.497290] Oops: 0002 [#1]
[ 0.497653] CPU: 0 PID: 1 Comm: swapper Tainted: G W 3.17.0-00001-gdd46990-dirty #2
[ 0.498659] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 0.499321] task: 78028000 ti: 78024000 task.ti: 78024000
[ 0.499955] EIP: 0060:[<79682de5>] EFLAGS: 00010097 CPU: 0
[ 0.500620] EIP is at mutex_lock_nested+0x135/0x2a0
[ 0.501145] EAX: 00200200 EBX: 78397434 ECX: 78397460 EDX: 78025e70
[ 0.501816] ESI: 00000246 EDI: 78028000 EBP: 78025e8c ESP: 78025e54
[ 0.502497] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 0.503076] CR0: 8005003b CR2: 00200200 CR3: 01b9d000 CR4: 00000690
[ 0.503773] Stack:
[ 0.503998] 00000000 00000001 00000000 7925b5e8 78397460 7925b5e8 78397474 78397460
[ 0.504944] 00200200 11111111 78025e70 78397000 79ac9d74 00000001 78025ea0 7925b5e8
[ 0.505451] 79ac9d74 fffffffe 00000001 78025ebc 7925a3ff 7a251398 78025ec8 7925bf80
[ 0.505451] Call Trace:
[ 0.505451] [<7925b5e8>] ? driver_detach+0x58/0xa0
[ 0.505451] [<7925b5e8>] ? driver_detach+0x58/0xa0
[ 0.505451] [<7925b5e8>] driver_detach+0x58/0xa0
[ 0.505451] [<7925a3ff>] bus_remove_driver+0x8f/0xb0
[ 0.505451] [<7925bf80>] ? class_unregister+0x40/0x80
[ 0.505451] [<7925bad7>] driver_unregister+0x47/0x50
[ 0.505451] [<7925c033>] ? class_destroy+0x13/0x20
[ 0.505451] [<7925d07b>] platform_driver_unregister+0xb/0x10
[ 0.505451] [<79b51ba0>] init_mac80211_hwsim+0x3e8/0x3f9
[ 0.505451] [<79b30c58>] do_one_initcall+0x106/0x1a9
[ 0.505451] [<79b517b8>] ? if_spi_init_module+0xac/0xac
[ 0.505451] [<79b517b8>] ? if_spi_init_module+0xac/0xac
[ 0.505451] [<79071935>] ? parse_args+0x2f5/0x480
[ 0.505451] [<7906b41e>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[ 0.505451] [<79b30dd9>] kernel_init_freeable+0xde/0x17d
[ 0.505451] [<79b304d6>] ? do_early_param+0x7a/0x7a
[ 0.505451] [<79677b1b>] kernel_init+0xb/0xe0
[ 0.505451] [<79075f42>] ? schedule_tail+0x12/0x40
[ 0.505451] [<79686580>] ret_from_kernel_thread+0x20/0x30
[ 0.505451] [<79677b10>] ? rest_init+0xc0/0xc0
[ 0.505451] Code: 89 d8 e8 cf 9b 9f ff 8b 4f 04 8d 55 e4 89 d8 e8 72 9d 9f ff 8d 43 2c 89 c1 89 45 d8 8b 43 30 8d 55 e4 89 53 30 89 4d e4 89 45 e8 <89> 10 8b 55 dc 8b 45 e0 89 7d ec e8 db af 9f ff eb 11 90 31 c0
[ 0.505451] EIP: [<79682de5>] mutex_lock_nested+0x135/0x2a0 SS:ESP 0068:78025e54
[ 0.505451] CR2: 0000000000200200
[ 0.505451] ---[ end trace ad8ac403ff8aef5e ]---
[ 0.505451] Kernel panic - not syncing: Fatal exception
Fixes: 9ea927748ced ("mac80211_hwsim: Register and bind to driver")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Junjie Mao <eternal.n08@gmail.com>
---
drivers/net/wireless/mac80211_hwsim.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c
index babbdc1ce741..c9ad4cf1adfb 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -1987,7 +1987,7 @@ static int mac80211_hwsim_create_radio(int channels, const char *reg_alpha2,
if (err != 0) {
printk(KERN_DEBUG "mac80211_hwsim: device_bind_driver failed (%d)\n",
err);
- goto failed_hw;
+ goto failed_bind;
}
skb_queue_head_init(&data->pending);
@@ -2183,6 +2183,8 @@ static int mac80211_hwsim_create_radio(int channels, const char *reg_alpha2,
return idx;
failed_hw:
+ device_release_driver(data->dev);
+failed_bind:
device_unregister(data->dev);
failed_drvdata:
ieee80211_free_hw(hw);
--
1.9.3
^ permalink raw reply related
* Re: irq disable in __netdev_alloc_frag() ?
From: Christoph Lameter @ 2014-10-28 2:30 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, Alexander Duyck, Alexei Starovoitov, Eric Dumazet,
Network Development
In-Reply-To: <20141027213523.799da09c@redhat.com>
On Mon, 27 Oct 2014, Jesper Dangaard Brouer wrote:
> > Same could be done with some kmem_cache_alloc() : SLAB uses hard irq
> > masking while some caches are never used from hard irq context.
>
> Sounds interesting.
SLUB does not disable interrupts in the fast paths.
^ permalink raw reply
* Re: irq disable in __netdev_alloc_frag() ?
From: Eric Dumazet @ 2014-10-28 2:46 UTC (permalink / raw)
To: Christoph Lameter
Cc: Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck,
Alexei Starovoitov, Network Development
In-Reply-To: <alpine.DEB.2.11.1410272129530.21936@gentwo.org>
On Mon, Oct 27, 2014 at 7:30 PM, Christoph Lameter <cl@linux.com> wrote:
> On Mon, 27 Oct 2014, Jesper Dangaard Brouer wrote:
>
>> > Same could be done with some kmem_cache_alloc() : SLAB uses hard irq
>> > masking while some caches are never used from hard irq context.
>>
>> Sounds interesting.
>
> SLUB does not disable interrupts in the fast paths.
>
Unfortunately, SLUB is more expensive than SLAB for many networking workloads.
The cost of disabling interrupts is pure noise compared to cache line misses.
SLUB has poor behavior compared to SLAB with alien caches,
even with the side effect that 'struct page' is 64 bytes aligned
instead of being 56 bytes with SLAB
Note that I am not doing SLUB/SLAB tests every day, so it might be
better nowadays.
^ permalink raw reply
* [PATCH net-next] tcp: allow for bigger reordering level
From: Eric Dumazet @ 2014-10-28 4:45 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Yaogong Wang
From: Eric Dumazet <edumazet@google.com>
While testing upcoming Yaogong patch (converting out of order queue
into an RB tree), I hit the max reordering level of linux TCP stack.
Reordering level was limited to 127 for no good reason, and some
network setups [1] can easily reach this limit and get limited
throughput.
Allow a new max limit of 300, and add a sysctl to allow admins to even
allow bigger (or lower) values if needed.
[1] Aggregation of links, per packet load balancing, fabrics not doing
deep packet inspections, alternative TCP congestion modules...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
---
Documentation/networking/bonding.txt | 7 ++-----
Documentation/networking/ip-sysctl.txt | 10 +++++++++-
include/linux/tcp.h | 4 ++--
include/net/tcp.h | 4 +---
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
net/ipv4/tcp_input.c | 3 ++-
6 files changed, 23 insertions(+), 12 deletions(-)
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index eeb5b2e97bedac5ce910a06cc03bf42035d544d4..7ddd70df4d9aaa76b5806bf4a74fd1583ba7e198 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -2230,11 +2230,8 @@ balance-rr: This mode is the only mode that will permit a single
It is possible to adjust TCP/IP's congestion limits by
altering the net.ipv4.tcp_reordering sysctl parameter. The
- usual default value is 3, and the maximum useful value is 127.
- For a four interface balance-rr bond, expect that a single
- TCP/IP stream will utilize no more than approximately 2.3
- interface's worth of throughput, even after adjusting
- tcp_reordering.
+ usual default value is 3. But keep in mind TCP stack is able
+ to automatically increase this when it detects reorders.
Note that the fraction of packets that will be delivered out of
order is highly variable, and is unlikely to be zero. The level
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 0307e2875f2159cb669b741f9d6a949618c3a055..9028b879a97baebc29832c42694896361ecfba03 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -376,9 +376,17 @@ tcp_orphan_retries - INTEGER
may consume significant resources. Cf. tcp_max_orphans.
tcp_reordering - INTEGER
- Maximal reordering of packets in a TCP stream.
+ Initial reordering level of packets in a TCP stream.
+ TCP stack can then dynamically adjust flow reordering level
+ between this initial value and tcp_max_reordering
Default: 3
+tcp_max_reordering - INTEGER
+ Maximal reordering level of packets in a TCP stream.
+ 300 is a fairly conservative value, but you might increase it
+ if paths are using per packet load balancing (like bonding rr mode)
+ Default: 300
+
tcp_retrans_collapse - BOOLEAN
Bug-to-bug compatibility with some broken printers.
On retransmit try to send bigger packets to work around bugs in
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c2dee7deefa8cb32af530d20e5aa32a61b10ce68..f566b8567892ef0bb213de0540b37cfc6ac03ca0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -204,10 +204,10 @@ struct tcp_sock {
u16 urg_data; /* Saved octet of OOB data and control flags */
u8 ecn_flags; /* ECN status bits. */
- u8 reordering; /* Packet reordering metric. */
+ u8 keepalive_probes; /* num of allowed keep alive probes */
+ u32 reordering; /* Packet reordering metric. */
u32 snd_up; /* Urgent pointer */
- u8 keepalive_probes; /* num of allowed keep alive probes */
/*
* Options received (usually on last packet, some only on SYN packets).
*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c73fc145ee4533c3f65adf5370e9c0348dfb4395..3a35b1500359446d98ee9f1cd0b55d34ac66d477 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -70,9 +70,6 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
/* After receiving this amount of duplicate ACKs fast retransmit starts. */
#define TCP_FASTRETRANS_THRESH 3
-/* Maximal reordering. */
-#define TCP_MAX_REORDERING 127
-
/* Maximal number of ACKs sent quickly to accelerate slow-start. */
#define TCP_MAX_QUICKACKS 16U
@@ -252,6 +249,7 @@ extern int sysctl_tcp_abort_on_overflow;
extern int sysctl_tcp_max_orphans;
extern int sysctl_tcp_fack;
extern int sysctl_tcp_reordering;
+extern int sysctl_tcp_max_reordering;
extern int sysctl_tcp_dsack;
extern long sysctl_tcp_mem[3];
extern int sysctl_tcp_wmem[3];
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b3c53c8b331efc3d5cf6437fd3ec7634a154263c..e0ee384a448fb0e6eb5b957d98dbcb272ea97edb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -496,6 +496,13 @@ static struct ctl_table ipv4_table[] = {
.proc_handler = proc_dointvec
},
{
+ .procname = "tcp_max_reordering",
+ .data = &sysctl_tcp_max_reordering,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
.procname = "tcp_dsack",
.data = &sysctl_tcp_dsack,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a12b455928e52211efdc6b471ef54de6218f5df0..9a18cdd633f37e6a805f0f096edece0b0852bc20 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -81,6 +81,7 @@ int sysctl_tcp_window_scaling __read_mostly = 1;
int sysctl_tcp_sack __read_mostly = 1;
int sysctl_tcp_fack __read_mostly = 1;
int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
+int sysctl_tcp_max_reordering __read_mostly = 300;
EXPORT_SYMBOL(sysctl_tcp_reordering);
int sysctl_tcp_dsack __read_mostly = 1;
int sysctl_tcp_app_win __read_mostly = 31;
@@ -833,7 +834,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
if (metric > tp->reordering) {
int mib_idx;
- tp->reordering = min(TCP_MAX_REORDERING, metric);
+ tp->reordering = min(sysctl_tcp_max_reordering, metric);
/* This exciting event is worth to be remembered. 8) */
if (ts)
^ permalink raw reply related
* Re: [PATCH] ovs: Turn vports with dependencies into separate modules
From: David Miller @ 2014-10-28 4:48 UTC (permalink / raw)
To: pshelar; +Cc: tgraf, dev, netdev
In-Reply-To: <CALnjE+p5b5EzLkY6_6J7jvwyf9rUdu-JGbdf5r3bDuhgndoeeg@mail.gmail.com>
From: Pravin Shelar <pshelar@nicira.com>
Date: Mon, 27 Oct 2014 17:27:11 -0700
> On Mon, Oct 27, 2014 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> The patch also brings additional flexibility to users of
>> distributions. Distros typically ship something like an allmodconfig
>> so a user can either run openvswitch.ko with all encaps compiled in
>> or not run openvswitch.ko. With vports as module, a user can blacklist
>> a certain encap type.
>>
>> Another advantage is obviously that users can run additional vport
>> types on top of their distribution kernels.
>>
>> Is there anything specific that you are concerned with in regard
>> to this proposed change?
>
> OVS vport code is not alot and making it plugable module does not save
> much space.
People don't blacklist modules to "save space".
^ permalink raw reply
* Re: [PATCH net-next 2/2] udp: Reset flow table for flows over unconnected sockets
From: David Miller @ 2014-10-28 4:51 UTC (permalink / raw)
To: therbert; +Cc: eric.dumazet, netdev
In-Reply-To: <CA+mtBx_eQKOkM-0PEXG2WEMosXDtqHgwT3j7NnQpP62KdZeJKQ@mail.gmail.com>
From: Tom Herbert <therbert@google.com>
Date: Mon, 27 Oct 2014 18:09:25 -0700
> This indicates nothing about the merits of this patch. Nevertheless,
> in order to avoid further rat-holing and since this patch does change
> a long standing behavior I'll will respin to make it enabled only by
> sysctl.
Kind of disappointed on my end that you haven't addressed Eric's
main point, which is that:
1) A hash table shared between protocols will perform poorly for
mixed workloads which are becomming increasingly common.
2) UDP is fundamentally different from TCP in that the issue of
'flow' vs. 'non-flow' packets
I personally do not see you avoiding this conversation by simply
hiding the new behavior behind a sysctl, I still want you to address
it before I apply anything.
^ permalink raw reply
* Re: irq disable in __netdev_alloc_frag() ?
From: David Miller @ 2014-10-28 4:56 UTC (permalink / raw)
To: edumazet; +Cc: cl, brouer, eric.dumazet, alexander.duyck, ast, netdev
In-Reply-To: <CANn89i+U0=YrwoUSASejsS37EiXO7dKR25Vx04at3PqGA1EpHA@mail.gmail.com>
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 27 Oct 2014 19:46:20 -0700
> Unfortunately, SLUB is more expensive than SLAB for many networking
> workloads.
>
> The cost of disabling interrupts is pure noise compared to cache
> line misses.
>
> SLUB has poor behavior compared to SLAB with alien caches, even with
> the side effect that 'struct page' is 64 bytes aligned instead of
> being 56 bytes with SLAB
And SLAB completely shits itself when lots of memory gets cached up on
a foreign node.
This discussion has happened many times, SLAB may be faster when things
work out nicely, but it acts poorly wrt. keeping foreign memory from
being cached too aggressively.
And there is a cost for that, which is that foreign memory has to be
properly balanced back to it's home node.
^ permalink raw reply
* [PATCH net 0/1] cnic: Update the rcu_access_pointer() usages
From: Nilesh Javali @ 2014-10-28 5:18 UTC (permalink / raw)
To: davem
Cc: netdev, Dept-GELinuxNICDev, sudarsana.kalluru, vikas.chaudhary,
giridhar.malavali, tej.parkash
This patch updates the rcu_access_pointer usages:
Tej Parkash (1):
cnic: Update the rcu_access_pointer() usages
drivers/net/ethernet/broadcom/cnic.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)
Please apply this patch to net.
Thanks,
Nilesh
^ permalink raw reply
* [PATCH net 1/1] cnic: Update the rcu_access_pointer() usages
From: Nilesh Javali @ 2014-10-28 5:18 UTC (permalink / raw)
To: davem
Cc: netdev, Dept-GELinuxNICDev, sudarsana.kalluru, vikas.chaudhary,
giridhar.malavali, tej.parkash
In-Reply-To: <1414473495-24790-1-git-send-email-nilesh.javali@qlogic.com>
From: Tej Parkash <tej.parkash@qlogic.com>
1. Remove the rcu_read_lock/unlock around rcu_access_pointer
2. Replace the rcu_dereference with rcu_access_pointer
Signed-off-by: Tej Parkash <tej.parkash@qlogic.com>
---
drivers/net/ethernet/broadcom/cnic.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 23f23c9..f05fab6 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -382,10 +382,8 @@ static int cnic_iscsi_nl_msg_recv(struct cnic_dev *dev, u32 msg_type,
if (l5_cid >= MAX_CM_SK_TBL_SZ)
break;
- rcu_read_lock();
if (!rcu_access_pointer(cp->ulp_ops[CNIC_ULP_L4])) {
rc = -ENODEV;
- rcu_read_unlock();
break;
}
csk = &cp->csk_tbl[l5_cid];
@@ -414,7 +412,6 @@ static int cnic_iscsi_nl_msg_recv(struct cnic_dev *dev, u32 msg_type,
}
}
csk_put(csk);
- rcu_read_unlock();
rc = 0;
}
}
@@ -615,7 +612,7 @@ static int cnic_unregister_device(struct cnic_dev *dev, int ulp_type)
cnic_send_nlmsg(cp, ISCSI_KEVENT_IF_DOWN, NULL);
mutex_lock(&cnic_lock);
- if (rcu_dereference(cp->ulp_ops[ulp_type])) {
+ if (rcu_access_pointer(cp->ulp_ops[ulp_type])) {
RCU_INIT_POINTER(cp->ulp_ops[ulp_type], NULL);
cnic_put(dev);
} else {
--
1.5.6
^ permalink raw reply related
* [PATCH net-next 0/2] r8152: support nway_reset
From: Hayes Wang @ 2014-10-28 6:05 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
Fix the CHECK from checkpatch.pl and support nway_reset.
Hayes Wang (2):
r8152: rename tx_underun
r8152: support nway_reset of ethtool
drivers/net/usb/r8152.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
--
1.9.3
^ permalink raw reply
* [PATCH net-next 1/2] r8152: rename tx_underun
From: Hayes Wang @ 2014-10-28 6:05 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
In-Reply-To: <1394712342-15778-66-Taiwan-albertk@realtek.com>
Replace tx_underun with tx_underrun for checkpatch.pl.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index e3d84c3..fdea194 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -486,7 +486,7 @@ struct tally_counter {
__le64 rx_broadcast;
__le32 rx_multicast;
__le16 tx_aborted;
- __le16 tx_underun;
+ __le16 tx_underrun;
};
struct rx_desc {
@@ -3420,7 +3420,7 @@ static void rtl8152_get_ethtool_stats(struct net_device *dev,
data[9] = le64_to_cpu(tally.rx_broadcast);
data[10] = le32_to_cpu(tally.rx_multicast);
data[11] = le16_to_cpu(tally.tx_aborted);
- data[12] = le16_to_cpu(tally.tx_underun);
+ data[12] = le16_to_cpu(tally.tx_underrun);
}
static void rtl8152_get_strings(struct net_device *dev, u32 stringset, u8 *data)
--
1.9.3
^ permalink raw reply related
* [PATCH net-next 2/2] r8152: support nway_reset of ethtool
From: Hayes Wang @ 2014-10-28 6:05 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
In-Reply-To: <1394712342-15778-66-Taiwan-albertk@realtek.com>
Support the nway_reset() function for ethtool.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index fdea194..e1810bc 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3558,11 +3558,33 @@ out:
return ret;
}
+static int rtl8152_nway_reset(struct net_device *dev)
+{
+ struct r8152 *tp = netdev_priv(dev);
+ int ret;
+
+ ret = usb_autopm_get_interface(tp->intf);
+ if (ret < 0)
+ goto out;
+
+ mutex_lock(&tp->control);
+
+ ret = mii_nway_restart(&tp->mii);
+
+ mutex_unlock(&tp->control);
+
+ usb_autopm_put_interface(tp->intf);
+
+out:
+ return ret;
+}
+
static struct ethtool_ops ops = {
.get_drvinfo = rtl8152_get_drvinfo,
.get_settings = rtl8152_get_settings,
.set_settings = rtl8152_set_settings,
.get_link = ethtool_op_get_link,
+ .nway_reset = rtl8152_nway_reset,
.get_msglevel = rtl8152_get_msglevel,
.set_msglevel = rtl8152_set_msglevel,
.get_wol = rtl8152_get_wol,
--
1.9.3
^ permalink raw reply related
* Re: [net 1/2] sctp: add transport state in /proc/net/sctp/remaddr
From: Michele Baldessari @ 2014-10-28 7:20 UTC (permalink / raw)
To: David Miller; +Cc: linux-sctp, vyasevich, nhorman, netdev, dborkman
In-Reply-To: <20141027.185545.551457974536550723.davem@davemloft.net>
Hi David,
On Mon, Oct 27, 2014 at 06:55:45PM -0400, David Miller wrote:
> From: Michele Baldessari <michele@acksyn.org>
> Date: Thu, 23 Oct 2014 21:48:40 +0200
>
> > It is often quite helpful to be able to know the state of a transport
> > outside of the application itself (for troubleshooting purposes or for
> > monitoring purposes). Add it under /proc/net/sctp/remaddr.
> >
> > Signed-off-by: Michele Baldessari <michele@acksyn.org>
>
> You can't change the layout of procfs files, applications parse
> these files and any modification can potentially break such tools.
Thanks for the review. I assumed that extending a procfile by adding
a column at the end is ok and that tools must cope with that anyway.
(i.e. like it's been done in f19c29e3e391a66a273e9afebaf01917245148cd)
> Secondly, even if this change were acceptable, targetting this
> change at anything other than the net-next tree is not appropriate
> because it is a new feature.
Ok. Unless you are against adding a column, I'll resubmit to net-next
later this week.
Thanks,
Michele
--
Michele Baldessari <michele@acksyn.org>
C2A5 9DA3 9961 4FFB E01B D0BC DDD4 DCCB 7515 5C6D
^ permalink raw reply
* [PATCH net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Erik Kline @ 2014-10-28 7:42 UTC (permalink / raw)
To: netdev; +Cc: davem, ben, lorenzo, hannes, Erik Kline
Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes. Preferred addresses
will still take precedence over optimistic addresses, subject
to other ranking in the source address selection algorithm.
This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).
The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.
For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.
Also, because optimistic nodes can receive frames, issue
an RTM_NEWADDR as soon as DAD starts. If DAD fails, a separate
RTM_DELADDR is always sent.
Also: add an entry in ip-sysctl.txt for optimistic_dad.
Signed-off-by: Erik Kline <ek@google.com>
---
Documentation/networking/ip-sysctl.txt | 13 +++++++++
include/linux/ipv6.h | 1 +
include/uapi/linux/ipv6.h | 1 +
net/ipv6/addrconf.c | 52 ++++++++++++++++++++++++++++++++--
4 files changed, 64 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 0307e28..e03cf49 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1452,6 +1452,19 @@ suppress_frag_ndisc - INTEGER
1 - (default) discard fragmented neighbor discovery packets
0 - allow fragmented neighbor discovery packets
+optimistic_dad - BOOLEAN
+ Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
+ 0: disabled (default)
+ 1: enabled
+
+use_optimistic - BOOLEAN
+ If enabled, do not classify optimistic addresses as deprecated during
+ source address selection. Preferred addresses will still be chosen
+ before optimistic addresses, subject to other ranking in the source
+ address selection algorithm.
+ 0: disabled (default)
+ 1: enabled
+
icmp/*:
ratelimit - INTEGER
Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index ff56053..7121a2e 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -42,6 +42,7 @@ struct ipv6_devconf {
__s32 accept_ra_from_local;
#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
__s32 optimistic_dad;
+ __s32 use_optimistic;
#endif
#ifdef CONFIG_IPV6_MROUTE
__s32 mc_forwarding;
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index efa2666..e863d08 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -164,6 +164,7 @@ enum {
DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL,
DEVCONF_SUPPRESS_FRAG_NDISC,
DEVCONF_ACCEPT_RA_FROM_LOCAL,
+ DEVCONF_USE_OPTIMISTIC,
DEVCONF_MAX
};
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 50b95b2..7161743 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1170,6 +1170,9 @@ enum {
IPV6_SADDR_RULE_PRIVACY,
IPV6_SADDR_RULE_ORCHID,
IPV6_SADDR_RULE_PREFIX,
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+ IPV6_SADDR_RULE_NOT_OPTIMISTIC,
+#endif
IPV6_SADDR_RULE_MAX
};
@@ -1197,6 +1200,15 @@ static inline int ipv6_saddr_preferred(int type)
return 0;
}
+static inline bool ipv6_use_optimistic_addr(struct inet6_dev *idev)
+{
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+ return idev && idev->cnf.optimistic_dad && idev->cnf.use_optimistic;
+#else
+ return false;
+#endif
+}
+
static int ipv6_get_saddr_eval(struct net *net,
struct ipv6_saddr_score *score,
struct ipv6_saddr_dst *dst,
@@ -1257,10 +1269,16 @@ static int ipv6_get_saddr_eval(struct net *net,
score->scopedist = ret;
break;
case IPV6_SADDR_RULE_PREFERRED:
+ {
/* Rule 3: Avoid deprecated and optimistic addresses */
+ u8 avoid = IFA_F_DEPRECATED;
+
+ if (!ipv6_use_optimistic_addr(score->ifa->idev))
+ avoid |= IFA_F_OPTIMISTIC;
ret = ipv6_saddr_preferred(score->addr_type) ||
- !(score->ifa->flags & (IFA_F_DEPRECATED|IFA_F_OPTIMISTIC));
+ !(score->ifa->flags & avoid);
break;
+ }
#ifdef CONFIG_IPV6_MIP6
case IPV6_SADDR_RULE_HOA:
{
@@ -1306,6 +1324,14 @@ static int ipv6_get_saddr_eval(struct net *net,
ret = score->ifa->prefix_len;
score->matchlen = ret;
break;
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+ case IPV6_SADDR_RULE_NOT_OPTIMISTIC:
+ /* Optimistic addresses still have lower precedence than other
+ * preferred addresses.
+ */
+ ret = !(score->ifa->flags & IFA_F_OPTIMISTIC);
+ break;
+#endif
default:
ret = 0;
}
@@ -3222,8 +3248,15 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
* Optimistic nodes can start receiving
* Frames right away
*/
- if (ifp->flags & IFA_F_OPTIMISTIC)
+ if (ifp->flags & IFA_F_OPTIMISTIC) {
ip6_ins_rt(ifp->rt);
+ if (ipv6_use_optimistic_addr(idev)) {
+ /* Because optimistic nodes can use this address,
+ * notify listeners. If DAD fails, RTM_DELADDR is sent.
+ */
+ ipv6_ifa_notify(RTM_NEWADDR, ifp);
+ }
+ }
addrconf_dad_kick(ifp);
out:
@@ -3354,7 +3387,11 @@ static void addrconf_dad_completed(struct inet6_ifaddr *ifp)
* Configure the address for reception. Now it is valid.
*/
- ipv6_ifa_notify(RTM_NEWADDR, ifp);
+ /* If optimistic DAD is in use, the notification was already sent
+ * in addrconf_dad_begin().
+ */
+ if (!ipv6_use_optimistic_addr(ifp->idev))
+ ipv6_ifa_notify(RTM_NEWADDR, ifp);
/* If added prefix is link local and we are prepared to process
router advertisements, start sending router solicitations.
@@ -4330,6 +4367,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
array[DEVCONF_ACCEPT_SOURCE_ROUTE] = cnf->accept_source_route;
#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
array[DEVCONF_OPTIMISTIC_DAD] = cnf->optimistic_dad;
+ array[DEVCONF_USE_OPTIMISTIC] = cnf->use_optimistic;
#endif
#ifdef CONFIG_IPV6_MROUTE
array[DEVCONF_MC_FORWARDING] = cnf->mc_forwarding;
@@ -5155,6 +5193,14 @@ static struct addrconf_sysctl_table
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "use_optimistic",
+ .data = &ipv6_devconf.use_optimistic,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+
+ },
#endif
#ifdef CONFIG_IPV6_MROUTE
{
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* Re: [PATCH] ovs: Turn vports with dependencies into separate modules
From: Thomas Graf @ 2014-10-28 8:10 UTC (permalink / raw)
To: Pravin Shelar; +Cc: dev@openvswitch.org, netdev
In-Reply-To: <CALnjE+p5b5EzLkY6_6J7jvwyf9rUdu-JGbdf5r3bDuhgndoeeg@mail.gmail.com>
On 10/27/14 at 05:27pm, Pravin Shelar wrote:
> On Mon, Oct 27, 2014 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > What I mean specifically is the following dependency logic which will
> > no longer be required:
> >
> > depends on NET_IPGRE_DEMUX && !(OPENVSWITCH=y && NET_IPGRE_DEMUX=m)
> >
> > The patch also brings additional flexibility to users of
> > distributions. Distros typically ship something like an allmodconfig
> > so a user can either run openvswitch.ko with all encaps compiled in
> > or not run openvswitch.ko. With vports as module, a user can blacklist
> > a certain encap type.
> >
> > Another advantage is obviously that users can run additional vport
> > types on top of their distribution kernels.
> >
> > Is there anything specific that you are concerned with in regard
> > to this proposed change?
>
> OVS vport code is not alot and making it plugable module does not save
> much space. Even with this patch user can not load any vport type
> since we still need to define the type in kernel interface and add the
> support in userspace netdev layer. Therefore this patch adds
> complexity without much gain.
Defining the type in the header now only serves the purpose of
reserving unique vport types. It will be perfectly fine to compile a
vport module of a newer OVS user space against an older kernel (that
has the vport API) and load the vport module even though that kernel
version does not have any explicit awareness of that type. This is
something users of distribution kernel like to do because they
typically can't recompile the kernel without break support contracts.
^ permalink raw reply
* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Yanko Kaneti @ 2014-10-28 8:12 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Jay Vosburgh, Josh Boyer, Eric W. Biederman, Cong Wang,
Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org, mroos, tj
In-Reply-To: <20141027174539.GC27568@linux.vnet.ibm.com>
On Mon-10/27/14-2014 10:45, Paul E. McKenney wrote:
> On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
> > > Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> > > >> Looking at the dmesg, the early boot messages seem to be
> > > >> confused as to how many CPUs there are, e.g.,
> > > >>
> > > >> [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> > > >> [ 0.000000] Hierarchical RCU implementation.
> > > >> [ 0.000000] RCU debugfs-based tracing is enabled.
> > > >> [ 0.000000] RCU dyntick-idle grace-period acceleration is enabled.
> > > >> [ 0.000000] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
> > > >> [ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
> > > >> [ 0.000000] NR_IRQS:16640 nr_irqs:456 0
> > > >> [ 0.000000] Offload RCU callbacks from all CPUs
> > > >> [ 0.000000] Offload RCU callbacks from CPUs: 0-3.
> > > >>
> > > >> but later shows 2:
> > > >>
> > > >> [ 0.233703] x86: Booting SMP configuration:
> > > >> [ 0.236003] .... node #0, CPUs: #1
> > > >> [ 0.255528] x86: Booted up 1 node, 2 CPUs
> > > >>
> > > >> In any event, the E8400 is a 2 core CPU with no hyperthreading.
> > > >
> > > >Well, this might explain some of the difficulties. If RCU decides to wait
> > > >on CPUs that don't exist, we will of course get a hang. And rcu_barrier()
> > > >was definitely expecting four CPUs.
> > > >
> > > >So what happens if you boot with maxcpus=2? (Or build with
> > > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang. If so,
> > > >I might have some ideas for a real fix.
> > >
> > > Booting with maxcpus=2 makes no difference (the dmesg output is
> > > the same).
> > >
> > > Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
> > > dmesg has different CPU information at boot:
> > >
> > > [ 0.000000] smpboot: 4 Processors exceeds NR_CPUS limit of 2
> > > [ 0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> > > [...]
> > > [ 0.000000] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:1
> > > [...]
> > > [ 0.000000] Hierarchical RCU implementation.
> > > [ 0.000000] RCU debugfs-based tracing is enabled.
> > > [ 0.000000] RCU dyntick-idle grace-period acceleration is enabled.
> > > [ 0.000000] NR_IRQS:4352 nr_irqs:440 0
> > > [ 0.000000] Offload RCU callbacks from all CPUs
> > > [ 0.000000] Offload RCU callbacks from CPUs: 0-1.
> >
> > Thank you -- this confirms my suspicions on the fix, though I must admit
> > to being surprised that maxcpus made no difference.
>
> And here is an alleged fix, lightly tested at this end. Does this patch
> help?
Tested this on top of rc2 (as found in Fedora, and failing without the patch)
with all my modprobe scenarios and it seems to have fixed it.
Thanks
-Yanko
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_barrier() understand about missing rcuo kthreads
>
> Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
> avoids creating rcuo kthreads for CPUs that never come online. This
> fixes a bug in many instances of firmware: Instead of lying about their
> age, these systems instead lie about the number of CPUs that they have.
> Before commit 35ce7f29a44a, this could result in huge numbers of useless
> rcuo kthreads being created.
>
> It appears that experience indicates that I should have told the
> people suffering from this problem to fix their broken firmware, but
> I instead produced what turned out to be a partial fix. The missing
> piece supplied by this commit makes sure that rcu_barrier() knows not to
> post callbacks for no-CBs CPUs that have not yet come online, because
> otherwise rcu_barrier() will hang on systems having firmware that lies
> about the number of CPUs.
>
> It is tempting to simply have rcu_barrier() refuse to post a callback on
> any no-CBs CPU that does not have an rcuo kthread. This unfortunately
> does not work because rcu_barrier() is required to wait for all pending
> callbacks. It is therefore required to wait even for those callbacks
> that cannot possibly be invoked. Even if doing so hangs the system.
>
> Given that posting a callback to a no-CBs CPU that does not yet have an
> rcuo kthread can hang rcu_barrier(), It is tempting to report an error
> in this case. Unfortunately, this will result in false positives at
> boot time, when it is perfectly legal to post callbacks to the boot CPU
> before the scheduler has started, in other words, before it is legal
> to invoke rcu_barrier().
>
> So this commit instead has rcu_barrier() avoid posting callbacks to
> CPUs having neither rcuo kthread nor pending callbacks, and has it
> complain bitterly if it finds CPUs having no rcuo kthread but some
> pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
> kthread but pending callbacks, as noted earlier, it has no choice but
> to hang indefinitely.
>
> Reported-by: Yanko Kaneti <yaneti@declera.com>
> Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
> Reported-by: Meelis Roos <mroos@linux.ee>
> Reported-by: Eric B Munson <emunson@akamai.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
> index aa8e5eea3ab4..c78e88ce5ea3 100644
> --- a/include/trace/events/rcu.h
> +++ b/include/trace/events/rcu.h
> @@ -660,18 +660,18 @@ TRACE_EVENT(rcu_torture_read,
> /*
> * Tracepoint for _rcu_barrier() execution. The string "s" describes
> * the _rcu_barrier phase:
> - * "Begin": rcu_barrier_callback() started.
> - * "Check": rcu_barrier_callback() checking for piggybacking.
> - * "EarlyExit": rcu_barrier_callback() piggybacked, thus early exit.
> - * "Inc1": rcu_barrier_callback() piggyback check counter incremented.
> - * "Offline": rcu_barrier_callback() found offline CPU
> - * "OnlineNoCB": rcu_barrier_callback() found online no-CBs CPU.
> - * "OnlineQ": rcu_barrier_callback() found online CPU with callbacks.
> - * "OnlineNQ": rcu_barrier_callback() found online CPU, no callbacks.
> + * "Begin": _rcu_barrier() started.
> + * "Check": _rcu_barrier() checking for piggybacking.
> + * "EarlyExit": _rcu_barrier() piggybacked, thus early exit.
> + * "Inc1": _rcu_barrier() piggyback check counter incremented.
> + * "OfflineNoCB": _rcu_barrier() found callback on never-online CPU
> + * "OnlineNoCB": _rcu_barrier() found online no-CBs CPU.
> + * "OnlineQ": _rcu_barrier() found online CPU with callbacks.
> + * "OnlineNQ": _rcu_barrier() found online CPU, no callbacks.
> * "IRQ": An rcu_barrier_callback() callback posted on remote CPU.
> * "CB": An rcu_barrier_callback() invoked a callback, not the last.
> * "LastCB": An rcu_barrier_callback() invoked the last callback.
> - * "Inc2": rcu_barrier_callback() piggyback check counter incremented.
> + * "Inc2": _rcu_barrier() piggyback check counter incremented.
> * The "cpu" argument is the CPU or -1 if meaningless, the "cnt" argument
> * is the count of remaining callbacks, and "done" is the piggybacking count.
> */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f6880052b917..7680fc275036 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3312,11 +3312,16 @@ static void _rcu_barrier(struct rcu_state *rsp)
> continue;
> rdp = per_cpu_ptr(rsp->rda, cpu);
> if (rcu_is_nocb_cpu(cpu)) {
> - _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
> - rsp->n_barrier_done);
> - atomic_inc(&rsp->barrier_cpu_count);
> - __call_rcu(&rdp->barrier_head, rcu_barrier_callback,
> - rsp, cpu, 0);
> + if (!rcu_nocb_cpu_needs_barrier(rsp, cpu)) {
> + _rcu_barrier_trace(rsp, "OfflineNoCB", cpu,
> + rsp->n_barrier_done);
> + } else {
> + _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
> + rsp->n_barrier_done);
> + atomic_inc(&rsp->barrier_cpu_count);
> + __call_rcu(&rdp->barrier_head,
> + rcu_barrier_callback, rsp, cpu, 0);
> + }
> } else if (ACCESS_ONCE(rdp->qlen)) {
> _rcu_barrier_trace(rsp, "OnlineQ", cpu,
> rsp->n_barrier_done);
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 4beab3d2328c..8e7b1843896e 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -587,6 +587,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu);
> static void print_cpu_stall_info_end(void);
> static void zero_cpu_stall_ticks(struct rcu_data *rdp);
> static void increment_cpu_stall_ticks(void);
> +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu);
> static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
> static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp);
> static void rcu_init_one_nocb(struct rcu_node *rnp);
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 927c17b081c7..68c5b23b7173 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2050,6 +2050,33 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> }
>
> /*
> + * Does the specified CPU need an RCU callback for the specified flavor
> + * of rcu_barrier()?
> + */
> +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
> +{
> + struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
> + struct rcu_head *rhp;
> +
> + /* No-CBs CPUs might have callbacks on any of three lists. */
> + rhp = ACCESS_ONCE(rdp->nocb_head);
> + if (!rhp)
> + rhp = ACCESS_ONCE(rdp->nocb_gp_head);
> + if (!rhp)
> + rhp = ACCESS_ONCE(rdp->nocb_follower_head);
> +
> + /* Having no rcuo kthread but CBs after scheduler starts is bad! */
> + if (!ACCESS_ONCE(rdp->nocb_kthread) && rhp) {
> + /* RCU callback enqueued before CPU first came online??? */
> + pr_err("RCU: Never-onlined no-CBs CPU %d has CB %p\n",
> + cpu, rhp->func);
> + WARN_ON_ONCE(1);
> + }
> +
> + return !!rhp;
> +}
> +
> +/*
> * Enqueue the specified string of rcu_head structures onto the specified
> * CPU's no-CBs lists. The CPU is specified by rdp, the head of the
> * string by rhp, and the tail of the string by rhtp. The non-lazy/lazy
> @@ -2646,6 +2673,10 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
>
> #else /* #ifdef CONFIG_RCU_NOCB_CPU */
>
> +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
> +{
> +}
> +
> static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> {
> }
>
^ permalink raw reply
* Re: [PATCH net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Lorenzo Colitti @ 2014-10-28 8:19 UTC (permalink / raw)
To: Erik Kline
Cc: netdev@vger.kernel.org, David Miller, Ben Hutchings,
Hannes Frederic Sowa
In-Reply-To: <1414482141-27912-1-git-send-email-ek@google.com>
On Tue, Oct 28, 2014 at 4:42 PM, Erik Kline <ek@google.com> wrote:
> * Configure the address for reception. Now it is valid.
> */
>
> - ipv6_ifa_notify(RTM_NEWADDR, ifp);
> + /* If optimistic DAD is in use, the notification was already sent
> + * in addrconf_dad_begin().
> + */
> + if (!ipv6_use_optimistic_addr(ifp->idev))
> + ipv6_ifa_notify(RTM_NEWADDR, ifp);
Won't this result in not sending RTM_NEWADDR messages if
use_optimistic is enabled on the interface, but the IP address that
has just completed DAD is not an optimistic address (e.g., if it's a
manually-configured address)?
^ permalink raw reply
* Re: [PATCH iproute2] xfrm: add support of ESN and anti-replay window
From: Rongqing Li @ 2014-10-28 8:30 UTC (permalink / raw)
To: Nicolas Dichtel; +Cc: shemminger, netdev, dingzhi, Adrien Mazarguil
In-Reply-To: <1413796984-9867-1-git-send-email-nicolas.dichtel@6wind.com>
On 10/20/2014 05:23 PM, Nicolas Dichtel wrote:
> From: dingzhi <zhi.ding@6wind.com>
>
> This patch allows to configure ESN and anti-replay window.
>
> Signed-off-by: dingzhi <zhi.ding@6wind.com>
> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
> ip/ipxfrm.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ip/xfrm_state.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
Docs or man needs to be updated.
-Roy
^ permalink raw reply
* Re: [PATCH net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Erik Kline @ 2014-10-28 8:59 UTC (permalink / raw)
To: Lorenzo Colitti
Cc: netdev@vger.kernel.org, David Miller, Ben Hutchings,
Hannes Frederic Sowa
In-Reply-To: <CAKD1Yr13oLyBsbuh92F91Fjno3Xv4vBQ1B__8Csoi0XUjJv9nw@mail.gmail.com>
>> * Configure the address for reception. Now it is valid.
>> */
>>
>> - ipv6_ifa_notify(RTM_NEWADDR, ifp);
>> + /* If optimistic DAD is in use, the notification was already sent
>> + * in addrconf_dad_begin().
>> + */
>> + if (!ipv6_use_optimistic_addr(ifp->idev))
>> + ipv6_ifa_notify(RTM_NEWADDR, ifp);
>
> Won't this result in not sending RTM_NEWADDR messages if
> use_optimistic is enabled on the interface, but the IP address that
> has just completed DAD is not an optimistic address (e.g., if it's a
> manually-configured address)?
Gah, yes. I originally unconditionally sent the RTM_NEWADDR, but
there was some concern about sending duplicates so this was a weak
attempt to reduce spurious messages.
I still think sending the RTM_NEWADDR unconditionally is the right
thing, since we send them all the time when something about the
address changes (including timer-refresh on receipt of RAs).
I'll revert that bit and send an updated patch ASAP.
Thanks.
^ permalink raw reply
* [PATCH v2 net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Erik Kline @ 2014-10-28 9:11 UTC (permalink / raw)
To: netdev; +Cc: davem, ben, lorenzo, hannes, Erik Kline
Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes. Preferred addresses
will still take precedence over optimistic addresses, subject
to other ranking in the source address selection algorithm.
This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).
The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.
For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.
Also, because optimistic nodes can receive frames, issue
an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
flag appropriately set). A second RTM_NEWADDR is sent if DAD
completes (the address flags have changed), otherwise an
RTM_DELADDR is sent.
Also: add an entry in ip-sysctl.txt for optimistic_dad.
Signed-off-by: Erik Kline <ek@google.com>
---
Documentation/networking/ip-sysctl.txt | 13 ++++++++++
include/linux/ipv6.h | 1 +
include/uapi/linux/ipv6.h | 1 +
net/ipv6/addrconf.c | 46 ++++++++++++++++++++++++++++++++--
4 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 0307e28..e03cf49 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1452,6 +1452,19 @@ suppress_frag_ndisc - INTEGER
1 - (default) discard fragmented neighbor discovery packets
0 - allow fragmented neighbor discovery packets
+optimistic_dad - BOOLEAN
+ Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
+ 0: disabled (default)
+ 1: enabled
+
+use_optimistic - BOOLEAN
+ If enabled, do not classify optimistic addresses as deprecated during
+ source address selection. Preferred addresses will still be chosen
+ before optimistic addresses, subject to other ranking in the source
+ address selection algorithm.
+ 0: disabled (default)
+ 1: enabled
+
icmp/*:
ratelimit - INTEGER
Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index ff56053..7121a2e 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -42,6 +42,7 @@ struct ipv6_devconf {
__s32 accept_ra_from_local;
#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
__s32 optimistic_dad;
+ __s32 use_optimistic;
#endif
#ifdef CONFIG_IPV6_MROUTE
__s32 mc_forwarding;
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index efa2666..e863d08 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -164,6 +164,7 @@ enum {
DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL,
DEVCONF_SUPPRESS_FRAG_NDISC,
DEVCONF_ACCEPT_RA_FROM_LOCAL,
+ DEVCONF_USE_OPTIMISTIC,
DEVCONF_MAX
};
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 50b95b2..8d12b7c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1170,6 +1170,9 @@ enum {
IPV6_SADDR_RULE_PRIVACY,
IPV6_SADDR_RULE_ORCHID,
IPV6_SADDR_RULE_PREFIX,
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+ IPV6_SADDR_RULE_NOT_OPTIMISTIC,
+#endif
IPV6_SADDR_RULE_MAX
};
@@ -1197,6 +1200,15 @@ static inline int ipv6_saddr_preferred(int type)
return 0;
}
+static inline bool ipv6_use_optimistic_addr(struct inet6_dev *idev)
+{
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+ return idev && idev->cnf.optimistic_dad && idev->cnf.use_optimistic;
+#else
+ return false;
+#endif
+}
+
static int ipv6_get_saddr_eval(struct net *net,
struct ipv6_saddr_score *score,
struct ipv6_saddr_dst *dst,
@@ -1257,10 +1269,16 @@ static int ipv6_get_saddr_eval(struct net *net,
score->scopedist = ret;
break;
case IPV6_SADDR_RULE_PREFERRED:
+ {
/* Rule 3: Avoid deprecated and optimistic addresses */
+ u8 avoid = IFA_F_DEPRECATED;
+
+ if (!ipv6_use_optimistic_addr(score->ifa->idev))
+ avoid |= IFA_F_OPTIMISTIC;
ret = ipv6_saddr_preferred(score->addr_type) ||
- !(score->ifa->flags & (IFA_F_DEPRECATED|IFA_F_OPTIMISTIC));
+ !(score->ifa->flags & avoid);
break;
+ }
#ifdef CONFIG_IPV6_MIP6
case IPV6_SADDR_RULE_HOA:
{
@@ -1306,6 +1324,14 @@ static int ipv6_get_saddr_eval(struct net *net,
ret = score->ifa->prefix_len;
score->matchlen = ret;
break;
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+ case IPV6_SADDR_RULE_NOT_OPTIMISTIC:
+ /* Optimistic addresses still have lower precedence than other
+ * preferred addresses.
+ */
+ ret = !(score->ifa->flags & IFA_F_OPTIMISTIC);
+ break;
+#endif
default:
ret = 0;
}
@@ -3222,8 +3248,15 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
* Optimistic nodes can start receiving
* Frames right away
*/
- if (ifp->flags & IFA_F_OPTIMISTIC)
+ if (ifp->flags & IFA_F_OPTIMISTIC) {
ip6_ins_rt(ifp->rt);
+ if (ipv6_use_optimistic_addr(idev)) {
+ /* Because optimistic nodes can use this address,
+ * notify listeners. If DAD fails, RTM_DELADDR is sent.
+ */
+ ipv6_ifa_notify(RTM_NEWADDR, ifp);
+ }
+ }
addrconf_dad_kick(ifp);
out:
@@ -4330,6 +4363,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
array[DEVCONF_ACCEPT_SOURCE_ROUTE] = cnf->accept_source_route;
#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
array[DEVCONF_OPTIMISTIC_DAD] = cnf->optimistic_dad;
+ array[DEVCONF_USE_OPTIMISTIC] = cnf->use_optimistic;
#endif
#ifdef CONFIG_IPV6_MROUTE
array[DEVCONF_MC_FORWARDING] = cnf->mc_forwarding;
@@ -5155,6 +5189,14 @@ static struct addrconf_sysctl_table
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "use_optimistic",
+ .data = &ipv6_devconf.use_optimistic,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+
+ },
#endif
#ifdef CONFIG_IPV6_MROUTE
{
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCH net] inet: frags: fix a race between inet_evict_bucket and inet_frag_kill
From: Nikolay Aleksandrov @ 2014-10-28 9:30 UTC (permalink / raw)
To: netdev; +Cc: Nikolay Aleksandrov, Florian Westphal, Eric Dumazet,
Patrick McLean
In-Reply-To: <1414455409.4845.1.camel@edumazet-glaptop2.roam.corp.google.com>
When the evictor is running it adds some chosen frags to a local list to
be evicted once the chain lock has been released but at the same time
the *frag_queue can be running for some of the same queues and it
may call inet_frag_kill which will wait on the chain lock and
will then delete the queue from the wrong list since it was added in the
eviction one. The fix is simple - check if the queue has the evict flag
set under the chain lock before deleting it, this is safe because the
evict flag is set only under that lock and having the flag set also means
that the queue has been detached from the chain list, so no need to delete
it again.
An important note to make is that we're safe w.r.t refcnt because
inet_frag_kill and inet_evict_bucket will sync on the del_timer operation
where only one of the two can succeed (or if the timer is executing -
none of them), the cases are:
1. inet_frag_kill succeeds in del_timer
- then the timer ref is removed, but inet_evict_bucket will not add
this queue to its expire list but will restart eviction in that chain
2. inet_evict_bucket succeeds in del_timer
- then the timer ref is kept until the evictor "expires" the queue, but
inet_frag_kill will remove the initial ref and will set
INET_FRAG_COMPLETE which will make the frag_expire fn just to remove
its ref.
In the end all of the queue users will do an inet_frag_put and the one
that reaches 0 will free it. The refcount balance should be okay.
CC: Florian Westphal <fw@strlen.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McLean <chutzpah@gentoo.org>
Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Patrick McLean <chutzpah@gentoo.org>
Tested-by: Patrick McLean <chutzpah@gentoo.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
A few more eyes to confirm all of this would be much appreciated.
net/ipv4/inet_fragment.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 9eb89f3f0ee4..894ec30c5896 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -285,7 +285,8 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
struct inet_frag_bucket *hb;
hb = get_frag_bucket_locked(fq, f);
- hlist_del(&fq->list);
+ if (!(fq->flags & INET_FRAG_EVICTED))
+ hlist_del(&fq->list);
spin_unlock(&hb->chain_lock);
}
--
1.9.3
^ permalink raw reply related
* [PATCH net] inet: frags: remove the WARN_ON from inet_evict_bucket
From: Nikolay Aleksandrov @ 2014-10-28 9:44 UTC (permalink / raw)
To: netdev; +Cc: Nikolay Aleksandrov, Florian Westphal, Eric Dumazet,
Patrick McLean
In-Reply-To: <1414455409.4845.1.camel@edumazet-glaptop2.roam.corp.google.com>
The WARN_ON in inet_evict_bucket can be triggered by a valid case:
inet_frag_kill and inet_evict_bucket can be running in parallel on the
same queue which means that there has been at least one more ref added
by a previous inet_frag_find call, but inet_frag_kill can delete the
timer before inet_evict_bucket which will cause the WARN_ON() there to
trigger since we'll have refcnt!=1. Now, this case is valid because the
queue is being "killed" for some reason (removed from the chain list and
its timer deleted) so it will get destroyed in the end by one of the
inet_frag_put() calls which reaches 0 i.e. refcnt is still valid.
CC: Florian Westphal <fw@strlen.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McLean <chutzpah@gentoo.org>
Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
Reported-by: Patrick McLean <chutzpah@gentoo.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
I'm sending this as a separate patch so the race fix doesn't get blocked
in case I'm wrong and also it's a different issue.
net/ipv4/inet_fragment.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 894ec30c5896..19419b60cb37 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -146,7 +146,6 @@ evict_again:
atomic_inc(&fq->refcnt);
spin_unlock(&hb->chain_lock);
del_timer_sync(&fq->timer);
- WARN_ON(atomic_read(&fq->refcnt) != 1);
inet_frag_put(fq, f);
goto evict_again;
}
--
1.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox