All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Florian Westphal <fw@strlen.de>,
	"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 4.12 36/52] Revert "net: use lib/percpu_counter API for fragmentation mem accounting"
Date: Mon, 18 Sep 2017 11:11:30 +0200	[thread overview]
Message-ID: <20170918091021.994120452@linuxfoundation.org> (raw)
In-Reply-To: <20170918091016.620101134@linuxfoundation.org>

4.12-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jesper Dangaard Brouer <brouer@redhat.com>


[ Upstream commit fb452a1aa3fd4034d7999e309c5466ff2d7005aa ]

This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2.

There is a bug in fragmentation codes use of the percpu_counter API,
that can cause issues on systems with many CPUs.

The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).

The correct API usage would be to use __percpu_counter_compare() which
does the right thing, and takes into account the number of (online)
CPUs and batch size, to account for this and call __percpu_counter_sum()
when needed.

We choose to revert the use of the lib/percpu_counter API for frag
memory accounting for several reasons:

1) On systems with CPUs > 24, the heavier fully locked
   __percpu_counter_sum() is always invoked, which will be more
   expensive than the atomic_t that is reverted to.

Given systems with more than 24 CPUs are becoming common this doesn't
seem like a good option.  To mitigate this, the batch size could be
decreased and thresh be increased.

2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
   CPU, before SKBs are pushed into sockets on remote CPUs.  Given
   NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
   likely be limited.  Thus, a fair chance that atomic add+dec happen
   on the same CPU.

Revert note that commit 1d6119baf061 ("net: fix percpu memory leaks")
removed init_frag_mem_limit() and instead use inet_frags_init_net().
After this revert, inet_frags_uninit_net() becomes empty.

Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/net/inet_frag.h  |   30 +++++++++---------------------
 net/ipv4/inet_fragment.c |    4 +---
 2 files changed, 10 insertions(+), 24 deletions(-)

--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -1,14 +1,9 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
-#include <linux/percpu_counter.h>
-
 struct netns_frags {
-	/* The percpu_counter "mem" need to be cacheline aligned.
-	 *  mem.count must not share cacheline with other writers
-	 */
-	struct percpu_counter   mem ____cacheline_aligned_in_smp;
-
+	/* Keep atomic mem on separate cachelines in structs that include it */
+	atomic_t		mem ____cacheline_aligned_in_smp;
 	/* sysctls */
 	int			timeout;
 	int			high_thresh;
@@ -110,11 +105,11 @@ void inet_frags_fini(struct inet_frags *
 
 static inline int inet_frags_init_net(struct netns_frags *nf)
 {
-	return percpu_counter_init(&nf->mem, 0, GFP_KERNEL);
+	atomic_set(&nf->mem, 0);
+	return 0;
 }
 static inline void inet_frags_uninit_net(struct netns_frags *nf)
 {
-	percpu_counter_destroy(&nf->mem);
 }
 
 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
@@ -140,31 +135,24 @@ static inline bool inet_frag_evicting(st
 
 /* Memory Tracking Functions. */
 
-/* The default percpu_counter batch size is not big enough to scale to
- * fragmentation mem acct sizes.
- * The mem size of a 64K fragment is approx:
- *  (44 fragments * 2944 truesize) + frag_queue struct(200) = 129736 bytes
- */
-static unsigned int frag_percpu_counter_batch = 130000;
-
 static inline int frag_mem_limit(struct netns_frags *nf)
 {
-	return percpu_counter_read(&nf->mem);
+	return atomic_read(&nf->mem);
 }
 
 static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	percpu_counter_add_batch(&nf->mem, -i, frag_percpu_counter_batch);
+	atomic_sub(i, &nf->mem);
 }
 
 static inline void add_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	percpu_counter_add_batch(&nf->mem, i, frag_percpu_counter_batch);
+	atomic_add(i, &nf->mem);
 }
 
-static inline unsigned int sum_frag_mem_limit(struct netns_frags *nf)
+static inline int sum_frag_mem_limit(struct netns_frags *nf)
 {
-	return percpu_counter_sum_positive(&nf->mem);
+	return atomic_read(&nf->mem);
 }
 
 /* RFC 3168 support :
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -234,10 +234,8 @@ evict_again:
 	cond_resched();
 
 	if (read_seqretry(&f->rnd_seqlock, seq) ||
-	    percpu_counter_sum(&nf->mem))
+	    sum_frag_mem_limit(nf))
 		goto evict_again;
-
-	percpu_counter_destroy(&nf->mem);
 }
 EXPORT_SYMBOL(inet_frags_exit_net);
 

  parent reply	other threads:[~2017-09-18  9:14 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-18  9:10 [PATCH 4.12 00/52] 4.12.14-stable review Greg Kroah-Hartman
2017-09-18  9:10 ` [PATCH 4.12 01/52] ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt() Greg Kroah-Hartman
2017-09-18  9:10 ` [PATCH 4.12 02/52] ipv6: add rcu grace period before freeing fib6_node Greg Kroah-Hartman
2017-09-18  9:10 ` [PATCH 4.12 03/52] ipv6: fix sparse warning on rt6i_node Greg Kroah-Hartman
2017-09-18  9:10 ` [PATCH 4.12 04/52] macsec: add genl family module alias Greg Kroah-Hartman
2017-09-18  9:10 ` [PATCH 4.12 05/52] udp: on peeking bad csum, drop packets even if not at head Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 06/52] bpf: fix map value attribute for hash of maps Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 07/52] fsl/man: Inherit parent device and of_node Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 08/52] sctp: Avoid out-of-bounds reads from address storage Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 09/52] qlge: avoid memcpy buffer overflow Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 10/52] tipc: Fix tipc_sk_reinit handling of -EAGAIN Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 11/52] net: systemport: Be drop monitor friendly Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 12/52] net: bcmgenet: " Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 13/52] net: systemport: Free DMA coherent descriptors on errors Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 14/52] netvsc: fix deadlock betwen link status and removal Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 15/52] udp6: set rx_dst_cookie on rx_dst updates Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 16/52] net: mvpp2: fix the mac address used when using PPv2.2 Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 17/52] cxgb4: Fix stack out-of-bounds read due to wrong size to t4_record_mbox() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 18/52] ipv6: set dst.obsolete when a cached route has expired Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 19/52] ipv6: do not set sk_destruct in IPV6_ADDRFORM sockopt Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 20/52] packet: Dont write vnet header beyond end of buffer Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 21/52] kcm: do not attach PF_KCM sockets to avoid deadlock Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 22/52] net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278 Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 23/52] net/mlx5e: Check for qos capability in dcbnl_initialize Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 24/52] net/mlx5e: Fix DCB_CAP_ATTR_DCBX capability for DCBNL getcap Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 25/52] net/mlx5: Fix arm SRQ command for ISSI version 0 Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 26/52] net/mlx5e: Fix dangling page pointer on DMA mapping error Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 27/52] net/mlx5e: Dont override user RSS upon set channels Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 28/52] net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 29/52] net/mlx5: E-Switch, Unload the representors in the correct order Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 30/52] net/mlx5e: Fix inline header size for small packets Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 31/52] net/mlx5e: Fix CQ moderation mode not set properly Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 32/52] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 33/52] net: fec: Allow reception of frames bigger than 1522 bytes Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 34/52] mlxsw: spectrum: Forbid linking to devices that have uppers Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 35/52] bridge: switchdev: Clear forward mark when transmitting packet Greg Kroah-Hartman
2017-09-18  9:11 ` Greg Kroah-Hartman [this message]
2017-09-18  9:11 ` [PATCH 4.12 37/52] Revert "net: fix percpu memory leaks" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 38/52] gianfar: Fix Tx flow control deactivation Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 39/52] vhost_net: correctly check tx avail during rx busy polling Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 40/52] ip6_gre: update mtu properly in ip6gre_err Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 41/52] ipv6: fix memory leak with multiple tables during netns destruction Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 42/52] ipv6: fix typo in fib6_net_exit() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 43/52] sctp: fix missing wake ups in some situations Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 44/52] f2fs: let fill_super handle roll-forward errors Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 45/52] f2fs: check hot_data for roll-forward recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 46/52] x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 47/52] x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 48/52] x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 49/52] fuse: allow server to run in different pid_ns Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 50/52] idr: remove WARN_ON_ONCE() when trying to replace negative ID Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 51/52] md/raid1/10: reset bio allocated from mempool Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.12 52/52] md/raid5: release/flush io in raid5_do_work() Greg Kroah-Hartman
2017-09-18 14:22 ` [PATCH 4.12 00/52] 4.12.14-stable review Sudip Mukherjee
2017-09-19  6:34   ` Greg Kroah-Hartman
2017-09-20 12:15     ` Sudip Mukherjee
2017-09-18 19:28 ` Guenter Roeck
2017-09-18 20:14 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170918091021.994120452@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=fw@strlen.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.