From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81193C48BC4 for ; Sun, 18 Feb 2024 19:25:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EDAEC8D0003; Sun, 18 Feb 2024 14:25:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E8AEC8D0002; Sun, 18 Feb 2024 14:25:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D52E08D0003; Sun, 18 Feb 2024 14:25:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C3AA08D0002 for ; Sun, 18 Feb 2024 14:25:35 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6DF54160195 for ; Sun, 18 Feb 2024 19:25:35 +0000 (UTC) X-FDA: 81805903830.12.2829DD7 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf19.hostedemail.com (Postfix) with ESMTP id 9F67D1A000F for ; Sun, 18 Feb 2024 19:25:32 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=G9d46F6o; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of rientjes@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708284332; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=B9xYxG+4EJNJj5YE6nXWMmKMGcc6ntgXWEQgOTyX2uY=; b=S/CIcy2VnOAdza59r3Ch+HPFN+jdVb27Ht93SarJWwFksI7Q3ioUXJcDIYrB2ac/2/T6kj iTMyo2zCV6iRAeflBPuAaFq0AnEzFvO8U9J9Ahwd5LzIO0jPPuPbR/9k3OK0tAxGZ68ZAr a7X+trLv/bI4syABoymv8m2GXvUaOic= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=G9d46F6o; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of rientjes@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708284332; a=rsa-sha256; cv=none; b=KvjRCO1ynXD27+PXzAGh2fYC9Vs0pwI0+zyOKSAJex90KGM4Ys4q1EksS0PRb6W3Y5dgNP zWc+yKzQvAoNBTZVbp8ItUZMpsWV9Es9hG2dJqPzHRmxlzKrxZ67oYdDAjFrImRN6MraK1 7qwadg9gydewgLRyDSNNHASNYuxiglA= Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-1d89f0ab02bso107595ad.1 for ; Sun, 18 Feb 2024 11:25:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1708284331; x=1708889131; darn=kvack.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=B9xYxG+4EJNJj5YE6nXWMmKMGcc6ntgXWEQgOTyX2uY=; b=G9d46F6ovIw8RPOY10twjwGT5aq/nxzLqVznbMTCJPYxS277NkNwJLTI+YBmJeogzC OZpmkUJ05sE1jFKcIFU3rsbo0EqQsAwQbxMNEAKEXFdQoYevRcLVJynY4arG6OBPPUkO lfJvCm5ooRGY8fiSOOpqUIZLwAGfut7UqLoxfiD9pUM/yHhwMq0s2kzmYb03Dvv9MZrc 9qfRdexv7f6lP+n6uS6/KS5y6fNTeYOl05BvtDGwW3biBVsAVYmFy3Hkvqnkv4PyRr0c 5s3gmtbsthT7Un5rd0moxTnLRKUWtPOGy8Qbd5Ezs9pPFTvrMcnonReywtwl31NT9ka9 YV3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708284331; x=1708889131; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=B9xYxG+4EJNJj5YE6nXWMmKMGcc6ntgXWEQgOTyX2uY=; b=cYxr7672CFQ2UcW6faHhpTHEI8yAH0GPOd8fODWRB+4DJmGcQSg1lUdIZxgbvfGXXJ Z+qU3FxnfnT8V8gc/ZmvnATYgNRtIAKC4qAPISEjrIf9sdUCBUQixMdCJ9p1QfMB4DkA FP1rO/e9y5snVozezq4CKMo19XsDXJp0t9ZmIkOrBYPxqCIS7FWcNP7GmFwjbSqKqN1i Wj2Ic/sychWiUkpUXnW70N6wDcQG1HxRay+v6JGtPk/slmR3QqU5uszE5ZVrXWG57UuY 4LAl5J4Fl98tTeYFAMFwXhGGpZK9+J4IKs9Tp6ra3tXjL8eG2EN+z2QyKG3yknMGyIN1 QDkA== X-Forwarded-Encrypted: i=1; AJvYcCUI067RtB36XAceRGgChVDp/Psxj1CIvc5UeUaAvLv6j+G9FGoNXOgLEpKJKSwu0HG38NewtITMMFU1JCS4VfU7A2w= X-Gm-Message-State: AOJu0YxyY4hrKH7lfc77YPbHbZXUBQmJcPfxE3IjDKkHTtLFZdJjfT2o 64wPLP/NkPmzpqREvKpBNcl7/znSieppKtAIYDYMLz1lGAnQ9CNz8HxBgy00PdqRG8hs+9AZxCP EAw== X-Google-Smtp-Source: AGHT+IH3jgWpu6FG5EYkD52aO3UsfnmkQqySwQcDKAwDMvNX32n1Z7SJTgSQUNvhJSiTOajSCd1OaA== X-Received: by 2002:a17:903:25d5:b0:1db:4f08:4b10 with SMTP id jc21-20020a17090325d500b001db4f084b10mr223317plb.21.1708284331153; Sun, 18 Feb 2024 11:25:31 -0800 (PST) Received: from [2620:0:1008:15:cd06:f5b0:224e:954a] ([2620:0:1008:15:cd06:f5b0:224e:954a]) by smtp.gmail.com with ESMTPSA id cz15-20020a17090ad44f00b0029703476e9bsm3546097pjb.44.2024.02.18.11.25.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 18 Feb 2024 11:25:30 -0800 (PST) Date: Sun, 18 Feb 2024 11:25:29 -0800 (PST) From: David Rientjes To: Jianfeng Wang cc: cl@linux.com, penberg@kernel.org, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, vbabka@suse.cz, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] slub: avoid scanning all partial slabs in get_slabinfo() In-Reply-To: <20240215211457.32172-1-jianfeng.w.wang@oracle.com> Message-ID: <6b58d81f-8e8f-3732-a5d4-40eece75013b@google.com> References: <20240215211457.32172-1-jianfeng.w.wang@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 9F67D1A000F X-Stat-Signature: pdcfmdjxomoiy71hrokzn5yi67xz4846 X-Rspam-User: X-HE-Tag: 1708284332-789836 X-HE-Meta: U2FsdGVkX1/n899HuACQmsNwYE9ABOQosSV3DC6mWAingca3kSCgt1m1qHXlqGgRLlQPBuNjhjXPfUTxsoQP1ZaFpeccdpQFfuIX32UWKL0jkdEIezmWiUstyhUrk+7utN1Hw+BB0Mc5RQGhvYNq4FsQpZVGr0S/1kY4zEJG47jTBjrqeCz4jsXz61wKgX7kIde7H7JXgWOaeIAsx6ydGRID6S+0N3sYbb+/bcabmG6FT1DPw24+4eLo8xQ25ZcBTvkbB1X33BAKKN7Nw7FvRmGSUJX5stoAjA82uq5YGdxIsGCUUY3pG4OHlR5Oc+KSOBp8F1YMlVQMbjWqxMEUZw5XTmRxAaEDzCMXE8T3uLWdFfkTTd7O4p/p00cd94Jm34H6qM/UjjvqP8I4mWQi4uR/OfdZwPBqXgDCdl8Ab/kkXFyYKxK0jh0Bcj83+9HYI9XNvvB0ow27u/rTFvGQ1RP9NUnh1xj8+3bzHcAgZGbWEBzUk/EjcjonXUVXdZRG7a7ZBiq8ORSy/BouIbWs7Em6C584lPuXDdlStRT0EN4KK8mNxLn96ZZrSWN6kPzLSHNNQ8P1jXsbfHSBTsnvx0LVgMDm/SFAnkjFf7LXr4cqbS7yvC4hi8/jjNKgWTR7xH97G6ZfdxEEmj87wgUU5In14cuDuW6KE/fQdZZbqY/mnii1Q+zMgvkyQ8QDHJPzYUjjlefGoKM/fv7Q2zz7Mz6PpgADKZScVJ7LnnUIerzZXvQGUU9d5byap9phEYoYg0l4Lv/FJ8ss1s7tHy/xV9Q1mm9EjYiCI3n8VCQKjZP91Rmfv0fCU19/pV34R0kkUtufOEDDr56KnupscIakdWS3XSPNza6ktOiMzv81RelIo6MrgH38mGzxkrdkypwfywPmCyOPEvL7kGX28Krbu9mQMxY4X0ilnLSkBOV7/FnMRml8/txyJldGWwy4z3cFuMnXz6tEUXzm4vLxGyc F9QxaRBL vwD7mdwOwJieThUHp1CBc3eiCM5CoIiXyQchSPelRCOn+aeUGbwxIGI+Lkr3ETZEp+OUfc9FFxupRMOf+xeP0DEXGUZLVv6yrOuhllKdZMJ9LHiXC+Vsk/CXdw3CjUUmaOBHdUaaaPuuYbvKEpDCw3GokFk4QsgXW7eFFTrYsPRUaecipRA7z9IfiTeN5N460iQ5OF1j9f2kozH4VfRQoJGoaQ06iL1f44pDaQtWwccgWCd21pk2L4UnC8OxAdmtJEpbGwVVZgtHTGRMrdnrkQ8HwMceg+WWEviBTO4xCXimAmbiIJWvoCZo9Q5C7rWthAPWErYFmAu0YN0jhpvMaA2jT8kaHEnB/Wc7aO24KqkEZrsTzykQF5QPJnPQRbA5ymYxsmkwnrQ9MMpz6bctnt8w8tpAhIZVkzDYmOcCz/MlkVviYy2FHbCAReDeqq7t9mdGp9gg7NDaFLAirGIev3xb56hu54iieD7N1wBoIZjubee2h5FfzHLfLYeoWr99rfgnrmapAWkze1RJkyGBbc10m04m/fclN348+A+JrxdHCibH5ULf80ETjXjDECtTzYwDxV2M3/x0lkSiJsTAjciCS5sl5F6h06nUPmWydIF2HecooABHWxe4Cqx4DVvFHxkdI2ro6AHC/26gqFV7mG1UeeSA9A6uurOwf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 15 Feb 2024, Jianfeng Wang wrote: > When reading "/proc/slabinfo", the kernel needs to report the number of > free objects for each kmem_cache. The current implementation relies on > count_partial() that counts the number of free objects by scanning each > kmem_cache_node's partial slab list and summing free objects from all > partial slabs in the list. This process must hold per kmem_cache_node > spinlock and disable IRQ. Consequently, it can block slab allocation > requests on other CPU cores and cause timeouts for network devices etc., > if the partial slab list is long. In production, even NMI watchdog can > be triggered because some slab caches have a long partial list: e.g., > for "buffer_head", the number of partial slabs was observed to be ~1M > in one kmem_cache_node. This problem was also observed by several > others [1-2] in the past. > > The fix is to maintain a counter of free objects for each kmem_cache. > Then, in get_slabinfo(), use the counter rather than count_partial() > when reporting the number of free objects for a slab cache. per-cpu > counter is used to minimize atomic or lock operation. > > Benchmark: run hackbench on a dual-socket 72-CPU bare metal machine > with 256 GB memory and Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.3 GHz. > The command is "hackbench 18 thread 20000". Each group gets 10 runs. > This seems particularly intrusive for the common path to optimize for reading of /proc/slabinfo, and that's shown in the benchmark result. Could you discuss the /proc/slabinfo usage model a bit? It's not clear if this is being continuously read, or whether even a single read in isolation is problematic. That said, optimizing for reading /proc/slabinfo at the cost of runtime performance degradation doesn't sound like the right trade-off. > Results: > - Mainline: > 21.0381 +- 0.0325 seconds time elapsed ( +- 0.15% ) > - Mainline w/ this patch: > 21.1878 +- 0.0239 seconds time elapsed ( +- 0.11% ) > > [1] https://lore.kernel.org/linux-mm/ > alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/ > [2] https://lore.kernel.org/lkml/ > alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/ > > Signed-off-by: Jianfeng Wang > --- > mm/slab.h | 4 ++++ > mm/slub.c | 31 +++++++++++++++++++++++++++++-- > 2 files changed, 33 insertions(+), 2 deletions(-) > > diff --git a/mm/slab.h b/mm/slab.h > index 54deeb0428c6..a0e7672ba648 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > /* > * Internal slab definitions > @@ -277,6 +278,9 @@ struct kmem_cache { > unsigned int red_left_pad; /* Left redzone padding size */ > const char *name; /* Name (only for display!) */ > struct list_head list; /* List of slab caches */ > +#ifdef CONFIG_SLUB_DEBUG > + struct percpu_counter free_objects; > +#endif > #ifdef CONFIG_SYSFS > struct kobject kobj; /* For sysfs */ > #endif > diff --git a/mm/slub.c b/mm/slub.c > index 2ef88bbf56a3..44f8ded96574 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -736,6 +736,12 @@ static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab, > static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)]; > static DEFINE_SPINLOCK(object_map_lock); > > +static inline void > +__update_kmem_cache_free_objs(struct kmem_cache *s, s64 delta) > +{ > + percpu_counter_add_batch(&s->free_objects, delta, INT_MAX); > +} > + > static void __fill_map(unsigned long *obj_map, struct kmem_cache *s, > struct slab *slab) > { > @@ -1829,6 +1835,9 @@ slab_flags_t kmem_cache_flags(unsigned int object_size, > return flags | slub_debug_local; > } > #else /* !CONFIG_SLUB_DEBUG */ > +static inline void > +__update_kmem_cache_free_objs(struct kmem_cache *s, s64 delta) {} > + > static inline void setup_object_debug(struct kmem_cache *s, void *object) {} > static inline > void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {} > @@ -2369,6 +2378,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > slab->inuse = 0; > slab->frozen = 0; > > + __update_kmem_cache_free_objs(s, slab->objects); > account_slab(slab, oo_order(oo), s, flags); > > slab->slab_cache = s; > @@ -2445,6 +2455,7 @@ static void free_slab(struct kmem_cache *s, struct slab *slab) > call_rcu(&slab->rcu_head, rcu_free_slab); > else > __free_slab(s, slab); > + __update_kmem_cache_free_objs(s, -slab->objects); > } > > static void discard_slab(struct kmem_cache *s, struct slab *slab) > @@ -3859,6 +3870,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list > */ > slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size); > > + if (object) > + __update_kmem_cache_free_objs(s, -1); > return object; > } > > @@ -4235,6 +4248,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s, > unsigned long tid; > void **freelist; > > + __update_kmem_cache_free_objs(s, cnt); > redo: > /* > * Determine the currently cpus per cpu slab. > @@ -4286,6 +4300,7 @@ static void do_slab_free(struct kmem_cache *s, > struct slab *slab, void *head, void *tail, > int cnt, unsigned long addr) > { > + __update_kmem_cache_free_objs(s, cnt); > __slab_free(s, slab, head, tail, cnt, addr); > } > #endif /* CONFIG_SLUB_TINY */ > @@ -4658,6 +4673,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > memcg_slab_alloc_error_hook(s, size, objcg); > } > > + __update_kmem_cache_free_objs(s, -i); > return i; > } > EXPORT_SYMBOL(kmem_cache_alloc_bulk); > @@ -4899,6 +4915,9 @@ void __kmem_cache_release(struct kmem_cache *s) > cache_random_seq_destroy(s); > #ifndef CONFIG_SLUB_TINY > free_percpu(s->cpu_slab); > +#endif > +#ifdef CONFIG_SLUB_DEBUG > + percpu_counter_destroy(&s->free_objects); > #endif > free_kmem_cache_nodes(s); > } > @@ -5109,6 +5128,14 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) > s->random = get_random_long(); > #endif > > +#ifdef CONFIG_SLUB_DEBUG > + int ret; > + > + ret = percpu_counter_init(&s->free_objects, 0, GFP_KERNEL); > + if (ret) > + return ret; > +#endif > + > if (!calculate_sizes(s)) > goto error; > if (disable_higher_order_debug) { > @@ -7100,15 +7127,15 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) > { > unsigned long nr_slabs = 0; > unsigned long nr_objs = 0; > - unsigned long nr_free = 0; > + unsigned long nr_free; > int node; > struct kmem_cache_node *n; > > for_each_kmem_cache_node(s, node, n) { > nr_slabs += node_nr_slabs(n); > nr_objs += node_nr_objs(n); > - nr_free += count_partial(n, count_free); > } > + nr_free = percpu_counter_sum_positive(&s->free_objects); > > sinfo->active_objs = nr_objs - nr_free; > sinfo->num_objs = nr_objs; > -- > 2.42.1 > >