From: Bharata B Rao <bharata@linux.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
aneesh.kumar@linux.ibm.com, Kirill Tkhai <ktkhai@virtuozzo.com>,
Roman Gushchin <guro@fb.com>, Shakeel Butt <shakeelb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: High kmalloc-32 slab cache consumption with 10k containers
Date: Wed, 7 Apr 2021 10:35:41 +0530 [thread overview]
Message-ID: <20210407050541.GC1354243@in.ibm.com> (raw)
In-Reply-To: <20210406222807.GD1990290@dread.disaster.area>
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> > Hi,
> >
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> >
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> >
> > If,
> >
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> >
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> >
> > Following factors contribute to the excessive allocations:
> >
> > - Lists are created for possible NUMA nodes.
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
> > list_lrus are created when it grows. Thus we end up creating list_lru_one
> > list_heads even for those memcgs which are yet to be created.
> > For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
> > a value of 12286.
>
> So, by your numbers, we have 2 * 2 * 12286 * 32 = 1.5MB per mount.
>
> So for that to make up 100GB of RAM, you must have somewhere over
> 500,000 mounted superblocks on the machine?
>
> That implies 50+ unique mounted superblocks per container, which
> seems like an awful lot.
Here is how the calculation turns out to be in my setup:
Number of possible NUMA nodes = 2
Number of mounts per container = 7 (Check below to see which are these)
Number of list creation requests per mount = 2
Number of containers = 10000
memcg_nr_cache_ids for 10k containers = 12286
size of kmalloc-32 = 32 byes
2*7*2*10000*12286*32 = 110082560000 bytes = 102.5G
>
> > - When a memcg goes offline, the list elements are drained to the parent
> > memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> > for non-existing memcgs remain and continue to contribute to the
> > kmalloc-32 allocation. This is presumably done for performance
> > reason as they get reused when new memcgs are created, but they end up
> > consuming slab memory until then.
> > - In case of containers, a few file systems get mounted and are specific
> > to the container namespace and hence to a particular memcg, but we
> > end up creating lists for all the memcgs.
> > As an example, if 7 FS mounts are done for every container and when
> > 10k containers are created, we end up creating 2*7*12286 list_lru_one
> > lists for each NUMA node. It appears that no elements will get added
> > to other than 2*7=14 of them in the case of containers.
>
> Yeah, at first glance this doesn't strike me as a problem with the
> list_lru structure, it smells more like a problem resulting from a
> huge number of superblock instantiations on the machine. Which,
> probably, mostly have no significant need for anything other than a
> single memcg awareness?
>
> Can you post a typical /proc/self/mounts output from one of these
> idle/empty containers so we can see exactly how many mounts and
> their type are being instantiated in each container?
Tracing type->name in alloc_super() lists these 7 types for
every container.
3-2691 [041] .... 222.761041: alloc_super: fstype: mqueue
3-2692 [072] .... 222.812187: alloc_super: fstype: proc
3-2692 [072] .... 222.812261: alloc_super: fstype: tmpfs
3-2692 [072] .... 222.812329: alloc_super: fstype: devpts
3-2692 [072] .... 222.812392: alloc_super: fstype: tmpfs
3-2692 [072] .... 222.813102: alloc_super: fstype: tmpfs
3-2692 [072] .... 222.813159: alloc_super: fstype: tmpfs
>
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
>
> Superblocks have to support an unknown number of memcgs after they
> have been mounted. bind mounts, child memcgs, etc, all mean that we
> can't just have a static, single mount time memcg instantiation.
>
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
>
> Avoiding memory allocation in list_lru_add() was one of the main
> reasons for up-front static allocation of memcg lists. We cannot do
> memory allocation while callers are holding multiple spinlocks in
> core system algorithms (e.g. dentry_kill -> retain_dentry ->
> d_lru_add -> list_lru_add), let alone while holding an internal
> spinlock.
>
> Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
> path we know might have memory demand in the *hundreds of GB* range
> gets an NACK from me. It's a great idea, but it's just not a
> feasible, robust solution as proposed. Work out how to put the
> memory allocation outside all the locks (including caller locks) and
> it might be ok, but that's messy.
Ok, I see the problem and it looks like hard to get allocations
outside of those locks.
>
> Another approach may be to identify filesystem types that do not
> need memcg awareness and feed that into alloc_super() to set/clear
> the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> virtual filesystems that expose system information do not really
> need full memcg awareness because they are generally only visible to
> a single memcg instance...
This however seems like a feasible approach, let me check on this.
Regards,
Bharata.
next prev parent reply other threads:[~2021-04-07 5:05 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-05 5:48 High kmalloc-32 slab cache consumption with 10k containers Bharata B Rao
2021-04-05 8:22 ` kernel test robot
2021-04-05 18:08 ` Yang Shi
2021-04-05 18:38 ` Roman Gushchin
2021-04-06 10:13 ` Bharata B Rao
2021-04-06 10:05 ` Bharata B Rao
2021-04-07 1:39 ` Yang Shi
2021-04-06 22:28 ` Dave Chinner
2021-04-07 5:05 ` Bharata B Rao [this message]
2021-04-07 10:07 ` Kirill Tkhai
2021-04-07 11:47 ` Bharata B Rao
2021-04-07 12:49 ` Kirill Tkhai
2021-04-07 13:57 ` Christian Brauner
2021-04-15 5:23 ` Bharata B Rao
2021-04-15 6:54 ` Michal Hocko
2021-04-15 7:21 ` Bharata B Rao
2021-04-16 4:44 ` Bharata B Rao
2021-04-19 1:23 ` Dave Chinner
2021-04-07 11:54 ` Michal Hocko
2021-04-07 13:32 ` Christian Brauner
2021-04-07 13:43 ` Bharata B Rao
2021-04-07 13:57 ` Michal Hocko
2021-04-07 15:42 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210407050541.GC1354243@in.ibm.com \
--to=bharata@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=david@fromorbit.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=ktkhai@virtuozzo.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=shakeelb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).