From: Feng Tang <feng.tang@intel.com>
To: Hyeonggon Yoo <42.hyeyoo@gmail.com>,
Vlastimil Babka <vbabka@suse.cz>,
David Rientjes <rientjes@google.com>
Cc: <yu.c.chen@intel.com>, "Sang, Oliver" <oliver.sang@intel.com>,
Jay Patel <jaypatel@linux.ibm.com>,
"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
lkp <lkp@intel.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>,
"Huang, Ying" <ying.huang@intel.com>,
"Yin, Fengwei" <fengwei.yin@intel.com>,
"cl@linux.com" <cl@linux.com>,
"penberg@kernel.org" <penberg@kernel.org>,
"iamjoonsoo.kim@lge.com" <iamjoonsoo.kim@lge.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"aneesh.kumar@linux.ibm.com" <aneesh.kumar@linux.ibm.com>,
"tsahu@linux.ibm.com" <tsahu@linux.ibm.com>,
"piyushs@linux.ibm.com" <piyushs@linux.ibm.com>
Subject: Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage
Date: Tue, 29 Aug 2023 16:30:17 +0800 [thread overview]
Message-ID: <ZO2smdi83wWwZBsm@feng-clx> (raw)
In-Reply-To: <ZL+R5kJpnHMUgGY2@feng-clx>
On Tue, Jul 25, 2023 at 05:20:01PM +0800, Tang, Feng wrote:
> On Tue, Jul 25, 2023 at 12:13:56PM +0900, Hyeonggon Yoo wrote:
> [...]
> > >
> > > I run the reproduce command in a local 2-socket box:
> > >
> > > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l" "30000" "-s" "100"
> > >
> > > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and
> > > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced
> > > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was bumped
> > > from 2 to 4. The setting of 'skbuff_head_cache' was kept unchanged.
> > >
> > > And this compiled with the perf-profile info from 0Day's report, that the
> > > 'list_lock' contention is increased with the patch:
> > >
> > > 13.71% 13.70% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
> > > 5.80% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreeze_partials;skb_release_data;consume_skb;unix_stream_read_generic;unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_read;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read
> > > 5.56% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_partial_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node;__kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_iter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_write
> >
> > Oh... neither of the assumptions were not true.
> > AFAICS it's a case of decreasing slab order increases lock contention,
> >
> > The number of cached objects per CPU is mostly the same (not exactly same,
> > because the cpu slab is not accounted for),
>
> Yes, this makes sense!
>
> > but only increases the
> > number of slabs
> > to process while taking slabs (get_partial_node()), and flushing the current
> > cpu partial list. (put_cpu_partial() -> __unfreeze_partials())
> >
> > Can we do better in this situation? improve __unfreeze_partials()?
>
> We can check that, IMHO, current MIN_PARTIAL and MAX_PARTIAL are too
> small as a global parameter, especially for server platforms with
> hundreds of GB or TBs memory.
>
> As for 'list_lock', I'm thinking of bumping the number of per-cpu
> objects in set_cpu_partial(), at least give user an option to do
> that for sever platforms with huge mount of memory. Will do some test
> around it, and let 0Day's peformance testing framework monitor
> for any regression.
Before this performance regression of 'hackbench', I've noticed other
cases where the per-node 'list-lock' is contended. With one processor
(socket/node) can have more and more CPUs (100+ or 200+), the scalability
problem could be much worse. So we may need to tackle it soon or later,
and surely we may need to separate the handling for large platforms
which suffer from scalability issue and small platforms who care more
about memory footprint.
For solving the scalability issue for large systems with big number
of CPU and memory, I tried 3 hacky patches for quick measurement:
1) increase the MIN_PARTIAL and MAX_PARTIAL to let each node have
more (64) partial slabs in maxim
2) increase the order of each slab (including changing the max slub
order to 4)
3) increase number of per-cpu partial slabs
These patches are mostly independent over each other.
And run will-it-scale benchmark's 'mmap1' test case on a 2 socket
Sapphire Rapids server (112 cores, 224 threads) with 256 GB DRAM,
run 3 configurations with parallel test threads of 25%, 50% and
100% of number of CPUs, and the data is (base is vanilla v6.5
kernel):
base base + patch-1 base + patch-1,2 base + patch-1,2,3
config-25% 223670 -0.0% 223641 +24.2% 277734 +37.7% 307991 per_process_ops
config-50% 186172 +12.9% 210108 +42.4% 265028 +59.8% 297495 per_process_ops
config-100% 89289 +11.3% 99363 +47.4% 131571 +78.1% 158991 per_process_ops
And from perf-profile data, the spinlock contention has been
greatly reduced:
43.65 -5.8 37.81 -25.9 17.78 -34.4 9.24 self.native_queued_spin_lock_slowpath
Some more perf backtrace stack changes are:
50.86 -4.7 46.16 -9.2 41.65 -16.3 34.57 bt.mmap_region.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
52.99 -4.4 48.55 -8.1 44.93 -14.6 38.35 bt.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
53.79 -4.4 49.44 -7.6 46.17 -14.0 39.75 bt.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
54.11 -4.3 49.78 -7.5 46.65 -13.8 40.33 bt.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
54.21 -4.3 49.89 -7.4 46.81 -13.7 40.50 bt.entry_SYSCALL_64_after_hwframe.__mmap
55.21 -4.2 51.00 -6.8 48.40 -13.0 42.23 bt.__mmap
19.59 -4.1 15.44 -10.3 9.30 -12.6 7.00 bt.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate
20.25 -4.1 16.16 -9.8 10.40 -12.1 8.15 bt.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate.mmap_region
20.52 -4.1 16.46 -9.7 10.80 -11.9 8.60 bt.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate.mmap_region.do_mmap
21.27 -4.0 17.25 -9.4 11.87 -11.4 9.83 bt.mas_alloc_nodes.mas_preallocate.mmap_region.do_mmap.vm_mmap_pgoff
21.34 -4.0 17.33 -9.4 11.97 -11.4 9.95 bt.mas_preallocate.mmap_region.do_mmap.vm_mmap_pgoff.do_syscall_64
2.60 -2.6 0.00 -2.6 0.00 -2.6 0.00 bt.get_partial_node.get_any_partial.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk
2.77 -2.4 0.35 ± 70% -2.8 0.00 -2.8 0.00 bt.get_any_partial.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes
Yu Chen also saw the similar slub lock contention in a scheduler
related 'hackbench' test, with these debug patches, the contention was
also reduced, https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
I'll think about how to only apply the changes to big systems and post
them as RFC patches.
Thanks,
Feng
next prev parent reply other threads:[~2023-08-29 8:40 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-28 9:57 [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage Jay Patel
2023-07-03 0:13 ` David Rientjes
2023-07-03 8:39 ` Jay Patel
2023-07-09 14:42 ` Hyeonggon Yoo
2023-07-12 13:06 ` Vlastimil Babka
2023-07-20 10:30 ` Jay Patel
2023-07-17 13:41 ` kernel test robot
2023-07-18 6:43 ` Hyeonggon Yoo
2023-07-20 3:00 ` Oliver Sang
2023-07-20 12:59 ` Hyeonggon Yoo
2023-07-20 13:46 ` Hyeonggon Yoo
2023-07-20 14:15 ` Hyeonggon Yoo
2023-07-24 2:39 ` Oliver Sang
2023-07-31 9:49 ` Hyeonggon Yoo
2023-07-20 13:49 ` Feng Tang
2023-07-20 15:05 ` Hyeonggon Yoo
2023-07-21 14:50 ` Binder Makin
2023-07-21 15:39 ` Hyeonggon Yoo
2023-07-21 18:31 ` Binder Makin
2023-07-24 14:35 ` Feng Tang
2023-07-25 3:13 ` Hyeonggon Yoo
2023-07-25 9:12 ` Feng Tang
2023-08-29 8:30 ` Feng Tang [this message]
2023-07-26 10:06 ` Vlastimil Babka
2023-08-10 10:38 ` Jay Patel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZO2smdi83wWwZBsm@feng-clx \
--to=feng.tang@intel.com \
--cc=42.hyeyoo@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=cl@linux.com \
--cc=fengwei.yin@intel.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=jaypatel@linux.ibm.com \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=penberg@kernel.org \
--cc=piyushs@linux.ibm.com \
--cc=rientjes@google.com \
--cc=tsahu@linux.ibm.com \
--cc=vbabka@suse.cz \
--cc=ying.huang@intel.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).