Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jay Patel <jaypatel@linux.ibm.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Feng Tang <feng.tang@intel.com>
Cc: "Sang, Oliver" <oliver.sang@intel.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	lkp <lkp@intel.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>,
	"cl@linux.com" <cl@linux.com>,
	"penberg@kernel.org" <penberg@kernel.org>,
	"rientjes@google.com" <rientjes@google.com>,
	"iamjoonsoo.kim@lge.com" <iamjoonsoo.kim@lge.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"aneesh.kumar@linux.ibm.com" <aneesh.kumar@linux.ibm.com>,
	"tsahu@linux.ibm.com" <tsahu@linux.ibm.com>,
	"piyushs@linux.ibm.com" <piyushs@linux.ibm.com>
Subject: Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage
Date: Thu, 10 Aug 2023 16:08:56 +0530	[thread overview]
Message-ID: <5b07232a4bdbf99cdd117c595eb897bb4eeb02ce.camel@linux.ibm.com> (raw)
In-Reply-To: <91bd907e-adc0-d7c7-7eaa-da199689c99c@suse.cz>

On Wed, 2023-07-26 at 12:06 +0200, Vlastimil Babka wrote:
> On 7/25/23 05:13, Hyeonggon Yoo wrote:
> > On Mon, Jul 24, 2023 at 11:43 PM Feng Tang <feng.tang@intel.com>
> > wrote:
> > > On Thu, Jul 20, 2023 at 11:05:17PM +0800, Hyeonggon Yoo wrote:
> > > > > > > let me introduce our test process.
> > > > > > > 
> > > > > > > we make sure the tests upon commit and its parent have
> > > > > > > exact same environment
> > > > > > > except the kernel difference, and we also make sure the
> > > > > > > config to build the
> > > > > > > commit and its parent are identical.
> > > > > > > 
> > > > > > > we run tests for one commit at least 6 times to make sure
> > > > > > > the data is stable.
> > > > > > > 
> > > > > > > such like for this case, we rebuild the commit and its
> > > > > > > parent's kernel, the
> > > > > > > config is attached FYI.
> > > > > > 
> > > > > > Hello Oliver,
> > > > > > 
> > > > > > Thank you for confirming the testing environment is totally
> > > > > > fine.
> > > > > > and I'm sorry. I didn't mean to offend that your tests were
> > > > > > bad.
> > > > > > 
> > > > > > It was more like  "oh, the data totally doesn't make sense
> > > > > > to me"
> > > > > > and I blamed the tests rather than my poor understanding of
> > > > > > the data ;)
> > > > > > 
> > > > > > Anyway,
> > > > > > as the data shows a repeatable regression,
> > > > > > let's think more about the possible scenario:
> > > > > > 
> > > > > > I can't stop thinking that the patch must've affected the
> > > > > > system's
> > > > > > reclamation behavior in some way.
> > > > > > (I think more active anon pages with a similar number total
> > > > > > of anon
> > > > > > pages implies the kernel scanned more pages)
> > > > > > 
> > > > > > It might be because kswapd was more frequently woken up
> > > > > > (possible if
> > > > > > skbs were allocated with GFP_ATOMIC)
> > > > > > But the data provided is not enough to support this
> > > > > > argument.
> > > > > > 
> > > > > > >  2.43 ± 7% +4.5 6.90 ± 11% perf-profile.children.cycles-
> > > > > > > pp.get_partial_node
> > > > > > >  3.23 ±  5%      +4.5        7.77 ±  9%  perf-
> > > > > > > profile.children.cycles-pp.___slab_alloc
> > > > > > >  7.51 ±  2%      +4.6       12.11 ±  5%  perf-
> > > > > > > profile.children.cycles-pp.kmalloc_reserve
> > > > > > > 6.94 ±  2%      +4.7       11.62 ±  6%  perf-
> > > > > > > profile.children.cycles-pp.__kmalloc_node_track_caller
> > > > > > > 6.46 ±  2%      +4.8       11.22 ±  6%  perf-
> > > > > > > profile.children.cycles-pp.__kmem_cache_alloc_node
> > > > > > >  8.48 ±  4%      +7.9       16.42 ±  8%  perf-
> > > > > > > profile.children.cycles-pp._raw_spin_lock_irqsave
> > > > > > >  6.12 ±  6%      +8.6       14.74 ±  9%  perf-
> > > > > > > profile.children.cycles-
> > > > > > > pp.native_queued_spin_lock_slowpath
> > > > > > 
> > > > > > And this increased cycles in the SLUB slowpath implies that
> > > > > > the actual
> > > > > > number of objects available in
> > > > > > the per cpu partial list has been decreased, possibly
> > > > > > because of
> > > > > > inaccuracy in the heuristic?
> > > > > > (cuz the assumption that slabs cached per are half-filled,
> > > > > > and that
> > > > > > slabs' order is s->oo)
> > > > > 
> > > > > From the patch:
> > > > > 
> > > > >  static unsigned int slub_max_order =
> > > > > -       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 :
> > > > > PAGE_ALLOC_COSTLY_ORDER;
> > > > > +       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2;
> > > > > 
> > > > > Could this be related? that it reduces the order for some
> > > > > slab cache,
> > > > > so each per-cpu slab will has less objects, which makes the
> > > > > contention
> > > > > for per-node spinlock 'list_lock' more severe when the slab
> > > > > allocation
> > > > > is under pressure from many concurrent threads.
> > > > 
> > > > hackbench uses skbuff_head_cache intensively. So we need to
> > > > check if
> > > > skbuff_head_cache's
> > > > order was increased or decreased. On my desktop
> > > > skbuff_head_cache's
> > > > order is 1 and I roughly
> > > > guessed it was increased, (but it's still worth checking in the
> > > > testing env)
> > > > 
> > > > But decreased slab order does not necessarily mean decreased
> > > > number
> > > > of cached objects per CPU, because when oo_order(s->oo) is
> > > > smaller,
> > > > then it caches
> > > > more slabs into the per cpu slab list.
> > > > 
> > > > I think more problematic situation is when oo_order(s->oo) is
> > > > higher,
> > > > because the heuristic
> > > > in SLUB assumes that each slab has order of oo_order(s->oo) and
> > > > it's
> > > > half-filled. if it allocates
> > > > slabs with order lower than oo_order(s->oo), the number of
> > > > cached
> > > > objects per CPU
> > > > decreases drastically due to the inaccurate assumption.
> > > > 
> > > > So yeah, decreased number of cached objects per CPU could be
> > > > the cause
> > > > of the regression due to the heuristic.
> > > > 
> > > > And I have another theory: it allocated high order slabs from
> > > > remote node
> > > > even if there are slabs with lower order in the local node.
> > > > 
> > > > ofc we need further experiment, but I think both improving the
> > > > accuracy of heuristic and
> > > > avoiding allocating high order slabs from remote nodes would
> > > > make SLUB
> > > > more robust.
> > > 
> > > I run the reproduce command in a local 2-socket box:
> > > 
> > > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l"
> > > "30000" "-s" "100"
> > > 
> > > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and
> > > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced
> > > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was
> > > bumped
> > > from 2 to 4. The setting of 'skbuff_head_cache' was kept
> > > unchanged.
> > > 
> > > And this compiled with the perf-profile info from 0Day's report,
> > > that the
> > > 'list_lock' contention is increased with the patch:
> > > 
> > >     13.71%    13.70%  [kernel.kallsyms]         [k]
> > > native_queued_spin_lock_slowpath                            -    
> > >   -
> > > 5.80%
> > > native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreez
> > > e_partials;skb_release_data;consume_skb;unix_stream_read_generic;
> > > unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_rea
> > > d;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read
> > > 5.56%
> > > native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_parti
> > > al_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node;
> > > __kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb
> > > _with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_i
> > > ter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwf
> > > rame;__libc_write
> > 
> > Oh... neither of the assumptions were not true.
> > AFAICS it's a case of decreasing slab order increases lock
> > contention,
> 
> Oh good, that would be the least surprising result, at least :) Yeah
> I've
> pointed out in my reply to this v2 that this patch should not result
> in
> decreasing slab order, at least for 4k pages.
> 
> The v3/v4 is indeed different in that it only affects 64k pages. But
> the
> inital goal from v1 to increase the order for 4k is also no longer
> there.
> Maybe that's fine as there's two things to consider here IMHO. 1) the
> order
> could be increased for 4k pages for some cache sizes to minimize
> waste
> (that's what v1 did, but also for 64k where it was not an
> improvement) 2)
> the orders we have might be too large for 64k pages. Now v4 addresses
> 2)
> AFAICS. We could return also to 1) separately if it shows benefits.
> 
Yes, so with V4 currently targeting larger page size for slub memory
wastage reduction, but will also work on point 1 later on as it shows
some benefits :) 
  
> In any case it means the benchmark results on v2 are no longer
> applicable,
> so we could move the discussion to v4:
> 
> https://lore.kernel.org/all/20230720102337.2069722-1-jaypatel@linux.ibm.com/
> 
So any reviews/feedbacks for V4.
 
> Now I noticed in v4 there's only M: folks from the MAINTAINERS slab
> section
> on Cc: but not R: folks. Next time please Cc: also R: (Hyeonggon and
> Roman).
> Thanks!
> 
Sure next time will also add R: floks :) 

Thanks 
Jay Patel
> > The number of cached objects per CPU is mostly the same (not
> > exactly same,
> > because the cpu slab is not accounted for), but only increases the
> > number of slabs
> > to process while taking slabs (get_partial_node()), and flushing
> > the current
> > cpu partial list. (put_cpu_partial() -> __unfreeze_partials())
> > 
> > Can we do better in this situation? improve __unfreeze_partials()?
> > 
> > > Also I tried to restore the slub_max_order to 3, and the
> > > regression was
> > > gone.
> > > 
> > >  static unsigned int slub_max_order =
> > > -       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2;
> > > +       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 3;
> > >  static unsigned int slub_min_objects;
> > > 
> > > Thanks,
> > > Feng
> > > 
> > > > > I don't have direct data to backup it, and I can try some
> > > > > experiment.
> > > > 
> > > > Thank you for taking time for experiment!
> > > > 
> > > > Thanks,
> > > > Hyeonggon
> > > > 
> > > > > > > then retest on this test machine:
> > > > > > > 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @
> > > > > > > 2.00GHz (Ice Lake) with 256G memory

     prev parent reply	other threads:[~2023-08-10 10:47 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-28  9:57 [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage Jay Patel
2023-07-03  0:13 ` David Rientjes
2023-07-03  8:39   ` Jay Patel
2023-07-09 14:42   ` Hyeonggon Yoo
2023-07-12 13:06 ` Vlastimil Babka
2023-07-20 10:30   ` Jay Patel
2023-07-17 13:41 ` kernel test robot
2023-07-18  6:43   ` Hyeonggon Yoo
2023-07-20  3:00     ` Oliver Sang
2023-07-20 12:59       ` Hyeonggon Yoo
2023-07-20 13:46         ` Hyeonggon Yoo
2023-07-20 14:15           ` Hyeonggon Yoo
2023-07-24  2:39             ` Oliver Sang
2023-07-31  9:49               ` Hyeonggon Yoo
2023-07-20 13:49         ` Feng Tang
2023-07-20 15:05           ` Hyeonggon Yoo
2023-07-21 14:50             ` Binder Makin
2023-07-21 15:39               ` Hyeonggon Yoo
2023-07-21 18:31                 ` Binder Makin
2023-07-24 14:35             ` Feng Tang
2023-07-25  3:13               ` Hyeonggon Yoo
2023-07-25  9:12                 ` Feng Tang
2023-08-29  8:30                   ` Feng Tang
2023-07-26 10:06                 ` Vlastimil Babka
2023-08-10 10:38                   ` Jay Patel [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b07232a4bdbf99cdd117c595eb897bb4eeb02ce.camel@linux.ibm.com \
    --to=jaypatel@linux.ibm.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=cl@linux.com \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=penberg@kernel.org \
    --cc=piyushs@linux.ibm.com \
    --cc=rientjes@google.com \
    --cc=tsahu@linux.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.