All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Wen Yang <wenyang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Xunlei Pang <xlpang@linux.alibaba.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects
Date: Tue, 18 Feb 2020 12:53:12 -0800	[thread overview]
Message-ID: <20200218205312.GA3156@carbon> (raw)
In-Reply-To: <b42f7daa-4aea-1cf8-5bbb-2cd5d48b4e9a@linux.alibaba.com>

On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
> 
> 
> On 2020/2/13 6:52 上午, Andrew Morton wrote:
> > On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
> > 
> > > The lock, protecting the node partial list, is taken when couting the free
> > > objects resident in that list. It introduces locking contention when the
> > > page(s) is moved between CPU and node partial lists in allocation path
> > > on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> > > allocation on another CPU for a while, 200ms in extreme cases. If the
> > > slab object is to carry network packet, targeting the far-end disk array,
> > > it causes block IO jitter issue.
> > > 
> > > This fixes the block IO jitter issue by caching the total inuse objects in
> > > the node in advance. The value is retrieved without taking the node partial
> > > list lock on reading "/proc/slabinfo".
> > > 
> > > ...
> > > 
> > > @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
> > >   static void discard_slab(struct kmem_cache *s, struct page *page)
> > >   {
> > > -	dec_slabs_node(s, page_to_nid(page), page->objects);
> > > +	int inuse = page->objects;
> > > +
> > > +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> > 
> > Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> > 
> > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> > and save a function argument.
> > 
> > If yes then why?
> > 
> 
> Thanks for your comments.
> We are happy to improve this patch based on your suggestions.
> 
> 
> When the user reads /proc/slabinfo, in order to obtain the active_objs
> information, the kernel traverses all slabs and executes the following code
> snippet:
> static unsigned long count_partial(struct kmem_cache_node *n,
>                                         int (*get_count)(struct page *))
> {
>         unsigned long flags;
>         unsigned long x = 0;
>         struct page *page;
> 
>         spin_lock_irqsave(&n->list_lock, flags);
>         list_for_each_entry(page, &n->partial, slab_list)
>                 x += get_count(page);
>         spin_unlock_irqrestore(&n->list_lock, flags);
>         return x;
> }
> 
> It may cause performance issues.
> 
> Christoph suggested "you could cache the value in the userspace application?
> Why is this value read continually?", But reading the /proc/slabinfo is
> initiated by the user program. As a cloud provider, we cannot control user
> behavior. If a user program inadvertently executes cat /proc/slabinfo, it
> may affect other user programs.
> 
> As Christoph said: "The count is not needed for any operations. Just for the
> slabinfo output. The value has no operational value for the allocator
> itself. So why use extra logic to track it in potentially performance
> critical paths?"
> 
> In this way, could we show the approximate value of active_objs in the
> /proc/slabinfo?
> 
> Based on the following information:
> In the discard_slab() function, page->inuse is equal to page->total_objects;
> In the allocate_slab() function, page->inuse is also equal to
> page->total_objects (with one exception: for kmem_cache_node, page-> inuse
> equals 1);
> page->inuse will only change continuously when the obj is constantly
> allocated or released. (This should be the performance critical path
> emphasized by Christoph)
> 
> When users query the global slabinfo information, we may use total_objects
> to approximate active_objs.

Well, from one point of view, it makes no sense, because the ratio between
these two numbers is very meaningful: it's the slab utilization rate.

On the other side, with enabled per-cpu partial lists active_objs has
nothing to do with the reality anyway, so I agree with you, calling
count_partial() is almost useless.

That said, I wonder if the right thing to do is something like the patch below?

Thanks!

Roman

--

diff --git a/mm/slub.c b/mm/slub.c
index 1d644143f93e..ba0505e75ecc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 static unsigned long count_partial(struct kmem_cache_node *n,
                                        int (*get_count)(struct page *))
 {
-       unsigned long flags;
        unsigned long x = 0;
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+       unsigned long flags;
        struct page *page;
 
        spin_lock_irqsave(&n->list_lock, flags);
        list_for_each_entry(page, &n->partial, slab_list)
                x += get_count(page);
        spin_unlock_irqrestore(&n->list_lock, flags);
+#endif
        return x;
 }
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */



  reply	other threads:[~2020-02-18 20:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-01  3:15 [PATCH] mm/slub: Detach node lock from counting free objects Wen Yang
2020-02-08  3:03 ` Wen Yang
2020-02-08 21:41   ` Christopher Lameter
2020-02-12 22:56     ` Andrew Morton
2020-02-14  2:16       ` Christopher Lameter
2020-02-12 22:52 ` Andrew Morton
2020-02-16  4:15   ` Wen Yang
2020-02-18 20:53     ` Roman Gushchin [this message]
2020-02-20 13:53       ` Wen Yang
2020-02-20 15:40         ` Roman Gushchin
2020-02-22  6:55           ` Wen Yang
2020-02-24 17:01             ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200218205312.GA3156@carbon \
    --to=guro@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=wenyang@linux.alibaba.com \
    --cc=xlpang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.