Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Yosry Ahmed <yosryahmed@google.com>,
	Shakeel Butt <shakeel.butt@linux.dev>
Cc: tj@kernel.org, cgroups@vger.kernel.org, hannes@cmpxchg.org,
	lizefan.x@bytedance.com, longman@redhat.com,
	kernel-team@cloudflare.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes
Date: Sat, 20 Jul 2024 17:05:53 +0200	[thread overview]
Message-ID: <74c53382-5c31-41e9-94a2-0a7f88c0d2a5@kernel.org> (raw)
In-Reply-To: <CAJD7tkaypFa3Nk0jh_ZYJX8YB0i7h9VY2YFXMg7GKzSS+f8H5g@mail.gmail.com>



On 20/07/2024 06.52, Yosry Ahmed wrote:
> On Fri, Jul 19, 2024 at 9:52 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>> On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 19/07/2024 02.40, Shakeel Butt wrote:
>>>>> Hi Jesper,
>>>>>
>>>>> On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
>>>>>>
>>>>> [...]
>>>>>>
>>>>>>
>>>>>> Looking at the production numbers for the time the lock is held for level 0:
>>>>>>
>>>>>> @locked_time_level[0]:
>>>>>> [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
>>>>>> [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>>>> [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
>>>>>> [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
>>>>>>
>>>>>
>>>>> Is it possible to get the above histogram for other levels as well?
>>>>
>>>> Data from other levels available in [1]:
>>>>   [1]
>>>> https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
>>>>
>>>> IMHO the data shows we will get most out of skipping level-0 root-cgroup
>>>> flushes.
>>>>
>>>
>>> Thanks a lot of the data. Are all or most of these locked_time_level[0]
>>> from kswapds? This just motivates me to strongly push the ratelimited
>>> flush patch of mine (which would be orthogonal to your patch series).
>>

There are also others flushing level 0.
Extended the bpftrace oneliner to also capture the process 'comm' name.
(I reduced 'kworker' to one entry in below, e.g pattern 'kworker/u392:19').

grep 'level\[' out01.bpf_oneliner_locked_time | awk -F/ '{print $1}' | 
sort | uniq
@locked_time_level[0, cadvisor]:
@locked_time_level[0, consul]:
@locked_time_level[0, kswapd0]:
@locked_time_level[0, kswapd10]:
@locked_time_level[0, kswapd11]:
@locked_time_level[0, kswapd1]:
@locked_time_level[0, kswapd2]:
@locked_time_level[0, kswapd3]:
@locked_time_level[0, kswapd4]:
@locked_time_level[0, kswapd5]:
@locked_time_level[0, kswapd6]:
@locked_time_level[0, kswapd7]:
@locked_time_level[0, kswapd8]:
@locked_time_level[0, kswapd9]:
@locked_time_level[0, kworker
@locked_time_level[0, lassen]:
@locked_time_level[0, thunderclap-san]:
@locked_time_level[0, xdpd]:
@locked_time_level[1, cadvisor]:
@locked_time_level[2, cadvisor]:
@locked_time_level[2, kworker
@locked_time_level[2, memory-saturati]:
@locked_time_level[2, systemd]:
@locked_time_level[2, thread-saturati]:
@locked_time_level[3, cadvisor]:
@locked_time_level[3, cat]:
@locked_time_level[3, kworker
@locked_time_level[3, memory-saturati]:
@locked_time_level[3, systemd]:
@locked_time_level[3, thread-saturati]:
@locked_time_level[4, cadvisor]:


>> Jesper and I were discussing a better ratelimiting approach, whether
>> it's measuring the time since the last flush, or only skipping if we
>> have a lot of flushes in a specific time frame (using __ratelimit()).
>> I believe this would be better than the current memcg ratelimiting
>> approach, and we can remove the latter.
>>
>> WDYT?
> 
> Forgot to link this:
> https://lore.kernel.org/lkml/CAJD7tkZ5nxoa7aCpAix1bYOoYiLVfn+aNkq7jmRAZqsxruHYLw@mail.gmail.com/
> 

I agree that ratelimiting is orthogonal to this patch, and that we 
really need to address this in follow up patchset.

The proposed mem_cgroup_flush_stats_ratelimited patch[1] helps, but is
limited to memory area.  I'm proposing a more generic solution in [2]
that helps all users of rstat.

It is time based, because it makes sense to observe the time it takes to
flush root (service rate), and then limit how quickly after another
flusher can run (limiting arrival rate). From practical queue theory we
intuitively know that we should keep arrival rate below service rate,
else queuing happens.

--Jesper

[1] "memcg: use ratelimited stats flush in the reclaim"
  - 
https://lore.kernel.org/all/20240615081257.3945587-1-shakeel.butt@linux.dev/

[2] "cgroup/rstat: introduce ratelimited rstat flushing"
  - 
https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/

next prev parent reply	other threads:[~2024-07-20 15:06 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-11 13:28 [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes Jesper Dangaard Brouer
2024-07-11 13:29 ` [PATCH V7 2/2 RFC] cgroup/rstat: add tracepoint for ongoing flusher waits Jesper Dangaard Brouer
2024-07-16  8:42 ` [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes Jesper Dangaard Brouer
2024-07-17  0:35   ` Yosry Ahmed
2024-07-17  3:00     ` Waiman Long
2024-07-17 16:05       ` Yosry Ahmed
2024-07-17 16:36     ` Jesper Dangaard Brouer
2024-07-17 16:49       ` Yosry Ahmed
2024-07-18  8:12         ` Jesper Dangaard Brouer
2024-07-18 15:55           ` Yosry Ahmed
2024-07-19  0:40       ` Shakeel Butt
2024-07-19  3:11         ` Yosry Ahmed
2024-07-19 23:01           ` Shakeel Butt
2024-07-19  7:54         ` Jesper Dangaard Brouer
2024-07-19 22:47           ` Shakeel Butt
2024-07-20  4:52             ` Yosry Ahmed
     [not found]               ` <CAJD7tkaypFa3Nk0jh_ZYJX8YB0i7h9VY2YFXMg7GKzSS+f8H5g@mail.gmail.com>
2024-07-20 15:05                 ` Jesper Dangaard Brouer [this message]
2024-07-22 20:02               ` Shakeel Butt
2024-07-22 20:12                 ` Yosry Ahmed
2024-07-22 21:32                   ` Shakeel Butt
2024-07-22 22:58                     ` Shakeel Butt
2024-07-23  6:24                       ` Yosry Ahmed
2024-07-17  0:30 ` Yosry Ahmed
2024-07-17  7:32   ` Jesper Dangaard Brouer
2024-07-17 16:31     ` Yosry Ahmed
2024-07-17 18:17       ` Jesper Dangaard Brouer
2024-07-17 18:43         ` Yosry Ahmed
2024-07-19 15:07   ` Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74c53382-5c31-41e9-94a2-0a7f88c0d2a5@kernel.org \
    --to=hawk@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=longman@redhat.com \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).