linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev,
	 hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com,
	 kernel-team@cloudflare.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes
Date: Wed, 17 Jul 2024 09:31:48 -0700	[thread overview]
Message-ID: <CAJD7tkbFPt-eTHkqtLxuOoV59eaqauodz008btEECT--x3VcBA@mail.gmail.com> (raw)
In-Reply-To: <8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org>

[..]
>
> I agree that description should be updated as patch have evolved, and
> you suggestions make sense, thanks.
>
> But I'm not sure we are finished evolving this patch, because the scheme
> of allowing all sub-cgroups to become the ongoing flusher (and only
> avoiding lock-spinning when cgroup_is_descendant), is causing issues in
> production.

Sure, I just wanted to point out the the commit log should be updated
going forward, not implying that we are done here :)

>
> We are using cadvisor in production for collecting metrics, which walks
> all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat
> (doing 3 flushes of same cgroup in short timespan), but individual
> cgroups don't overlap much to benefit from ongoing_flusher scheme.
>
> Below is (1 sec) production data, where cadvisor and kswapd collide:
>
>  > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1
>  > @ongoing_flusher_cnt[kswapd5]: 3
>  > @ongoing_flusher_cnt[kswapd7]: 4
>  > @ongoing_flusher_cnt[kswapd6]: 4
>  > @ongoing_flusher_cnt[kswapd8]: 4
>  > @ongoing_flusher_cnt[kswapd9]: 5
>  > @ongoing_flusher_cnt[kswapd2]: 5
>  > @ongoing_flusher_cnt[kswapd10]: 5
>  > @ongoing_flusher_cnt[kswapd3]: 5
>  > @ongoing_flusher_cnt[kswapd11]: 5
>  > @ongoing_flusher_cnt[kswapd0]: 5
>  > @ongoing_flusher_cnt[kswapd4]: 6
>  > @ongoing_flusher_cnt[kswapd1]: 7
>  > @ongoing_flusher_cnt[cadvisor]: 9
>
> For cadvisor ongoing_flusher only helps 9 times to avoid flush.
>
>  > @ongoing_flusher_cnt[handled_race]: 18
>  > @ongoing_flusher_cnt[all]: 61
>
> Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes.
>
>  > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61
>  > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462
>  > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032
>
> This is bad: Lock is contended 9032 time within this 1 sec period.
> Below, lock_contended cases is captured in more detail.
>
>  > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435
>  > @lock_contended[normal, 4]: 1
>  > @lock_contended[normal, 1]: 2
>  > @lock_contended[normal, 3]: 6
>  > @lock_contended[normal, 2]: 15
>  > @lock_contended[yield, 4]: 49
>  > @lock_contended[after_obtaining_lock, 4]: 49
>  > @lock_contended[normal, 0]: 59
>
> The "normal" lock_contended for level 0,
> meaning the cgroup root is contended 59 times within this 1 sec period.
>
>  > @lock_contended[after_obtaining_lock, 1]: 117
>  > @lock_contended[yield, 1]: 118
>  > @lock_contended[yield, 3]: 258
>  > @lock_contended[after_obtaining_lock, 3]: 258
>  > @lock_contended[after_obtaining_lock, 2]: 946
>  > @lock_contended[yield, 2]: 946
>
> Lock contention for 'yielded' case for level 2.
> Goes crazy with 946/sec.
>
>  > @lock_contended[yield, 0]: 7579
>
> Lock contention for 'yielded' case for level 0, the root.
> Is really crazy with 7579/sec lock spin cases.
>
>  > @lock_contended[after_obtaining_lock, 0]: 7579
>
>
> IMHO this shows that, due to lock yielding, the scheme of
> ongoing_flusher for sub-trees only cause issues.

Just to make sure I understand correctly, allowing non-root to become
ongoing_flushers is problematic because we only support a single
ongoing_flusher, so if we have a subsequent root flush, it won't be
the ongoing_flusher and we won't benefit from skipping other flushes.
Correct?

I honestly think the problem here is not that we support
ongoing_flusher for subtrees, which imo is nice to have at a low
additional cost, and does have its benefits (see below). I think the
problem is that lock yielding means that we can have multiple ongoing
flushers, while the current scheme only records one.

What I think we should be doing is either supporting multiple
ongoing_flushers (by replacing the global variable with a per-cgroup
flag, perhaps), or biting the bullet and start using a mutex to drop
lock yielding. If there aren't any concerns beyond priority inversion
(raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I
suggested could fix this issue (and also put a bound on the flushing
latency in general)?

(I suppose the latter is preferrable)

>
> IMHO we should go back to only doing ongoing_flusher for root-cgroup.
> There is really low chance of sub-trees flushes being concurrent enough
> to benefit from this, and it cause issues and needs (ugly) race handling.
>
> Further more sub-tree flushing doesn't take as long time as level 0
> flushing, which further lower the chance of concurrent flushes.

I agree that the race handling is not pretty, and we can try to
improve the implementation or handle the race in other ways. However,
I think that skipping flushes for subtrees is valuable. I am not sure
about the cgroup arrangement in your use case, but we do have cgroups
with a lot of tasks in them (and/or child cgroups). If there is memory
pressure (or hit the cgroup limit), they all may go into direct
reclaim concurrently, so skipping flushes could really be beneficial.

Of course, if the difference in complexity is not justified, we can go
back to only supporting root cgroups for ongoing_flusher for now. But
as I mentioned in the v4 email thread, some of the complexity may
still remain anyway as we have multiple root cgroups in v1.

>
> Let's get some quick data on flush times from production, to support my
> claim:

Thanks for the data. I agree that in general root flushes will be a
lot more expensive than subtree flushes, but keep in mind that the
data may look really different depends on the cgroup setup. As I
mentioned, I think we should still support subtree flushes unless the
difference in complexity is not justified.

>
> The bpftrace onliner:
>   sudo bpftrace -e '
>    tracepoint:cgroup:cgroup_rstat_locked {
>     if (args->cpu == -1) { @start[tid]=nsecs; }
>     @cnt[probe]=count();
>     if (args->contended) {
>       @lock_contended["after_obtaining_lock", args->level]=count();
>     }}
>    tracepoint:cgroup:cgroup_rstat_unlock {
>     if (args->cpu == -1) {
>      $now=nsecs; $start=@start[tid]; $diff=$now-$start;
>      @locked_time_level[args->level]=hist($diff);
>     }
>     @cnt[probe]=count()}
>    kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()}
>    interval:s:1 {time("%H:%M:%S ");
>     print(@cnt);
>     print(@lock_contended);
>     print(@locked_time_level);
>     clear (@cnt);
>     clear (@lock_contended); }
>    END {clear(@start)}'
>
> Time below is in nanosec.
>
> @locked_time_level[0]:
> [4M, 8M)             623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>       |
> [8M, 16M)            860
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16M, 32M)           295 |@@@@@@@@@@@@@@@@@
>       |
> [32M, 64M)           275 |@@@@@@@@@@@@@@@@
>       |
>
>
> @locked_time_level[1]:
> [4K, 8K)               6 |@@@@
>       |
> [8K, 16K)             65
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)            52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>       |
> [32K, 64K)            23 |@@@@@@@@@@@@@@@@@@
>       |
> [64K, 128K)           15 |@@@@@@@@@@@@
>       |
> [128K, 256K)          10 |@@@@@@@@
>       |
> [256K, 512K)           6 |@@@@
>       |
> [512K, 1M)            15 |@@@@@@@@@@@@
>       |
> [1M, 2M)               2 |@
>       |
> [2M, 4M)              14 |@@@@@@@@@@@
>       |
> [4M, 8M)               6 |@@@@
>       |
> [8M, 16M)              7 |@@@@@
>       |
> [16M, 32M)             1 |
>       |
>
>
> @locked_time_level[2]:
> [2K, 4K)               1 |
>       |
> [4K, 8K)             160 |@@@@@@@@@
>       |
> [8K, 16K)            733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>       |
> [16K, 32K)           901
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)           191 |@@@@@@@@@@@
>       |
> [64K, 128K)          115 |@@@@@@
>       |
> [128K, 256K)          61 |@@@
>       |
> [256K, 512K)          70 |@@@@
>       |
> [512K, 1M)            59 |@@@
>       |
> [1M, 2M)              27 |@
>       |
> [2M, 4M)               9 |
>       |
>
>
> @locked_time_level[3]:
> [1K, 2K)               3 |
>       |
> [2K, 4K)               2 |
>       |
> [4K, 8K)               5 |
>       |
> [8K, 16K)            147 |@@@@@@
>       |
> [16K, 32K)          1222
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)           266 |@@@@@@@@@@@
>       |
> [64K, 128K)          199 |@@@@@@@@
>       |
> [128K, 256K)         146 |@@@@@@
>       |
> [256K, 512K)         124 |@@@@@
>       |
> [512K, 1M)            17 |
>       |
> [1M, 2M)               0 |
>       |
> [2M, 4M)               0 |
>       |
> [4M, 8M)               1 |
>       |
>
>
> @locked_time_level[4]:
> [4K, 8K)               2 |@@
>       |
> [8K, 16K)             17 |@@@@@@@@@@@@@@@@@@@@@@
>       |
> [16K, 32K)            40
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)             4 |@@@@@
>       |
>
> --Jesper
>


  reply	other threads:[~2024-07-17 16:32 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-11 13:28 [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes Jesper Dangaard Brouer
2024-07-11 13:29 ` [PATCH V7 2/2 RFC] cgroup/rstat: add tracepoint for ongoing flusher waits Jesper Dangaard Brouer
2024-07-16  8:42 ` [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes Jesper Dangaard Brouer
2024-07-17  0:35   ` Yosry Ahmed
2024-07-17  3:00     ` Waiman Long
2024-07-17 16:05       ` Yosry Ahmed
2024-07-17 16:36     ` Jesper Dangaard Brouer
2024-07-17 16:49       ` Yosry Ahmed
2024-07-18  8:12         ` Jesper Dangaard Brouer
2024-07-18 15:55           ` Yosry Ahmed
2024-07-19  0:40       ` Shakeel Butt
2024-07-19  3:11         ` Yosry Ahmed
2024-07-19 23:01           ` Shakeel Butt
2024-07-19  7:54         ` Jesper Dangaard Brouer
2024-07-19 22:47           ` Shakeel Butt
2024-07-20  4:52             ` Yosry Ahmed
     [not found]               ` <CAJD7tkaypFa3Nk0jh_ZYJX8YB0i7h9VY2YFXMg7GKzSS+f8H5g@mail.gmail.com>
2024-07-20 15:05                 ` Jesper Dangaard Brouer
2024-07-22 20:02               ` Shakeel Butt
2024-07-22 20:12                 ` Yosry Ahmed
2024-07-22 21:32                   ` Shakeel Butt
2024-07-22 22:58                     ` Shakeel Butt
2024-07-23  6:24                       ` Yosry Ahmed
2024-07-17  0:30 ` Yosry Ahmed
2024-07-17  7:32   ` Jesper Dangaard Brouer
2024-07-17 16:31     ` Yosry Ahmed [this message]
2024-07-17 18:17       ` Jesper Dangaard Brouer
2024-07-17 18:43         ` Yosry Ahmed
2024-07-19 15:07   ` Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJD7tkbFPt-eTHkqtLxuOoV59eaqauodz008btEECT--x3VcBA@mail.gmail.com \
    --to=yosryahmed@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hawk@kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=longman@redhat.com \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).