Re: Advice on cgroup rstat lock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Waiman Long <longman@redhat.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>,
	Jesper Dangaard Brouer <jesper@cloudflare.com>,
	"David S. Miller" <davem@davemloft.net>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Shakeel Butt <shakeelb@google.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	kernel-team <kernel-team@cloudflare.com>,
	cgroups@vger.kernel.org, Linux-MM <linux-mm@kvack.org>,
	Netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Ivan Babrou <ivan@cloudflare.com>
Subject: Re: Advice on cgroup rstat lock
Date: Tue, 9 Apr 2024 11:37:31 -0400	[thread overview]
Message-ID: <96728c6d-3863-48c7-986b-b0b37689849e@redhat.com> (raw)
In-Reply-To: <ac4cf07f-52dd-454f-b897-2a4b3796a4d9@kernel.org>

On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
> Let move this discussion upstream.
>
> On 22/03/2024 19.32, Yosry Ahmed wrote:
>> [..]
>>>> There was a couple of series that made all calls to
>>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
>>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
>>>> problem that we used to face (except in our case, we saw hard lockups
>>>> in extreme scenarios):
>>>> https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/ 
>>>>
>>>> https://lore.kernel.org/lkml/20230421174020.2994750-1-yosryahmed@google.com/ 
>>>>
>>>
>>> I've only done the 6.6 backport, and these were in 6.5/6.6.
>
> Given I have these in my 6.6 kernel. You are basically saying I should
> be able to avoid IRQ-disable for the lock, right?
>
> My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
> and (thereby also) BH/softirq (spin_lock_irq).  This cause production
> issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
> latency issues (debug via softirq_net_latency.bt [5]).
>
>   [3] 
> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
>   [5] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt 
>
>
>>> And between 6.1 to 6.6 we did observe an improvement in this area.
>>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan 
>>> progress)
>>>
>>> I've had a chance to get running in prod for 6.6 backport.
>>> As you can see in attached grafana heatmap pictures, we do observe an
>>> improved/reduced softirq wait time.
>>> These softirq "not-able-to-run" outliers is *one* of the prod issues we
>>> observed.  As you can see, I still have other areas to improve/fix.
>>
>> I am not very familiar with such heatmaps, but I am glad there is an
>> improvement with 6.6 and the backports. Let me know if there is
>> anything I could do to help with your effort.
>
> The heatmaps give me an overview, but I needed a debugging tool, so I
> developed some bpftrace scripts [1][2] I'm running on production.
> To measure how long time we hold the cgroup rstat lock (results below).
> Adding ACME and Daniel as I hope there is an easier way to measure lock
> hold time and congestion. Notice tricky release/yield in
> cgroup_rstat_flush_locked[4].
>
> My production results on 6.6 with backported patches (below signature)
> vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
> shows how long time the lock+IRQs were disabled (taking into account it
> can be released in the loop [4]).
>
> Patched kernel:
>
> 21:49:02  time elapsed: 43200 sec
> @lock_time_hist_ns:
> [2K, 4K)              61 |      |
> [4K, 8K)             734 |      |
> [8K, 16K)         121500 |@@@@@@@@@@@@@@@@      |
> [16K, 32K)        385714 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)        145600 |@@@@@@@@@@@@@@@@@@@      |
> [64K, 128K)       156873 |@@@@@@@@@@@@@@@@@@@@@      |
> [128K, 256K)      261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [256K, 512K)      291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [512K, 1M)        101859 |@@@@@@@@@@@@@      |
> [1M, 2M)           19866 |@@      |
> [2M, 4M)           10146 |@      |
> [4M, 8M)           30633 |@@@@      |
> [8M, 16M)          40365 |@@@@@      |
> [16M, 32M)         21650 |@@      |
> [32M, 64M)          5842 |      |
> [64M, 128M)            8 |      |
>
> And normal 6.6 kernel:
>
> 21:48:32  time elapsed: 43200 sec
> @lock_time_hist_ns:
> [1K, 2K)              25 |      |
> [2K, 4K)            1146 |      |
> [4K, 8K)           59397 |@@@@      |
> [8K, 16K)         571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [16K, 32K)        542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [32K, 64K)        202810 |@@@@@@@@@@@@@      |
> [64K, 128K)       134564 |@@@@@@@@@      |
> [128K, 256K)       72870 |@@@@@      |
> [256K, 512K)       56914 |@@@      |
> [512K, 1M)         83140 |@@@@@      |
> [1M, 2M)          170514 |@@@@@@@@@@@      |
> [2M, 4M)          396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [4M, 8M)          755537 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8M, 16M)         231222 |@@@@@@@@@@@@@@@      |
> [16M, 32M)         76370 |@@@@@      |
> [32M, 64M)          1043 |      |
> [64M, 128M)           12 |      |
>
>
> For the unpatched kernel we see more events in 4ms to 8ms bucket than
> any other bucket.
> For patched kernel, we clearly see a significant reduction of events in
> the 4 ms to 64 ms area, but we still have some events in this area.  I'm
> very happy to see these patches improves the situation.  But for network
> processing I'm not happy to see events in area 16ms to 128ms area.  If
> we can just avoid disabling IRQs/softirq for the lock, I would be happy.
>
> How far can we go... could cgroup_rstat_lock be converted to a mutex?

The cgroup_rstat_lock was originally a mutex. It was converted to a 
spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with 
a spinlock"). Irq was disabled to enable calling from atomic context. 
Since commit 0a2dc6ac3329 ("cgroup: remove 
cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from 
atomic context anymore. Theoretically, we could change it back to a 
mutex or not disabling interrupt. That will require that the API cannot 
be called from atomic context going forward.

Cheers,
Longman

next prev parent reply	other threads:[~2024-04-09 15:37 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <7cd05fac-9d93-45ca-aa15-afd1a34329c6@kernel.org>
     [not found] ` <20240319154437.GA144716@cmpxchg.org>
     [not found]   ` <56556042-5269-4c7e-99ed-1a1ab21ac27f@kernel.org>
     [not found]     ` <CAJD7tkYbO7MdKUBsaOiSp6-qnDesdmVsTCiZApN_ncS3YkDqGQ@mail.gmail.com>
     [not found]       ` <bf94f850-fab4-4171-8dfe-b19ada22f3be@kernel.org>
     [not found]         ` <CAJD7tkbn-wFEbhnhGWTy0-UsFoosr=m7wiJ+P96XnDoFnSH7Zg@mail.gmail.com>
2024-04-09 11:08           ` Advice on cgroup rstat lock Jesper Dangaard Brouer
2024-04-09 15:37             ` Waiman Long [this message]
2024-04-09 16:45               ` Yosry Ahmed
2024-04-09 16:59                 ` Waiman Long
2024-04-11 10:17                   ` Jesper Dangaard Brouer
2024-04-11 17:22                     ` Yosry Ahmed
2024-04-12 19:26                       ` Jesper Dangaard Brouer
2024-04-12 19:51                         ` Yosry Ahmed
2024-04-16 14:22                           ` Jesper Dangaard Brouer
2024-04-16 18:41                             ` Shakeel Butt
2024-04-18  2:04                               ` Yosry Ahmed
2024-04-11 10:52             ` Arnaldo Carvalho de Melo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=96728c6d-3863-48c7-986b-b0b37689849e@redhat.com \
    --to=longman@redhat.com \
    --cc=acme@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=bristot@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=hannes@cmpxchg.org \
    --cc=hawk@kernel.org \
    --cc=ivan@cloudflare.com \
    --cc=jesper@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).