Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: "Greg Thelen" <gthelen@google.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Eric Dumazet" <edumzaet@google.com>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	"Eric Dumazet" <edumazet@google.com>
Subject: Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu)
Date: Wed, 19 Mar 2025 17:18:05 +0000	[thread overview]
Message-ID: <Z9r8TX0WiPWVffI0@google.com> (raw)
In-Reply-To: <u5kcjffhyrjsxagpdzas7q463ldgqtptaafozea3bv64odn2xt@agx42ih5m76l>

On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> > iterating all possible cpus. It only drops the lock if there is
> > scheduler or spin lock contention. If neither, then interrupts can be
> > disabled for a long time. On large machines this can disable interrupts
> > for a long enough time to drop network packets. On 400+ CPU machines
> > I've seen interrupt disabled for over 40 msec.
> > 
> > Prevent rstat from disabling interrupts while processing all possible
> > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
> > approach was previously discussed in
> > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> > though this was in the context of an non-irq rstat spin lock.
> > 
> > Benchmark this change with:
> > 1) a single stat_reader process with 400 threads, each reading a test
> >    memcg's memory.stat repeatedly for 10 seconds.
> > 2) 400 memory hog processes running in the test memcg and repeatedly
> >    charging memory until oom killed. Then they repeat charging and oom
> >    killing.
> > 
> > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> > interrupts are disabled by rstat for 45341 usec:
> >   #  => started at: _raw_spin_lock_irq
> >   #  => ended at:   cgroup_rstat_flush
> >   #
> >   #
> >   #                    _------=> CPU#
> >   #                   / _-----=> irqs-off/BH-disabled
> >   #                  | / _----=> need-resched
> >   #                  || / _---=> hardirq/softirq
> >   #                  ||| / _--=> preempt-depth
> >   #                  |||| / _-=> migrate-disable
> >   #                  ||||| /     delay
> >   #  cmd     pid     |||||| time  |   caller
> >   #     \   /        ||||||  \    |    /
> >   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
> >   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45343us : <stack trace>
> >    => memcg1_stat_format
> >    => memory_stat_format
> >    => memory_stat_show
> >    => seq_read_iter
> >    => vfs_read
> >    => ksys_read
> >    => do_syscall_64
> >    => entry_SYSCALL_64_after_hwframe
> > 
> > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> > longest holder. The longest irqs-off holder has irqs disabled for
> > 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> > 
> > Running stat_reader memory.stat reader for 10 seconds:
> > - without memory hogs: 9.84M accesses => 12.7M accesses
> > -    with memory hogs: 9.46M accesses => 11.1M accesses
> > The throughput of memory.stat access improves.
> > 
> > The mode of memory.stat access latency after grouping by of 2 buckets:
> > - without memory hogs: 64 usec => 16 usec
> > -    with memory hogs: 64 usec =>  8 usec
> > The memory.stat latency improves.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > Tested-by: Greg Thelen <gthelen@google.com>
> > ---
> >  kernel/cgroup/rstat.c | 12 +++++-------
> >  1 file changed, 5 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index aac91466279f..976c24b3671a 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> >  			rcu_read_unlock();
> >  		}
> >  
> > -		/* play nice and yield if necessary */
> > -		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> > -			__cgroup_rstat_unlock(cgrp, cpu);
> > -			if (!cond_resched())
> > -				cpu_relax();
> > -			__cgroup_rstat_lock(cgrp, cpu);
> > -		}
> > +		/* play nice and avoid disabling interrupts for a long time */
> > +		__cgroup_rstat_unlock(cgrp, cpu);
> > +		if (!cond_resched())
> > +			cpu_relax();
> > +		__cgroup_rstat_lock(cgrp, cpu);
> >  	}
> >  }
> 
> Is not this going a little too far?
> 
> the lock + irq trip is quite expensive in its own right and now is
> going to be paid for each cpu, as in the total time spent executing
> cgroup_rstat_flush_locked is going to go up.
> 
> Would your problem go away toggling this every -- say -- 8 cpus?

I was concerned about this too, and about more lock bouncing, but the
testing suggests that this actually overall improves the latency of
cgroup_rstat_flush_locked() (at least on tested HW).

So I don't think we need to do something like this unless a regression
is observed.

> 
> Just a suggestion.
>

next prev parent reply	other threads:[~2025-03-19 17:18 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19  7:13 [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu) Greg Thelen
2025-03-19  7:17 ` Greg Thelen
2025-03-19 10:20 ` Michal Koutný
2025-03-19 10:47 ` Mateusz Guzik
2025-03-19 17:18   ` Yosry Ahmed [this message]
2025-03-27 14:38     ` Mateusz Guzik
2025-03-27 17:17       ` Yosry Ahmed
2025-03-27 17:47         ` Mateusz Guzik
2025-04-01 15:00           ` Michal Koutný
2025-04-01 15:46             ` Mateusz Guzik
2025-04-01 16:59               ` Michal Koutný
2025-03-19 17:26   ` Tejun Heo
2025-03-26 23:57     ` Mateusz Guzik
2025-03-19 17:16 ` Yosry Ahmed
2025-03-19 18:06   ` Johannes Weiner
2025-03-19 18:35     ` Yosry Ahmed
2025-03-19 19:10       ` Tejun Heo
2025-03-19 19:16       ` Johannes Weiner
2025-03-19 19:46         ` Johannes Weiner
2025-03-19 17:26 ` Tejun Heo
2025-03-19 17:35 ` Johannes Weiner
2025-03-19 17:53 ` Shakeel Butt
2025-03-19 18:37   ` Eric Dumazet
2025-03-20 14:43   ` Greg Thelen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z9r8TX0WiPWVffI0@google.com \
    --to=yosry.ahmed@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=edumzaet@google.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjguzik@gmail.com \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.