From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
Joe Mario <jmario@redhat.com>, Sebastian Jug <sejug@redhat.com>,
Yosry Ahmed <yosryahmed@google.com>,
Waiman Long <longman@redhat.com>
Subject: [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
Date: Wed, 1 Nov 2023 20:53:10 -0400 [thread overview]
Message-ID: <20231102005310.439588-1-longman@redhat.com> (raw)
When cgroup_rstat_updated() isn't being called concurrently with
cgroup_rstat_flush_locked(), its run time is pretty short. When
both are called concurrently, the cgroup_rstat_updated() run time
can spike to a pretty high value due to high cpu_lock hold time in
cgroup_rstat_flush_locked(). This can be problematic if the task calling
cgroup_rstat_updated() is a realtime task running on an isolated CPU
with a strict latency requirement. The cgroup_rstat_updated() call can
happens when there is a page fault even though the task is running in
user space most of the time.
The percpu cpu_lock is used to protect the update tree -
updated_next and updated_children. This protection is only needed
when cgroup_rstat_cpu_pop_updated() is being called. The subsequent
flushing operation which can take a much longer time does not need
that protection.
To reduce the cpu_lock hold time, we need to perform all the
cgroup_rstat_cpu_pop_updated() calls up front with the lock
released afterward before doing any flushing. This patch adds a new
cgroup_rstat_updated_list() function to return a singly linked list of
cgroups to be flushed.
By adding some instrumentation code to measure the maximum elapsed times
of the new cgroup_rstat_updated_list() function and each cpu iteration of
cgroup_rstat_updated_locked() around the old cpu_lock lock/unlock pair
on a 2-socket x86-64 server running parallel kernel build, the maximum
elapsed times are 27us and 88us respectively. The maximum cpu_lock hold
time is now reduced to about 30% of the original.
Below were the run time distribution of cgroup_rstat_updated_list()
during the same period:
Run time Count
-------- -----
t <= 1us 12,574,302
1us < t <= 5us 2,127,482
5us < t <= 10us 8,445
10us < t <= 20us 6,425
20us < t <= 30us 50
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/cgroup-defs.h | 6 +++++
kernel/cgroup/rstat.c | 45 ++++++++++++++++++++++++-------------
2 files changed, 36 insertions(+), 15 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 265da00a1a8b..daaf6d4eb8b6 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -491,6 +491,12 @@ struct cgroup {
struct cgroup_rstat_cpu __percpu *rstat_cpu;
struct list_head rstat_css_list;
+ /*
+ * A singly-linked list of cgroup structures to be rstat flushed.
+ * Protected by cgroup_rstat_lock.
+ */
+ struct cgroup *rstat_flush_next;
+
/* cgroup basic resource statistics */
struct cgroup_base_stat last_bstat;
struct cgroup_base_stat bstat;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d80d7a608141..a86d40ed8bda 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -145,6 +145,34 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
return pos;
}
+/*
+ * Return a list of updated cgroups to be flushed
+ */
+static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+{
+ raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+ struct cgroup *head, *tail, *next;
+ unsigned long flags;
+
+ /*
+ * The _irqsave() is needed because cgroup_rstat_lock is
+ * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
+ * this lock with the _irq() suffix only disables interrupts on
+ * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
+ * interrupts on both configurations. The _irqsave() ensures
+ * that interrupts are always disabled and later restored.
+ */
+ raw_spin_lock_irqsave(cpu_lock, flags);
+ head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
+ while (tail) {
+ next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
+ tail->rstat_flush_next = next;
+ tail = next;
+ }
+ raw_spin_unlock_irqrestore(cpu_lock, flags);
+ return head;
+}
+
/*
* A hook for bpf stat collectors to attach to and flush their stats.
* Together with providing bpf kfuncs for cgroup_rstat_updated() and
@@ -179,21 +207,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
lockdep_assert_held(&cgroup_rstat_lock);
for_each_possible_cpu(cpu) {
- raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock,
- cpu);
- struct cgroup *pos = NULL;
- unsigned long flags;
+ struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
- /*
- * The _irqsave() is needed because cgroup_rstat_lock is
- * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
- * this lock with the _irq() suffix only disables interrupts on
- * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
- * interrupts on both configurations. The _irqsave() ensures
- * that interrupts are always disabled and later restored.
- */
- raw_spin_lock_irqsave(cpu_lock, flags);
- while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
+ for (; pos; pos = pos->rstat_flush_next) {
struct cgroup_subsys_state *css;
cgroup_base_stat_flush(pos, cpu);
@@ -205,7 +221,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
css->ss->css_rstat_flush(css, cpu);
rcu_read_unlock();
}
- raw_spin_unlock_irqrestore(cpu_lock, flags);
/* play nice and yield if necessary */
if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
--
2.39.3
next reply other threads:[~2023-11-02 0:53 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-02 0:53 Waiman Long [this message]
2023-11-02 4:35 ` [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Yosry Ahmed
2023-11-02 19:07 ` Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231102005310.439588-1-longman@redhat.com \
--to=longman@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jmario@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lizefan.x@bytedance.com \
--cc=sejug@redhat.com \
--cc=tj@kernel.org \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox