public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
@ 2023-11-02  0:53 Waiman Long
  2023-11-02  4:35 ` Yosry Ahmed
  0 siblings, 1 reply; 3+ messages in thread
From: Waiman Long @ 2023-11-02  0:53 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Joe Mario, Sebastian Jug, Yosry Ahmed,
	Waiman Long

When cgroup_rstat_updated() isn't being called concurrently with
cgroup_rstat_flush_locked(), its run time is pretty short. When
both are called concurrently, the cgroup_rstat_updated() run time
can spike to a pretty high value due to high cpu_lock hold time in
cgroup_rstat_flush_locked(). This can be problematic if the task calling
cgroup_rstat_updated() is a realtime task running on an isolated CPU
with a strict latency requirement. The cgroup_rstat_updated() call can
happens when there is a page fault even though the task is running in
user space most of the time.

The percpu cpu_lock is used to protect the update tree -
updated_next and updated_children. This protection is only needed
when cgroup_rstat_cpu_pop_updated() is being called. The subsequent
flushing operation which can take a much longer time does not need
that protection.

To reduce the cpu_lock hold time, we need to perform all the
cgroup_rstat_cpu_pop_updated() calls up front with the lock
released afterward before doing any flushing. This patch adds a new
cgroup_rstat_updated_list() function to return a singly linked list of
cgroups to be flushed.

By adding some instrumentation code to measure the maximum elapsed times
of the new cgroup_rstat_updated_list() function and each cpu iteration of
cgroup_rstat_updated_locked() around the old cpu_lock lock/unlock pair
on a 2-socket x86-64 server running parallel kernel build, the maximum
elapsed times are 27us and 88us respectively. The maximum cpu_lock hold
time is now reduced to about 30% of the original.

Below were the run time distribution of cgroup_rstat_updated_list()
during the same period:

      Run time             Count
      --------             -----
         t <= 1us       12,574,302
   1us < t <= 5us        2,127,482
   5us < t <= 10us           8,445
  10us < t <= 20us           6,425
  20us < t <= 30us              50

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h |  6 +++++
 kernel/cgroup/rstat.c       | 45 ++++++++++++++++++++++++-------------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 265da00a1a8b..daaf6d4eb8b6 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -491,6 +491,12 @@ struct cgroup {
 	struct cgroup_rstat_cpu __percpu *rstat_cpu;
 	struct list_head rstat_css_list;
 
+	/*
+	 * A singly-linked list of cgroup structures to be rstat flushed.
+	 * Protected by cgroup_rstat_lock.
+	 */
+	struct cgroup	*rstat_flush_next;
+
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d80d7a608141..a86d40ed8bda 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -145,6 +145,34 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	return pos;
 }
 
+/*
+ * Return a list of updated cgroups to be flushed
+ */
+static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+{
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+	struct cgroup *head, *tail, *next;
+	unsigned long flags;
+
+	/*
+	 * The _irqsave() is needed because cgroup_rstat_lock is
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
+	 * this lock with the _irq() suffix only disables interrupts on
+	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
+	 * interrupts on both configurations. The _irqsave() ensures
+	 * that interrupts are always disabled and later restored.
+	 */
+	raw_spin_lock_irqsave(cpu_lock, flags);
+	head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
+	while (tail) {
+		next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
+		tail->rstat_flush_next = next;
+		tail = next;
+	}
+	raw_spin_unlock_irqrestore(cpu_lock, flags);
+	return head;
+}
+
 /*
  * A hook for bpf stat collectors to attach to and flush their stats.
  * Together with providing bpf kfuncs for cgroup_rstat_updated() and
@@ -179,21 +207,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock,
-						       cpu);
-		struct cgroup *pos = NULL;
-		unsigned long flags;
+		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
 
-		/*
-		 * The _irqsave() is needed because cgroup_rstat_lock is
-		 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-		 * this lock with the _irq() suffix only disables interrupts on
-		 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-		 * interrupts on both configurations. The _irqsave() ensures
-		 * that interrupts are always disabled and later restored.
-		 */
-		raw_spin_lock_irqsave(cpu_lock, flags);
-		while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
+		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css;
 
 			cgroup_base_stat_flush(pos, cpu);
@@ -205,7 +221,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 				css->ss->css_rstat_flush(css, cpu);
 			rcu_read_unlock();
 		}
-		raw_spin_unlock_irqrestore(cpu_lock, flags);
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
  2023-11-02  0:53 [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Waiman Long
@ 2023-11-02  4:35 ` Yosry Ahmed
  2023-11-02 19:07   ` Waiman Long
  0 siblings, 1 reply; 3+ messages in thread
From: Yosry Ahmed @ 2023-11-02  4:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-kernel,
	Joe Mario, Sebastian Jug

On Wed, Nov 1, 2023 at 5:53 PM Waiman Long <longman@redhat.com> wrote:
>
> When cgroup_rstat_updated() isn't being called concurrently with
> cgroup_rstat_flush_locked(), its run time is pretty short. When
> both are called concurrently, the cgroup_rstat_updated() run time
> can spike to a pretty high value due to high cpu_lock hold time in
> cgroup_rstat_flush_locked(). This can be problematic if the task calling
> cgroup_rstat_updated() is a realtime task running on an isolated CPU
> with a strict latency requirement. The cgroup_rstat_updated() call can
> happens when there is a page fault even though the task is running in

s/happens/happen

> user space most of the time.
>
> The percpu cpu_lock is used to protect the update tree -
> updated_next and updated_children. This protection is only needed
> when cgroup_rstat_cpu_pop_updated() is being called. The subsequent
> flushing operation which can take a much longer time does not need
> that protection.

nit: add: as it is already protected by cgroup_rstat_lock.

>
> To reduce the cpu_lock hold time, we need to perform all the
> cgroup_rstat_cpu_pop_updated() calls up front with the lock
> released afterward before doing any flushing. This patch adds a new
> cgroup_rstat_updated_list() function to return a singly linked list of
> cgroups to be flushed.
>
> By adding some instrumentation code to measure the maximum elapsed times
> of the new cgroup_rstat_updated_list() function and each cpu iteration of
> cgroup_rstat_updated_locked() around the old cpu_lock lock/unlock pair
> on a 2-socket x86-64 server running parallel kernel build, the maximum
> elapsed times are 27us and 88us respectively. The maximum cpu_lock hold
> time is now reduced to about 30% of the original.
>
> Below were the run time distribution of cgroup_rstat_updated_list()
> during the same period:
>
>       Run time             Count
>       --------             -----
>          t <= 1us       12,574,302
>    1us < t <= 5us        2,127,482
>    5us < t <= 10us           8,445
>   10us < t <= 20us           6,425
>   20us < t <= 30us              50
>
> Signed-off-by: Waiman Long <longman@redhat.com>

LGTM with some nits.

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  include/linux/cgroup-defs.h |  6 +++++
>  kernel/cgroup/rstat.c       | 45 ++++++++++++++++++++++++-------------
>  2 files changed, 36 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 265da00a1a8b..daaf6d4eb8b6 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -491,6 +491,12 @@ struct cgroup {
>         struct cgroup_rstat_cpu __percpu *rstat_cpu;
>         struct list_head rstat_css_list;
>
> +       /*
> +        * A singly-linked list of cgroup structures to be rstat flushed.
> +        * Protected by cgroup_rstat_lock.

Do you think we should mention that this is a scratch area for
cgroup_rstat_flush_locked()? IOW, this field will be invalid or may
contain garbage otherwise.

It might be also useful to mention that the scope of usage for this is
for each percpu flushing iteration. The cgroup_rstat_lock can be
dropped between percpu flushing iterations, so different flushers can
reuse this field safely because it is re-initialized in every
iteration and only used there.

> +        */
> +       struct cgroup   *rstat_flush_next;
> +
>         /* cgroup basic resource statistics */
>         struct cgroup_base_stat last_bstat;
>         struct cgroup_base_stat bstat;
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index d80d7a608141..a86d40ed8bda 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -145,6 +145,34 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
>         return pos;
>  }
>
> +/*
> + * Return a list of updated cgroups to be flushed
> + */

Why not just on a single line?
/* Return a list of updated cgroups to be flushed */

> +static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
> +{
> +       raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> +       struct cgroup *head, *tail, *next;
> +       unsigned long flags;
> +
> +       /*
> +        * The _irqsave() is needed because cgroup_rstat_lock is
> +        * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
> +        * this lock with the _irq() suffix only disables interrupts on
> +        * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
> +        * interrupts on both configurations. The _irqsave() ensures
> +        * that interrupts are always disabled and later restored.
> +        */
> +       raw_spin_lock_irqsave(cpu_lock, flags);
> +       head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
> +       while (tail) {
> +               next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
> +               tail->rstat_flush_next = next;
> +               tail = next;
> +       }
> +       raw_spin_unlock_irqrestore(cpu_lock, flags);
> +       return head;
> +}
> +
>  /*
>   * A hook for bpf stat collectors to attach to and flush their stats.
>   * Together with providing bpf kfuncs for cgroup_rstat_updated() and
> @@ -179,21 +207,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>         lockdep_assert_held(&cgroup_rstat_lock);
>
>         for_each_possible_cpu(cpu) {
> -               raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock,
> -                                                      cpu);
> -               struct cgroup *pos = NULL;
> -               unsigned long flags;
> +               struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
>
> -               /*
> -                * The _irqsave() is needed because cgroup_rstat_lock is
> -                * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
> -                * this lock with the _irq() suffix only disables interrupts on
> -                * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
> -                * interrupts on both configurations. The _irqsave() ensures
> -                * that interrupts are always disabled and later restored.
> -                */
> -               raw_spin_lock_irqsave(cpu_lock, flags);
> -               while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> +               for (; pos; pos = pos->rstat_flush_next) {
>                         struct cgroup_subsys_state *css;
>
>                         cgroup_base_stat_flush(pos, cpu);
> @@ -205,7 +221,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>                                 css->ss->css_rstat_flush(css, cpu);
>                         rcu_read_unlock();
>                 }
> -               raw_spin_unlock_irqrestore(cpu_lock, flags);
>
>                 /* play nice and yield if necessary */
>                 if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> --
> 2.39.3
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
  2023-11-02  4:35 ` Yosry Ahmed
@ 2023-11-02 19:07   ` Waiman Long
  0 siblings, 0 replies; 3+ messages in thread
From: Waiman Long @ 2023-11-02 19:07 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-kernel,
	Joe Mario, Sebastian Jug


On 11/2/23 00:35, Yosry Ahmed wrote:
> On Wed, Nov 1, 2023 at 5:53 PM Waiman Long <longman@redhat.com> wrote:
>> When cgroup_rstat_updated() isn't being called concurrently with
>> cgroup_rstat_flush_locked(), its run time is pretty short. When
>> both are called concurrently, the cgroup_rstat_updated() run time
>> can spike to a pretty high value due to high cpu_lock hold time in
>> cgroup_rstat_flush_locked(). This can be problematic if the task calling
>> cgroup_rstat_updated() is a realtime task running on an isolated CPU
>> with a strict latency requirement. The cgroup_rstat_updated() call can
>> happens when there is a page fault even though the task is running in
> s/happens/happen
>
>> user space most of the time.
>>
>> The percpu cpu_lock is used to protect the update tree -
>> updated_next and updated_children. This protection is only needed
>> when cgroup_rstat_cpu_pop_updated() is being called. The subsequent
>> flushing operation which can take a much longer time does not need
>> that protection.
> nit: add: as it is already protected by cgroup_rstat_lock.
>
>> To reduce the cpu_lock hold time, we need to perform all the
>> cgroup_rstat_cpu_pop_updated() calls up front with the lock
>> released afterward before doing any flushing. This patch adds a new
>> cgroup_rstat_updated_list() function to return a singly linked list of
>> cgroups to be flushed.
>>
>> By adding some instrumentation code to measure the maximum elapsed times
>> of the new cgroup_rstat_updated_list() function and each cpu iteration of
>> cgroup_rstat_updated_locked() around the old cpu_lock lock/unlock pair
>> on a 2-socket x86-64 server running parallel kernel build, the maximum
>> elapsed times are 27us and 88us respectively. The maximum cpu_lock hold
>> time is now reduced to about 30% of the original.
>>
>> Below were the run time distribution of cgroup_rstat_updated_list()
>> during the same period:
>>
>>        Run time             Count
>>        --------             -----
>>           t <= 1us       12,574,302
>>     1us < t <= 5us        2,127,482
>>     5us < t <= 10us           8,445
>>    10us < t <= 20us           6,425
>>    20us < t <= 30us              50
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> LGTM with some nits.
>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>
>> ---
>>   include/linux/cgroup-defs.h |  6 +++++
>>   kernel/cgroup/rstat.c       | 45 ++++++++++++++++++++++++-------------
>>   2 files changed, 36 insertions(+), 15 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 265da00a1a8b..daaf6d4eb8b6 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -491,6 +491,12 @@ struct cgroup {
>>          struct cgroup_rstat_cpu __percpu *rstat_cpu;
>>          struct list_head rstat_css_list;
>>
>> +       /*
>> +        * A singly-linked list of cgroup structures to be rstat flushed.
>> +        * Protected by cgroup_rstat_lock.
> Do you think we should mention that this is a scratch area for
> cgroup_rstat_flush_locked()? IOW, this field will be invalid or may
> contain garbage otherwise.
I can certainly add that into the comment.
>
> It might be also useful to mention that the scope of usage for this is
> for each percpu flushing iteration. The cgroup_rstat_lock can be
> dropped between percpu flushing iterations, so different flushers can
> reuse this field safely because it is re-initialized in every
> iteration and only used there.
>
>> +        */
>> +       struct cgroup   *rstat_flush_next;
>> +
>>          /* cgroup basic resource statistics */
>>          struct cgroup_base_stat last_bstat;
>>          struct cgroup_base_stat bstat;
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index d80d7a608141..a86d40ed8bda 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -145,6 +145,34 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
>>          return pos;
>>   }
>>
>> +/*
>> + * Return a list of updated cgroups to be flushed
>> + */
> Why not just on a single line?
> /* Return a list of updated cgroups to be flushed */

Yes, it can be compressed into a one liner.

Thanks for the review and suggestion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-11-02 19:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-02  0:53 [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Waiman Long
2023-11-02  4:35 ` Yosry Ahmed
2023-11-02 19:07   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox