Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Yury Norov <yury.norov@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks
Date: Wed, 27 Nov 2024 08:41:01 +0100	[thread overview]
Message-ID: <Z0bNDfGufRZLfAP6@gpd3> (raw)
In-Reply-To: <Z0aFZJZ85LUwapZE@yury-ThinkPad>

On Tue, Nov 26, 2024 at 06:35:16PM -0800, Yury Norov wrote:
> On Tue, Nov 26, 2024 at 10:56:40AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> >
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> >
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> 
> This all is definitely a complication of the code. Have you any numbers to
> justify it?

I'll include some numbers in v2. For now I just wanted to send the patch
out to start some discussion. :)

> 
> > [Open issue]
> >
> > The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return
> > a single cpumask for all the CPUs, have been changed to report only the
> > cpumask of the current NUMA node (using the current CPU); this breaks
> > the old behavior, so it can potentially introduce regressions in some
> > scx schedulers.
> >
> > An alternative approach could be to construct a global cpumask
> > on-the-fly, but this could add significant overhead to ops.select_cpu()
> > for schedulers relying on these kfunc's. Additionally, it would be less
> > reliable than accessing the actual cpumasks, as the copy could quickly
> > become out of sync and not represent the actual idle state very well.
> >
> > Probably a better way to solve this issue is to introduce new kfunc's to
> > explicitly select specific per-NUMA cpumask and modify the scx
> > schedulers to transition to this new API, for example:
> >
> >   const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node)
> >   const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node)
> >
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
...
> > @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu)
> >        */
> >       if (sched_smt_active()) {
> >               const struct cpumask *smt = cpu_smt_mask(cpu);
> > +             struct cpumask *idle_smt = get_idle_smtmask(cpu);
> >
> >               /*
> >                * If offline, @cpu is not its own sibling and
> >                * scx_pick_idle_cpu() can get caught in an infinite loop as
> > -              * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> > -              * is eventually cleared.
> > +              * @cpu is never cleared from the idle SMT mask. Ensure that
> > +              * @cpu is eventually cleared.
> >                */
> > -             if (cpumask_intersects(smt, idle_masks.smt))
> > -                     cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> > -             else if (cpumask_test_cpu(cpu, idle_masks.smt))
> > -                     __cpumask_clear_cpu(cpu, idle_masks.smt);
> > +             if (cpumask_intersects(smt, idle_smt))
> > +                     cpumask_andnot(idle_smt, idle_smt, smt);
> 
> cpumask_andnot() is a no-op if masks don't intersect.
> cpumask_intersects() is O(N), and you can safely drop it and
> save some cycles.
> 
> > +             else if (cpumask_test_cpu(cpu, idle_smt))
> > +                     __cpumask_clear_cpu(cpu, idle_smt);
> 
> Same here. If the CPU is clear, __cpumask_clear_cpu() is a no-op.
> I feel like you can just clear all that CPUs unconditionally. In
> the worst case, you'll clear them twice, which is harmless. Or I
> misunderstand something?

Yeah, makes sense to me, it definitely seems more efficient to just
clear the CPU. Thanks for noticing it.

> 
> >       }
> >  #endif
> > -     return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> > +     return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu));
> >  }
> >
> >  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  {
> > -     int cpu;
> > +     int start = cpu_to_node(smp_processor_id());
> > +     int node, cnt, cpu;
> >
> >  retry:
> >       if (sched_smt_active()) {
> > -             cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > -             if (cpu < nr_cpu_ids)
> > -                     goto found;
> > +             for_each_node_state_from(node, N_ONLINE, start, cnt) {
> > +                     cpu = cpumask_any_and_distribute(idle_masks[node]->smt,
> > +                                                      cpus_allowed);
> 
> Nit: no need to break the line here.

Ok.

> 
> > +                     if (cpu < nr_cpu_ids)
> > +                             goto found;
> > +             }
> >
> >               if (flags & SCX_PICK_IDLE_CORE)
> >                       return -EBUSY;
> >       }
> >
> > -     cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> > -     if (cpu >= nr_cpu_ids)
> > -             return -EBUSY;
> > +     for_each_node_state_from(node, N_ONLINE, start, cnt) {
> > +             cpu = cpumask_any_and_distribute(idle_masks[node]->cpu,
> > +                                              cpus_allowed);
> > +             if (cpu < nr_cpu_ids)
> > +                     goto found;
> > +     }
> > +     return -EBUSY;
> >
> >  found:
> >       if (test_and_clear_cpu_idle(cpu))
> > @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >                * piled up on it even if there is an idle core elsewhere on
> >                * the system.
> >                */
> > -             if (!cpumask_empty(idle_masks.cpu) &&
> > +             if (!cpumask_empty(get_idle_cpumask(cpu)) &&
> >                   !(current->flags & PF_EXITING) &&
> >                   cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> >                       if (cpumask_test_cpu(cpu, p->cpus_ptr))
> 
> cpumask_empty() is O(N), and the other checks are O(1). If you reorder
> them such that cpumask_empty() would be the last check, you can
> increase probability that you will not check it at all.

Ok.

> 
> > @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >               /*
> >                * Keep using @prev_cpu if it's part of a fully idle core.
> >                */
> > -             if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> > +             if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) &&
> >                   test_and_clear_cpu_idle(prev_cpu)) {
> >                       cpu = prev_cpu;
> >                       goto cpu_found;
> > @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >
> >  static void reset_idle_masks(void)
> >  {
> > +     int node;
> > +
> >       /*
> >        * Consider all online cpus idle. Should converge to the actual state
> >        * quickly.
> >        */
> > -     cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -     cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +     for_each_node_state(node, N_POSSIBLE) {
> > +             const struct cpumask *node_mask = cpumask_of_node(node);
> > +
> > +             cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask);
> > +             cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask);
> 
> cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu)
> 

Ok.

> > +     }
> >  }
> >
> >  void __scx_update_idle(struct rq *rq, bool idle)
> > @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle)
> >       }
> >
> >       if (idle)
> > -             cpumask_set_cpu(cpu, idle_masks.cpu);
> > +             cpumask_set_cpu(cpu, get_idle_cpumask(cpu));
> >       else
> > -             cpumask_clear_cpu(cpu, idle_masks.cpu);
> > +             cpumask_clear_cpu(cpu, get_idle_cpumask(cpu));
> 
> assign_cpu(cpu, get_idle_cpumask(cpu), idle)

Ok.

> 
> Thanks,
> Yury

Thanks for the review! Will include all these changes in v2.

-Andrea

next prev parent reply	other threads:[~2024-11-27  7:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-26  9:56 [PATCHSET sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-11-26  9:56 ` [PATCH 1/2] nodemask: Introduce for_each_node_mask_from/for_each_node_state_from() Andrea Righi
2024-11-27  1:23   ` Yury Norov
2024-11-27  7:29     ` Andrea Righi
2024-11-26  9:56 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-11-27  2:35   ` Yury Norov
2024-11-27  7:41     ` Andrea Righi [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-11-29 17:54 [PATCHSET v2 sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-11-29 17:54 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-11-29 19:38   ` Yury Norov
2024-11-30 15:24     ` Andrea Righi
2024-12-03  7:38       ` Andrea Righi
2024-12-03 14:16     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z0bNDfGufRZLfAP6@gpd3 \
    --to=arighi@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox