Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Vernet <void@manifault.com>
To: Chen Yu <yu.c.chen@intel.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
	roman.gushchin@linux.dev, gautham.shenoy@amd.com,
	kprateek.nayak@amd.com, aaron.lu@intel.com,
	wuyun.abel@bytedance.com, kernel-team@meta.com,
	tim.c.chen@intel.com
Subject: Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
Date: Thu, 31 Aug 2023 14:14:44 -0500	[thread overview]
Message-ID: <20230831191444.GD531917@maniforge> (raw)
In-Reply-To: <ZPBvN9W6RYOr51zQ@chenyu5-mobl2>

On Thu, Aug 31, 2023 at 06:45:11PM +0800, Chen Yu wrote:
> On 2023-08-30 at 19:01:47 -0500, David Vernet wrote:
> > On Wed, Aug 30, 2023 at 02:17:09PM +0800, Chen Yu wrote:
> > > 
> > > 5. Check the L2 cache miss rate.
> > > perf stat -e l2_rqsts.references,l2_request.miss sleep 10
> > > The results show that the L2 cache miss rate is nearly the same with/without
> > > shared_runqueue enabled.
> > 
> > As mentioned below, I expect it would be interesting to also collect
> > icache / iTLB numbers. In my experience, poor uop cache locality will
> > also result in poor icache locality, though of course that depends on a
> > lot of other factors like alignment, how many (un)conditional branches
> > you have within some byte window, etc. If alignment, etc were the issue
> > though, we'd likely observe this also without SHARED_RUNQ.
> >
> 
> [snip...] 
> 
> > 
> > Interesting. As mentioned above, I expect we also see an increase in
> > iTLB and icache misses?
> > 
> 
> This is a good point, according to the perf topdown:
> 
> SHARED_RUNQ is disabled:
> 
>      13.0 %  tma_frontend_bound
>       6.7 %  tma_fetch_latency
>        0.3 %  tma_itlb_misses
>        0.7 %  tma_icache_misses
> 
> itlb miss ratio is 13.0% * 6.7% * 0.3%
> icache miss ratio is 13.0% * 6.7% * 0.7%
> 
> SHARED_RUNQ is enabled:
>      20.0 %  tma_frontend_bound
>       11.6 %  tma_fetch_latency
>        0.9 %  tma_itlb_misses
>        0.5 %  tma_icache_misses
> 
> itlb miss ratio is 20.0% * 11.6% * 0.9%
> icache miss ratio is 20.0% * 11.6% * 0.5%
> 
> So both icache and itlb miss ratio increase, and itlb miss increases more,
> although the bottleneck is neither icache nor itlb.
> And as you mentioned below, it depends on other factors, including the hardware
> settings, icache size, tlb size, DSB size, eg.

Thanks for collecting these stats. Good to know that things are making
sense and the data we're collecting are telling a consistent story.

> > This is something we deal with in HHVM. Like any other JIT engine /
> > compiler, it is heavily front-end CPU bound, and has very poor icache,
> > iTLB, and uop cache locality (also lots of branch resteers, etc).
> > SHARED_RUNQ actually helps this workload quite a lot, as explained in
> > the cover letter for the patch series. It makes sense that it would: uop
> > locality is really bad even without increasing CPU util. So we have no
> > reason not to migrate the task and hop on a CPU.
> >
> 
> I see, this makes sense.
>  
> > > I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
> > > say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
> > > on the per-LLC shared runqueue? In this way it will not be migrated too offen
> > > so as to keep its locality(both in terms of L1/L2 cache and DSB).
> > 
> > I'm hesitant to apply such heuristics to the feature. As mentioned
> > above, SHARED_RUNQ works very well on HHVM, despite its potential hit to
> > icache / iTLB / DSB locality. Those hhvmworker tasks run for a very long
> > time, sometimes upwards of 20+ms. They also tend to have poor L1 cache
> > locality in general even when they're scheduled on the same core they
> > were on before they were descheduled, so we observe better performance
> > if the task is migrated to a fully idle core rather than e.g. its
> > hypertwin if it's available. That's not something we can guarantee with
> > SHARED_RUNQ, but it hopefully illustrates the point that it's an example
> > of a workload that would suffer with such a heuristic.
> >
> 
> OK, the hackbench is just a microbenchmark to help us evaluate
> what the impact SHARED_RUNQ could bring. If such heuristic heals
> hackbench but hurts the real workload then we can consider
> other direction.
>  
> > Another point to consider is that performance implications that are a
> > result of Intel micro architectural details don't necessarily apply to
> > everyone. I'm not as familiar with the instruction decode pipeline on
> > AMD chips like Zen4. I'm sure they have a uop cache, but the size of
> > that cache, alignment requirements, the way that cache interfaces with
> > e.g. their version of the MITE / decoder, etc, are all going to be quite
> > different.
> >
> 
> Yes, this is true.
>  
> > In general, I think it's difficult for heuristics like this to suit all
> > possible workloads or situations (not that you're claiming it is). My
> > preference is to keep it as is so that it's easier for users to build a
> > mental model of what outcome they should expect if they use the feature.
> > Put another way: As a user of this feature, I'd be a lot more surprised
> > to see that I enabled it and CPU util stayed low, vs. enabling it and
> > seeing higher CPU util, but also degraded icache / iTLB locality.
> >
> 
> Understand.
>  
> > Let me know what you think, and thanks again for investing your time
> > into this.
> > 
> 
> Let me run other benchmarks to see if others are sensitive to
> the resource locality.

Great, thank you Chenyu.

FYI, I'll be on vacation for over a week starting later today, so my
responses may be delayed.

Thanks in advance for working on this. Looking forward to seeing the
results when I'm back at work.

Thanks,
David

next prev parent reply	other threads:[~2023-08-31 19:14 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
2023-08-09 22:12 ` [PATCH v3 1/7] sched: Expose move_queued_task() from core.c David Vernet
2023-08-09 22:12 ` [PATCH v3 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
2023-08-09 22:12 ` [PATCH v3 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
2023-08-09 22:12 ` [PATCH v3 4/7] sched: Enable sched_feat callbacks on enable/disable David Vernet
2023-08-09 22:12 ` [PATCH v3 5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
2023-08-10  7:11   ` kernel test robot
2023-08-10  7:41   ` kernel test robot
2023-08-30  6:46   ` K Prateek Nayak
2023-08-31  1:34     ` David Vernet
2023-08-31  3:47       ` K Prateek Nayak
2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
2023-08-09 23:46   ` kernel test robot
2023-08-10  0:12     ` David Vernet
2023-08-10  7:11   ` kernel test robot
2023-08-30  6:17   ` Chen Yu
2023-08-31  0:01     ` David Vernet
2023-08-31 10:45       ` Chen Yu
2023-08-31 19:14         ` David Vernet [this message]
2023-09-23  6:35           ` Chen Yu
2023-08-17  8:42 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Gautham R. Shenoy
2023-08-18  5:03   ` David Vernet
2023-08-18  8:49     ` Gautham R. Shenoy
2023-08-24 11:14       ` Gautham R. Shenoy
2023-08-24 22:51         ` David Vernet
2023-08-30  9:56           ` K Prateek Nayak
2023-08-31  2:32             ` David Vernet
2023-08-31  4:21               ` K Prateek Nayak
2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance K Prateek Nayak
2023-08-31 18:45                 ` David Vernet
2023-08-31 19:47                   ` K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag K Prateek Nayak
2023-08-31 19:11                 ` David Vernet
2023-08-31 20:23                   ` K Prateek Nayak
2023-09-29 17:01                     ` David Vernet
2023-10-04  4:21                       ` K Prateek Nayak
2023-10-04 17:20                         ` David Vernet
2023-10-05  3:50                           ` K Prateek Nayak
2023-09-27  4:23                   ` K Prateek Nayak
2023-09-27  6:59                     ` Chen Yu
2023-09-27  8:36                       ` K Prateek Nayak
2023-09-28  8:41                         ` Chen Yu
2023-10-03 21:05                       ` David Vernet
2023-10-07  2:10                         ` Chen Yu
2023-09-27 13:08                     ` David Vernet
2023-11-27  8:28 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Aboorva Devarajan
2023-11-27 19:49   ` David Vernet
2023-12-07  6:00     ` Aboorva Devarajan
2023-12-04 19:30 ` David Vernet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230831191444.GD531917@maniforge \
    --to=void@manifault.com \
    --cc=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox