Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>,
	Aaron Tomlin <atomlin@atomlin.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Russell King <linux@armlinux.org.uk>,
	Huacai Chen <chenhuacai@kernel.org>,
	Heiko Carstens <hca@linux.ibm.com>,
	x86@kernel.org, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
Date: Wed, 22 Mar 2023 11:20:55 -0300	[thread overview]
Message-ID: <ZBsOx1abWfBTdGFl@tpad> (raw)
In-Reply-To: <ZBsEGMEfEI98Wpwq@dhcp22.suse.cz>

On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> On Wed 22-03-23 08:23:21, Marcelo Tosatti wrote:
> > On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote:
> > > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> > > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > > > > This patch series addresses the following two problems:
> > > > > > 
> > > > > > 1. A customer provided evidence indicating that a process
> > > > > >    was stalled in direct reclaim:
> > > > > > 
> > > > > This is addressed by the trivial patch 1.
> > > > > 
> > > > > [...]
> > > > > >  2. With a task that busy loops on a given CPU,
> > > > > >     the kworker interruption to execute vmstat_update
> > > > > >     is undesired and may exceed latency thresholds
> > > > > >     for certain applications.
> > > > > 
> > > > > Yes it can but why does that matter?
> > > > 
> > > > It matters for the application that is executing and expects
> > > > not to be interrupted.
> > > 
> > > Those workloads shouldn't enter the kernel in the first place, no?
> > 
> > It depends on the latency requirements and individual system calls.
> > 
> > > Otherwise the in kernel execution with all the direct or indirect
> > > dependencies (e.g. via locks) can throw any latency expectations off the
> > > window.
> > > 
> > > > > > By having vmstat_shepherd flush the per-CPU counters to the
> > > > > > global counters from remote CPUs.
> > > > > > 
> > > > > > This is done using cmpxchg to manipulate the counters,
> > > > > > both CPU locally (via the account functions),
> > > > > > and remotely (via cpu_vm_stats_fold).
> > > > > > 
> > > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > > > > the initial patch series.
> > > > > > 
> > > > > > 
> > > > > > Performance details for the kworker interruption:
> > > > > > 
> > > > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > >  
> > > > > > The example above shows an additional 7us for the
> > > > > > 
> > > > > >         oslat -> kworker -> oslat
> > > > > > 
> > > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > > latency threshold for certain applications.
> > > > > 
> > > > > I do not think we have ever promissed any specific latency guarantees
> > > > > for vmstat. These are statistics have been mostly used for debugging
> > > > > purposes AFAIK. I am not aware of any specific user space use case that
> > > > > would be latency sensitive. Your changelog doesn't go into details there
> > > > > either.
> > > > 
> > > > There is a class of workloads for which response time can be
> > > > of interest. MAC scheduler is an example:
> > > > 
> > > > https://par.nsf.gov/servlets/purl/10090368
> > > 
> > > Yes, I am not disputing low latency workloads in general. I am just
> > > saying that you haven't really established a very sound justification
> > > here.
> > 
> > The -v7 cover letter was updated with additional details, 
> > as you requested (perhaps you missed it):
> > 
> > "Performance details for the kworker interruption:
> > 
> > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > 
> > The example above shows an additional 7us for the
> > 
> >         oslat -> kworker -> oslat
> > 
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold for certain applications."
> 
> Yes, I have seen that but it doesn't really give a wider context to
> understand why those numbers matter.

OK.

"In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
interruption observed in a guest (which is above the safety
threshold for this application)."

Is that OK?


> > > Of course there are workloads which do not want to conflict with
> > > any in kernel house keeping. Those have to be configured and implemented
> > > very carefully though. Vmstat as such should not collide with those
> > > workloads as long as they do not interact with the kernel in a way
> > > counters are updated. Is this hard or impossible to avoid? 
> > 
> > The practical problem we have been seeing is -RT app initialization.
> > For example:
> > 
> > 1) mlock();
> > 2) enter loop without system calls
> 
> OK, that is what I have kinda expected. Would have been better to
> mention it explicitly.
> 
> I expect this to be a very common pattern and vmstat might not be the
> only subsystem that could interfere later on. Would it make more sense
> to address this by a more generic solution? E.g. a syscall to flush all
> per-cpu caches so they won't interfere later unless userspace hits the
> kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)?
> The above pattern could then be implemented as
> 
> 	do_initial_setup()
> 	sched_setaffinity(getpid(), cpumask);
> 	flush_cpu_caches(cpumask, 0);
> 	do_userspace_loop()

I would argue that fixing this without introducing a userspace tunable 
is more generic as all programs (modified to use a syscall or not)
benefit from the improvement. HPC workloads, for example.

But it might be necessary to do what you suggest for other
reasons (where you'd want a behaviour to be enabled which
is undesired for other application types).

next prev parent reply	other threads:[~2023-03-22 17:27 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
2023-03-20 18:21   ` Michal Hocko
2023-03-20 18:32     ` Marcelo Tosatti
2023-03-22 10:03       ` Michal Hocko
2023-03-20 18:03 ` [PATCH v7 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 03/13] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 04/13] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 05/13] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 08/13] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 09/13] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 12/13] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
2023-03-20 20:43   ` Tim Chen
2023-03-22  1:20     ` Marcelo Tosatti
2023-03-20 18:25 ` [PATCH v7 00/13] fold per-CPU vmstats remotely Michal Hocko
2023-03-20 19:07   ` Marcelo Tosatti
2023-03-22 10:13     ` Michal Hocko
2023-03-22 11:23       ` Marcelo Tosatti
2023-03-22 13:35         ` Michal Hocko
2023-03-22 14:20           ` Marcelo Tosatti [this message]
2023-03-23  7:51             ` Michal Hocko
2023-03-23 10:52               ` Marcelo Tosatti
2023-03-23 10:59                 ` Marcelo Tosatti
2023-03-23 12:17                 ` Michal Hocko
2023-03-23 13:30                   ` Marcelo Tosatti
2023-03-23 13:32                     ` Marcelo Tosatti
2023-04-18 22:02 ` Andrew Morton
2023-04-19 11:14   ` Marcelo Tosatti
2023-04-19 11:15     ` Marcelo Tosatti
2023-04-19 13:44       ` Andrew Theurer
2023-04-20  7:55         ` Michal Hocko
2023-04-23  1:25           ` Marcelo Tosatti
2023-04-19 11:29     ` Marcelo Tosatti
2023-04-19 11:59       ` Marcelo Tosatti
2023-04-19 12:24         ` Frederic Weisbecker
2023-04-19 13:48           ` Marcelo Tosatti
2023-04-19 14:35             ` Michal Hocko
2023-04-19 16:35               ` Marcelo Tosatti
2023-04-20  8:40                 ` Michal Hocko
2023-04-23  1:10                   ` Marcelo Tosatti
2023-04-20 13:45                 ` Marcelo Tosatti
2023-04-26 14:34                   ` Marcelo Tosatti
2023-04-27  8:31                     ` Michal Hocko
2023-04-27 14:59                       ` Marcelo Tosatti
2023-04-26 15:04                   ` Vlastimil Babka
2023-04-26 16:10                     ` Marcelo Tosatti
2023-04-27  8:39                       ` Michal Hocko
2023-04-27 16:25                         ` Marcelo Tosatti
2023-04-19 16:47       ` Vlastimil Babka
2023-04-19 19:15         ` Marcelo Tosatti
2023-05-03 13:51           ` Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZBsOx1abWfBTdGFl@tpad \
    --to=mtosatti@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=atomlin@atomlin.com \
    --cc=chenhuacai@kernel.org \
    --cc=cl@linux.com \
    --cc=frederic@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@armlinux.org.uk \
    --cc=mhocko@suse.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.