From: Marcelo Tosatti <mtosatti@redhat.com>
To: Christoph Lameter <cl@linux.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>,
Frederic Weisbecker <frederic@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
Marcelo Tosatti <mtosatti@redhat.com>
Subject: [PATCH v3 2/3] vmstat: skip periodic vmstat update for isolated CPUs
Date: Mon, 05 Jun 2023 15:56:29 -0300 [thread overview]
Message-ID: <20230605190132.059270652@redhat.com> (raw)
In-Reply-To: 20230605185627.923698377@redhat.com
Problem: The interruption caused by vmstat_update is undesirable
for certain applications.
With workloads that are running on isolated cpus with nohz full mode to
shield off any kernel interruption. For example, a VM running a
time sensitive application with a 50us maximum acceptable interruption
(use case: soft PLC).
oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
The example above shows an additional 7us for the
oslat -> kworker -> oslat
switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold.
The isolated vCPU can perform operations that modify per-CPU page counters,
for example to complete I/O operations:
CPU 11/KVM-9540 [001] dNh1. 2314.248584: mod_zone_page_state <-__folio_end_writeback
CPU 11/KVM-9540 [001] dNh1. 2314.248585: <stack trace>
=> 0xffffffffc042b083
=> mod_zone_page_state
=> __folio_end_writeback
=> folio_end_writeback
=> iomap_finish_ioend
=> blk_mq_end_request_batch
=> nvme_irq
=> __handle_irq_event_percpu
=> handle_irq_event
=> handle_edge_irq
=> __common_interrupt
=> common_interrupt
=> asm_common_interrupt
=> vmx_do_interrupt_nmi_irqoff
=> vmx_handle_exit_irqoff
=> vcpu_enter_guest
=> vcpu_run
=> kvm_arch_vcpu_ioctl_run
=> kvm_vcpu_ioctl
=> __x64_sys_ioctl
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
In kernel users of vmstat counters either require the precise value and
they are using zone_page_state_snapshot interface or they can live with
an imprecision as the regular flushing can happen at arbitrary time and
cumulative error can grow (see calculate_normal_threshold).
>From that POV the regular flushing can be postponed for CPUs that have
been isolated from the kernel interference without critical
infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
for all isolated CPUs to avoid interference with the isolated workload.
Suggested by Michal Hocko.
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
v3: improve changelog (Michal Hocko)
v2: use cpu_is_isolated (Michal Hocko)
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -28,6 +28,7 @@
#include <linux/mm_inline.h>
#include <linux/page_ext.h>
#include <linux/page_owner.h>
+#include <linux/sched/isolation.h>
#include "internal.h"
@@ -2022,6 +2023,20 @@ static void vmstat_shepherd(struct work_
for_each_online_cpu(cpu) {
struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
+ /*
+ * In kernel users of vmstat counters either require the precise value and
+ * they are using zone_page_state_snapshot interface or they can live with
+ * an imprecision as the regular flushing can happen at arbitrary time and
+ * cumulative error can grow (see calculate_normal_threshold).
+ *
+ * From that POV the regular flushing can be postponed for CPUs that have
+ * been isolated from the kernel interference without critical
+ * infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
+ * for all isolated CPUs to avoid interference with the isolated workload.
+ */
+ if (cpu_is_isolated(cpu))
+ continue;
+
if (!delayed_work_pending(dw) && need_update(cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
next prev parent reply other threads:[~2023-06-05 19:04 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-05 18:56 [PATCH v3 0/3] vmstat bug fixes for nohz_full and isolated CPUs Marcelo Tosatti
2023-06-05 18:56 ` [PATCH v3 1/3] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
2023-06-05 18:56 ` Marcelo Tosatti [this message]
2023-06-05 18:56 ` [PATCH v3 3/3] mm/vmstat: do not refresh stats for isolated CPUs Marcelo Tosatti
2023-06-05 19:20 ` Michal Hocko
2023-06-05 19:53 ` Marcelo Tosatti
2023-06-05 20:22 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230605190132.059270652@redhat.com \
--to=mtosatti@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=atomlin@atomlin.com \
--cc=cl@linux.com \
--cc=frederic@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.