[RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
@ 2024-04-03 14:01 Vineeth Pillai (Google)
  2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
                   ` (6 more replies)
  0 siblings, 7 replies; 42+ messages in thread
From: Vineeth Pillai (Google) @ 2024-04-03 14:01 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Vineeth Pillai (Google), Steven Rostedt, Joel Fernandes,
	Suleiman Souhlal, Masami Hiramatsu, himadrics, kvm, linux-kernel,
	x86

Double scheduling is a concern with virtualization hosts where the host
schedules vcpus without knowing whats run by the vcpu and guest schedules
tasks without knowing where the vcpu is physically running. This causes
issues related to latencies, power consumption, resource utilization
etc. An ideal solution would be to have a cooperative scheduling
framework where the guest and host shares scheduling related information
and makes an educated scheduling decision to optimally handle the
workloads. As a first step, we are taking a stab at reducing latencies
for latency sensitive workloads in the guest.

v1 RFC[1] was posted in December 2023. The main disagreement was in the
implementation where the patch was making scheduling policy decisions
in kvm and kvm is not the right place to do it. The suggestion was to
move the polcy decisions outside of kvm and let kvm only handle the
notifications needed to make the policy decisions. This patch series is
an iterative step towards implementing the feature as a layered
design where the policy could be implemented outside of kvm as a
kernel built-in, a kernel module or a bpf program.

This design comprises mainly of 4 components:

- pvsched driver: Implements the scheduling policies. Register with
    host with a set of callbacks that hypervisor(kvm) can use to notify
    vcpu events that the driver is interested in. The callback will be
    passed in the address of shared memory so that the driver can get
    scheduling information shared by the guest and also update the
    scheduling policies set by the driver.
- kvm component: Selects the pvsched driver for a guest and notifies
    the driver via callbacks for events that the driver is interested
    in. Also interface with the guest in retreiving the shared memory
    region for sharing the scheduling information.
- host kernel component: Implements the APIs for:
    - pvsched driver for register/unregister to the host kernel, and
    - hypervisor for assingning/unassigning driver for guests.
- guest component: Implements a framework for sharing the scheduling
    information with the pvsched driver through kvm.

There is another component that we refer to as pvsched protocol. This
defines the details about shared memory layout, information sharing and
sheduling policy decisions. The protocol need not be part of the kernel
and can be defined separately based on the use case and requirements.
Both guest and the selected pvsched driver need to match the protocol
for the feature to work. Protocol shall be identified by a name and a
possible versioning scheme. Guest will advertise the protocol and then
the hypervisor can assign the driver implementing the protocol if it is
registered in the host kernel.

This patch series only implements the first 3 components. Guest side
implementation and the protocol framework shall come as a separate
series once we finalize rest of the design.

This series also implements a sample bpf program and a kernel-builtin
pvsched drivers. They do not do any real stuff now, but just skeletons
to demonstrate the feature.

Rebased on 6.8.2.

[1]: https://lwn.net/Articles/955145/

Vineeth Pillai (Google) (5):
  pvsched: paravirt scheduling framework
  kvm: Implement the paravirt sched framework for kvm
  kvm: interface for managing pvsched driver for guest VMs
  pvsched: bpf support for pvsched
  selftests/bpf: sample implementation of a bpf pvsched driver.

 Kconfig                                       |   2 +
 arch/x86/kvm/Kconfig                          |  13 +
 arch/x86/kvm/x86.c                            |   3 +
 include/linux/kvm_host.h                      |  32 +++
 include/linux/pvsched.h                       | 102 +++++++
 include/uapi/linux/kvm.h                      |   6 +
 kernel/bpf/bpf_struct_ops_types.h             |   4 +
 kernel/sysctl.c                               |  27 ++
 .../testing/selftests/bpf/progs/bpf_pvsched.c |  37 +++
 virt/Makefile                                 |   2 +-
 virt/kvm/kvm_main.c                           | 265 ++++++++++++++++++
 virt/pvsched/Kconfig                          |  12 +
 virt/pvsched/Makefile                         |   2 +
 virt/pvsched/pvsched.c                        | 215 ++++++++++++++
 virt/pvsched/pvsched_bpf.c                    | 141 ++++++++++
 15 files changed, 862 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pvsched.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
 create mode 100644 virt/pvsched/Kconfig
 create mode 100644 virt/pvsched/Makefile
 create mode 100644 virt/pvsched/pvsched.c
 create mode 100644 virt/pvsched/pvsched_bpf.c

-- 
2.40.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework
  2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
@ 2024-04-03 14:01 ` Vineeth Pillai (Google)
  2024-04-08 13:57   ` Vineeth Remanan Pillai
  2024-04-03 14:01 ` [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm Vineeth Pillai (Google)
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Vineeth Pillai (Google) @ 2024-04-03 14:01 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Vineeth Pillai (Google), Steven Rostedt, Joel Fernandes,
	Suleiman Souhlal, Masami Hiramatsu, himadrics, kvm, linux-kernel,
	x86

Implement a paravirt scheduling framework for linux kernel.

The framework allows for pvsched driver to register to the kernel and
receive callbacks from hypervisor(eg: kvm) for interested vcpu events
like VMENTER, VMEXIT etc.

The framework also allows hypervisor to select a pvsched driver (from
the available list of registered drivers) for each guest.

Also implement a sysctl for listing the available pvsched drivers.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 Kconfig                 |   2 +
 include/linux/pvsched.h | 102 +++++++++++++++++++
 kernel/sysctl.c         |  27 +++++
 virt/Makefile           |   2 +-
 virt/pvsched/Kconfig    |  12 +++
 virt/pvsched/Makefile   |   2 +
 virt/pvsched/pvsched.c  | 215 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 361 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pvsched.h
 create mode 100644 virt/pvsched/Kconfig
 create mode 100644 virt/pvsched/Makefile
 create mode 100644 virt/pvsched/pvsched.c

diff --git a/Kconfig b/Kconfig
index 745bc773f567..4a52eaa21166 100644
--- a/Kconfig
+++ b/Kconfig
@@ -29,4 +29,6 @@ source "lib/Kconfig"
 
 source "lib/Kconfig.debug"
 
+source "virt/pvsched/Kconfig"
+
 source "Documentation/Kconfig"
diff --git a/include/linux/pvsched.h b/include/linux/pvsched.h
new file mode 100644
index 000000000000..59df6b44aacb
--- /dev/null
+++ b/include/linux/pvsched.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2024 Google  */
+
+#ifndef _LINUX_PVSCHED_H
+#define _LINUX_PVSCHED_H 1
+
+/*
+ * List of events for which hypervisor calls back into pvsched driver.
+ * Driver can specify the events it is interested in.
+ */
+enum pvsched_vcpu_events {
+	PVSCHED_VCPU_VMENTER = 0x1,
+	PVSCHED_VCPU_VMEXIT = 0x2,
+	PVSCHED_VCPU_HALT = 0x4,
+	PVSCHED_VCPU_INTR_INJ = 0x8,
+};
+
+#define PVSCHED_NAME_MAX	32
+#define PVSCHED_MAX		8
+#define PVSCHED_DRV_BUF_MAX	(PVSCHED_NAME_MAX * PVSCHED_MAX + PVSCHED_MAX)
+
+/*
+ * pvsched driver callbacks.
+ * TODO: versioning support for better compatibility with the guest
+ *       component implementing this feature.
+ */
+struct pvsched_vcpu_ops {
+	/*
+	 * pvsched_vcpu_register() - Register the vcpu with pvsched driver.
+	 * @pid: pid of the vcpu task.
+	 *
+	 * pvsched driver can store the pid internally and initialize
+	 * itself to prepare for receiving callbacks from thsi vcpu.
+	 */
+	int (*pvsched_vcpu_register)(struct pid *pid);
+
+	/*
+	 * pvsched_vcpu_unregister() - Un-register the vcpu with pvsched driver.
+	 * @pid: pid of the vcpu task.
+	 */
+	void (*pvsched_vcpu_unregister)(struct pid *pid);
+
+	/*
+	 * pvsched_vcpu_notify_event() - Callback for pvsched events
+	 * @addr: Address of the memory region shared with guest
+	 * @pid: pid of the vcpu task.
+	 * @events: bit mask of the events that hypervisor wants to notify.
+	 */
+	void (*pvsched_vcpu_notify_event)(void *addr, struct pid *pid, u32 event);
+
+	char name[PVSCHED_NAME_MAX];
+	struct module *owner;
+	struct list_head list;
+	u32 events;
+	u32 key;
+};
+
+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+int pvsched_get_available_drivers(char *buf, size_t maxlen);
+
+int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops);
+void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops);
+
+struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name);
+void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops);
+
+static inline int pvsched_validate_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+	/*
+	 * All callbacks are mandatory.
+	 */
+	if (!ops->pvsched_vcpu_register || !ops->pvsched_vcpu_unregister ||
+			!ops->pvsched_vcpu_notify_event)
+		return -EINVAL;
+
+	return 0;
+}
+#else
+static inline void pvsched_get_available_drivers(char *buf, size_t maxlen)
+{
+}
+
+static inline int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+	return -ENOTSUPP;
+}
+
+static inline void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+}
+
+static inline struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
+{
+	return NULL;
+}
+
+static inline void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+}
+#endif
+
+#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 157f7ce2942d..10a18a791b4f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -63,6 +63,7 @@
 #include <linux/mount.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/pid.h>
+#include <linux/pvsched.h>
 
 #include "../lib/kstrtox.h"
 
@@ -1615,6 +1616,24 @@ int proc_do_static_key(struct ctl_table *table, int write,
 	return ret;
 }
 
+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+static int proc_pvsched_available_drivers(struct ctl_table *ctl,
+						 int write, void *buffer,
+						 size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table tbl = { .maxlen = PVSCHED_DRV_BUF_MAX, };
+	int ret;
+
+	tbl.data = kmalloc(tbl.maxlen, GFP_USER);
+	if (!tbl.data)
+		return -ENOMEM;
+	pvsched_get_available_drivers(tbl.data, PVSCHED_DRV_BUF_MAX);
+	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
+	kfree(tbl.data);
+	return ret;
+}
+#endif
+
 static struct ctl_table kern_table[] = {
 	{
 		.procname	= "panic",
@@ -2033,6 +2052,14 @@ static struct ctl_table kern_table[] = {
 		.extra1		= SYSCTL_ONE,
 		.extra2		= SYSCTL_INT_MAX,
 	},
+#endif
+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+	{
+		.procname	= "pvsched_available_drivers",
+		.maxlen		= PVSCHED_DRV_BUF_MAX,
+		.mode		= 0444,
+		.proc_handler   = proc_pvsched_available_drivers,
+	},
 #endif
 	{ }
 };
diff --git a/virt/Makefile b/virt/Makefile
index 1cfea9436af9..9d0f32d775a1 100644
--- a/virt/Makefile
+++ b/virt/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y	+= lib/
+obj-y	+= lib/ pvsched/
diff --git a/virt/pvsched/Kconfig b/virt/pvsched/Kconfig
new file mode 100644
index 000000000000..5ca2669060cb
--- /dev/null
+++ b/virt/pvsched/Kconfig
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config PARAVIRT_SCHED_HOST
+	bool "Paravirt scheduling framework in the host kernel"
+	default n
+	help
+	  Paravirtualized scheduling facilitates the exchange of scheduling
+	  related information between the host and guest through shared memory,
+	  enhancing the efficiency of vCPU thread scheduling by the hypervisor.
+	  An illustrative use case involves dynamically boosting the priority of
+	  a vCPU thread when the guest is executing a latency-sensitive workload
+	  on that specific vCPU.
+	  This config enables paravirt scheduling framework in the host kernel.
diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
new file mode 100644
index 000000000000..4ca38e30479b
--- /dev/null
+++ b/virt/pvsched/Makefile
@@ -0,0 +1,2 @@
+
+obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
diff --git a/virt/pvsched/pvsched.c b/virt/pvsched/pvsched.c
new file mode 100644
index 000000000000..610c85cf90d2
--- /dev/null
+++ b/virt/pvsched/pvsched.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2024 Google  */
+
+/*
+ *  Paravirt scheduling framework
+ *
+ */
+
+/*
+ * Heavily inspired from tcp congestion avoidance implementation.
+ * (net/ipv4/tcp_cong.c)
+ */
+
+#define pr_fmt(fmt) "PVSCHED: " fmt
+
+#include <linux/module.h>
+#include <linux/bpf.h>
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/jhash.h>
+#include <linux/pvsched.h>
+
+static DEFINE_SPINLOCK(pvsched_drv_list_lock);
+static int nr_pvsched_drivers = 0;
+static LIST_HEAD(pvsched_drv_list);
+
+/*
+ * Retrieve pvsched_vcpu_ops given the name.
+ */
+static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_name(char *name)
+{
+	struct pvsched_vcpu_ops *ops;
+
+	list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
+		if (strcmp(ops->name, name) == 0)
+			return ops;
+	}
+
+	return NULL;
+}
+
+/*
+ * Retrieve pvsched_vcpu_ops given the hash key.
+ */
+static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_key(u32 key)
+{
+	struct pvsched_vcpu_ops *ops;
+
+	list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
+		if (ops->key == key)
+			return ops;
+	}
+
+	return NULL;
+}
+
+/*
+ * pvsched_get_available_drivers() - Copy space separated list of pvsched
+ * driver names.
+ * @buf: buffer to store the list of driver names
+ * @maxlen: size of the buffer
+ *
+ * Return: 0 on success, negative value on error.
+ */
+int pvsched_get_available_drivers(char *buf, size_t maxlen)
+{
+	struct pvsched_vcpu_ops *ops;
+	size_t offs = 0;
+
+	if (!buf)
+		return -EINVAL;
+
+	if (maxlen > PVSCHED_DRV_BUF_MAX)
+		maxlen = PVSCHED_DRV_BUF_MAX;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
+		offs += snprintf(buf + offs, maxlen - offs,
+				 "%s%s",
+				 offs == 0 ? "" : " ", ops->name);
+
+		if (WARN_ON_ONCE(offs >= maxlen))
+			break;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pvsched_get_available_drivers);
+
+/*
+ * pvsched_register_vcpu_ops() - Register the driver in the kernel.
+ * @ops: Driver data(callbacks)
+ *
+ * After the registration, driver will be exposed to the hypervisor
+ * for assignment to the guest VMs.
+ *
+ * Return: 0 on success, negative value on error.
+ */
+int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+	int ret = 0;
+
+	ops->key = jhash(ops->name, sizeof(ops->name), strlen(ops->name));
+	spin_lock(&pvsched_drv_list_lock);
+	if (nr_pvsched_drivers > PVSCHED_MAX) {
+		ret = -ENOSPC;
+	} if (pvsched_find_vcpu_ops_key(ops->key)) {
+		ret = -EEXIST;
+	} else if (!(ret = pvsched_validate_vcpu_ops(ops))) {
+		list_add_tail_rcu(&ops->list, &pvsched_drv_list);
+		nr_pvsched_drivers++;
+	}
+	spin_unlock(&pvsched_drv_list_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvsched_register_vcpu_ops);
+
+/*
+ * pvsched_register_vcpu_ops() - Un-register the driver from the kernel.
+ * @ops: Driver data(callbacks)
+ *
+ * After un-registration, driver will not be visible to hypervisor.
+ */
+void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+	spin_lock(&pvsched_drv_list_lock);
+	list_del_rcu(&ops->list);
+	nr_pvsched_drivers--;
+	spin_unlock(&pvsched_drv_list_lock);
+
+	synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(pvsched_unregister_vcpu_ops);
+
+/*
+ * pvsched_get_vcpu_ops: Acquire the driver.
+ * @name: Name of the driver to be acquired.
+ *
+ * Hypervisor can use this API to get the driver structure for
+ * assigning it to guest VMs. This API takes a reference on the
+ * module/bpf program so that driver doesn't vanish under the
+ * hypervisor.
+ *
+ * Return: driver structure if found, else NULL.
+ */
+struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
+{
+	struct pvsched_vcpu_ops *ops;
+
+	if (!name || (strlen(name) >= PVSCHED_NAME_MAX))
+		return NULL;
+
+	rcu_read_lock();
+	ops = pvsched_find_vcpu_ops_name(name);
+	if (!ops)
+		goto out;
+
+	if (unlikely(!bpf_try_module_get(ops, ops->owner))) {
+		ops = NULL;
+		goto out;
+	}
+
+out:
+	rcu_read_unlock();
+	return ops;
+}
+EXPORT_SYMBOL_GPL(pvsched_get_vcpu_ops);
+
+/*
+ * pvsched_put_vcpu_ops: Release the driver.
+ * @name: Name of the driver to be releases.
+ *
+ * Hypervisor can use this API to release the driver.
+ */
+void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+	bpf_module_put(ops, ops->owner);
+}
+EXPORT_SYMBOL_GPL(pvsched_put_vcpu_ops);
+
+/*
+ * NOP vm_ops Sample implementation.
+ * This driver doesn't do anything other than registering itself.
+ * Placeholder for adding some default logic when the feature is
+ * complete.
+ */
+static int nop_pvsched_vcpu_register(struct pid *pid)
+{
+	return 0;
+}
+static void nop_pvsched_vcpu_unregister(struct pid *pid)
+{
+}
+static void nop_pvsched_notify_event(void *addr, struct pid *pid, u32 event)
+{
+}
+
+struct pvsched_vcpu_ops nop_vcpu_ops = {
+	.events = PVSCHED_VCPU_VMENTER | PVSCHED_VCPU_VMEXIT | PVSCHED_VCPU_HALT,
+	.pvsched_vcpu_register = nop_pvsched_vcpu_register,
+	.pvsched_vcpu_unregister = nop_pvsched_vcpu_unregister,
+	.pvsched_vcpu_notify_event = nop_pvsched_notify_event,
+	.name = "pvsched_nop",
+	.owner = THIS_MODULE,
+};
+
+static int __init pvsched_init(void)
+{
+	return WARN_ON(pvsched_register_vcpu_ops(&nop_vcpu_ops));
+}
+
+late_initcall(pvsched_init);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm
  2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
  2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
@ 2024-04-03 14:01 ` Vineeth Pillai (Google)
  2024-04-08 13:58   ` Vineeth Remanan Pillai
  2024-04-03 14:01 ` [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs Vineeth Pillai (Google)
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Vineeth Pillai (Google) @ 2024-04-03 14:01 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Vineeth Pillai (Google), Steven Rostedt, Joel Fernandes,
	Suleiman Souhlal, Masami Hiramatsu, himadrics, kvm, linux-kernel,
	x86

kvm uses the kernel's paravirt sched framework to assign an available
pvsched driver for a guest. guest vcpus registers with the pvsched
driver and calls into the driver callback to notify the events that the
driver is interested in.

This PoC doesn't do the callback on interrupt injection yet. Will be
implemented in subsequent iterations.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/x86/kvm/Kconfig     |  13 ++++
 arch/x86/kvm/x86.c       |   3 +
 include/linux/kvm_host.h |  32 +++++++++
 virt/kvm/kvm_main.c      | 148 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 196 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 65ed14b6540b..c1776cdb5b65 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -189,4 +189,17 @@ config KVM_MAX_NR_VCPUS
 	  the memory footprint of each KVM guest, regardless of how many vCPUs are
 	  created for a given VM.
 
+config PARAVIRT_SCHED_KVM
+	bool "Enable paravirt scheduling capability for kvm"
+	depends on KVM
+	default n
+	help
+	  Paravirtualized scheduling facilitates the exchange of scheduling
+	  related information between the host and guest through shared memory,
+	  enhancing the efficiency of vCPU thread scheduling by the hypervisor.
+	  An illustrative use case involves dynamically boosting the priority of
+	  a vCPU thread when the guest is executing a latency-sensitive workload
+	  on that specific vCPU.
+	  This config enables paravirt scheduling in the kvm hypervisor.
+
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffe580169c93..d0abc2c64d47 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10896,6 +10896,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	preempt_disable();
 
+	kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMENTER);
+
 	static_call(kvm_x86_prepare_switch_to_guest)(vcpu);
 
 	/*
@@ -11059,6 +11061,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	guest_timing_exit_irqoff();
 
 	local_irq_enable();
+	kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMEXIT);
 	preempt_enable();
 
 	kvm_vcpu_srcu_read_lock(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 179df96b20f8..6381569f3de8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -45,6 +45,8 @@
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
 
+#include <linux/pvsched.h>
+
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
 #endif
@@ -832,6 +834,11 @@ struct kvm {
 	bool vm_bugged;
 	bool vm_dead;
 
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+	spinlock_t pvsched_ops_lock;
+	struct pvsched_vcpu_ops __rcu *pvsched_ops;
+#endif
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
 #endif
@@ -2413,4 +2420,29 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events);
+int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu);
+void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu);
+
+int kvm_replace_pvsched_ops(struct kvm *kvm, char *name);
+#else
+static inline int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
+{
+	return 0;
+}
+static inline int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+static inline void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
+{
+}
+
+static inline int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
+{
+	return 0;
+}
+#endif
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f50960b0e3a..0546814e4db7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -170,6 +170,142 @@ bool kvm_is_zone_device_page(struct page *page)
 	return is_zone_device_page(page);
 }
 
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+typedef enum {
+	PVSCHED_CB_REGISTER = 1,
+	PVSCHED_CB_UNREGISTER = 2,
+	PVSCHED_CB_NOTIFY = 3
+} pvsched_vcpu_callback_t;
+
+/*
+ * Helper function to invoke the pvsched driver callback.
+ */
+static int __vcpu_pvsched_callback(struct kvm_vcpu *vcpu, u32 events,
+		pvsched_vcpu_callback_t action)
+{
+	int ret = 0;
+	struct pid *pid;
+	struct pvsched_vcpu_ops *ops;
+
+	rcu_read_lock();
+	ops = rcu_dereference(vcpu->kvm->pvsched_ops);
+	if (!ops) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	pid = rcu_dereference(vcpu->pid);
+	if (WARN_ON_ONCE(!pid)) {
+		ret = -EINVAL;
+		goto out;
+	}
+	get_pid(pid);
+	switch(action) {
+		case PVSCHED_CB_REGISTER:
+			ops->pvsched_vcpu_register(pid);
+			break;
+		case PVSCHED_CB_UNREGISTER:
+			ops->pvsched_vcpu_unregister(pid);
+			break;
+		case PVSCHED_CB_NOTIFY:
+			if (ops->events & events) {
+				ops->pvsched_vcpu_notify_event(
+					NULL, /* TODO: Pass guest allocated sharedmem addr */
+					pid,
+					ops->events & events);
+			}
+			break;
+		default:
+			WARN_ON_ONCE(1);
+	}
+	put_pid(pid);
+
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
+{
+	return __vcpu_pvsched_callback(vcpu, events, PVSCHED_CB_NOTIFY);
+}
+
+int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
+{
+	return __vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_REGISTER);
+	/*
+	 * TODO: Action if the registration fails?
+	 */
+}
+
+void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
+{
+	__vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_UNREGISTER);
+}
+
+/*
+ * Replaces the VM's current pvsched driver.
+ * if name is NULL or empty string, unassign the
+ * current driver.
+ */
+int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
+{
+	int ret = 0;
+	unsigned long i;
+	struct kvm_vcpu *vcpu = NULL;
+	struct pvsched_vcpu_ops *ops = NULL, *prev_ops;
+
+
+	spin_lock(&kvm->pvsched_ops_lock);
+
+	prev_ops = rcu_dereference(kvm->pvsched_ops);
+
+	/*
+	 * Unassign operation if the passed in value is
+	 * NULL or an empty string.
+	 */
+	if (name && *name) {
+		ops = pvsched_get_vcpu_ops(name);
+		if (!ops) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (prev_ops) {
+		/*
+		 * Unregister current pvsched driver.
+		 */
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			kvm_vcpu_pvsched_unregister(vcpu);
+		}
+
+		pvsched_put_vcpu_ops(prev_ops);
+	}
+
+
+	rcu_assign_pointer(kvm->pvsched_ops, ops);
+	if (ops) {
+		/*
+		 * Register new pvsched driver.
+		 */
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			WARN_ON_ONCE(kvm_vcpu_pvsched_register(vcpu));
+		}
+	}
+
+out:
+	spin_unlock(&kvm->pvsched_ops_lock);
+
+	if (ret)
+		return ret;
+
+	synchronize_rcu();
+
+	return 0;
+}
+#endif
+
 /*
  * Returns a 'struct page' if the pfn is "valid" and backed by a refcounted
  * page, NULL otherwise.  Note, the list of refcounted PG_reserved page types
@@ -508,6 +644,8 @@ static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvm_arch_vcpu_destroy(vcpu);
 	kvm_dirty_ring_free(&vcpu->dirty_ring);
 
+	kvm_vcpu_pvsched_unregister(vcpu);
+
 	/*
 	 * No need for rcu_read_lock as VCPU_RUN is the only place that changes
 	 * the vcpu->pid pointer, and at destruction time all file descriptors
@@ -1221,6 +1359,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+	spin_lock_init(&kvm->pvsched_ops_lock);
+#endif
+
 	/*
 	 * Force subsequent debugfs file creations to fail if the VM directory
 	 * is not created (by kvm_create_vm_debugfs()).
@@ -1343,6 +1485,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	kvm_replace_pvsched_ops(kvm, NULL);
+
 	kvm_destroy_pm_notifier(kvm);
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
@@ -3779,6 +3923,8 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
 		if (kvm_vcpu_check_block(vcpu) < 0)
 			break;
 
+		kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_HALT);
+
 		waited = true;
 		schedule();
 	}
@@ -4434,6 +4580,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
 			/* The thread running this VCPU changed. */
 			struct pid *newpid;
 
+			kvm_vcpu_pvsched_unregister(vcpu);
 			r = kvm_arch_vcpu_run_pid_change(vcpu);
 			if (r)
 				break;
@@ -4442,6 +4589,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
 			rcu_assign_pointer(vcpu->pid, newpid);
 			if (oldpid)
 				synchronize_rcu();
+			kvm_vcpu_pvsched_register(vcpu);
 			put_pid(oldpid);
 		}
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs
  2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
  2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
  2024-04-03 14:01 ` [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm Vineeth Pillai (Google)
@ 2024-04-03 14:01 ` Vineeth Pillai (Google)
  2024-04-08 13:59   ` Vineeth Remanan Pillai
  2024-04-03 14:01 ` [RFC PATCH v2 4/5] pvsched: bpf support for pvsched Vineeth Pillai (Google)
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Vineeth Pillai (Google) @ 2024-04-03 14:01 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Vineeth Pillai (Google), Steven Rostedt, Joel Fernandes,
	Suleiman Souhlal, Masami Hiramatsu, himadrics, kvm, linux-kernel,
	x86

Implement ioctl for assigning and unassigning pvsched driver for a
guest. VMMs would need to adopt this ioctls for supporting the feature.
Also add a temporary debugfs interface for managing this.

Ideally, the hypervisor would be able to determine the pvsched driver
based on the information received from the guest. Guest VMs with the
feature enabled would request hypervisor to select a pvsched driver.
ioctl api is an override mechanism to give more control to the admin.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/uapi/linux/kvm.h |   6 ++
 virt/kvm/kvm_main.c      | 117 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c3308536482b..4b29bdad4188 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2227,4 +2227,10 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+struct kvm_pvsched_ops {
+	__u8 ops_name[32]; /* PVSCHED_NAME_MAX */
+};
+
+#define KVM_GET_PVSCHED_OPS		_IOR(KVMIO, 0xe4, struct kvm_pvsched_ops)
+#define KVM_REPLACE_PVSCHED_OPS		_IOWR(KVMIO, 0xe5, struct kvm_pvsched_ops)
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0546814e4db7..b3d9c362d2e3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1223,6 +1223,79 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
 	}
 }
 
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+static int pvsched_vcpu_ops_show(struct seq_file *m, void *data)
+{
+	char ops_name[PVSCHED_NAME_MAX];
+	struct pvsched_vcpu_ops *ops;
+	struct kvm *kvm = (struct kvm *) m->private;
+
+	rcu_read_lock();
+	ops = rcu_dereference(kvm->pvsched_ops);
+	if (ops)
+		strncpy(ops_name, ops->name, PVSCHED_NAME_MAX);
+	rcu_read_unlock();
+
+	seq_printf(m, "%s\n", ops_name);
+
+	return 0;
+}
+
+static ssize_t
+pvsched_vcpu_ops_write(struct file *filp, const char __user *ubuf,
+		size_t cnt, loff_t *ppos)
+{
+	int ret;
+	char *cmp;
+	char buf[PVSCHED_NAME_MAX];
+	struct inode *inode;
+	struct kvm *kvm;
+
+	if (cnt > PVSCHED_NAME_MAX)
+		return -EINVAL;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	cmp = strstrip(buf);
+
+	inode = file_inode(filp);
+	inode_lock(inode);
+	kvm = (struct kvm *)inode->i_private;
+	ret = kvm_replace_pvsched_ops(kvm, cmp);
+	inode_unlock(inode);
+
+	if (ret)
+		return ret;
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int pvsched_vcpu_ops_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, pvsched_vcpu_ops_show, inode->i_private);
+}
+
+static const struct file_operations pvsched_vcpu_ops_fops = {
+	.open		= pvsched_vcpu_ops_open,
+	.write		= pvsched_vcpu_ops_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
+{
+	debugfs_create_file("pvsched_vcpu_ops", 0644, kvm->debugfs_dentry, kvm,
+			    &pvsched_vcpu_ops_fops);
+}
+#else
+static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
+{
+}
+#endif
+
 static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
 {
 	static DEFINE_MUTEX(kvm_debugfs_lock);
@@ -1288,6 +1361,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
 				    &stat_fops_per_vm);
 	}
 
+	kvm_create_vm_pvsched_debugfs(kvm);
+
 	ret = kvm_arch_create_vm_debugfs(kvm);
 	if (ret)
 		goto out_err;
@@ -5474,6 +5549,48 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_gmem_create(kvm, &guest_memfd);
 		break;
 	}
+#endif
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+	case KVM_REPLACE_PVSCHED_OPS:
+		struct pvsched_vcpu_ops *ops;
+		struct kvm_pvsched_ops in_ops, out_ops;
+
+		r = -EFAULT;
+		if (copy_from_user(&in_ops, argp, sizeof(in_ops)))
+			goto out;
+
+		out_ops.ops_name[0] = 0;
+
+		rcu_read_lock();
+		ops = rcu_dereference(kvm->pvsched_ops);
+		if (ops)
+			strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
+		rcu_read_unlock();
+
+		r = kvm_replace_pvsched_ops(kvm, (char *)in_ops.ops_name);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
+			goto out;
+
+		r = 0;
+		break;
+	case KVM_GET_PVSCHED_OPS:
+		out_ops.ops_name[0] = 0;
+		rcu_read_lock();
+		ops = rcu_dereference(kvm->pvsched_ops);
+		if (ops)
+			strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
+		rcu_read_unlock();
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
+			goto out;
+
+		r = 0;
+		break;
 #endif
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 4/5] pvsched: bpf support for pvsched
  2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
                   ` (2 preceding siblings ...)
  2024-04-03 14:01 ` [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs Vineeth Pillai (Google)
@ 2024-04-03 14:01 ` Vineeth Pillai (Google)
  2024-04-08 14:00   ` Vineeth Remanan Pillai
  2024-04-03 14:01 ` [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver Vineeth Pillai (Google)
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Vineeth Pillai (Google) @ 2024-04-03 14:01 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Vineeth Pillai (Google), Steven Rostedt, Joel Fernandes,
	Suleiman Souhlal, Masami Hiramatsu, himadrics, kvm, linux-kernel,
	x86

Add support for implementing bpf pvsched drivers. bpf programs can use
the struct_ops to define the callbacks of pvsched drivers.

This is only a skeleton of the bpf framework for pvsched. Some
verification details are not implemented yet.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/bpf/bpf_struct_ops_types.h |   4 +
 virt/pvsched/Makefile             |   2 +-
 virt/pvsched/pvsched_bpf.c        | 141 ++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 virt/pvsched/pvsched_bpf.c

diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
index 5678a9ddf817..9d5e4d1a331a 100644
--- a/kernel/bpf/bpf_struct_ops_types.h
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -9,4 +9,8 @@ BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
 #include <net/tcp.h>
 BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
 #endif
+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+#include <linux/pvsched.h>
+BPF_STRUCT_OPS_TYPE(pvsched_vcpu_ops)
+#endif
 #endif
diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
index 4ca38e30479b..02bc072cd806 100644
--- a/virt/pvsched/Makefile
+++ b/virt/pvsched/Makefile
@@ -1,2 +1,2 @@
 
-obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
+obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o pvsched_bpf.o
diff --git a/virt/pvsched/pvsched_bpf.c b/virt/pvsched/pvsched_bpf.c
new file mode 100644
index 000000000000..b125089abc3b
--- /dev/null
+++ b/virt/pvsched/pvsched_bpf.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Google  */
+
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <linux/pvsched.h>
+
+
+/* "extern" is to avoid sparse warning.  It is only used in bpf_struct_ops.c. */
+extern struct bpf_struct_ops bpf_pvsched_vcpu_ops;
+
+static int bpf_pvsched_vcpu_init(struct btf *btf)
+{
+	return 0;
+}
+
+static bool bpf_pvsched_vcpu_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       const struct bpf_prog *prog,
+				       struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+		return false;
+	if (type != BPF_READ)
+		return false;
+	if (off % size != 0)
+		return false;
+
+	if (!btf_ctx_access(off, size, type, prog, info))
+		return false;
+
+	return true;
+}
+
+static int bpf_pvsched_vcpu_btf_struct_access(struct bpf_verifier_log *log,
+					const struct bpf_reg_state *reg,
+					int off, int size)
+{
+	/*
+	 * TODO: Enable write access to Guest shared mem.
+	 */
+	return -EACCES;
+}
+
+static const struct bpf_func_proto *
+bpf_pvsched_vcpu_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+static const struct bpf_verifier_ops bpf_pvsched_vcpu_verifier_ops = {
+	.get_func_proto		= bpf_pvsched_vcpu_get_func_proto,
+	.is_valid_access	= bpf_pvsched_vcpu_is_valid_access,
+	.btf_struct_access	= bpf_pvsched_vcpu_btf_struct_access,
+};
+
+static int bpf_pvsched_vcpu_init_member(const struct btf_type *t,
+				  const struct btf_member *member,
+				  void *kdata, const void *udata)
+{
+	const struct pvsched_vcpu_ops *uvm_ops;
+	struct pvsched_vcpu_ops *vm_ops;
+	u32 moff;
+
+	uvm_ops = (const struct pvsched_vcpu_ops *)udata;
+	vm_ops = (struct pvsched_vcpu_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct pvsched_vcpu_ops, events):
+		vm_ops->events = *(u32 *)(udata + moff);
+		return 1;
+	case offsetof(struct pvsched_vcpu_ops, name):
+		if (bpf_obj_name_cpy(vm_ops->name, uvm_ops->name,
+					sizeof(vm_ops->name)) <= 0)
+			return -EINVAL;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int bpf_pvsched_vcpu_check_member(const struct btf_type *t,
+				   const struct btf_member *member,
+				   const struct bpf_prog *prog)
+{
+	return 0;
+}
+
+static int bpf_pvsched_vcpu_reg(void *kdata)
+{
+	return pvsched_register_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
+}
+
+static void bpf_pvsched_vcpu_unreg(void *kdata)
+{
+	pvsched_unregister_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
+}
+
+static int bpf_pvsched_vcpu_validate(void *kdata)
+{
+	return pvsched_validate_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
+}
+
+static int bpf_pvsched_vcpu_update(void *kdata, void *old_kdata)
+{
+	return -EOPNOTSUPP;
+}
+
+static int __pvsched_vcpu_register(struct pid *pid)
+{
+	return 0;
+}
+static void __pvsched_vcpu_unregister(struct pid *pid)
+{
+}
+static void __pvsched_notify_event(void *addr, struct pid *pid, u32 event)
+{
+}
+
+static struct pvsched_vcpu_ops __bpf_ops_pvsched_vcpu_ops = {
+	.pvsched_vcpu_register = __pvsched_vcpu_register,
+	.pvsched_vcpu_unregister = __pvsched_vcpu_unregister,
+	.pvsched_vcpu_notify_event = __pvsched_notify_event,
+};
+
+struct bpf_struct_ops bpf_pvsched_vcpu_ops = {
+	.init = &bpf_pvsched_vcpu_init,
+	.validate = bpf_pvsched_vcpu_validate,
+	.update = bpf_pvsched_vcpu_update,
+	.verifier_ops = &bpf_pvsched_vcpu_verifier_ops,
+	.reg = bpf_pvsched_vcpu_reg,
+	.unreg = bpf_pvsched_vcpu_unreg,
+	.check_member = bpf_pvsched_vcpu_check_member,
+	.init_member = bpf_pvsched_vcpu_init_member,
+	.name = "pvsched_vcpu_ops",
+	.cfi_stubs = &__bpf_ops_pvsched_vcpu_ops,
+};
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver.
  2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
                   ` (3 preceding siblings ...)
  2024-04-03 14:01 ` [RFC PATCH v2 4/5] pvsched: bpf support for pvsched Vineeth Pillai (Google)
@ 2024-04-03 14:01 ` Vineeth Pillai (Google)
  2024-04-08 14:01   ` Vineeth Remanan Pillai
  2024-04-08 13:54 ` [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Remanan Pillai
  2024-05-01 15:29 ` Sean Christopherson
  6 siblings, 1 reply; 42+ messages in thread
From: Vineeth Pillai (Google) @ 2024-04-03 14:01 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Vineeth Pillai (Google), Steven Rostedt, Joel Fernandes,
	Suleiman Souhlal, Masami Hiramatsu, himadrics, kvm, linux-kernel,
	x86

A dummy skeleton of a bpf pvsched driver. This is just for demonstration
purpose and would need more work to be included as a test for this
feature.

Not-Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
---
 .../testing/selftests/bpf/progs/bpf_pvsched.c | 37 +++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c

diff --git a/tools/testing/selftests/bpf/progs/bpf_pvsched.c b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
new file mode 100644
index 000000000000..a653baa3034b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include "vmlinux.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/pvsched_vcpu_reg")
+int BPF_PROG(pvsched_vcpu_reg, struct pid *pid)
+{
+	bpf_printk("pvsched_vcpu_reg: pid: %p", pid);
+	return 0;
+}
+
+SEC("struct_ops/pvsched_vcpu_unreg")
+void BPF_PROG(pvsched_vcpu_unreg, struct pid *pid)
+{
+	bpf_printk("pvsched_vcpu_unreg: pid: %p", pid);
+}
+
+SEC("struct_ops/pvsched_vcpu_notify_event")
+void BPF_PROG(pvsched_vcpu_notify_event, void *addr, struct pid *pid, __u32 event)
+{
+	bpf_printk("pvsched_vcpu_notify: pid: %p, event:%u", pid, event);
+}
+
+SEC(".struct_ops")
+struct pvsched_vcpu_ops pvsched_ops = {
+	.pvsched_vcpu_register		= (void *)pvsched_vcpu_reg,
+	.pvsched_vcpu_unregister	= (void *)pvsched_vcpu_unreg,
+	.pvsched_vcpu_notify_event	= (void *)pvsched_vcpu_notify_event,
+	.events				= 0x6,
+	.name				= "bpf_pvsched_ops",
+};
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
                   ` (4 preceding siblings ...)
  2024-04-03 14:01 ` [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver Vineeth Pillai (Google)
@ 2024-04-08 13:54 ` Vineeth Remanan Pillai
  2024-05-01 15:29 ` Sean Christopherson
  6 siblings, 0 replies; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-04-08 13:54 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Steven Rostedt, Joel Fernandes, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, Tejun Heo,
	Josh Don, Barret Rhoden, David Vernet

Sorry I missed sched_ext folks, adding them as well.

Thanks,
Vineeth


On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<vineeth@bitbyteword.org> wrote:
>
> Double scheduling is a concern with virtualization hosts where the host
> schedules vcpus without knowing whats run by the vcpu and guest schedules
> tasks without knowing where the vcpu is physically running. This causes
> issues related to latencies, power consumption, resource utilization
> etc. An ideal solution would be to have a cooperative scheduling
> framework where the guest and host shares scheduling related information
> and makes an educated scheduling decision to optimally handle the
> workloads. As a first step, we are taking a stab at reducing latencies
> for latency sensitive workloads in the guest.
>
> v1 RFC[1] was posted in December 2023. The main disagreement was in the
> implementation where the patch was making scheduling policy decisions
> in kvm and kvm is not the right place to do it. The suggestion was to
> move the polcy decisions outside of kvm and let kvm only handle the
> notifications needed to make the policy decisions. This patch series is
> an iterative step towards implementing the feature as a layered
> design where the policy could be implemented outside of kvm as a
> kernel built-in, a kernel module or a bpf program.
>
> This design comprises mainly of 4 components:
>
> - pvsched driver: Implements the scheduling policies. Register with
>     host with a set of callbacks that hypervisor(kvm) can use to notify
>     vcpu events that the driver is interested in. The callback will be
>     passed in the address of shared memory so that the driver can get
>     scheduling information shared by the guest and also update the
>     scheduling policies set by the driver.
> - kvm component: Selects the pvsched driver for a guest and notifies
>     the driver via callbacks for events that the driver is interested
>     in. Also interface with the guest in retreiving the shared memory
>     region for sharing the scheduling information.
> - host kernel component: Implements the APIs for:
>     - pvsched driver for register/unregister to the host kernel, and
>     - hypervisor for assingning/unassigning driver for guests.
> - guest component: Implements a framework for sharing the scheduling
>     information with the pvsched driver through kvm.
>
> There is another component that we refer to as pvsched protocol. This
> defines the details about shared memory layout, information sharing and
> sheduling policy decisions. The protocol need not be part of the kernel
> and can be defined separately based on the use case and requirements.
> Both guest and the selected pvsched driver need to match the protocol
> for the feature to work. Protocol shall be identified by a name and a
> possible versioning scheme. Guest will advertise the protocol and then
> the hypervisor can assign the driver implementing the protocol if it is
> registered in the host kernel.
>
> This patch series only implements the first 3 components. Guest side
> implementation and the protocol framework shall come as a separate
> series once we finalize rest of the design.
>
> This series also implements a sample bpf program and a kernel-builtin
> pvsched drivers. They do not do any real stuff now, but just skeletons
> to demonstrate the feature.
>
> Rebased on 6.8.2.
>
> [1]: https://lwn.net/Articles/955145/
>
> Vineeth Pillai (Google) (5):
>   pvsched: paravirt scheduling framework
>   kvm: Implement the paravirt sched framework for kvm
>   kvm: interface for managing pvsched driver for guest VMs
>   pvsched: bpf support for pvsched
>   selftests/bpf: sample implementation of a bpf pvsched driver.
>
>  Kconfig                                       |   2 +
>  arch/x86/kvm/Kconfig                          |  13 +
>  arch/x86/kvm/x86.c                            |   3 +
>  include/linux/kvm_host.h                      |  32 +++
>  include/linux/pvsched.h                       | 102 +++++++
>  include/uapi/linux/kvm.h                      |   6 +
>  kernel/bpf/bpf_struct_ops_types.h             |   4 +
>  kernel/sysctl.c                               |  27 ++
>  .../testing/selftests/bpf/progs/bpf_pvsched.c |  37 +++
>  virt/Makefile                                 |   2 +-
>  virt/kvm/kvm_main.c                           | 265 ++++++++++++++++++
>  virt/pvsched/Kconfig                          |  12 +
>  virt/pvsched/Makefile                         |   2 +
>  virt/pvsched/pvsched.c                        | 215 ++++++++++++++
>  virt/pvsched/pvsched_bpf.c                    | 141 ++++++++++
>  15 files changed, 862 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/pvsched.h
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
>  create mode 100644 virt/pvsched/Kconfig
>  create mode 100644 virt/pvsched/Makefile
>  create mode 100644 virt/pvsched/pvsched.c
>  create mode 100644 virt/pvsched/pvsched_bpf.c
>
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework
  2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
@ 2024-04-08 13:57   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-04-08 13:57 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Steven Rostedt, Joel Fernandes, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, Tejun Heo,
	Josh Don, Barret Rhoden, David Vernet

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<vineeth@bitbyteword.org> wrote:
>
> Implement a paravirt scheduling framework for linux kernel.
>
> The framework allows for pvsched driver to register to the kernel and
> receive callbacks from hypervisor(eg: kvm) for interested vcpu events
> like VMENTER, VMEXIT etc.
>
> The framework also allows hypervisor to select a pvsched driver (from
> the available list of registered drivers) for each guest.
>
> Also implement a sysctl for listing the available pvsched drivers.
>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  Kconfig                 |   2 +
>  include/linux/pvsched.h | 102 +++++++++++++++++++
>  kernel/sysctl.c         |  27 +++++
>  virt/Makefile           |   2 +-
>  virt/pvsched/Kconfig    |  12 +++
>  virt/pvsched/Makefile   |   2 +
>  virt/pvsched/pvsched.c  | 215 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 361 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/pvsched.h
>  create mode 100644 virt/pvsched/Kconfig
>  create mode 100644 virt/pvsched/Makefile
>  create mode 100644 virt/pvsched/pvsched.c
>
> diff --git a/Kconfig b/Kconfig
> index 745bc773f567..4a52eaa21166 100644
> --- a/Kconfig
> +++ b/Kconfig
> @@ -29,4 +29,6 @@ source "lib/Kconfig"
>
>  source "lib/Kconfig.debug"
>
> +source "virt/pvsched/Kconfig"
> +
>  source "Documentation/Kconfig"
> diff --git a/include/linux/pvsched.h b/include/linux/pvsched.h
> new file mode 100644
> index 000000000000..59df6b44aacb
> --- /dev/null
> +++ b/include/linux/pvsched.h
> @@ -0,0 +1,102 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2024 Google  */
> +
> +#ifndef _LINUX_PVSCHED_H
> +#define _LINUX_PVSCHED_H 1
> +
> +/*
> + * List of events for which hypervisor calls back into pvsched driver.
> + * Driver can specify the events it is interested in.
> + */
> +enum pvsched_vcpu_events {
> +       PVSCHED_VCPU_VMENTER = 0x1,
> +       PVSCHED_VCPU_VMEXIT = 0x2,
> +       PVSCHED_VCPU_HALT = 0x4,
> +       PVSCHED_VCPU_INTR_INJ = 0x8,
> +};
> +
> +#define PVSCHED_NAME_MAX       32
> +#define PVSCHED_MAX            8
> +#define PVSCHED_DRV_BUF_MAX    (PVSCHED_NAME_MAX * PVSCHED_MAX + PVSCHED_MAX)
> +
> +/*
> + * pvsched driver callbacks.
> + * TODO: versioning support for better compatibility with the guest
> + *       component implementing this feature.
> + */
> +struct pvsched_vcpu_ops {
> +       /*
> +        * pvsched_vcpu_register() - Register the vcpu with pvsched driver.
> +        * @pid: pid of the vcpu task.
> +        *
> +        * pvsched driver can store the pid internally and initialize
> +        * itself to prepare for receiving callbacks from thsi vcpu.
> +        */
> +       int (*pvsched_vcpu_register)(struct pid *pid);
> +
> +       /*
> +        * pvsched_vcpu_unregister() - Un-register the vcpu with pvsched driver.
> +        * @pid: pid of the vcpu task.
> +        */
> +       void (*pvsched_vcpu_unregister)(struct pid *pid);
> +
> +       /*
> +        * pvsched_vcpu_notify_event() - Callback for pvsched events
> +        * @addr: Address of the memory region shared with guest
> +        * @pid: pid of the vcpu task.
> +        * @events: bit mask of the events that hypervisor wants to notify.
> +        */
> +       void (*pvsched_vcpu_notify_event)(void *addr, struct pid *pid, u32 event);
> +
> +       char name[PVSCHED_NAME_MAX];
> +       struct module *owner;
> +       struct list_head list;
> +       u32 events;
> +       u32 key;
> +};
> +
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> +int pvsched_get_available_drivers(char *buf, size_t maxlen);
> +
> +int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops);
> +void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops);
> +
> +struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name);
> +void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops);
> +
> +static inline int pvsched_validate_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +       /*
> +        * All callbacks are mandatory.
> +        */
> +       if (!ops->pvsched_vcpu_register || !ops->pvsched_vcpu_unregister ||
> +                       !ops->pvsched_vcpu_notify_event)
> +               return -EINVAL;
> +
> +       return 0;
> +}
> +#else
> +static inline void pvsched_get_available_drivers(char *buf, size_t maxlen)
> +{
> +}
> +
> +static inline int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +       return -ENOTSUPP;
> +}
> +
> +static inline void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +}
> +
> +static inline struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
> +{
> +       return NULL;
> +}
> +
> +static inline void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +}
> +#endif
> +
> +#endif
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 157f7ce2942d..10a18a791b4f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -63,6 +63,7 @@
>  #include <linux/mount.h>
>  #include <linux/userfaultfd_k.h>
>  #include <linux/pid.h>
> +#include <linux/pvsched.h>
>
>  #include "../lib/kstrtox.h"
>
> @@ -1615,6 +1616,24 @@ int proc_do_static_key(struct ctl_table *table, int write,
>         return ret;
>  }
>
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> +static int proc_pvsched_available_drivers(struct ctl_table *ctl,
> +                                                int write, void *buffer,
> +                                                size_t *lenp, loff_t *ppos)
> +{
> +       struct ctl_table tbl = { .maxlen = PVSCHED_DRV_BUF_MAX, };
> +       int ret;
> +
> +       tbl.data = kmalloc(tbl.maxlen, GFP_USER);
> +       if (!tbl.data)
> +               return -ENOMEM;
> +       pvsched_get_available_drivers(tbl.data, PVSCHED_DRV_BUF_MAX);
> +       ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
> +       kfree(tbl.data);
> +       return ret;
> +}
> +#endif
> +
>  static struct ctl_table kern_table[] = {
>         {
>                 .procname       = "panic",
> @@ -2033,6 +2052,14 @@ static struct ctl_table kern_table[] = {
>                 .extra1         = SYSCTL_ONE,
>                 .extra2         = SYSCTL_INT_MAX,
>         },
> +#endif
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> +       {
> +               .procname       = "pvsched_available_drivers",
> +               .maxlen         = PVSCHED_DRV_BUF_MAX,
> +               .mode           = 0444,
> +               .proc_handler   = proc_pvsched_available_drivers,
> +       },
>  #endif
>         { }
>  };
> diff --git a/virt/Makefile b/virt/Makefile
> index 1cfea9436af9..9d0f32d775a1 100644
> --- a/virt/Makefile
> +++ b/virt/Makefile
> @@ -1,2 +1,2 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -obj-y  += lib/
> +obj-y  += lib/ pvsched/
> diff --git a/virt/pvsched/Kconfig b/virt/pvsched/Kconfig
> new file mode 100644
> index 000000000000..5ca2669060cb
> --- /dev/null
> +++ b/virt/pvsched/Kconfig
> @@ -0,0 +1,12 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config PARAVIRT_SCHED_HOST
> +       bool "Paravirt scheduling framework in the host kernel"
> +       default n
> +       help
> +         Paravirtualized scheduling facilitates the exchange of scheduling
> +         related information between the host and guest through shared memory,
> +         enhancing the efficiency of vCPU thread scheduling by the hypervisor.
> +         An illustrative use case involves dynamically boosting the priority of
> +         a vCPU thread when the guest is executing a latency-sensitive workload
> +         on that specific vCPU.
> +         This config enables paravirt scheduling framework in the host kernel.
> diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
> new file mode 100644
> index 000000000000..4ca38e30479b
> --- /dev/null
> +++ b/virt/pvsched/Makefile
> @@ -0,0 +1,2 @@
> +
> +obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
> diff --git a/virt/pvsched/pvsched.c b/virt/pvsched/pvsched.c
> new file mode 100644
> index 000000000000..610c85cf90d2
> --- /dev/null
> +++ b/virt/pvsched/pvsched.c
> @@ -0,0 +1,215 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024 Google  */
> +
> +/*
> + *  Paravirt scheduling framework
> + *
> + */
> +
> +/*
> + * Heavily inspired from tcp congestion avoidance implementation.
> + * (net/ipv4/tcp_cong.c)
> + */
> +
> +#define pr_fmt(fmt) "PVSCHED: " fmt
> +
> +#include <linux/module.h>
> +#include <linux/bpf.h>
> +#include <linux/gfp.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/jhash.h>
> +#include <linux/pvsched.h>
> +
> +static DEFINE_SPINLOCK(pvsched_drv_list_lock);
> +static int nr_pvsched_drivers = 0;
> +static LIST_HEAD(pvsched_drv_list);
> +
> +/*
> + * Retrieve pvsched_vcpu_ops given the name.
> + */
> +static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_name(char *name)
> +{
> +       struct pvsched_vcpu_ops *ops;
> +
> +       list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
> +               if (strcmp(ops->name, name) == 0)
> +                       return ops;
> +       }
> +
> +       return NULL;
> +}
> +
> +/*
> + * Retrieve pvsched_vcpu_ops given the hash key.
> + */
> +static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_key(u32 key)
> +{
> +       struct pvsched_vcpu_ops *ops;
> +
> +       list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
> +               if (ops->key == key)
> +                       return ops;
> +       }
> +
> +       return NULL;
> +}
> +
> +/*
> + * pvsched_get_available_drivers() - Copy space separated list of pvsched
> + * driver names.
> + * @buf: buffer to store the list of driver names
> + * @maxlen: size of the buffer
> + *
> + * Return: 0 on success, negative value on error.
> + */
> +int pvsched_get_available_drivers(char *buf, size_t maxlen)
> +{
> +       struct pvsched_vcpu_ops *ops;
> +       size_t offs = 0;
> +
> +       if (!buf)
> +               return -EINVAL;
> +
> +       if (maxlen > PVSCHED_DRV_BUF_MAX)
> +               maxlen = PVSCHED_DRV_BUF_MAX;
> +
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
> +               offs += snprintf(buf + offs, maxlen - offs,
> +                                "%s%s",
> +                                offs == 0 ? "" : " ", ops->name);
> +
> +               if (WARN_ON_ONCE(offs >= maxlen))
> +                       break;
> +       }
> +       rcu_read_unlock();
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(pvsched_get_available_drivers);
> +
> +/*
> + * pvsched_register_vcpu_ops() - Register the driver in the kernel.
> + * @ops: Driver data(callbacks)
> + *
> + * After the registration, driver will be exposed to the hypervisor
> + * for assignment to the guest VMs.
> + *
> + * Return: 0 on success, negative value on error.
> + */
> +int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +       int ret = 0;
> +
> +       ops->key = jhash(ops->name, sizeof(ops->name), strlen(ops->name));
> +       spin_lock(&pvsched_drv_list_lock);
> +       if (nr_pvsched_drivers > PVSCHED_MAX) {
> +               ret = -ENOSPC;
> +       } if (pvsched_find_vcpu_ops_key(ops->key)) {
> +               ret = -EEXIST;
> +       } else if (!(ret = pvsched_validate_vcpu_ops(ops))) {
> +               list_add_tail_rcu(&ops->list, &pvsched_drv_list);
> +               nr_pvsched_drivers++;
> +       }
> +       spin_unlock(&pvsched_drv_list_lock);
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(pvsched_register_vcpu_ops);
> +
> +/*
> + * pvsched_register_vcpu_ops() - Un-register the driver from the kernel.
> + * @ops: Driver data(callbacks)
> + *
> + * After un-registration, driver will not be visible to hypervisor.
> + */
> +void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +       spin_lock(&pvsched_drv_list_lock);
> +       list_del_rcu(&ops->list);
> +       nr_pvsched_drivers--;
> +       spin_unlock(&pvsched_drv_list_lock);
> +
> +       synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(pvsched_unregister_vcpu_ops);
> +
> +/*
> + * pvsched_get_vcpu_ops: Acquire the driver.
> + * @name: Name of the driver to be acquired.
> + *
> + * Hypervisor can use this API to get the driver structure for
> + * assigning it to guest VMs. This API takes a reference on the
> + * module/bpf program so that driver doesn't vanish under the
> + * hypervisor.
> + *
> + * Return: driver structure if found, else NULL.
> + */
> +struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
> +{
> +       struct pvsched_vcpu_ops *ops;
> +
> +       if (!name || (strlen(name) >= PVSCHED_NAME_MAX))
> +               return NULL;
> +
> +       rcu_read_lock();
> +       ops = pvsched_find_vcpu_ops_name(name);
> +       if (!ops)
> +               goto out;
> +
> +       if (unlikely(!bpf_try_module_get(ops, ops->owner))) {
> +               ops = NULL;
> +               goto out;
> +       }
> +
> +out:
> +       rcu_read_unlock();
> +       return ops;
> +}
> +EXPORT_SYMBOL_GPL(pvsched_get_vcpu_ops);
> +
> +/*
> + * pvsched_put_vcpu_ops: Release the driver.
> + * @name: Name of the driver to be releases.
> + *
> + * Hypervisor can use this API to release the driver.
> + */
> +void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +       bpf_module_put(ops, ops->owner);
> +}
> +EXPORT_SYMBOL_GPL(pvsched_put_vcpu_ops);
> +
> +/*
> + * NOP vm_ops Sample implementation.
> + * This driver doesn't do anything other than registering itself.
> + * Placeholder for adding some default logic when the feature is
> + * complete.
> + */
> +static int nop_pvsched_vcpu_register(struct pid *pid)
> +{
> +       return 0;
> +}
> +static void nop_pvsched_vcpu_unregister(struct pid *pid)
> +{
> +}
> +static void nop_pvsched_notify_event(void *addr, struct pid *pid, u32 event)
> +{
> +}
> +
> +struct pvsched_vcpu_ops nop_vcpu_ops = {
> +       .events = PVSCHED_VCPU_VMENTER | PVSCHED_VCPU_VMEXIT | PVSCHED_VCPU_HALT,
> +       .pvsched_vcpu_register = nop_pvsched_vcpu_register,
> +       .pvsched_vcpu_unregister = nop_pvsched_vcpu_unregister,
> +       .pvsched_vcpu_notify_event = nop_pvsched_notify_event,
> +       .name = "pvsched_nop",
> +       .owner = THIS_MODULE,
> +};
> +
> +static int __init pvsched_init(void)
> +{
> +       return WARN_ON(pvsched_register_vcpu_ops(&nop_vcpu_ops));
> +}
> +
> +late_initcall(pvsched_init);
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm
  2024-04-03 14:01 ` [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm Vineeth Pillai (Google)
@ 2024-04-08 13:58   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-04-08 13:58 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Steven Rostedt, Joel Fernandes, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, Tejun Heo,
	Josh Don, Barret Rhoden, David Vernet

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<vineeth@bitbyteword.org> wrote:
>
> kvm uses the kernel's paravirt sched framework to assign an available
> pvsched driver for a guest. guest vcpus registers with the pvsched
> driver and calls into the driver callback to notify the events that the
> driver is interested in.
>
> This PoC doesn't do the callback on interrupt injection yet. Will be
> implemented in subsequent iterations.
>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  arch/x86/kvm/Kconfig     |  13 ++++
>  arch/x86/kvm/x86.c       |   3 +
>  include/linux/kvm_host.h |  32 +++++++++
>  virt/kvm/kvm_main.c      | 148 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 196 insertions(+)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 65ed14b6540b..c1776cdb5b65 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -189,4 +189,17 @@ config KVM_MAX_NR_VCPUS
>           the memory footprint of each KVM guest, regardless of how many vCPUs are
>           created for a given VM.
>
> +config PARAVIRT_SCHED_KVM
> +       bool "Enable paravirt scheduling capability for kvm"
> +       depends on KVM
> +       default n
> +       help
> +         Paravirtualized scheduling facilitates the exchange of scheduling
> +         related information between the host and guest through shared memory,
> +         enhancing the efficiency of vCPU thread scheduling by the hypervisor.
> +         An illustrative use case involves dynamically boosting the priority of
> +         a vCPU thread when the guest is executing a latency-sensitive workload
> +         on that specific vCPU.
> +         This config enables paravirt scheduling in the kvm hypervisor.
> +
>  endif # VIRTUALIZATION
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ffe580169c93..d0abc2c64d47 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10896,6 +10896,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
>         preempt_disable();
>
> +       kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMENTER);
> +
>         static_call(kvm_x86_prepare_switch_to_guest)(vcpu);
>
>         /*
> @@ -11059,6 +11061,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>         guest_timing_exit_irqoff();
>
>         local_irq_enable();
> +       kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMEXIT);
>         preempt_enable();
>
>         kvm_vcpu_srcu_read_lock(vcpu);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 179df96b20f8..6381569f3de8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -45,6 +45,8 @@
>  #include <asm/kvm_host.h>
>  #include <linux/kvm_dirty_ring.h>
>
> +#include <linux/pvsched.h>
> +
>  #ifndef KVM_MAX_VCPU_IDS
>  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
>  #endif
> @@ -832,6 +834,11 @@ struct kvm {
>         bool vm_bugged;
>         bool vm_dead;
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +       spinlock_t pvsched_ops_lock;
> +       struct pvsched_vcpu_ops __rcu *pvsched_ops;
> +#endif
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>         struct notifier_block pm_notifier;
>  #endif
> @@ -2413,4 +2420,29 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events);
> +int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu);
> +
> +int kvm_replace_pvsched_ops(struct kvm *kvm, char *name);
> +#else
> +static inline int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
> +{
> +       return 0;
> +}
> +static inline int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
> +{
> +       return 0;
> +}
> +static inline void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
> +{
> +}
> +
> +static inline int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
> +{
> +       return 0;
> +}
> +#endif
> +
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0f50960b0e3a..0546814e4db7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -170,6 +170,142 @@ bool kvm_is_zone_device_page(struct page *page)
>         return is_zone_device_page(page);
>  }
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +typedef enum {
> +       PVSCHED_CB_REGISTER = 1,
> +       PVSCHED_CB_UNREGISTER = 2,
> +       PVSCHED_CB_NOTIFY = 3
> +} pvsched_vcpu_callback_t;
> +
> +/*
> + * Helper function to invoke the pvsched driver callback.
> + */
> +static int __vcpu_pvsched_callback(struct kvm_vcpu *vcpu, u32 events,
> +               pvsched_vcpu_callback_t action)
> +{
> +       int ret = 0;
> +       struct pid *pid;
> +       struct pvsched_vcpu_ops *ops;
> +
> +       rcu_read_lock();
> +       ops = rcu_dereference(vcpu->kvm->pvsched_ops);
> +       if (!ops) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       pid = rcu_dereference(vcpu->pid);
> +       if (WARN_ON_ONCE(!pid)) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +       get_pid(pid);
> +       switch(action) {
> +               case PVSCHED_CB_REGISTER:
> +                       ops->pvsched_vcpu_register(pid);
> +                       break;
> +               case PVSCHED_CB_UNREGISTER:
> +                       ops->pvsched_vcpu_unregister(pid);
> +                       break;
> +               case PVSCHED_CB_NOTIFY:
> +                       if (ops->events & events) {
> +                               ops->pvsched_vcpu_notify_event(
> +                                       NULL, /* TODO: Pass guest allocated sharedmem addr */
> +                                       pid,
> +                                       ops->events & events);
> +                       }
> +                       break;
> +               default:
> +                       WARN_ON_ONCE(1);
> +       }
> +       put_pid(pid);
> +
> +out:
> +       rcu_read_unlock();
> +       return ret;
> +}
> +
> +int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
> +{
> +       return __vcpu_pvsched_callback(vcpu, events, PVSCHED_CB_NOTIFY);
> +}
> +
> +int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
> +{
> +       return __vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_REGISTER);
> +       /*
> +        * TODO: Action if the registration fails?
> +        */
> +}
> +
> +void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
> +{
> +       __vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_UNREGISTER);
> +}
> +
> +/*
> + * Replaces the VM's current pvsched driver.
> + * if name is NULL or empty string, unassign the
> + * current driver.
> + */
> +int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
> +{
> +       int ret = 0;
> +       unsigned long i;
> +       struct kvm_vcpu *vcpu = NULL;
> +       struct pvsched_vcpu_ops *ops = NULL, *prev_ops;
> +
> +
> +       spin_lock(&kvm->pvsched_ops_lock);
> +
> +       prev_ops = rcu_dereference(kvm->pvsched_ops);
> +
> +       /*
> +        * Unassign operation if the passed in value is
> +        * NULL or an empty string.
> +        */
> +       if (name && *name) {
> +               ops = pvsched_get_vcpu_ops(name);
> +               if (!ops) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +       }
> +
> +       if (prev_ops) {
> +               /*
> +                * Unregister current pvsched driver.
> +                */
> +               kvm_for_each_vcpu(i, vcpu, kvm) {
> +                       kvm_vcpu_pvsched_unregister(vcpu);
> +               }
> +
> +               pvsched_put_vcpu_ops(prev_ops);
> +       }
> +
> +
> +       rcu_assign_pointer(kvm->pvsched_ops, ops);
> +       if (ops) {
> +               /*
> +                * Register new pvsched driver.
> +                */
> +               kvm_for_each_vcpu(i, vcpu, kvm) {
> +                       WARN_ON_ONCE(kvm_vcpu_pvsched_register(vcpu));
> +               }
> +       }
> +
> +out:
> +       spin_unlock(&kvm->pvsched_ops_lock);
> +
> +       if (ret)
> +               return ret;
> +
> +       synchronize_rcu();
> +
> +       return 0;
> +}
> +#endif
> +
>  /*
>   * Returns a 'struct page' if the pfn is "valid" and backed by a refcounted
>   * page, NULL otherwise.  Note, the list of refcounted PG_reserved page types
> @@ -508,6 +644,8 @@ static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
>         kvm_arch_vcpu_destroy(vcpu);
>         kvm_dirty_ring_free(&vcpu->dirty_ring);
>
> +       kvm_vcpu_pvsched_unregister(vcpu);
> +
>         /*
>          * No need for rcu_read_lock as VCPU_RUN is the only place that changes
>          * the vcpu->pid pointer, and at destruction time all file descriptors
> @@ -1221,6 +1359,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>
>         BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +       spin_lock_init(&kvm->pvsched_ops_lock);
> +#endif
> +
>         /*
>          * Force subsequent debugfs file creations to fail if the VM directory
>          * is not created (by kvm_create_vm_debugfs()).
> @@ -1343,6 +1485,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>         int i;
>         struct mm_struct *mm = kvm->mm;
>
> +       kvm_replace_pvsched_ops(kvm, NULL);
> +
>         kvm_destroy_pm_notifier(kvm);
>         kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>         kvm_destroy_vm_debugfs(kvm);
> @@ -3779,6 +3923,8 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
>                 if (kvm_vcpu_check_block(vcpu) < 0)
>                         break;
>
> +               kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_HALT);
> +
>                 waited = true;
>                 schedule();
>         }
> @@ -4434,6 +4580,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
>                         /* The thread running this VCPU changed. */
>                         struct pid *newpid;
>
> +                       kvm_vcpu_pvsched_unregister(vcpu);
>                         r = kvm_arch_vcpu_run_pid_change(vcpu);
>                         if (r)
>                                 break;
> @@ -4442,6 +4589,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
>                         rcu_assign_pointer(vcpu->pid, newpid);
>                         if (oldpid)
>                                 synchronize_rcu();
> +                       kvm_vcpu_pvsched_register(vcpu);
>                         put_pid(oldpid);
>                 }
>                 r = kvm_arch_vcpu_ioctl_run(vcpu);
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs
  2024-04-03 14:01 ` [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs Vineeth Pillai (Google)
@ 2024-04-08 13:59   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-04-08 13:59 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Steven Rostedt, Joel Fernandes, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, Tejun Heo,
	Barret Rhoden, David Vernet, Josh Don

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<vineeth@bitbyteword.org> wrote:
>
> Implement ioctl for assigning and unassigning pvsched driver for a
> guest. VMMs would need to adopt this ioctls for supporting the feature.
> Also add a temporary debugfs interface for managing this.
>
> Ideally, the hypervisor would be able to determine the pvsched driver
> based on the information received from the guest. Guest VMs with the
> feature enabled would request hypervisor to select a pvsched driver.
> ioctl api is an override mechanism to give more control to the admin.
>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/uapi/linux/kvm.h |   6 ++
>  virt/kvm/kvm_main.c      | 117 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 123 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index c3308536482b..4b29bdad4188 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -2227,4 +2227,10 @@ struct kvm_create_guest_memfd {
>         __u64 reserved[6];
>  };
>
> +struct kvm_pvsched_ops {
> +       __u8 ops_name[32]; /* PVSCHED_NAME_MAX */
> +};
> +
> +#define KVM_GET_PVSCHED_OPS            _IOR(KVMIO, 0xe4, struct kvm_pvsched_ops)
> +#define KVM_REPLACE_PVSCHED_OPS                _IOWR(KVMIO, 0xe5, struct kvm_pvsched_ops)
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0546814e4db7..b3d9c362d2e3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1223,6 +1223,79 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
>         }
>  }
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +static int pvsched_vcpu_ops_show(struct seq_file *m, void *data)
> +{
> +       char ops_name[PVSCHED_NAME_MAX];
> +       struct pvsched_vcpu_ops *ops;
> +       struct kvm *kvm = (struct kvm *) m->private;
> +
> +       rcu_read_lock();
> +       ops = rcu_dereference(kvm->pvsched_ops);
> +       if (ops)
> +               strncpy(ops_name, ops->name, PVSCHED_NAME_MAX);
> +       rcu_read_unlock();
> +
> +       seq_printf(m, "%s\n", ops_name);
> +
> +       return 0;
> +}
> +
> +static ssize_t
> +pvsched_vcpu_ops_write(struct file *filp, const char __user *ubuf,
> +               size_t cnt, loff_t *ppos)
> +{
> +       int ret;
> +       char *cmp;
> +       char buf[PVSCHED_NAME_MAX];
> +       struct inode *inode;
> +       struct kvm *kvm;
> +
> +       if (cnt > PVSCHED_NAME_MAX)
> +               return -EINVAL;
> +
> +       if (copy_from_user(&buf, ubuf, cnt))
> +               return -EFAULT;
> +
> +       cmp = strstrip(buf);
> +
> +       inode = file_inode(filp);
> +       inode_lock(inode);
> +       kvm = (struct kvm *)inode->i_private;
> +       ret = kvm_replace_pvsched_ops(kvm, cmp);
> +       inode_unlock(inode);
> +
> +       if (ret)
> +               return ret;
> +
> +       *ppos += cnt;
> +       return cnt;
> +}
> +
> +static int pvsched_vcpu_ops_open(struct inode *inode, struct file *filp)
> +{
> +       return single_open(filp, pvsched_vcpu_ops_show, inode->i_private);
> +}
> +
> +static const struct file_operations pvsched_vcpu_ops_fops = {
> +       .open           = pvsched_vcpu_ops_open,
> +       .write          = pvsched_vcpu_ops_write,
> +       .read           = seq_read,
> +       .llseek         = seq_lseek,
> +       .release        = single_release,
> +};
> +
> +static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
> +{
> +       debugfs_create_file("pvsched_vcpu_ops", 0644, kvm->debugfs_dentry, kvm,
> +                           &pvsched_vcpu_ops_fops);
> +}
> +#else
> +static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
> +{
> +}
> +#endif
> +
>  static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
>  {
>         static DEFINE_MUTEX(kvm_debugfs_lock);
> @@ -1288,6 +1361,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
>                                     &stat_fops_per_vm);
>         }
>
> +       kvm_create_vm_pvsched_debugfs(kvm);
> +
>         ret = kvm_arch_create_vm_debugfs(kvm);
>         if (ret)
>                 goto out_err;
> @@ -5474,6 +5549,48 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_gmem_create(kvm, &guest_memfd);
>                 break;
>         }
> +#endif
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +       case KVM_REPLACE_PVSCHED_OPS:
> +               struct pvsched_vcpu_ops *ops;
> +               struct kvm_pvsched_ops in_ops, out_ops;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&in_ops, argp, sizeof(in_ops)))
> +                       goto out;
> +
> +               out_ops.ops_name[0] = 0;
> +
> +               rcu_read_lock();
> +               ops = rcu_dereference(kvm->pvsched_ops);
> +               if (ops)
> +                       strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
> +               rcu_read_unlock();
> +
> +               r = kvm_replace_pvsched_ops(kvm, (char *)in_ops.ops_name);
> +               if (r)
> +                       goto out;
> +
> +               r = -EFAULT;
> +               if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
> +                       goto out;
> +
> +               r = 0;
> +               break;
> +       case KVM_GET_PVSCHED_OPS:
> +               out_ops.ops_name[0] = 0;
> +               rcu_read_lock();
> +               ops = rcu_dereference(kvm->pvsched_ops);
> +               if (ops)
> +                       strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
> +               rcu_read_unlock();
> +
> +               r = -EFAULT;
> +               if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
> +                       goto out;
> +
> +               r = 0;
> +               break;
>  #endif
>         default:
>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/5] pvsched: bpf support for pvsched
  2024-04-03 14:01 ` [RFC PATCH v2 4/5] pvsched: bpf support for pvsched Vineeth Pillai (Google)
@ 2024-04-08 14:00   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-04-08 14:00 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Steven Rostedt, Joel Fernandes, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, Tejun Heo,
	Josh Don, Barret Rhoden, David Vernet

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<vineeth@bitbyteword.org> wrote:
>
> Add support for implementing bpf pvsched drivers. bpf programs can use
> the struct_ops to define the callbacks of pvsched drivers.
>
> This is only a skeleton of the bpf framework for pvsched. Some
> verification details are not implemented yet.
>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/bpf/bpf_struct_ops_types.h |   4 +
>  virt/pvsched/Makefile             |   2 +-
>  virt/pvsched/pvsched_bpf.c        | 141 ++++++++++++++++++++++++++++++
>  3 files changed, 146 insertions(+), 1 deletion(-)
>  create mode 100644 virt/pvsched/pvsched_bpf.c
>
> diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
> index 5678a9ddf817..9d5e4d1a331a 100644
> --- a/kernel/bpf/bpf_struct_ops_types.h
> +++ b/kernel/bpf/bpf_struct_ops_types.h
> @@ -9,4 +9,8 @@ BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
>  #include <net/tcp.h>
>  BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
>  #endif
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> +#include <linux/pvsched.h>
> +BPF_STRUCT_OPS_TYPE(pvsched_vcpu_ops)
> +#endif
>  #endif
> diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
> index 4ca38e30479b..02bc072cd806 100644
> --- a/virt/pvsched/Makefile
> +++ b/virt/pvsched/Makefile
> @@ -1,2 +1,2 @@
>
> -obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
> +obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o pvsched_bpf.o
> diff --git a/virt/pvsched/pvsched_bpf.c b/virt/pvsched/pvsched_bpf.c
> new file mode 100644
> index 000000000000..b125089abc3b
> --- /dev/null
> +++ b/virt/pvsched/pvsched_bpf.c
> @@ -0,0 +1,141 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Google  */
> +
> +#include <linux/types.h>
> +#include <linux/bpf_verifier.h>
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/filter.h>
> +#include <linux/pvsched.h>
> +
> +
> +/* "extern" is to avoid sparse warning.  It is only used in bpf_struct_ops.c. */
> +extern struct bpf_struct_ops bpf_pvsched_vcpu_ops;
> +
> +static int bpf_pvsched_vcpu_init(struct btf *btf)
> +{
> +       return 0;
> +}
> +
> +static bool bpf_pvsched_vcpu_is_valid_access(int off, int size,
> +                                      enum bpf_access_type type,
> +                                      const struct bpf_prog *prog,
> +                                      struct bpf_insn_access_aux *info)
> +{
> +       if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
> +               return false;
> +       if (type != BPF_READ)
> +               return false;
> +       if (off % size != 0)
> +               return false;
> +
> +       if (!btf_ctx_access(off, size, type, prog, info))
> +               return false;
> +
> +       return true;
> +}
> +
> +static int bpf_pvsched_vcpu_btf_struct_access(struct bpf_verifier_log *log,
> +                                       const struct bpf_reg_state *reg,
> +                                       int off, int size)
> +{
> +       /*
> +        * TODO: Enable write access to Guest shared mem.
> +        */
> +       return -EACCES;
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_pvsched_vcpu_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       return bpf_base_func_proto(func_id);
> +}
> +
> +static const struct bpf_verifier_ops bpf_pvsched_vcpu_verifier_ops = {
> +       .get_func_proto         = bpf_pvsched_vcpu_get_func_proto,
> +       .is_valid_access        = bpf_pvsched_vcpu_is_valid_access,
> +       .btf_struct_access      = bpf_pvsched_vcpu_btf_struct_access,
> +};
> +
> +static int bpf_pvsched_vcpu_init_member(const struct btf_type *t,
> +                                 const struct btf_member *member,
> +                                 void *kdata, const void *udata)
> +{
> +       const struct pvsched_vcpu_ops *uvm_ops;
> +       struct pvsched_vcpu_ops *vm_ops;
> +       u32 moff;
> +
> +       uvm_ops = (const struct pvsched_vcpu_ops *)udata;
> +       vm_ops = (struct pvsched_vcpu_ops *)kdata;
> +
> +       moff = __btf_member_bit_offset(t, member) / 8;
> +       switch (moff) {
> +       case offsetof(struct pvsched_vcpu_ops, events):
> +               vm_ops->events = *(u32 *)(udata + moff);
> +               return 1;
> +       case offsetof(struct pvsched_vcpu_ops, name):
> +               if (bpf_obj_name_cpy(vm_ops->name, uvm_ops->name,
> +                                       sizeof(vm_ops->name)) <= 0)
> +                       return -EINVAL;
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int bpf_pvsched_vcpu_check_member(const struct btf_type *t,
> +                                  const struct btf_member *member,
> +                                  const struct bpf_prog *prog)
> +{
> +       return 0;
> +}
> +
> +static int bpf_pvsched_vcpu_reg(void *kdata)
> +{
> +       return pvsched_register_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
> +}
> +
> +static void bpf_pvsched_vcpu_unreg(void *kdata)
> +{
> +       pvsched_unregister_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
> +}
> +
> +static int bpf_pvsched_vcpu_validate(void *kdata)
> +{
> +       return pvsched_validate_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
> +}
> +
> +static int bpf_pvsched_vcpu_update(void *kdata, void *old_kdata)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int __pvsched_vcpu_register(struct pid *pid)
> +{
> +       return 0;
> +}
> +static void __pvsched_vcpu_unregister(struct pid *pid)
> +{
> +}
> +static void __pvsched_notify_event(void *addr, struct pid *pid, u32 event)
> +{
> +}
> +
> +static struct pvsched_vcpu_ops __bpf_ops_pvsched_vcpu_ops = {
> +       .pvsched_vcpu_register = __pvsched_vcpu_register,
> +       .pvsched_vcpu_unregister = __pvsched_vcpu_unregister,
> +       .pvsched_vcpu_notify_event = __pvsched_notify_event,
> +};
> +
> +struct bpf_struct_ops bpf_pvsched_vcpu_ops = {
> +       .init = &bpf_pvsched_vcpu_init,
> +       .validate = bpf_pvsched_vcpu_validate,
> +       .update = bpf_pvsched_vcpu_update,
> +       .verifier_ops = &bpf_pvsched_vcpu_verifier_ops,
> +       .reg = bpf_pvsched_vcpu_reg,
> +       .unreg = bpf_pvsched_vcpu_unreg,
> +       .check_member = bpf_pvsched_vcpu_check_member,
> +       .init_member = bpf_pvsched_vcpu_init_member,
> +       .name = "pvsched_vcpu_ops",
> +       .cfi_stubs = &__bpf_ops_pvsched_vcpu_ops,
> +};
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver.
  2024-04-03 14:01 ` [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver Vineeth Pillai (Google)
@ 2024-04-08 14:01   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-04-08 14:01 UTC (permalink / raw)
  To: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li
  Cc: Steven Rostedt, Joel Fernandes, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, Tejun Heo,
	Josh Don, Barret Rhoden, David Vernet

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<vineeth@bitbyteword.org> wrote:
>
> A dummy skeleton of a bpf pvsched driver. This is just for demonstration
> purpose and would need more work to be included as a test for this
> feature.
>
> Not-Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> ---
>  .../testing/selftests/bpf/progs/bpf_pvsched.c | 37 +++++++++++++++++++
>  1 file changed, 37 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
>
> diff --git a/tools/testing/selftests/bpf/progs/bpf_pvsched.c b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
> new file mode 100644
> index 000000000000..a653baa3034b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
> @@ -0,0 +1,37 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019 Facebook */
> +
> +#include "vmlinux.h"
> +#include "bpf_tracing_net.h"
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("struct_ops/pvsched_vcpu_reg")
> +int BPF_PROG(pvsched_vcpu_reg, struct pid *pid)
> +{
> +       bpf_printk("pvsched_vcpu_reg: pid: %p", pid);
> +       return 0;
> +}
> +
> +SEC("struct_ops/pvsched_vcpu_unreg")
> +void BPF_PROG(pvsched_vcpu_unreg, struct pid *pid)
> +{
> +       bpf_printk("pvsched_vcpu_unreg: pid: %p", pid);
> +}
> +
> +SEC("struct_ops/pvsched_vcpu_notify_event")
> +void BPF_PROG(pvsched_vcpu_notify_event, void *addr, struct pid *pid, __u32 event)
> +{
> +       bpf_printk("pvsched_vcpu_notify: pid: %p, event:%u", pid, event);
> +}
> +
> +SEC(".struct_ops")
> +struct pvsched_vcpu_ops pvsched_ops = {
> +       .pvsched_vcpu_register          = (void *)pvsched_vcpu_reg,
> +       .pvsched_vcpu_unregister        = (void *)pvsched_vcpu_unreg,
> +       .pvsched_vcpu_notify_event      = (void *)pvsched_vcpu_notify_event,
> +       .events                         = 0x6,
> +       .name                           = "bpf_pvsched_ops",
> +};
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
                   ` (5 preceding siblings ...)
  2024-04-08 13:54 ` [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Remanan Pillai
@ 2024-05-01 15:29 ` Sean Christopherson
  2024-05-02 13:42   ` Vineeth Remanan Pillai
  6 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2024-05-01 15:29 UTC (permalink / raw)
  To: Vineeth Pillai (Google)
  Cc: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt,
	Joel Fernandes, Suleiman Souhlal, Masami Hiramatsu, himadrics,
	kvm, linux-kernel, x86

On Wed, Apr 03, 2024, Vineeth Pillai (Google) wrote:
> Double scheduling is a concern with virtualization hosts where the host
> schedules vcpus without knowing whats run by the vcpu and guest schedules
> tasks without knowing where the vcpu is physically running. This causes
> issues related to latencies, power consumption, resource utilization
> etc. An ideal solution would be to have a cooperative scheduling
> framework where the guest and host shares scheduling related information
> and makes an educated scheduling decision to optimally handle the
> workloads. As a first step, we are taking a stab at reducing latencies
> for latency sensitive workloads in the guest.
> 
> v1 RFC[1] was posted in December 2023. The main disagreement was in the
> implementation where the patch was making scheduling policy decisions
> in kvm and kvm is not the right place to do it. The suggestion was to
> move the polcy decisions outside of kvm and let kvm only handle the
> notifications needed to make the policy decisions. This patch series is
> an iterative step towards implementing the feature as a layered
> design where the policy could be implemented outside of kvm as a
> kernel built-in, a kernel module or a bpf program.
> 
> This design comprises mainly of 4 components:
> 
> - pvsched driver: Implements the scheduling policies. Register with
>     host with a set of callbacks that hypervisor(kvm) can use to notify
>     vcpu events that the driver is interested in. The callback will be
>     passed in the address of shared memory so that the driver can get
>     scheduling information shared by the guest and also update the
>     scheduling policies set by the driver.
> - kvm component: Selects the pvsched driver for a guest and notifies
>     the driver via callbacks for events that the driver is interested
>     in. Also interface with the guest in retreiving the shared memory
>     region for sharing the scheduling information.
> - host kernel component: Implements the APIs for:
>     - pvsched driver for register/unregister to the host kernel, and
>     - hypervisor for assingning/unassigning driver for guests.
> - guest component: Implements a framework for sharing the scheduling
>     information with the pvsched driver through kvm.

Roughly summarazing an off-list discussion.
 
 - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
   similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.

 - "Negotiating" features/hooks should also be handled outside of the kernel,
   e.g. similar to how VirtIO devices negotiate features between host and guest.

 - Pushing PV scheduler entities to KVM should either be done through an exported
   API, e.g. if the scheduler is provided by a separate kernel module, or by a
   KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).

I think those were the main takeaways?  Vineeth and Joel, please chime in on
anything I've missed or misremembered.
 
The other reason I'm bringing this discussion back on-list is that I (very) briefly
discussed this with Paolo, and he pointed out the proposed rseq-based mechanism
that would allow userspace to request an extended time slice[*], and that if that
landed it would be easy-ish to reuse the interface for KVM's steal_time PV API.

I see that you're both on that thread, so presumably you're already aware of the
idea, but I wanted to bring it up here to make sure that we aren't trying to
design something that's more complex than is needed.

Specifically, if the guest has a generic way to request an extended time slice
(or boost its priority?), would that address your use cases?  Or rather, how close
does it get you?  E.g. the guest will have no way of requesting a larger time
slice or boosting priority when an event is _pending_ but not yet receiveed by
the guest, but is that actually problematic in practice?

[*] https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-05-01 15:29 ` Sean Christopherson
@ 2024-05-02 13:42   ` Vineeth Remanan Pillai
  2024-06-24 11:01     ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-05-02 13:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt,
	Joel Fernandes, Suleiman Souhlal, Masami Hiramatsu, himadrics,
	kvm, linux-kernel, x86

> > This design comprises mainly of 4 components:
> >
> > - pvsched driver: Implements the scheduling policies. Register with
> >     host with a set of callbacks that hypervisor(kvm) can use to notify
> >     vcpu events that the driver is interested in. The callback will be
> >     passed in the address of shared memory so that the driver can get
> >     scheduling information shared by the guest and also update the
> >     scheduling policies set by the driver.
> > - kvm component: Selects the pvsched driver for a guest and notifies
> >     the driver via callbacks for events that the driver is interested
> >     in. Also interface with the guest in retreiving the shared memory
> >     region for sharing the scheduling information.
> > - host kernel component: Implements the APIs for:
> >     - pvsched driver for register/unregister to the host kernel, and
> >     - hypervisor for assingning/unassigning driver for guests.
> > - guest component: Implements a framework for sharing the scheduling
> >     information with the pvsched driver through kvm.
>
> Roughly summarazing an off-list discussion.
>
>  - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
>    similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
>
>  - "Negotiating" features/hooks should also be handled outside of the kernel,
>    e.g. similar to how VirtIO devices negotiate features between host and guest.
>
>  - Pushing PV scheduler entities to KVM should either be done through an exported
>    API, e.g. if the scheduler is provided by a separate kernel module, or by a
>    KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
>
> I think those were the main takeaways?  Vineeth and Joel, please chime in on
> anything I've missed or misremembered.
>
Thanks for the brief about the offlist discussion, all the points are
captured, just some minor additions. v2 implementation removed the
scheduling policies outside of kvm to a separate entity called pvsched
driver and could be implemented as a kernel module or bpf program. But
the handshake between guest and host to decide on what pvsched driver
to attach was still going through kvm. So it was suggested to move
this handshake(discovery and negotiation) outside of kvm. The idea is
to have a virtual device exposed by the VMM which would take care of
the handshake. Guest driver for this device would talk to the device
to understand the pvsched details on the host and pass the shared
memory details. Once the handshake is completed, the device is
responsible for loading the pvsched driver(bpf program or kernel
module responsible for implementing the policies). The pvsched driver
will register to the trace points exported by kvm and handle the
callbacks from then on. The scheduling will be taken care of by the
host scheduler, pvsched driver on host is responsible only for setting
the policies(placement, priorities etc).

With the above approach, the only change in kvm would be the internal
tracepoints for pvsched. Host kernel will also be unchanged and all
the complexities move to the VMM and the pvsched driver. Guest kernel
will have a new driver to talk to the virtual pvsched device and this
driver would hook into the guest kernel for passing scheduling
information to the host(via tracepoints).

> The other reason I'm bringing this discussion back on-list is that I (very) briefly
> discussed this with Paolo, and he pointed out the proposed rseq-based mechanism
> that would allow userspace to request an extended time slice[*], and that if that
> landed it would be easy-ish to reuse the interface for KVM's steal_time PV API.
>
> I see that you're both on that thread, so presumably you're already aware of the
> idea, but I wanted to bring it up here to make sure that we aren't trying to
> design something that's more complex than is needed.
>
> Specifically, if the guest has a generic way to request an extended time slice
> (or boost its priority?), would that address your use cases?  Or rather, how close
> does it get you?  E.g. the guest will have no way of requesting a larger time
> slice or boosting priority when an event is _pending_ but not yet receiveed by
> the guest, but is that actually problematic in practice?
>
> [*] https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home
>
Thanks for bringing this up. We were also very much interested in this
feature and were planning to use the pvmem shared memory  instead of
rseq framework for guests. The motivation of paravirt scheduling
framework was a bit broader than the latency issues and hence we were
proposing a bit more complex design. Other than the use case for
temporarily extending the time slice of vcpus, we were also looking at
vcpu placements on physical cpus, educated decisions that could be
made by guest scheduler if it has a picture of host cpu load etc.
Having a paravirt mechanism to share scheduling information would
benefit in such cases. Once we have this framework setup, the policy
implementation on guest and host could be taken care of by other
entities like BPF programs, modules or schedulers like sched_ext.

We are working on a v3 incorporating the above ideas and would shortly
be posting a design RFC soon. Thanks for all the help and inputs on
this.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-05-02 13:42   ` Vineeth Remanan Pillai
@ 2024-06-24 11:01     ` Vineeth Remanan Pillai
  2024-07-12 12:57       ` Joel Fernandes
  0 siblings, 1 reply; 42+ messages in thread
From: Vineeth Remanan Pillai @ 2024-06-24 11:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt,
	Joel Fernandes, Suleiman Souhlal, Masami Hiramatsu, himadrics,
	kvm, linux-kernel, x86, graf, drjunior.org

> > Roughly summarazing an off-list discussion.
> >
> >  - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
> >    similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
> >
> >  - "Negotiating" features/hooks should also be handled outside of the kernel,
> >    e.g. similar to how VirtIO devices negotiate features between host and guest.
> >
> >  - Pushing PV scheduler entities to KVM should either be done through an exported
> >    API, e.g. if the scheduler is provided by a separate kernel module, or by a
> >    KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
> >
> > I think those were the main takeaways?  Vineeth and Joel, please chime in on
> > anything I've missed or misremembered.
> >
> Thanks for the brief about the offlist discussion, all the points are
> captured, just some minor additions. v2 implementation removed the
> scheduling policies outside of kvm to a separate entity called pvsched
> driver and could be implemented as a kernel module or bpf program. But
> the handshake between guest and host to decide on what pvsched driver
> to attach was still going through kvm. So it was suggested to move
> this handshake(discovery and negotiation) outside of kvm. The idea is
> to have a virtual device exposed by the VMM which would take care of
> the handshake. Guest driver for this device would talk to the device
> to understand the pvsched details on the host and pass the shared
> memory details. Once the handshake is completed, the device is
> responsible for loading the pvsched driver(bpf program or kernel
> module responsible for implementing the policies). The pvsched driver
> will register to the trace points exported by kvm and handle the
> callbacks from then on. The scheduling will be taken care of by the
> host scheduler, pvsched driver on host is responsible only for setting
> the policies(placement, priorities etc).
>
> With the above approach, the only change in kvm would be the internal
> tracepoints for pvsched. Host kernel will also be unchanged and all
> the complexities move to the VMM and the pvsched driver. Guest kernel
> will have a new driver to talk to the virtual pvsched device and this
> driver would hook into the guest kernel for passing scheduling
> information to the host(via tracepoints).
>
Noting down the recent offlist discussion and details of our response.

Based on the previous discussions, we had come up with a modified
design focusing on minimum kvm changes. The design is as follows:
- Guest and host share scheduling information via shared memory
region. Details of the layout of the memory region, information shared
and actions and policies are defined by the pvsched protocol. And this
protocol is implemented by a BPF program or a kernel module.
- Host exposes a virtual device(pvsched device to the guest). This
device is the mechanism for host and guest for handshake and
negotiation to reach a decision on the pvsched protocol to use. The
virtual device is implemented in the VMM in userland as it doesn't
come in the performance critical path.
- Guest loads a pvsched driver during device enumeration. the driver
initiates the protocol handshake and negotiation with the host and
decides on the protocol. This driver creates a per-cpu shared memory
region and shares the GFN with the device in the host. Guest also
loads the BPF program that implements the protocol in the guest.
- Once the VMM has all the information needed(per-cpu shared memory
GFN, vcpu task pids etc), it loads the BPF program which implements
the protocol on the host.
- BPF program on the host registers the trace points in kvm to get
callbacks on interested events like VMENTER, VMEXIT, interrupt
injection etc. Similarly, the guest BPF program registers tracepoints
in the guest kernel for interested events like sched wakeup, sched
switch, enqueue, dequeue, irq entry/exit etc.

The above design is minimally invasive to the kvm and core kernel and
implements the protocol as loadable programs and protocol handshake
and negotiation through the virtual device framework. Protocol
implementation takes care of information sharing and policy
enforcements and scheduler handles the actual scheduling decisions.
Sample policy implementation(boosting for latency sensitive workloads
as an example) could be included in the kernel for reference.

We had an offlist discussion about the above design and a couple of
ideas were suggested as an alternative. We had taken an action item to
study the alternatives for the feasibility. Rest of the mail lists the
use cases(not conclusive) and our feasibility investigations.

Existing use cases
-------------------------

- A latency sensitive workload on the guest might need more than one
time slice to complete, but should not block any higher priority task
in the host. In our design, the latency sensitive workload shares its
priority requirements to host(RT priority, cfs nice value etc). Host
implementation of the protocol sets the priority of the vcpu task
accordingly so that the host scheduler can make an educated decision
on the next task to run. This makes sure that host processes and vcpu
tasks compete fairly for the cpu resource.
- Guest should be able to notify the host that it is running a lower
priority task so that the host can reschedule it if needed. As
mentioned before, the guest shares the priority with the host and the
host takes a better scheduling decision.
- Proactive vcpu boosting for events like interrupt injection.
Depending on the guest for boost request might be too late as the vcpu
might not be scheduled to run even after interrupt injection. Host
implementation of the protocol boosts the vcpu tasks priority so that
it gets a better chance of immediately being scheduled and guest can
handle the interrupt with minimal latency. Once the guest is done
handling the interrupt, it can notify the host and lower the priority
of the vcpu task.
- Guests which assign specialized tasks to specific vcpus can share
that information with the host so that host can try to avoid
colocation of those cpus in a single physical cpu. for eg: there are
interrupt pinning use cases where specific cpus are chosen to handle
critical interrupts and passing this information to the host could be
useful.
- Another use case is the sharing of cpu capacity details between
guest and host. Sharing the host cpu's load with the guest will enable
the guest to schedule latency sensitive tasks on the best possible
vcpu. This could be partially achievable by steal time, but steal time
is more apparent on busy vcpus. There are workloads which are mostly
sleepers, but wake up intermittently to serve short latency sensitive
workloads. input event handlers in chrome is one such example.

Data from the prototype implementation shows promising improvement in
reducing latencies. Data was shared in the v1 cover letter. We have
not implemented the capacity based placement policies yet, but plan to
do that soon and have some real numbers to share.

Ideas brought up during offlist discussion
-------------------------------------------------------

1. rseq based timeslice extension mechanism[1]

While the rseq based mechanism helps in giving the vcpu task one more
time slice, it will not help in the other use cases. We had a chat
with Steve and the rseq mechanism was mainly for improving lock
contention and would not work best with vcpu boosting considering all
the use cases above. RT or high priority tasks in the VM would often
need more than one time slice to complete its work and at the same,
should not be hurting the host workloads. The goal for the above use
cases is not requesting an extra slice, but to modify the priority in
such a way that host processes and guest processes get a fair way to
compete for cpu resources. This also means that vcpu task can request
a lower priority when it is running lower priority tasks in the VM.

2. vDSO approach
Regarding the vDSO approach, we had a look at that and feel that
without a major redesign of vDSO, it might be difficult to achieve the
requirements. vDSO is currently implemented as a shared read-only
memory region with the processes. For this to work with
virtualization, we would need to map a similar region to the guest and
it has to be read-write. This is more or less what we are also
proposing, but with minimal changes in the core kernel. With the
current design, the shared memory region would be the responsibility
of the virtual pvsched device framework.

Sorry for the long mail. Please have a look and let us know your thoughts :-)

Thanks,

[1]: https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-06-24 11:01     ` Vineeth Remanan Pillai
@ 2024-07-12 12:57       ` Joel Fernandes
  2024-07-12 14:09         ` Mathieu Desnoyers
  0 siblings, 1 reply; 42+ messages in thread
From: Joel Fernandes @ 2024-07-12 12:57 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Sean Christopherson, Ben Segall, Borislav Petkov,
	Daniel Bristot de Oliveira, Dave Hansen, Dietmar Eggemann,
	H . Peter Anvin, Ingo Molnar, Juri Lelli, Mel Gorman,
	Paolo Bonzini, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, mathieu.desnoyers, Vincent Guittot,
	Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> > > Roughly summarazing an off-list discussion.
> > >
> > >  - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
> > >    similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
> > >
> > >  - "Negotiating" features/hooks should also be handled outside of the kernel,
> > >    e.g. similar to how VirtIO devices negotiate features between host and guest.
> > >
> > >  - Pushing PV scheduler entities to KVM should either be done through an exported
> > >    API, e.g. if the scheduler is provided by a separate kernel module, or by a
> > >    KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
> > >
> > > I think those were the main takeaways?  Vineeth and Joel, please chime in on
> > > anything I've missed or misremembered.
> > >
> > Thanks for the brief about the offlist discussion, all the points are
> > captured, just some minor additions. v2 implementation removed the
> > scheduling policies outside of kvm to a separate entity called pvsched
> > driver and could be implemented as a kernel module or bpf program. But
> > the handshake between guest and host to decide on what pvsched driver
> > to attach was still going through kvm. So it was suggested to move
> > this handshake(discovery and negotiation) outside of kvm. The idea is
> > to have a virtual device exposed by the VMM which would take care of
> > the handshake. Guest driver for this device would talk to the device
> > to understand the pvsched details on the host and pass the shared
> > memory details. Once the handshake is completed, the device is
> > responsible for loading the pvsched driver(bpf program or kernel
> > module responsible for implementing the policies). The pvsched driver
> > will register to the trace points exported by kvm and handle the
> > callbacks from then on. The scheduling will be taken care of by the
> > host scheduler, pvsched driver on host is responsible only for setting
> > the policies(placement, priorities etc).
> >
> > With the above approach, the only change in kvm would be the internal
> > tracepoints for pvsched. Host kernel will also be unchanged and all
> > the complexities move to the VMM and the pvsched driver. Guest kernel
> > will have a new driver to talk to the virtual pvsched device and this
> > driver would hook into the guest kernel for passing scheduling
> > information to the host(via tracepoints).
> >
> Noting down the recent offlist discussion and details of our response.
> 
> Based on the previous discussions, we had come up with a modified
> design focusing on minimum kvm changes. The design is as follows:
> - Guest and host share scheduling information via shared memory
> region. Details of the layout of the memory region, information shared
> and actions and policies are defined by the pvsched protocol. And this
> protocol is implemented by a BPF program or a kernel module.
> - Host exposes a virtual device(pvsched device to the guest). This
> device is the mechanism for host and guest for handshake and
> negotiation to reach a decision on the pvsched protocol to use. The
> virtual device is implemented in the VMM in userland as it doesn't
> come in the performance critical path.
> - Guest loads a pvsched driver during device enumeration. the driver
> initiates the protocol handshake and negotiation with the host and
> decides on the protocol. This driver creates a per-cpu shared memory
> region and shares the GFN with the device in the host. Guest also
> loads the BPF program that implements the protocol in the guest.
> - Once the VMM has all the information needed(per-cpu shared memory
> GFN, vcpu task pids etc), it loads the BPF program which implements
> the protocol on the host.
> - BPF program on the host registers the trace points in kvm to get
> callbacks on interested events like VMENTER, VMEXIT, interrupt
> injection etc. Similarly, the guest BPF program registers tracepoints
> in the guest kernel for interested events like sched wakeup, sched
> switch, enqueue, dequeue, irq entry/exit etc.
> 
> The above design is minimally invasive to the kvm and core kernel and
> implements the protocol as loadable programs and protocol handshake
> and negotiation through the virtual device framework. Protocol
> implementation takes care of information sharing and policy
> enforcements and scheduler handles the actual scheduling decisions.
> Sample policy implementation(boosting for latency sensitive workloads
> as an example) could be included in the kernel for reference.
> 
> We had an offlist discussion about the above design and a couple of
> ideas were suggested as an alternative. We had taken an action item to
> study the alternatives for the feasibility. Rest of the mail lists the
> use cases(not conclusive) and our feasibility investigations.
> 
> Existing use cases
> -------------------------
> 
> - A latency sensitive workload on the guest might need more than one
> time slice to complete, but should not block any higher priority task
> in the host. In our design, the latency sensitive workload shares its
> priority requirements to host(RT priority, cfs nice value etc). Host
> implementation of the protocol sets the priority of the vcpu task
> accordingly so that the host scheduler can make an educated decision
> on the next task to run. This makes sure that host processes and vcpu
> tasks compete fairly for the cpu resource.
> - Guest should be able to notify the host that it is running a lower
> priority task so that the host can reschedule it if needed. As
> mentioned before, the guest shares the priority with the host and the
> host takes a better scheduling decision.
> - Proactive vcpu boosting for events like interrupt injection.
> Depending on the guest for boost request might be too late as the vcpu
> might not be scheduled to run even after interrupt injection. Host
> implementation of the protocol boosts the vcpu tasks priority so that
> it gets a better chance of immediately being scheduled and guest can
> handle the interrupt with minimal latency. Once the guest is done
> handling the interrupt, it can notify the host and lower the priority
> of the vcpu task.
> - Guests which assign specialized tasks to specific vcpus can share
> that information with the host so that host can try to avoid
> colocation of those cpus in a single physical cpu. for eg: there are
> interrupt pinning use cases where specific cpus are chosen to handle
> critical interrupts and passing this information to the host could be
> useful.
> - Another use case is the sharing of cpu capacity details between
> guest and host. Sharing the host cpu's load with the guest will enable
> the guest to schedule latency sensitive tasks on the best possible
> vcpu. This could be partially achievable by steal time, but steal time
> is more apparent on busy vcpus. There are workloads which are mostly
> sleepers, but wake up intermittently to serve short latency sensitive
> workloads. input event handlers in chrome is one such example.
> 
> Data from the prototype implementation shows promising improvement in
> reducing latencies. Data was shared in the v1 cover letter. We have
> not implemented the capacity based placement policies yet, but plan to
> do that soon and have some real numbers to share.
> 
> Ideas brought up during offlist discussion
> -------------------------------------------------------
> 
> 1. rseq based timeslice extension mechanism[1]
> 
> While the rseq based mechanism helps in giving the vcpu task one more
> time slice, it will not help in the other use cases. We had a chat
> with Steve and the rseq mechanism was mainly for improving lock
> contention and would not work best with vcpu boosting considering all
> the use cases above. RT or high priority tasks in the VM would often
> need more than one time slice to complete its work and at the same,
> should not be hurting the host workloads. The goal for the above use
> cases is not requesting an extra slice, but to modify the priority in
> such a way that host processes and guest processes get a fair way to
> compete for cpu resources. This also means that vcpu task can request
> a lower priority when it is running lower priority tasks in the VM.

I was looking at the rseq on request from the KVM call, however it does not
make sense to me yet how to expose the rseq area via the Guest VA to the host
kernel.  rseq is for userspace to kernel, not VM to kernel.

Steven Rostedt said as much as well, thoughts? Add Mathieu as well.

This idea seems to suffer from the same vDSO over-engineering below, rseq
does not seem to fit.

Steven Rostedt told me, what we instead need is a tracepoint callback in a
driver, that does the boosting.

 - Joel



> 
> 2. vDSO approach
> Regarding the vDSO approach, we had a look at that and feel that
> without a major redesign of vDSO, it might be difficult to achieve the
> requirements. vDSO is currently implemented as a shared read-only
> memory region with the processes. For this to work with
> virtualization, we would need to map a similar region to the guest and
> it has to be read-write. This is more or less what we are also
> proposing, but with minimal changes in the core kernel. With the
> current design, the shared memory region would be the responsibility
> of the virtual pvsched device framework.
> 
> Sorry for the long mail. Please have a look and let us know your thoughts :-)
> 
> Thanks,
> 
> [1]: https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 12:57       ` Joel Fernandes
@ 2024-07-12 14:09         ` Mathieu Desnoyers
  2024-07-12 14:48           ` Sean Christopherson
                             ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2024-07-12 14:09 UTC (permalink / raw)
  To: Joel Fernandes, Vineeth Remanan Pillai
  Cc: Sean Christopherson, Ben Segall, Borislav Petkov,
	Daniel Bristot de Oliveira, Dave Hansen, Dietmar Eggemann,
	H . Peter Anvin, Ingo Molnar, Juri Lelli, Mel Gorman,
	Paolo Bonzini, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li,
	Steven Rostedt, Suleiman Souhlal, Masami Hiramatsu, himadrics,
	kvm, linux-kernel, x86, graf, drjunior.org

On 2024-07-12 08:57, Joel Fernandes wrote:
> On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
[...]
>> Existing use cases
>> -------------------------
>>
>> - A latency sensitive workload on the guest might need more than one
>> time slice to complete, but should not block any higher priority task
>> in the host. In our design, the latency sensitive workload shares its
>> priority requirements to host(RT priority, cfs nice value etc). Host
>> implementation of the protocol sets the priority of the vcpu task
>> accordingly so that the host scheduler can make an educated decision
>> on the next task to run. This makes sure that host processes and vcpu
>> tasks compete fairly for the cpu resource.

AFAIU, the information you need to convey to achieve this is the priority
of the task within the guest. This information need to reach the host
scheduler to make informed decision.

One thing that is unclear about this is what is the acceptable
overhead/latency to push this information from guest to host ?
Is an hypercall OK or does it need to be exchanged over a memory
mapping shared between guest and host ?

Hypercalls provide simple ABIs across guest/host, and they allow
the guest to immediately notify the host (similar to an interrupt).

Shared memory mapping will require a carefully crafted ABI layout,
and will only allow the host to use the information provided when
the host runs. Therefore, if the choice is to share this information
only through shared memory, the host scheduler will only be able to
read it when it runs, so in hypercall, interrupt, and so on.

>> - Guest should be able to notify the host that it is running a lower
>> priority task so that the host can reschedule it if needed. As
>> mentioned before, the guest shares the priority with the host and the
>> host takes a better scheduling decision.

It is unclear to me whether this information needs to be "pushed"
from guest to host (e.g. hypercall) in a way that allows the host
to immediately act on this information, or if it is OK to have the
host read this information when its scheduler happens to run.

>> - Proactive vcpu boosting for events like interrupt injection.
>> Depending on the guest for boost request might be too late as the vcpu
>> might not be scheduled to run even after interrupt injection. Host
>> implementation of the protocol boosts the vcpu tasks priority so that
>> it gets a better chance of immediately being scheduled and guest can
>> handle the interrupt with minimal latency. Once the guest is done
>> handling the interrupt, it can notify the host and lower the priority
>> of the vcpu task.

This appears to be a scenario where the host sets a "high priority", and
the guest clears it when it is done with the irq handler. I guess it can
be done either ways (hypercall or shared memory), but the choice would
depend on the parameters identified above: acceptable overhead vs acceptable
latency to inform the host scheduler.

>> - Guests which assign specialized tasks to specific vcpus can share
>> that information with the host so that host can try to avoid
>> colocation of those cpus in a single physical cpu. for eg: there are
>> interrupt pinning use cases where specific cpus are chosen to handle
>> critical interrupts and passing this information to the host could be
>> useful.

How frequently is this topology expected to change ? Is it something that
is set once when the guest starts and then is fixed ? How often it changes
will likely affect the tradeoffs here.

>> - Another use case is the sharing of cpu capacity details between
>> guest and host. Sharing the host cpu's load with the guest will enable
>> the guest to schedule latency sensitive tasks on the best possible
>> vcpu. This could be partially achievable by steal time, but steal time
>> is more apparent on busy vcpus. There are workloads which are mostly
>> sleepers, but wake up intermittently to serve short latency sensitive
>> workloads. input event handlers in chrome is one such example.

OK so for this use-case information goes the other way around: from host
to guest. Here the shared mapping seems better than polling the state
through an hypercall.

>>
>> Data from the prototype implementation shows promising improvement in
>> reducing latencies. Data was shared in the v1 cover letter. We have
>> not implemented the capacity based placement policies yet, but plan to
>> do that soon and have some real numbers to share.
>>
>> Ideas brought up during offlist discussion
>> -------------------------------------------------------
>>
>> 1. rseq based timeslice extension mechanism[1]
>>
>> While the rseq based mechanism helps in giving the vcpu task one more
>> time slice, it will not help in the other use cases. We had a chat
>> with Steve and the rseq mechanism was mainly for improving lock
>> contention and would not work best with vcpu boosting considering all
>> the use cases above. RT or high priority tasks in the VM would often
>> need more than one time slice to complete its work and at the same,
>> should not be hurting the host workloads. The goal for the above use
>> cases is not requesting an extra slice, but to modify the priority in
>> such a way that host processes and guest processes get a fair way to
>> compete for cpu resources. This also means that vcpu task can request
>> a lower priority when it is running lower priority tasks in the VM.
> 
> I was looking at the rseq on request from the KVM call, however it does not
> make sense to me yet how to expose the rseq area via the Guest VA to the host
> kernel.  rseq is for userspace to kernel, not VM to kernel.
> 
> Steven Rostedt said as much as well, thoughts? Add Mathieu as well.

I'm not sure that rseq would help at all here, but I think we may want to
borrow concepts of data sitting in shared memory across privilege levels
and apply them to VMs.

If some of the ideas end up being useful *outside* of the context of VMs,
then I'd be willing to consider adding fields to rseq. But as long as it is
VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
you can safely share across host/guest kernels.

> 
> This idea seems to suffer from the same vDSO over-engineering below, rseq
> does not seem to fit.
> 
> Steven Rostedt told me, what we instead need is a tracepoint callback in a
> driver, that does the boosting.

I utterly dislike changing the system behavior through tracepoints. They were
designed to observe the system, not modify its behavior. If people start abusing
them, then subsystem maintainers will stop adding them. Please don't do that.
Add a notifier or think about integrating what you are planning to add into the
driver instead.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 14:09         ` Mathieu Desnoyers
@ 2024-07-12 14:48           ` Sean Christopherson
  2024-07-12 15:32             ` Mathieu Desnoyers
  2024-07-12 16:24           ` Steven Rostedt
  2024-07-12 16:24           ` Joel Fernandes
  2 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2024-07-12 14:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joel Fernandes, Vineeth Remanan Pillai, Ben Segall,
	Borislav Petkov, Daniel Bristot de Oliveira, Dave Hansen,
	Dietmar Eggemann, H . Peter Anvin, Ingo Molnar, Juri Lelli,
	Mel Gorman, Paolo Bonzini, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
> On 2024-07-12 08:57, Joel Fernandes wrote:
> > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> [...]
> > > Existing use cases
> > > -------------------------
> > > 
> > > - A latency sensitive workload on the guest might need more than one
> > > time slice to complete, but should not block any higher priority task
> > > in the host. In our design, the latency sensitive workload shares its
> > > priority requirements to host(RT priority, cfs nice value etc). Host
> > > implementation of the protocol sets the priority of the vcpu task
> > > accordingly so that the host scheduler can make an educated decision
> > > on the next task to run. This makes sure that host processes and vcpu
> > > tasks compete fairly for the cpu resource.
> 
> AFAIU, the information you need to convey to achieve this is the priority
> of the task within the guest. This information need to reach the host
> scheduler to make informed decision.
> 
> One thing that is unclear about this is what is the acceptable
> overhead/latency to push this information from guest to host ?
> Is an hypercall OK or does it need to be exchanged over a memory
> mapping shared between guest and host ?
> 
> Hypercalls provide simple ABIs across guest/host, and they allow
> the guest to immediately notify the host (similar to an interrupt).

Hypercalls have myriad problems.  They require a VM-Exit, which largely defeats
the purpose of boosting the vCPU priority for performance reasons.  They don't
allow for delegation as there's no way for the hypervisor to know if a hypercall
from guest userspace should be allowed, versus anything memory based where the
ability for guest userspace to access the memory demonstrates permission (else
the guest kernel wouldn't have mapped the memory into userspace).

> > > Ideas brought up during offlist discussion
> > > -------------------------------------------------------
> > > 
> > > 1. rseq based timeslice extension mechanism[1]
> > > 
> > > While the rseq based mechanism helps in giving the vcpu task one more
> > > time slice, it will not help in the other use cases. We had a chat
> > > with Steve and the rseq mechanism was mainly for improving lock
> > > contention and would not work best with vcpu boosting considering all
> > > the use cases above. RT or high priority tasks in the VM would often
> > > need more than one time slice to complete its work and at the same,
> > > should not be hurting the host workloads. The goal for the above use
> > > cases is not requesting an extra slice, but to modify the priority in
> > > such a way that host processes and guest processes get a fair way to
> > > compete for cpu resources. This also means that vcpu task can request
> > > a lower priority when it is running lower priority tasks in the VM.

Then figure out a way to let userspace boot a task's priority without needing a
syscall.  vCPUs are not directly schedulable entities, the task doing KVM_RUN
on the vCPU fd is what the scheduler sees.  Any scheduling enhancement that
benefits vCPUs by definition can benefit userspace tasks.

> > I was looking at the rseq on request from the KVM call, however it does not
> > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > kernel.  rseq is for userspace to kernel, not VM to kernel.

Any memory that is exposed to host userspace can be exposed to the guest.  Things
like this are implemented via "overlay" pages, where the guest asks host userspace
to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
address of the page containing the rseq structure associated with the vCPU (in
pretty much every modern VMM, each vCPU has a dedicated task/thread).

A that point, the vCPU can read/write the rseq structure directly.

The reason us KVM folks are pushing y'all towards something like rseq is that
(again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
is actually just priority boosting a task.  So rather than invent something
virtualization specific, invent a mechanism for priority boosting from userspace
without a syscall, and then extend it to the virtualization use case.

> > Steven Rostedt said as much as well, thoughts? Add Mathieu as well.
> 
> I'm not sure that rseq would help at all here, but I think we may want to
> borrow concepts of data sitting in shared memory across privilege levels
> and apply them to VMs.
> 
> If some of the ideas end up being useful *outside* of the context of VMs,

Modulo the assertion above that this is is about boosting priority instead of
requesting an extended time slice, this is essentially the same thing as the
"delay resched" discussion[*].  The only difference is that the vCPU is in a
critical section, e.q. IRQ handler, versus the userspace task being in a critical
section.

[*] https://lore.kernel.org/all/20231025054219.1acaa3dd@gandalf.local.home

> then I'd be willing to consider adding fields to rseq. But as long as it is
> VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
> you can safely share across host/guest kernels.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 14:48           ` Sean Christopherson
@ 2024-07-12 15:32             ` Mathieu Desnoyers
  2024-07-12 16:14               ` Sean Christopherson
  2024-07-12 16:30               ` Steven Rostedt
  0 siblings, 2 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2024-07-12 15:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Joel Fernandes, Vineeth Remanan Pillai, Ben Segall,
	Borislav Petkov, Daniel Bristot de Oliveira, Dave Hansen,
	Dietmar Eggemann, H . Peter Anvin, Ingo Molnar, Juri Lelli,
	Mel Gorman, Paolo Bonzini, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On 2024-07-12 10:48, Sean Christopherson wrote:
> On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
>> On 2024-07-12 08:57, Joel Fernandes wrote:
>>> On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
>> [...]
>>>> Existing use cases
>>>> -------------------------
>>>>
>>>> - A latency sensitive workload on the guest might need more than one
>>>> time slice to complete, but should not block any higher priority task
>>>> in the host. In our design, the latency sensitive workload shares its
>>>> priority requirements to host(RT priority, cfs nice value etc). Host
>>>> implementation of the protocol sets the priority of the vcpu task
>>>> accordingly so that the host scheduler can make an educated decision
>>>> on the next task to run. This makes sure that host processes and vcpu
>>>> tasks compete fairly for the cpu resource.
>>
>> AFAIU, the information you need to convey to achieve this is the priority
>> of the task within the guest. This information need to reach the host
>> scheduler to make informed decision.
>>
>> One thing that is unclear about this is what is the acceptable
>> overhead/latency to push this information from guest to host ?
>> Is an hypercall OK or does it need to be exchanged over a memory
>> mapping shared between guest and host ?
>>
>> Hypercalls provide simple ABIs across guest/host, and they allow
>> the guest to immediately notify the host (similar to an interrupt).
> 
> Hypercalls have myriad problems.  They require a VM-Exit, which largely defeats
> the purpose of boosting the vCPU priority for performance reasons.  They don't
> allow for delegation as there's no way for the hypervisor to know if a hypercall
> from guest userspace should be allowed, versus anything memory based where the
> ability for guest userspace to access the memory demonstrates permission (else
> the guest kernel wouldn't have mapped the memory into userspace).

OK, this answers my question above: the overhead of the hypercall pretty
much defeats the purpose of this priority boosting.

> 
>>>> Ideas brought up during offlist discussion
>>>> -------------------------------------------------------
>>>>
>>>> 1. rseq based timeslice extension mechanism[1]
>>>>
>>>> While the rseq based mechanism helps in giving the vcpu task one more
>>>> time slice, it will not help in the other use cases. We had a chat
>>>> with Steve and the rseq mechanism was mainly for improving lock
>>>> contention and would not work best with vcpu boosting considering all
>>>> the use cases above. RT or high priority tasks in the VM would often
>>>> need more than one time slice to complete its work and at the same,
>>>> should not be hurting the host workloads. The goal for the above use
>>>> cases is not requesting an extra slice, but to modify the priority in
>>>> such a way that host processes and guest processes get a fair way to
>>>> compete for cpu resources. This also means that vcpu task can request
>>>> a lower priority when it is running lower priority tasks in the VM.
> 
> Then figure out a way to let userspace boot a task's priority without needing a
> syscall.  vCPUs are not directly schedulable entities, the task doing KVM_RUN
> on the vCPU fd is what the scheduler sees.  Any scheduling enhancement that
> benefits vCPUs by definition can benefit userspace tasks.

Yes.

> 
>>> I was looking at the rseq on request from the KVM call, however it does not
>>> make sense to me yet how to expose the rseq area via the Guest VA to the host
>>> kernel.  rseq is for userspace to kernel, not VM to kernel.
> 
> Any memory that is exposed to host userspace can be exposed to the guest.  Things
> like this are implemented via "overlay" pages, where the guest asks host userspace
> to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> address of the page containing the rseq structure associated with the vCPU (in
> pretty much every modern VMM, each vCPU has a dedicated task/thread).
> 
> A that point, the vCPU can read/write the rseq structure directly.

This helps me understand what you are trying to achieve. I disagree with
some aspects of the design you present above: mainly the lack of
isolation between the guest kernel and the host task doing the KVM_RUN.
We do not want to let the guest kernel store to rseq fields that would
result in getting the host task killed (e.g. a bogus rseq_cs pointer).
But this is something we can improve upon once we understand what we
are trying to achieve.

> 
> The reason us KVM folks are pushing y'all towards something like rseq is that
> (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> is actually just priority boosting a task.  So rather than invent something
> virtualization specific, invent a mechanism for priority boosting from userspace
> without a syscall, and then extend it to the virtualization use case.
> 
[...]

OK, so how about we expose "offsets" tuning the base values ?

- The task doing KVM_RUN, just like any other task, has its "priority"
   value as set by setpriority(2).

- We introduce two new fields in the per-thread struct rseq, which is
   mapped in the host task doing KVM_RUN and readable from the scheduler:

   - __s32 prio_offset; /* Priority offset to apply on the current task priority. */

   - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */

     vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
     which would be typically NULL except for tasks doing KVM_RUN. This would
     sit in its own pages per vcpu, which takes care of isolation between guest
     kernel and host process. Those would be RW by the guest kernel as
     well and contain e.g.:

     struct vcpu_sched {
         __u32 len;  /* Length of active fields. */

         __s32 prio_offset;
         __s32 cpu_capacity_offset;
         [...]
     };

So when the host kernel try to calculate the effective priority of a task
doing KVM_RUN, it would basically start from its current priority, and offset
by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).

The cpu_capacity_offset would be populated by the host kernel and read by the
guest kernel scheduler for scheduling/migration decisions.

I'm certainly missing details about how priority offsets should be bounded for
given tasks. This could be an extension to setrlimit(2).

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 15:32             ` Mathieu Desnoyers
@ 2024-07-12 16:14               ` Sean Christopherson
  2024-07-12 16:30               ` Steven Rostedt
  1 sibling, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-07-12 16:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joel Fernandes, Vineeth Remanan Pillai, Ben Segall,
	Borislav Petkov, Daniel Bristot de Oliveira, Dave Hansen,
	Dietmar Eggemann, H . Peter Anvin, Ingo Molnar, Juri Lelli,
	Mel Gorman, Paolo Bonzini, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
> On 2024-07-12 10:48, Sean Christopherson wrote:
> > > > I was looking at the rseq on request from the KVM call, however it does not
> > > > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > > > kernel.  rseq is for userspace to kernel, not VM to kernel.
> > 
> > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > like this are implemented via "overlay" pages, where the guest asks host userspace
> > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > address of the page containing the rseq structure associated with the vCPU (in
> > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > 
> > A that point, the vCPU can read/write the rseq structure directly.
> 
> This helps me understand what you are trying to achieve. I disagree with
> some aspects of the design you present above: mainly the lack of
> isolation between the guest kernel and the host task doing the KVM_RUN.
> We do not want to let the guest kernel store to rseq fields that would
> result in getting the host task killed (e.g. a bogus rseq_cs pointer).

Yeah, exposing the full rseq structure to the guest is probably a terrible idea.
The above isn't intended to be a design, the goal is just to illustrate how an
rseq-like mechanism can be extended to the guest without needing virtualization
specific ABI and without needing new KVM functionality.

> But this is something we can improve upon once we understand what we
> are trying to achieve.
> 
> > 
> > The reason us KVM folks are pushing y'all towards something like rseq is that
> > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> > is actually just priority boosting a task.  So rather than invent something
> > virtualization specific, invent a mechanism for priority boosting from userspace
> > without a syscall, and then extend it to the virtualization use case.
> > 
> [...]
> 
> OK, so how about we expose "offsets" tuning the base values ?
> 
> - The task doing KVM_RUN, just like any other task, has its "priority"
>   value as set by setpriority(2).
> 
> - We introduce two new fields in the per-thread struct rseq, which is
>   mapped in the host task doing KVM_RUN and readable from the scheduler:
> 
>   - __s32 prio_offset; /* Priority offset to apply on the current task priority. */
> 
>   - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */

Ideally, there won't be a virtualization specific structure.  A vCPU specific
field might make sense (or it might not), but I really want to avoid defining a
structure that is unique to virtualization.  E.g. a userspace doing M:N scheduling
can likely benefit from any capacity hooks/information that would benefit a guest
scheduler.  I.e. rather than a vcpu_sched structure, have a user_sched structure
(or whatever name makes sense), and then have two struct pointers in rseq.

Though I'm skeptical that having two structs in play would be necessary or sane.
E.g. if both userspace and guest can adjust priority, then they'll need to coordinate
in order to avoid unexpected results.  I can definitely see wanting to let the
userspace VMM bound the priority of a vCPU, but that should be a relatively static
decision, i.e. can be done via syscall or something similarly "slow".

>     vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
>     which would be typically NULL except for tasks doing KVM_RUN. This would
>     sit in its own pages per vcpu, which takes care of isolation between guest
>     kernel and host process. Those would be RW by the guest kernel as
>     well and contain e.g.:
> 
>     struct vcpu_sched {
>         __u32 len;  /* Length of active fields. */
> 
>         __s32 prio_offset;
>         __s32 cpu_capacity_offset;
>         [...]
>     };
> 
> So when the host kernel try to calculate the effective priority of a task
> doing KVM_RUN, it would basically start from its current priority, and offset
> by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).
> 
> The cpu_capacity_offset would be populated by the host kernel and read by the
> guest kernel scheduler for scheduling/migration decisions.
> 
> I'm certainly missing details about how priority offsets should be bounded for
> given tasks. This could be an extension to setrlimit(2).
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 14:09         ` Mathieu Desnoyers
  2024-07-12 14:48           ` Sean Christopherson
@ 2024-07-12 16:24           ` Steven Rostedt
  2024-07-12 16:44             ` Sean Christopherson
  2024-07-12 16:24           ` Joel Fernandes
  2 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2024-07-12 16:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joel Fernandes, Vineeth Remanan Pillai, Sean Christopherson,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, 12 Jul 2024 10:09:03 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > driver, that does the boosting.  
> 
> I utterly dislike changing the system behavior through tracepoints. They were
> designed to observe the system, not modify its behavior. If people start abusing
> them, then subsystem maintainers will stop adding them. Please don't do that.
> Add a notifier or think about integrating what you are planning to add into the
> driver instead.

I tend to agree that a notifier would be much better than using
tracepoints, but then I also think eBPF has already let that cat out of
the bag. :-p

All we need is a notifier that gets called at every VMEXIT.

The main issue that this is trying to solve is to boost the priority of
the guest without making the hypercall, so that it can quickly react
(lower the latency of reaction to an event). Now when the task is
unboosted, there's no avoiding of the hypercall as there's no other way
to tell the host that this vCPU should not be running at a higher
priority (the high priority may prevent schedules, or even checking the
new prio in the shared memory).

If there's a way to have a shared memory, via virtio or whatever, and
any notifier that gets called at any VMEXIT, then this is trivial to
implement.

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 14:09         ` Mathieu Desnoyers
  2024-07-12 14:48           ` Sean Christopherson
  2024-07-12 16:24           ` Steven Rostedt
@ 2024-07-12 16:24           ` Joel Fernandes
  2024-07-12 17:28             ` Mathieu Desnoyers
  2 siblings, 1 reply; 42+ messages in thread
From: Joel Fernandes @ 2024-07-12 16:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Vineeth Remanan Pillai, Sean Christopherson, Ben Segall,
	Borislav Petkov, Daniel Bristot de Oliveira, Dave Hansen,
	Dietmar Eggemann, H . Peter Anvin, Ingo Molnar, Juri Lelli,
	Mel Gorman, Paolo Bonzini, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-07-12 08:57, Joel Fernandes wrote:
> > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> [...]
> >> Existing use cases
> >> -------------------------
> >>
> >> - A latency sensitive workload on the guest might need more than one
> >> time slice to complete, but should not block any higher priority task
> >> in the host. In our design, the latency sensitive workload shares its
> >> priority requirements to host(RT priority, cfs nice value etc). Host
> >> implementation of the protocol sets the priority of the vcpu task
> >> accordingly so that the host scheduler can make an educated decision
> >> on the next task to run. This makes sure that host processes and vcpu
> >> tasks compete fairly for the cpu resource.
>
> AFAIU, the information you need to convey to achieve this is the priority
> of the task within the guest. This information need to reach the host
> scheduler to make informed decision.
>
> One thing that is unclear about this is what is the acceptable
> overhead/latency to push this information from guest to host ?
> Is an hypercall OK or does it need to be exchanged over a memory
> mapping shared between guest and host ?

Shared memory for the boost (Can do it later during host preemption).
But for unboost, we possibly need a hypercall in addition to it as
well.

>
> Hypercalls provide simple ABIs across guest/host, and they allow
> the guest to immediately notify the host (similar to an interrupt).
>
> Shared memory mapping will require a carefully crafted ABI layout,
> and will only allow the host to use the information provided when
> the host runs. Therefore, if the choice is to share this information
> only through shared memory, the host scheduler will only be able to
> read it when it runs, so in hypercall, interrupt, and so on.

The initial idea was to handle the details/format/allocation of the
shared memory out-of-band in a driver, but then later the rseq idea
came up.

> >> - Guest should be able to notify the host that it is running a lower
> >> priority task so that the host can reschedule it if needed. As
> >> mentioned before, the guest shares the priority with the host and the
> >> host takes a better scheduling decision.
>
> It is unclear to me whether this information needs to be "pushed"
> from guest to host (e.g. hypercall) in a way that allows the host
> to immediately act on this information, or if it is OK to have the
> host read this information when its scheduler happens to run.

For boosting, there is no need to immediately push. Only on preemption.

> >> - Proactive vcpu boosting for events like interrupt injection.
> >> Depending on the guest for boost request might be too late as the vcpu
> >> might not be scheduled to run even after interrupt injection. Host
> >> implementation of the protocol boosts the vcpu tasks priority so that
> >> it gets a better chance of immediately being scheduled and guest can
> >> handle the interrupt with minimal latency. Once the guest is done
> >> handling the interrupt, it can notify the host and lower the priority
> >> of the vcpu task.
>
> This appears to be a scenario where the host sets a "high priority", and
> the guest clears it when it is done with the irq handler. I guess it can
> be done either ways (hypercall or shared memory), but the choice would
> depend on the parameters identified above: acceptable overhead vs acceptable
> latency to inform the host scheduler.

Yes, we have found ways to reduce/make fewer hypercalls on unboost.

> >> - Guests which assign specialized tasks to specific vcpus can share
> >> that information with the host so that host can try to avoid
> >> colocation of those cpus in a single physical cpu. for eg: there are
> >> interrupt pinning use cases where specific cpus are chosen to handle
> >> critical interrupts and passing this information to the host could be
> >> useful.
>
> How frequently is this topology expected to change ? Is it something that
> is set once when the guest starts and then is fixed ? How often it changes
> will likely affect the tradeoffs here.

Yes, will be fixed.

> >> - Another use case is the sharing of cpu capacity details between
> >> guest and host. Sharing the host cpu's load with the guest will enable
> >> the guest to schedule latency sensitive tasks on the best possible
> >> vcpu. This could be partially achievable by steal time, but steal time
> >> is more apparent on busy vcpus. There are workloads which are mostly
> >> sleepers, but wake up intermittently to serve short latency sensitive
> >> workloads. input event handlers in chrome is one such example.
>
> OK so for this use-case information goes the other way around: from host
> to guest. Here the shared mapping seems better than polling the state
> through an hypercall.

Yes, FWIW this particular part is for future and not initially required per-se.

> >> Data from the prototype implementation shows promising improvement in
> >> reducing latencies. Data was shared in the v1 cover letter. We have
> >> not implemented the capacity based placement policies yet, but plan to
> >> do that soon and have some real numbers to share.
> >>
> >> Ideas brought up during offlist discussion
> >> -------------------------------------------------------
> >>
> >> 1. rseq based timeslice extension mechanism[1]
> >>
> >> While the rseq based mechanism helps in giving the vcpu task one more
> >> time slice, it will not help in the other use cases. We had a chat
> >> with Steve and the rseq mechanism was mainly for improving lock
> >> contention and would not work best with vcpu boosting considering all
> >> the use cases above. RT or high priority tasks in the VM would often
> >> need more than one time slice to complete its work and at the same,
> >> should not be hurting the host workloads. The goal for the above use
> >> cases is not requesting an extra slice, but to modify the priority in
> >> such a way that host processes and guest processes get a fair way to
> >> compete for cpu resources. This also means that vcpu task can request
> >> a lower priority when it is running lower priority tasks in the VM.
> >
> > I was looking at the rseq on request from the KVM call, however it does not
> > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > kernel.  rseq is for userspace to kernel, not VM to kernel.
> >
> > Steven Rostedt said as much as well, thoughts? Add Mathieu as well.
>
> I'm not sure that rseq would help at all here, but I think we may want to
> borrow concepts of data sitting in shared memory across privilege levels
> and apply them to VMs.
>
> If some of the ideas end up being useful *outside* of the context of VMs,
> then I'd be willing to consider adding fields to rseq. But as long as it is
> VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
> you can safely share across host/guest kernels.

Yes, this was the initial plan. I also feel rseq cannot be applied here.

> > This idea seems to suffer from the same vDSO over-engineering below, rseq
> > does not seem to fit.
> >
> > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > driver, that does the boosting.
>
> I utterly dislike changing the system behavior through tracepoints. They were
> designed to observe the system, not modify its behavior. If people start abusing
> them, then subsystem maintainers will stop adding them. Please don't do that.
> Add a notifier or think about integrating what you are planning to add into the
> driver instead.

Well, we do have "raw" tracepoints not accessible from userspace, so
you're saying even those are off limits for adding callbacks?

 - Joel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 15:32             ` Mathieu Desnoyers
  2024-07-12 16:14               ` Sean Christopherson
@ 2024-07-12 16:30               ` Steven Rostedt
  2024-07-12 16:39                 ` Sean Christopherson
  1 sibling, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2024-07-12 16:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sean Christopherson, Joel Fernandes, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, 12 Jul 2024 11:32:30 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >>> I was looking at the rseq on request from the KVM call, however it does not
> >>> make sense to me yet how to expose the rseq area via the Guest VA to the host
> >>> kernel.  rseq is for userspace to kernel, not VM to kernel.  
> > 
> > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > like this are implemented via "overlay" pages, where the guest asks host userspace
> > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > address of the page containing the rseq structure associated with the vCPU (in
> > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > 
> > A that point, the vCPU can read/write the rseq structure directly.  

So basically, the vCPU thread can just create a virtio device that
exposes the rseq memory to the guest kernel?

One other issue we need to worry about is that IIUC rseq memory is
allocated by the guest/user, not the host kernel. This means it can be
swapped out. The code that handles this needs to be able to handle user
page faults.

> 
> This helps me understand what you are trying to achieve. I disagree with
> some aspects of the design you present above: mainly the lack of
> isolation between the guest kernel and the host task doing the KVM_RUN.
> We do not want to let the guest kernel store to rseq fields that would
> result in getting the host task killed (e.g. a bogus rseq_cs pointer).
> But this is something we can improve upon once we understand what we
> are trying to achieve.
> 
> > 
> > The reason us KVM folks are pushing y'all towards something like rseq is that
> > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> > is actually just priority boosting a task.  So rather than invent something
> > virtualization specific, invent a mechanism for priority boosting from userspace
> > without a syscall, and then extend it to the virtualization use case.
> >   
> [...]
> 
> OK, so how about we expose "offsets" tuning the base values ?
> 
> - The task doing KVM_RUN, just like any other task, has its "priority"
>    value as set by setpriority(2).
> 
> - We introduce two new fields in the per-thread struct rseq, which is
>    mapped in the host task doing KVM_RUN and readable from the scheduler:
> 
>    - __s32 prio_offset; /* Priority offset to apply on the current task priority. */
> 
>    - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */
> 
>      vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
>      which would be typically NULL except for tasks doing KVM_RUN. This would
>      sit in its own pages per vcpu, which takes care of isolation between guest
>      kernel and host process. Those would be RW by the guest kernel as
>      well and contain e.g.:

Hmm, maybe not make this only vcpu specific, but perhaps this can be
useful for user space tasks that want to dynamically change their
priority without a system call. It could do the same thing. Yeah, yeah,
I may be coming up with a solution in search of a problem ;-)

-- Steve

> 
>      struct vcpu_sched {
>          __u32 len;  /* Length of active fields. */
> 
>          __s32 prio_offset;
>          __s32 cpu_capacity_offset;
>          [...]
>      };
> 
> So when the host kernel try to calculate the effective priority of a task
> doing KVM_RUN, it would basically start from its current priority, and offset
> by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).
> 
> The cpu_capacity_offset would be populated by the host kernel and read by the
> guest kernel scheduler for scheduling/migration decisions.
> 
> I'm certainly missing details about how priority offsets should be bounded for
> given tasks. This could be an extension to setrlimit(2).
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 16:30               ` Steven Rostedt
@ 2024-07-12 16:39                 ` Sean Christopherson
  2024-07-12 17:02                   ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2024-07-12 16:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Joel Fernandes, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024, Steven Rostedt wrote:
> On Fri, 12 Jul 2024 11:32:30 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > >>> I was looking at the rseq on request from the KVM call, however it does not
> > >>> make sense to me yet how to expose the rseq area via the Guest VA to the host
> > >>> kernel.  rseq is for userspace to kernel, not VM to kernel.  
> > > 
> > > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > > like this are implemented via "overlay" pages, where the guest asks host userspace
> > > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > > address of the page containing the rseq structure associated with the vCPU (in
> > > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > > 
> > > A that point, the vCPU can read/write the rseq structure directly.  
> 
> So basically, the vCPU thread can just create a virtio device that
> exposes the rseq memory to the guest kernel?
> 
> One other issue we need to worry about is that IIUC rseq memory is
> allocated by the guest/user, not the host kernel. This means it can be
> swapped out. The code that handles this needs to be able to handle user
> page faults.

This is a non-issue, it will Just Work, same as any other memory that is exposed
to the guest and can be reclaimed/swapped/migrated..

If the host swaps out the rseq page, mmu_notifiers will call into KVM and KVM will
unmap the page from the guest.  If/when the page is accessed by the guest, KVM
will fault the page back into the host's primary MMU, and then map the new pfn
into the guest.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 16:24           ` Steven Rostedt
@ 2024-07-12 16:44             ` Sean Christopherson
  2024-07-12 16:50               ` Joel Fernandes
  2024-07-12 17:12               ` Steven Rostedt
  0 siblings, 2 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-07-12 16:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Joel Fernandes, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024, Steven Rostedt wrote:
> On Fri, 12 Jul 2024 10:09:03 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > > 
> > > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > > driver, that does the boosting.  
> > 
> > I utterly dislike changing the system behavior through tracepoints. They were
> > designed to observe the system, not modify its behavior. If people start abusing
> > them, then subsystem maintainers will stop adding them. Please don't do that.
> > Add a notifier or think about integrating what you are planning to add into the
> > driver instead.
> 
> I tend to agree that a notifier would be much better than using
> tracepoints, but then I also think eBPF has already let that cat out of
> the bag. :-p
> 
> All we need is a notifier that gets called at every VMEXIT.

Why?  The only argument I've seen for needing to hook VM-Exit is so that the
host can speculatively boost the priority of the vCPU when deliverying an IRQ,
but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
_before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
modern hardware that supports posted interrupts and IPI virtualization, i.e. for
which there will be no VM-Exit.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 16:44             ` Sean Christopherson
@ 2024-07-12 16:50               ` Joel Fernandes
  2024-07-12 17:08                 ` Sean Christopherson
  2024-07-12 17:12               ` Steven Rostedt
  1 sibling, 1 reply; 42+ messages in thread
From: Joel Fernandes @ 2024-07-12 16:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Steven Rostedt, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024 at 09:44:16AM -0700, Sean Christopherson wrote:
> On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > On Fri, 12 Jul 2024 10:09:03 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > > > 
> > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > > > driver, that does the boosting.  
> > > 
> > > I utterly dislike changing the system behavior through tracepoints. They were
> > > designed to observe the system, not modify its behavior. If people start abusing
> > > them, then subsystem maintainers will stop adding them. Please don't do that.
> > > Add a notifier or think about integrating what you are planning to add into the
> > > driver instead.
> > 
> > I tend to agree that a notifier would be much better than using
> > tracepoints, but then I also think eBPF has already let that cat out of
> > the bag. :-p
> > 
> > All we need is a notifier that gets called at every VMEXIT.
> 
> Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> which there will be no VM-Exit.

I am a bit confused by your statement Sean, because if a higher prio HOST
thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should
happen. That has nothing to do with IRQ delivery.  What am I missing?

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 16:39                 ` Sean Christopherson
@ 2024-07-12 17:02                   ` Steven Rostedt
  0 siblings, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2024-07-12 17:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mathieu Desnoyers, Joel Fernandes, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, 12 Jul 2024 09:39:52 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > 
> > One other issue we need to worry about is that IIUC rseq memory is
> > allocated by the guest/user, not the host kernel. This means it can be
> > swapped out. The code that handles this needs to be able to handle user
> > page faults.  
> 
> This is a non-issue, it will Just Work, same as any other memory that is exposed
> to the guest and can be reclaimed/swapped/migrated..
> 
> If the host swaps out the rseq page, mmu_notifiers will call into KVM and KVM will
> unmap the page from the guest.  If/when the page is accessed by the guest, KVM
> will fault the page back into the host's primary MMU, and then map the new pfn
> into the guest.

My comment is that in the host kernel, the access to this memory needs
to be user page fault safe. You can't call it in atomic context.

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 16:50               ` Joel Fernandes
@ 2024-07-12 17:08                 ` Sean Christopherson
  2024-07-12 17:14                   ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2024-07-12 17:08 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024, Joel Fernandes wrote:
> On Fri, Jul 12, 2024 at 09:44:16AM -0700, Sean Christopherson wrote:
> > On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > > On Fri, 12 Jul 2024 10:09:03 -0400
> > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > 
> > > > > 
> > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > > > > driver, that does the boosting.  
> > > > 
> > > > I utterly dislike changing the system behavior through tracepoints. They were
> > > > designed to observe the system, not modify its behavior. If people start abusing
> > > > them, then subsystem maintainers will stop adding them. Please don't do that.
> > > > Add a notifier or think about integrating what you are planning to add into the
> > > > driver instead.
> > > 
> > > I tend to agree that a notifier would be much better than using
> > > tracepoints, but then I also think eBPF has already let that cat out of
> > > the bag. :-p
> > > 
> > > All we need is a notifier that gets called at every VMEXIT.
> > 
> > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > which there will be no VM-Exit.
> 
> I am a bit confused by your statement Sean, because if a higher prio HOST
> thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should
> happen. That has nothing to do with IRQ delivery.  What am I missing?

Why does that require hooking VM-Exit?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 16:44             ` Sean Christopherson
  2024-07-12 16:50               ` Joel Fernandes
@ 2024-07-12 17:12               ` Steven Rostedt
  2024-07-16 23:44                 ` Sean Christopherson
  1 sibling, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2024-07-12 17:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mathieu Desnoyers, Joel Fernandes, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, 12 Jul 2024 09:44:16 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > All we need is a notifier that gets called at every VMEXIT.  
> 
> Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> which there will be no VM-Exit.

No. The speculatively boost was for something else, but slightly
related. I guess the ideal there was to have the interrupt coming in
boost the vCPU because the interrupt could be waking an RT task. It may
still be something needed, but that's not what I'm talking about here.

The idea here is when an RT task is scheduled in on the guest, we want
to lazily boost it. As long as the vCPU is running on the CPU, we do
not need to do anything. If the RT task is scheduled for a very short
time, it should not need to call any hypercall. It would set the shared
memory to the new priority when the RT task is scheduled, and then put
back the lower priority when it is scheduled out and a SCHED_OTHER task
is scheduled in.

Now if the vCPU gets preempted, it is this moment that we need the host
kernel to look at the current priority of the task thread running on
the vCPU. If it is an RT task, we need to boost the vCPU to that
priority, so that a lower priority host thread does not interrupt it.

The host should also set a bit in the shared memory to tell the guest
that it was boosted. Then when the vCPU schedules a lower priority task
than what is in shared memory, and the bit is set that tells the guest
the host boosted the vCPU, it needs to make a hypercall to tell the
host that it can lower its priority again.

The incoming irq is to handle the race between the event that wakes the
RT task, and the RT task getting a chance to run. If the preemption
happens there, the vCPU may never have a chance to notify the host that
it wants to run an RT task.

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 17:08                 ` Sean Christopherson
@ 2024-07-12 17:14                   ` Steven Rostedt
  0 siblings, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2024-07-12 17:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Joel Fernandes, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, 12 Jul 2024 10:08:43 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > I am a bit confused by your statement Sean, because if a higher prio HOST
> > thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should
> > happen. That has nothing to do with IRQ delivery.  What am I missing?  
> 
> Why does that require hooking VM-Exit?

To do the lazy boosting. That's the time that the host can see the
priority of the currently running task.

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 16:24           ` Joel Fernandes
@ 2024-07-12 17:28             ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2024-07-12 17:28 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Vineeth Remanan Pillai, Sean Christopherson, Ben Segall,
	Borislav Petkov, Daniel Bristot de Oliveira, Dave Hansen,
	Dietmar Eggemann, H . Peter Anvin, Ingo Molnar, Juri Lelli,
	Mel Gorman, Paolo Bonzini, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Vitaly Kuznetsov, Wanpeng Li, Steven Rostedt, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On 2024-07-12 12:24, Joel Fernandes wrote:
> On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
[...]
>>>
>>> Steven Rostedt told me, what we instead need is a tracepoint callback in a
>>> driver, that does the boosting.
>>
>> I utterly dislike changing the system behavior through tracepoints. They were
>> designed to observe the system, not modify its behavior. If people start abusing
>> them, then subsystem maintainers will stop adding them. Please don't do that.
>> Add a notifier or think about integrating what you are planning to add into the
>> driver instead.
> 
> Well, we do have "raw" tracepoints not accessible from userspace, so
> you're saying even those are off limits for adding callbacks?

Yes. Even the "raw" tracepoints were designed as an "observation only"
API. Using them in lieu of notifiers is really repurposing them for
something they were not meant to do.

Just in terms of maintainability at the caller site, we should be
allowed to consider _all_ tracepoints as mostly exempt from side-effects
outside of the data structures within the attached tracers. This is not
true anymore if they are repurposed as notifiers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-12 17:12               ` Steven Rostedt
@ 2024-07-16 23:44                 ` Sean Christopherson
  2024-07-17  0:13                   ` Steven Rostedt
  2024-07-17  5:16                   ` Joel Fernandes
  0 siblings, 2 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-07-16 23:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Joel Fernandes, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Fri, Jul 12, 2024, Steven Rostedt wrote:
> On Fri, 12 Jul 2024 09:44:16 -0700
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > > All we need is a notifier that gets called at every VMEXIT.  
> > 
> > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > which there will be no VM-Exit.
> 
> No. The speculatively boost was for something else, but slightly
> related. I guess the ideal there was to have the interrupt coming in
> boost the vCPU because the interrupt could be waking an RT task. It may
> still be something needed, but that's not what I'm talking about here.
> 
> The idea here is when an RT task is scheduled in on the guest, we want
> to lazily boost it. As long as the vCPU is running on the CPU, we do
> not need to do anything. If the RT task is scheduled for a very short
> time, it should not need to call any hypercall. It would set the shared
> memory to the new priority when the RT task is scheduled, and then put
> back the lower priority when it is scheduled out and a SCHED_OTHER task
> is scheduled in.
> 
> Now if the vCPU gets preempted, it is this moment that we need the host
> kernel to look at the current priority of the task thread running on
> the vCPU. If it is an RT task, we need to boost the vCPU to that
> priority, so that a lower priority host thread does not interrupt it.

I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
preempted, the host scheduler is already getting "notified", otherwise the vCPU
would still be scheduled in, i.e. wouldn't have been preempted.

> The host should also set a bit in the shared memory to tell the guest
> that it was boosted. Then when the vCPU schedules a lower priority task
> than what is in shared memory, and the bit is set that tells the guest
> the host boosted the vCPU, it needs to make a hypercall to tell the
> host that it can lower its priority again.

Which again doesn't _need_ a dedicated/manual VM-Exit.  E.g. why force the host
to reasses the priority instead of simply waiting until the next reschedule?  If
the host is running tickless, then presumably there is a scheduling entity running
on a different pCPU, i.e. that can react to vCPU priority changes without needing
a VM-Exit.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-16 23:44                 ` Sean Christopherson
@ 2024-07-17  0:13                   ` Steven Rostedt
  2024-07-17  5:16                   ` Joel Fernandes
  1 sibling, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2024-07-17  0:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mathieu Desnoyers, Joel Fernandes, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Tue, 16 Jul 2024 16:44:05 -0700
Sean Christopherson <seanjc@google.com> wrote:
> > 
> > Now if the vCPU gets preempted, it is this moment that we need the host
> > kernel to look at the current priority of the task thread running on
> > the vCPU. If it is an RT task, we need to boost the vCPU to that
> > priority, so that a lower priority host thread does not interrupt it.  
> 
> I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
> preempted, the host scheduler is already getting "notified", otherwise the vCPU
> would still be scheduled in, i.e. wouldn't have been preempted.

The guest wants to lazily up its priority when needed. So, it changes its
priority on this shared memory, but the host doesn't know about the raised
priority, and decides to preempt it (where it would not if it knew the
priority was raised). Then it exits into the host via VMEXIT. When else is
the host going to know of this priority changed?

> 
> > The host should also set a bit in the shared memory to tell the guest
> > that it was boosted. Then when the vCPU schedules a lower priority task
> > than what is in shared memory, and the bit is set that tells the guest
> > the host boosted the vCPU, it needs to make a hypercall to tell the
> > host that it can lower its priority again.  
> 
> Which again doesn't _need_ a dedicated/manual VM-Exit.  E.g. why force the host
> to reasses the priority instead of simply waiting until the next reschedule?  If
> the host is running tickless, then presumably there is a scheduling entity running
> on a different pCPU, i.e. that can react to vCPU priority changes without needing
> a VM-Exit.

This is done in a shared memory location. The guest can raise and lower its
priority via writing into the shared memory. It may raise and lower it back
without the host ever knowing. No hypercall needed.

But if it raises its priority, and the host decides to schedule it because
the host is unaware of its raised priority, it will preempt it. Then when
it exits into the host (via VMEXIT) this is the first time the host will
know that its priority was raised, and then we can call something like
rt_mutex_setprio() to lazily change its priority. It would then also set a
bit to inform the guest that the host knows of the change, and when the
guest lowers its priority, it will now need to make a hypercall to tell the
kernel its priority is low again, and it's OK to preempt it normally.

This is similar to how some architectures do lazy irq disabling. Where they
only set some memory that says interrupts are disabled. But interrupts only
get disabled if an interrupt goes off and the code sees it's "soft
disabled", and then will disable interrupts. When the interrupts are
enabled again, it then calls the interrupt handler.

What are you suggesting to do for this fast way of increasing and
decreasing the priority of tasks?

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-16 23:44                 ` Sean Christopherson
  2024-07-17  0:13                   ` Steven Rostedt
@ 2024-07-17  5:16                   ` Joel Fernandes
  2024-07-17 14:14                     ` Sean Christopherson
  1 sibling, 1 reply; 42+ messages in thread
From: Joel Fernandes @ 2024-07-17  5:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Steven Rostedt, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Tue, Jul 16, 2024 at 7:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > On Fri, 12 Jul 2024 09:44:16 -0700
> > Sean Christopherson <seanjc@google.com> wrote:
> >
> > > > All we need is a notifier that gets called at every VMEXIT.
> > >
> > > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > > which there will be no VM-Exit.
> >
> > No. The speculatively boost was for something else, but slightly
> > related. I guess the ideal there was to have the interrupt coming in
> > boost the vCPU because the interrupt could be waking an RT task. It may
> > still be something needed, but that's not what I'm talking about here.
> >
> > The idea here is when an RT task is scheduled in on the guest, we want
> > to lazily boost it. As long as the vCPU is running on the CPU, we do
> > not need to do anything. If the RT task is scheduled for a very short
> > time, it should not need to call any hypercall. It would set the shared
> > memory to the new priority when the RT task is scheduled, and then put
> > back the lower priority when it is scheduled out and a SCHED_OTHER task
> > is scheduled in.
> >
> > Now if the vCPU gets preempted, it is this moment that we need the host
> > kernel to look at the current priority of the task thread running on
> > the vCPU. If it is an RT task, we need to boost the vCPU to that
> > priority, so that a lower priority host thread does not interrupt it.
>
> I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
> preempted, the host scheduler is already getting "notified", otherwise the vCPU
> would still be scheduled in, i.e. wouldn't have been preempted.

What you're saying is the scheduler should change the priority of the
vCPU thread dynamically. That's really not the job of the scheduler.
The user of the scheduler is what changes the priority of threads, not
the scheduler itself.

Joel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17  5:16                   ` Joel Fernandes
@ 2024-07-17 14:14                     ` Sean Christopherson
  2024-07-17 14:36                       ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2024-07-17 14:14 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Wed, Jul 17, 2024, Joel Fernandes wrote:
> On Tue, Jul 16, 2024 at 7:44 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > > On Fri, 12 Jul 2024 09:44:16 -0700
> > > Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > > All we need is a notifier that gets called at every VMEXIT.
> > > >
> > > > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > > > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > > > which there will be no VM-Exit.
> > >
> > > No. The speculatively boost was for something else, but slightly
> > > related. I guess the ideal there was to have the interrupt coming in
> > > boost the vCPU because the interrupt could be waking an RT task. It may
> > > still be something needed, but that's not what I'm talking about here.
> > >
> > > The idea here is when an RT task is scheduled in on the guest, we want
> > > to lazily boost it. As long as the vCPU is running on the CPU, we do
> > > not need to do anything. If the RT task is scheduled for a very short
> > > time, it should not need to call any hypercall. It would set the shared
> > > memory to the new priority when the RT task is scheduled, and then put
> > > back the lower priority when it is scheduled out and a SCHED_OTHER task
> > > is scheduled in.
> > >
> > > Now if the vCPU gets preempted, it is this moment that we need the host
> > > kernel to look at the current priority of the task thread running on
> > > the vCPU. If it is an RT task, we need to boost the vCPU to that
> > > priority, so that a lower priority host thread does not interrupt it.
> >
> > I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
> > preempted, the host scheduler is already getting "notified", otherwise the vCPU
> > would still be scheduled in, i.e. wouldn't have been preempted.
> 
> What you're saying is the scheduler should change the priority of the
> vCPU thread dynamically. That's really not the job of the scheduler.
> The user of the scheduler is what changes the priority of threads, not
> the scheduler itself.

No.  If we go the proposed route[*] of adding a data structure that lets userspace
and/or the guest express/adjust the task's priority, then the scheduler simply
checks that data structure when querying the priority of a task.

[*] https://lore.kernel.org/all/ZpFWfInsXQdPJC0V@google.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17 14:14                     ` Sean Christopherson
@ 2024-07-17 14:36                       ` Steven Rostedt
  2024-07-17 14:52                         ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2024-07-17 14:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Joel Fernandes, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Wed, 17 Jul 2024 07:14:59 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > What you're saying is the scheduler should change the priority of the
> > vCPU thread dynamically. That's really not the job of the scheduler.
> > The user of the scheduler is what changes the priority of threads, not
> > the scheduler itself.  
> 
> No.  If we go the proposed route[*] of adding a data structure that lets userspace
> and/or the guest express/adjust the task's priority, then the scheduler simply
> checks that data structure when querying the priority of a task.

The problem with that is the only use case for such a feature is for
vCPUS. There's no use case for a single thread to up and down its
priority. I work a lot in RT applications (well, not as much anymore,
but my career was heavy into it). And I can't see any use case where a
single thread would bounce its priority around. In fact, if I did see
that, I would complain that it was a poorly designed system.

Now for a guest kernel, that's very different. It has to handle things
like priority inheritance and such, where bouncing a threads (or its
own vCPU thread) priority most definitely makes sense.

So you are requesting that we add a bad user space interface to allow
lazy priority management from a thread so that we can use it in the
proper use case of a vCPU?

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17 14:36                       ` Steven Rostedt
@ 2024-07-17 14:52                         ` Steven Rostedt
  2024-07-17 15:20                           ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2024-07-17 14:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Joel Fernandes, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Wed, 17 Jul 2024 10:36:47 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> The problem with that is the only use case for such a feature is for
> vCPUS. There's no use case for a single thread to up and down its
> priority. I work a lot in RT applications (well, not as much anymore,
> but my career was heavy into it). And I can't see any use case where a
> single thread would bounce its priority around. In fact, if I did see
> that, I would complain that it was a poorly designed system.
> 
> Now for a guest kernel, that's very different. It has to handle things
> like priority inheritance and such, where bouncing a threads (or its
> own vCPU thread) priority most definitely makes sense.
> 
> So you are requesting that we add a bad user space interface to allow
> lazy priority management from a thread so that we can use it in the
> proper use case of a vCPU?

Now I stated the above thinking you wanted to add a generic interface
for all user space. But perhaps there is a way to get this to be done
by the scheduler itself. But its use case is still only for VMs.

We could possibly add a new sched class that has a dynamic priority.
That is, it can switch between other sched classes. A vCPU thread could
be assigned to this class from inside the kernel (via a virtio device)
where this is not exposed to user space at all. Then the virtio device
would control the mapping of a page between the vCPU thread and the
host kernel. When this task gets scheduled, it can call into the code
that handles the dynamic priority. This will require buy-in from the
scheduler folks.

This could also handle the case of a vCPU being woken up by an
interrupt, as the hooks could be there on the wakeup side as well.

Thoughts?

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17 14:52                         ` Steven Rostedt
@ 2024-07-17 15:20                           ` Steven Rostedt
  2024-07-17 17:03                             ` Suleiman Souhlal
  2024-07-17 20:57                             ` Joel Fernandes
  0 siblings, 2 replies; 42+ messages in thread
From: Steven Rostedt @ 2024-07-17 15:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Joel Fernandes, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Wed, 17 Jul 2024 10:52:33 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> We could possibly add a new sched class that has a dynamic priority.

It wouldn't need to be a new sched class. This could work with just a
task_struct flag.

It would only need to be checked in pick_next_task() and
try_to_wake_up(). It would require that the shared memory has to be
allocated by the host kernel and always present (unlike rseq). But this
coming from a virtio device driver, that shouldn't be a problem.

If this flag is set on current, then the first thing that
pick_next_task() should do is to see if it needs to change current's
priority and policy (via a callback to the driver). And then it can
decide what task to pick, as if current was boosted, it could very well
be the next task again.

In try_to_wake_up(), if the task waking up has this flag set, it could
boost it via an option set by the virtio device. This would allow it to
preempt the current process if necessary and get on the CPU. Then the
guest would be require to lower its priority if it the boost was not
needed.

Hmm, this could work.

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17 15:20                           ` Steven Rostedt
@ 2024-07-17 17:03                             ` Suleiman Souhlal
  2024-07-17 20:57                             ` Joel Fernandes
  1 sibling, 0 replies; 42+ messages in thread
From: Suleiman Souhlal @ 2024-07-17 17:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sean Christopherson, Joel Fernandes, Mathieu Desnoyers,
	Vineeth Remanan Pillai, Ben Segall, Borislav Petkov,
	Daniel Bristot de Oliveira, Dave Hansen, Dietmar Eggemann,
	H . Peter Anvin, Ingo Molnar, Juri Lelli, Mel Gorman,
	Paolo Bonzini, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Thu, Jul 18, 2024 at 12:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 17 Jul 2024 10:52:33 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > We could possibly add a new sched class that has a dynamic priority.
>
> It wouldn't need to be a new sched class. This could work with just a
> task_struct flag.
>
> It would only need to be checked in pick_next_task() and
> try_to_wake_up(). It would require that the shared memory has to be
> allocated by the host kernel and always present (unlike rseq). But this
> coming from a virtio device driver, that shouldn't be a problem.
>
> If this flag is set on current, then the first thing that
> pick_next_task() should do is to see if it needs to change current's
> priority and policy (via a callback to the driver). And then it can
> decide what task to pick, as if current was boosted, it could very well
> be the next task again.
>
> In try_to_wake_up(), if the task waking up has this flag set, it could
> boost it via an option set by the virtio device. This would allow it to
> preempt the current process if necessary and get on the CPU. Then the
> guest would be require to lower its priority if it the boost was not
> needed.
>
> Hmm, this could work.

For what it's worth, I proposed something somewhat conceptually similar before:
https://lore.kernel.org/kvm/CABCjUKBXCFO4-cXAUdbYEKMz4VyvZ5hD-1yP9H7S7eL8XsqO-g@mail.gmail.com/T/

Guests VCPUs would report their preempt_count to the host and the host
would use that to try not to preempt a VCPU that was in a critical
section (with some simple safeguards in case the guest was not well
behaved).
(It worked by adding a "may_preempt" notifier that would get called in
schedule(), whose return value would determine whether we'd try to
schedule away from current or not.)

It was VM specific, but the same idea could be made to work for
generic userspace tasks.

-- Suleiman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17 15:20                           ` Steven Rostedt
  2024-07-17 17:03                             ` Suleiman Souhlal
@ 2024-07-17 20:57                             ` Joel Fernandes
  2024-07-17 21:00                               ` Steven Rostedt
  1 sibling, 1 reply; 42+ messages in thread
From: Joel Fernandes @ 2024-07-17 20:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sean Christopherson, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 17 Jul 2024 10:52:33 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > We could possibly add a new sched class that has a dynamic priority.
>
> It wouldn't need to be a new sched class. This could work with just a
> task_struct flag.
>
> It would only need to be checked in pick_next_task() and
> try_to_wake_up(). It would require that the shared memory has to be
> allocated by the host kernel and always present (unlike rseq). But this
> coming from a virtio device driver, that shouldn't be a problem.

Problem is its not only about preemption, if we set the vCPU boosted
to RT class, and another RT task is already running on the same CPU,
then the vCPU thread should get migrated to different CPU. We can't do
that I think if we just did it without doing a proper
sched_setscheduler() / sched_setattr() and let the scheduler handle
things.  Vineeth's patches was doing that in VMEXIT..

 - Joel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17 20:57                             ` Joel Fernandes
@ 2024-07-17 21:00                               ` Steven Rostedt
  2024-07-17 21:09                                 ` Joel Fernandes
  0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2024-07-17 21:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Sean Christopherson, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Wed, 17 Jul 2024 16:57:43 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:

> On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 17 Jul 2024 10:52:33 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >  
> > > We could possibly add a new sched class that has a dynamic priority.  
> >
> > It wouldn't need to be a new sched class. This could work with just a
> > task_struct flag.
> >
> > It would only need to be checked in pick_next_task() and
> > try_to_wake_up(). It would require that the shared memory has to be
> > allocated by the host kernel and always present (unlike rseq). But this
> > coming from a virtio device driver, that shouldn't be a problem.  
> 
> Problem is its not only about preemption, if we set the vCPU boosted
> to RT class, and another RT task is already running on the same CPU,

That can only happen on wakeup (interrupt). As the point of lazy
priority changing, it is only done when the vCPU is running.

-- Steve


> then the vCPU thread should get migrated to different CPU. We can't do
> that I think if we just did it without doing a proper
> sched_setscheduler() / sched_setattr() and let the scheduler handle
> things.  Vineeth's patches was doing that in VMEXIT..



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
  2024-07-17 21:00                               ` Steven Rostedt
@ 2024-07-17 21:09                                 ` Joel Fernandes
  0 siblings, 0 replies; 42+ messages in thread
From: Joel Fernandes @ 2024-07-17 21:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sean Christopherson, Mathieu Desnoyers, Vineeth Remanan Pillai,
	Ben Segall, Borislav Petkov, Daniel Bristot de Oliveira,
	Dave Hansen, Dietmar Eggemann, H . Peter Anvin, Ingo Molnar,
	Juri Lelli, Mel Gorman, Paolo Bonzini, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Vitaly Kuznetsov, Wanpeng Li, Suleiman Souhlal,
	Masami Hiramatsu, himadrics, kvm, linux-kernel, x86, graf,
	drjunior.org

On Wed, Jul 17, 2024 at 5:00 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 17 Jul 2024 16:57:43 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
>
> > On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Wed, 17 Jul 2024 10:52:33 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > > We could possibly add a new sched class that has a dynamic priority.
> > >
> > > It wouldn't need to be a new sched class. This could work with just a
> > > task_struct flag.
> > >
> > > It would only need to be checked in pick_next_task() and
> > > try_to_wake_up(). It would require that the shared memory has to be
> > > allocated by the host kernel and always present (unlike rseq). But this
> > > coming from a virtio device driver, that shouldn't be a problem.
> >
> > Problem is its not only about preemption, if we set the vCPU boosted
> > to RT class, and another RT task is already running on the same CPU,
>
> That can only happen on wakeup (interrupt). As the point of lazy
> priority changing, it is only done when the vCPU is running.

True, but I think it will miss stuff related to load balancing, say if
the "boost" is a higher CFS priority. Then someone has to pull the
running vCPU thread to another CPU etc... IMO it is better to set the
priority/class externally and let the scheduler deal with it. Let me
think some more about your idea though..

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-07-17 21:09 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
2024-04-08 13:57   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm Vineeth Pillai (Google)
2024-04-08 13:58   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs Vineeth Pillai (Google)
2024-04-08 13:59   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 4/5] pvsched: bpf support for pvsched Vineeth Pillai (Google)
2024-04-08 14:00   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver Vineeth Pillai (Google)
2024-04-08 14:01   ` Vineeth Remanan Pillai
2024-04-08 13:54 ` [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Remanan Pillai
2024-05-01 15:29 ` Sean Christopherson
2024-05-02 13:42   ` Vineeth Remanan Pillai
2024-06-24 11:01     ` Vineeth Remanan Pillai
2024-07-12 12:57       ` Joel Fernandes
2024-07-12 14:09         ` Mathieu Desnoyers
2024-07-12 14:48           ` Sean Christopherson
2024-07-12 15:32             ` Mathieu Desnoyers
2024-07-12 16:14               ` Sean Christopherson
2024-07-12 16:30               ` Steven Rostedt
2024-07-12 16:39                 ` Sean Christopherson
2024-07-12 17:02                   ` Steven Rostedt
2024-07-12 16:24           ` Steven Rostedt
2024-07-12 16:44             ` Sean Christopherson
2024-07-12 16:50               ` Joel Fernandes
2024-07-12 17:08                 ` Sean Christopherson
2024-07-12 17:14                   ` Steven Rostedt
2024-07-12 17:12               ` Steven Rostedt
2024-07-16 23:44                 ` Sean Christopherson
2024-07-17  0:13                   ` Steven Rostedt
2024-07-17  5:16                   ` Joel Fernandes
2024-07-17 14:14                     ` Sean Christopherson
2024-07-17 14:36                       ` Steven Rostedt
2024-07-17 14:52                         ` Steven Rostedt
2024-07-17 15:20                           ` Steven Rostedt
2024-07-17 17:03                             ` Suleiman Souhlal
2024-07-17 20:57                             ` Joel Fernandes
2024-07-17 21:00                               ` Steven Rostedt
2024-07-17 21:09                                 ` Joel Fernandes
2024-07-12 16:24           ` Joel Fernandes
2024-07-12 17:28             ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox