[RFC PATCH 0/4] Scheduler time slice extension

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Scheduler time slice extension
@ 2024-11-13  0:01 Prakash Sangappa
  2024-11-13  0:01 ` [RFC PATCH 1/4] Introduce per thread user-kernel shared structure Prakash Sangappa
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13  0:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, peterz, tglx, daniel.m.jordan, prakash.sangappa

A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, can have undesirable affect on
performance. Having a way for the thread to request additional execution
time on cpu so that it can complete the critical section will be useful in
such scenario.  The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature. 

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[3], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patch set implements the above mentioned 50us extension time as posted
by Peter. But instead of using restartable sequences as API to set the flag
to request the extension, this patch proposes a new API with use of a per
thread shared structure implementation described below. This shared structure
is accessible in both users pace and kernel. The user thread will set the
flag in this shared structure to request execution time extension.

We tried this change with a database workload. Here are the results, comparing
runs with and without use of the 50us scheduler time slice extension. This
clearly shows there is benefit.

Test results:
=============
	Test system 2 socket AMD Genoa

	Lock table test:- a simple database test to grab table lock(spin lock).
	   Simulates sql query executions.
	   300 clients + 400 cpu hog tasks to generate load.

		Without extension : 182K SQL exec/sec
		With extension    : 262K SQL exec/sec
	   44% improvement.

	Swingbench - standard database benchmark 
	   Cached(database files on tmpfs) run, with 1000 clients.

		Without extension : 99K SQL exec/sec
		with extension    : 153K SQL exec/sec
	   55% improvement in throughput.

Shared structure mechanism:
==========================

A per thread structure is allocated from a page shared mapped between user
space and kernel. This will be useful in sharing thread specific information
between user space and kernel without the need for making system calls in
latency sensitive code path.

Implementation:

A new system call is added to request use of a shared structure by a user
thread. Kernel will allocate page(s), shared mapped with user space in
which per-thread shared structures will be allocated. These structures
are padded to 128 bytes. Multiple such shared structures will be allocated
from that page(upto 32 per 4k page) to accommodate requests from multiple
threads in a process, thus avoiding the need to allocate one page per thread. 
Additional pages are allocated as needed to accommodate more thread's
requesting the shared structure. The number of pages required will depend
on the number of threads of the process requesting/using shared structure.

These pages are pinned and so the kernel can access/update the shared
structure thru the kernel address, without the need for
copy_from_user/copy_to_user() calls. 

The system call will return a pointer(user address) to the per thread shared
structure. Application threads could save this per thread pointer in a TLS
variable and reference it.

Request for scheduler time extension described above, will be a use case 
of this shared structure API. The user thread will request execution time 
extension by setting a flag in the shared structure.

Additional members can be appended to the shared structure to implement 
new features that address other use cases. For example sharing thread's
time spent off cpu and on run queue(described in [4]). These would help
the user thread measure cpu time consumption accross some operation,
without having to call getrusage() system call or read /proc/pid/schedstat
frequently.

Another use case is to share user thread's state 'on' or 'off' cpu thru the
shared structure - which can be useful in implementing adaptive waits in
user space(as discussed in [1]). The waiter thread checks if the owner 
of the resource(lock) is 'on' cpu to continue spinning.

API:
===
	The system call 
	int task_getshared(int option, int flags, void __user *uaddr)

	Only supports TASK_SHAREDINFO option for now, 'flags' are not used.

	/* option */
	#define TASK_SHAREDINFO 1

	struct task_sharedinfo {
		volatile unsigned short sched_delay;    
	};

	#define TASK_PREEMPT_DELAY_REQ     1
	#define TASK_PREEMPT_DELAY_GRANTED 2
	#define TASK_PREEMPT_DELAY_DENIED  3

	Following call:
	__thread struct task_sharedinfo *ts;
	task_getshared(TASK_SHAREDINFO, 0, &ts);

	User task sets 'sched_delay' member to TASK_PREEMPT_DELAY_REQ to
	request scheduler time extension. Kernel sets 'sched_delay' to 
	TASK_PREEMPT_DELAY_GRANTED to indicate if the request for execution 
	time extension was granted or TASK_PREEMPT_DELAY_DENIED if denied.

[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[4] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/

Prakash Sangappa (4):
  Introduce per thread user-kernel shared structure
  Scheduler time extention
  Indicate if schedular preemption delay request is granted
  Add scheduler preemption delay granted stats

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/entry-common.h           |  10 +-
 include/linux/mm_types.h               |   4 +
 include/linux/sched.h                  |  30 ++
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  63 +++++
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/task_shared.h       |  29 ++
 init/Kconfig                           |  10 +
 kernel/entry/common.c                  |  15 +-
 kernel/fork.c                          |  12 +
 kernel/sched/core.c                    |  28 ++
 kernel/sched/debug.c                   |   4 +
 kernel/sched/syscalls.c                |   7 +
 kernel/sys_ni.c                        |   2 +
 mm/Makefile                            |   1 +
 mm/mmap.c                              |  13 +
 mm/task_shared.c                       | 366 +++++++++++++++++++++++++
 19 files changed, 593 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 include/uapi/linux/task_shared.h
 create mode 100644 mm/task_shared.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/4] Introduce per thread user-kernel shared structure
  2024-11-13  0:01 [RFC PATCH 0/4] Scheduler time slice extension Prakash Sangappa
@ 2024-11-13  0:01 ` Prakash Sangappa
  2024-11-13  0:01 ` [RFC PATCH 2/4] Scheduler time extention Prakash Sangappa
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13  0:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, peterz, tglx, daniel.m.jordan, prakash.sangappa

A structure per thread is allocated from a page that is shared mapped
between user space and kernel, to be used as a means for faster
communication. This will facilitate sharing thread specific information
between user space and kernel, which can be accessed by the application
without requiring system calls in latency sensitive code path.

This change adds a new system call, which will allocate the shared
structure and return its mapped user address. Multiple such structures
will be allocated on a page to accommodate requests from different
threads of a multithreaded process. Available space on a page is managed
using a bitmap. When a thread exits, the shared structure is freed and
can get reused for another thread that requests it. More pages will be
allocated and used as needed based on the number of threads requesting
use of shared structures. These pages are all freed when the process
exits.

Each of these per thread shared structures are rounded to 128 bytes.
Available space in this structure can be used to add structure members
to implement new features.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/mm_types.h               |   4 +
 include/linux/sched.h                  |   5 +
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  63 +++++
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/task_shared.h       |  18 ++
 init/Kconfig                           |  10 +
 kernel/fork.c                          |  12 +
 kernel/sys_ni.c                        |   2 +
 mm/Makefile                            |   1 +
 mm/mmap.c                              |  13 ++
 mm/task_shared.c                       | 306 +++++++++++++++++++++++++
 14 files changed, 441 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 include/uapi/linux/task_shared.h
 create mode 100644 mm/task_shared.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 534c74b14fab..3838fdc3d292 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -468,3 +468,4 @@
 460	i386	lsm_set_self_attr	sys_lsm_set_self_attr
 461	i386	lsm_list_modules	sys_lsm_list_modules
 462	i386	mseal 			sys_mseal
+463	i386	task_getshared		sys_task_getshared
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7093ee21c0d1..5bc4ecd74117 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -386,6 +386,7 @@
 460	common	lsm_set_self_attr	sys_lsm_set_self_attr
 461	common	lsm_list_modules	sys_lsm_list_modules
 462 	common  mseal			sys_mseal
+463	common	task_getshared		sys_task_getshared
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bc..d32d92a47c34 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1030,6 +1030,10 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN_WALKS_MMU */
+#ifdef CONFIG_TASKSHARED
+		/* user shared pages */
+		void *usharedpg;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee9d8aecc5b5..1ca7d4efa932 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1593,6 +1593,11 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef	CONFIG_TASKSHARED
+	/* user shared struct */
+	void *task_ushrd;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 5758104921e6..3ca79244aa0b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -997,6 +997,8 @@ asmlinkage long sys_spu_create(const char __user *name,
 		unsigned int flags, umode_t mode, int fd);
 
 
+asmlinkage long sys_task_getshared(long opt, long flags, void __user *uaddr);
+
 /*
  * Deprecated system calls which are still defined in
  * include/uapi/asm-generic/unistd.h and wanted by >= 1 arch
diff --git a/include/linux/task_shared.h b/include/linux/task_shared.h
new file mode 100644
index 000000000000..983fdae47308
--- /dev/null
+++ b/include/linux/task_shared.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef	__TASK_SHARED_H__
+#define	__TASK_SHARED_H__
+
+#include <linux/mm_types.h>
+#include <uapi/linux/task_shared.h>
+
+#ifdef CONFIG_TASKSHARED
+/*
+ * Track user-kernel shared pages referred by mm_struct
+ */
+struct ushared_pages {
+	struct list_head plist;
+	struct list_head frlist;
+	unsigned long pcount;
+};
+
+
+/*
+ * Following is used for cacheline aligned allocations of shared structures
+ * within a page.
+ */
+union task_shared {
+	struct task_sharedinfo ts;
+	char    s[128];
+};
+
+/*
+ * Struct to track per page slots
+ */
+struct ushared_pg {
+	struct list_head list;
+	struct list_head fr_list;
+	struct page *pages[2];
+	u64 bitmap; /* free slots */
+	int slot_count;
+	unsigned long kaddr;
+	unsigned long vaddr; /* user address */
+	struct vm_special_mapping ushrd_mapping;
+};
+
+/*
+ * Following struct is referred by struct task_struct, contains mapped address
+ * of per thread shared structure allocated.
+ */
+struct task_ushrd_struct {
+	union task_shared *kaddr; /* kernel address */
+	union task_shared *uaddr; /* user address */
+	struct ushared_pg *upg;
+};
+
+extern void task_ushared_free(struct task_struct *t);
+extern void mm_ushared_clear(struct mm_struct *mm);
+#else	/* !CONFIG_TASKSHARED */
+static inline void task_ushared_free(struct task_struct *t)
+{
+}
+
+static inline void mm_ushared_clear(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_TASKSHARED */
+#endif /* __TASK_SHARED_H__ */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 5bf6148cac2b..7f6367616fb5 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -840,9 +840,11 @@ __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 
 #define __NR_mseal 462
 __SYSCALL(__NR_mseal, sys_mseal)
+#define __NR_task_getshared 463
+__SYSCALL(__NR_task_getshared, sys_task_getshared)
 
 #undef __NR_syscalls
-#define __NR_syscalls 463
+#define __NR_syscalls 464
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/task_shared.h b/include/uapi/linux/task_shared.h
new file mode 100644
index 000000000000..a07902c57380
--- /dev/null
+++ b/include/uapi/linux/task_shared.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef LINUX_TASK_SHARED_H
+#define LINUX_TASK_SHARED_H
+
+/*
+ * Per task user-kernel mapped structure
+ */
+
+/*
+ * Option to request allocation of struct task_sharedinfo shared structure,
+ * used for sharing per thread information between userspace and kernel.
+ */
+#define TASK_SHAREDINFO 1
+
+struct task_sharedinfo {
+		int version;
+};
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index a7666e186064..1f84851d1b7e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1803,6 +1803,16 @@ config DEBUG_RSEQ
 	  Enable extra debugging checks for the rseq system call.
 
 	  If unsure, say N.
+config TASKSHARED
+	bool "Enable task getshared syscall" if EXPERT
+	default y
+	help
+	  Enable mechanism to provide per thread shared structure mapped
+	  between userspace<->kernel for faster communication. Used
+	  for sharing per thread information.
+
+	  If unsure, say Y.
+
 
 config CACHESTAT_SYSCALL
 	bool "Enable cachestat() system call" if EXPERT
diff --git a/kernel/fork.c b/kernel/fork.c
index 22f43721d031..b40792c84718 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -112,6 +112,7 @@
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <linux/task_shared.h>
 
 #include <trace/events/sched.h>
 
@@ -1126,6 +1127,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (err)
 		goto free_stack;
 
+#ifdef CONFIG_TASKSHARED
+	/* task's shared structures are not inherited across fork */
+	tsk->task_ushrd =  NULL;
+#endif
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -1282,6 +1288,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
 	mm->pmd_huge_pte = NULL;
 #endif
+
+#ifdef CONFIG_TASKSHARED
+	mm->usharedpg = NULL;
+#endif
 	mm_init_uprobes_state(mm);
 	hugetlb_count_init(mm);
 
@@ -1346,6 +1356,7 @@ static inline void __mmput(struct mm_struct *mm)
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
 	exit_mmap(mm);
+	mm_ushared_clear(mm);
 	mm_put_huge_zero_folio(mm);
 	set_mm_exe_file(mm, NULL);
 	if (!list_empty(&mm->mmlist)) {
@@ -1605,6 +1616,7 @@ static int wait_for_vfork_done(struct task_struct *child,
 static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 {
 	uprobe_free_utask(tsk);
+	task_ushared_free(tsk);
 
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c00a86931f8c..9039a69e95ac 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,5 +390,7 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+/* task shared */
+COND_SYSCALL(task_getshared);
 
 COND_SYSCALL(uretprobe);
diff --git a/mm/Makefile b/mm/Makefile
index d5639b036166..007743b40f87 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -145,3 +145,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_TASKSHARED) += task_shared.o
diff --git a/mm/mmap.c b/mm/mmap.c
index 79d541f1502b..05b947fac55b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2086,6 +2086,18 @@ static int special_mapping_split(struct vm_area_struct *vma, unsigned long addr)
 	return -EINVAL;
 }
 
+/*
+ * XXX: Having a mkwrite hook prevents attempt to update file time on
+ * page write fault, these mappings do not have a file structure associated.
+ */
+static vm_fault_t special_mapping_page_mkwrite(struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	lock_page(page);
+
+	return VM_FAULT_LOCKED;
+}
+
 static const struct vm_operations_struct special_mapping_vmops = {
 	.close = special_mapping_close,
 	.fault = special_mapping_fault,
@@ -2094,6 +2106,7 @@ static const struct vm_operations_struct special_mapping_vmops = {
 	/* vDSO code relies that VVAR can't be accessed remotely */
 	.access = NULL,
 	.may_split = special_mapping_split,
+	.page_mkwrite = special_mapping_page_mkwrite,
 };
 
 static vm_fault_t special_mapping_fault(struct vm_fault *vmf)
diff --git a/mm/task_shared.c b/mm/task_shared.c
new file mode 100644
index 000000000000..cea45d913b91
--- /dev/null
+++ b/mm/task_shared.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/uio.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/highmem.h>
+#include <linux/ptrace.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/task_shared.h>
+
+/*
+ * Per thread shared structure mechanism
+ */
+
+#define TASK_USHARED_SLOTS (PAGE_SIZE/sizeof(union task_shared))
+
+/*
+ * Called once to init struct ushared_pages pointer.
+ */
+static int init_mm_ushared(struct mm_struct *mm)
+{
+	struct ushared_pages *usharedpg;
+
+	usharedpg = kmalloc(sizeof(struct ushared_pages), GFP_KERNEL);
+	if (usharedpg == NULL)
+		return 1;
+
+	INIT_LIST_HEAD(&usharedpg->plist);
+	INIT_LIST_HEAD(&usharedpg->frlist);
+	usharedpg->pcount = 0;
+	mmap_write_lock(mm);
+	if (mm->usharedpg == NULL) {
+		mm->usharedpg = usharedpg;
+		usharedpg = NULL;
+	}
+	mmap_write_unlock(mm);
+	if (usharedpg != NULL)
+		kfree(usharedpg);
+	return 0;
+}
+
+static int init_task_ushrd(struct task_struct *t)
+{
+	struct task_ushrd_struct *ushrd;
+
+	ushrd = kzalloc(sizeof(struct task_ushrd_struct), GFP_KERNEL);
+	if (ushrd == NULL)
+		return 1;
+
+	mmap_write_lock(t->mm);
+	if (t->task_ushrd == NULL) {
+		t->task_ushrd = ushrd;
+		ushrd = NULL;
+	}
+	mmap_write_unlock(t->mm);
+	if (ushrd != NULL)
+		kfree(ushrd);
+	return 0;
+}
+
+/*
+ * Called from __mmput(), mm is going away
+ */
+void mm_ushared_clear(struct mm_struct *mm)
+{
+	struct ushared_pg *upg;
+	struct ushared_pg *tmp;
+	struct ushared_pages *usharedpg;
+
+	if (mm == NULL || mm->usharedpg == NULL)
+		return;
+
+	usharedpg = mm->usharedpg;
+	if (list_empty(&usharedpg->frlist))
+		goto out;
+
+	list_for_each_entry_safe(upg, tmp, &usharedpg->frlist, fr_list) {
+		list_del(&upg->fr_list);
+		put_page(upg->pages[0]);
+		kfree(upg);
+	}
+out:
+	kfree(mm->usharedpg);
+	mm->usharedpg = NULL;
+
+}
+
+void task_ushared_free(struct task_struct *t)
+{
+	struct task_ushrd_struct *ushrd = t->task_ushrd;
+	struct mm_struct *mm = t->mm;
+	struct ushared_pages *usharedpg;
+	int slot;
+
+	if (mm == NULL || mm->usharedpg == NULL || ushrd == NULL)
+		return;
+
+	usharedpg = mm->usharedpg;
+	mmap_write_lock(mm);
+
+	if (ushrd->upg == NULL)
+		goto out;
+
+	slot = (unsigned long)((unsigned long)ushrd->uaddr
+		 - ushrd->upg->vaddr) / sizeof(union task_shared);
+	clear_bit(slot, (unsigned long *)(&ushrd->upg->bitmap));
+
+	/* move to head */
+	if (ushrd->upg->slot_count == 0) {
+		list_del(&ushrd->upg->fr_list);
+		list_add(&ushrd->upg->fr_list, &usharedpg->frlist);
+	}
+
+	ushrd->upg->slot_count++;
+
+	ushrd->uaddr = ushrd->kaddr = NULL;
+	ushrd->upg = NULL;
+
+out:
+	t->task_ushrd = NULL;
+	mmap_write_unlock(mm);
+	kfree(ushrd);
+}
+
+/* map shared page */
+static int task_shared_add_vma(struct ushared_pg *pg)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm =  current->mm;
+	unsigned long ret = 1;
+
+
+	if (!pg->vaddr) {
+		/* Try to map as high as possible, this is only a hint. */
+		pg->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
+					PAGE_SIZE, 0, 0);
+		if (pg->vaddr & ~PAGE_MASK) {
+			ret = 0;
+			goto fail;
+		}
+	}
+
+	vma = _install_special_mapping(mm, pg->vaddr, PAGE_SIZE,
+			VM_SHARED|VM_READ|VM_MAYREAD|
+			VM_WRITE|VM_MAYWRITE|VM_DONTCOPY|VM_IO,
+			&pg->ushrd_mapping);
+	if (IS_ERR(vma)) {
+		ret = 0;
+		pg->vaddr = 0;
+		goto fail;
+	}
+
+	pg->kaddr = (unsigned long)page_address(pg->pages[0]);
+fail:
+	return ret;
+}
+
+/*
+ * Allocate a page, map user address and add to freelist
+ */
+static struct ushared_pg *ushared_allocpg(void)
+{
+
+	struct ushared_pg *pg;
+	struct mm_struct *mm = current->mm;
+	struct ushared_pages *usharedpg = mm->usharedpg;
+
+	if (usharedpg == NULL)
+		return NULL;
+	pg = kzalloc(sizeof(*pg), GFP_KERNEL);
+
+	if (unlikely(!pg))
+		return NULL;
+	pg->ushrd_mapping.name = "[task_shared]";
+	pg->ushrd_mapping.fault = NULL;
+	pg->ushrd_mapping.pages = pg->pages;
+	pg->pages[0] = alloc_page(GFP_KERNEL);
+	if (!pg->pages[0])
+		goto out;
+	pg->pages[1] = NULL;
+	pg->bitmap = 0;
+
+	/*
+	 * page size should be 4096 or 8192
+	 */
+	pg->slot_count = TASK_USHARED_SLOTS;
+
+	mmap_write_lock(mm);
+	if (task_shared_add_vma(pg)) {
+		list_add(&pg->fr_list, &usharedpg->frlist);
+		usharedpg->pcount++;
+		mmap_write_unlock(mm);
+		return pg;
+	}
+	mmap_write_unlock(mm);
+
+	__free_page(pg->pages[0]);
+out:
+	kfree(pg);
+	return NULL;
+}
+
+
+/*
+ * Allocate task_shared struct for calling thread.
+ */
+static int task_ushared_alloc(void)
+{
+	struct mm_struct *mm = current->mm;
+	struct ushared_pg *ent = NULL;
+	struct task_ushrd_struct *ushrd;
+	struct ushared_pages *usharedpg;
+	int tryalloc = 0;
+	int slot = -1;
+	int ret = -ENOMEM;
+
+	if (mm->usharedpg == NULL && init_mm_ushared(mm))
+		return ret;
+
+	if (current->task_ushrd == NULL && init_task_ushrd(current))
+		return ret;
+
+	usharedpg = mm->usharedpg;
+	ushrd = current->task_ushrd;
+repeat:
+	if (mmap_write_lock_killable(mm))
+		return -EINTR;
+
+	ent = list_empty(&usharedpg->frlist) ? NULL :
+		list_entry(usharedpg->frlist.next,
+		struct ushared_pg, fr_list);
+
+	if (ent == NULL || ent->slot_count == 0) {
+		if (tryalloc == 0) {
+			mmap_write_unlock(mm);
+			(void)ushared_allocpg();
+			tryalloc = 1;
+			goto repeat;
+		} else {
+			ent = NULL;
+		}
+	}
+
+	if (ent) {
+		slot = find_first_zero_bit((unsigned long *)(&ent->bitmap),
+		  TASK_USHARED_SLOTS);
+		BUG_ON(slot >=  TASK_USHARED_SLOTS);
+
+		set_bit(slot, (unsigned long *)(&ent->bitmap));
+
+		ushrd->uaddr = (union task_shared *)(ent->vaddr +
+		  (slot * sizeof(union task_shared)));
+		ushrd->kaddr = (union task_shared *)(ent->kaddr +
+		  (slot * sizeof(union task_shared)));
+		ushrd->upg = ent;
+		ent->slot_count--;
+		/* move it to tail */
+		if (ent->slot_count == 0) {
+			list_del(&ent->fr_list);
+			list_add_tail(&ent->fr_list, &usharedpg->frlist);
+		}
+
+	       ret = 0;
+	}
+
+	mmap_write_unlock(mm);
+	return ret;
+}
+
+
+/*
+ * Get Task Shared structure, allocate if needed and return mapped user address.
+ */
+static long task_getshared(u64 opt, u64 flags, void __user *uaddr)
+{
+	struct task_ushrd_struct *ushrd = current->task_ushrd;
+
+	/* currently only TASK_SHAREDINFO supported */
+	if (opt != TASK_SHAREDINFO)
+		return (-EINVAL);
+
+	/* if a shared structure is already allocated, return address */
+	if (ushrd != NULL && ushrd->upg != NULL) {
+		if (copy_to_user(uaddr, &ushrd->uaddr,
+			sizeof(struct  task_sharedinfo *)))
+			return (-EFAULT);
+		return 0;
+	}
+
+	task_ushared_alloc();
+	ushrd = current->task_ushrd;
+	if (ushrd != NULL && ushrd->upg != NULL) {
+		if (copy_to_user(uaddr, &ushrd->uaddr,
+			sizeof(struct task_sharedinfo *)))
+			return (-EFAULT);
+		return 0;
+	}
+	return (-ENOMEM);
+}
+
+
+SYSCALL_DEFINE3(task_getshared, u64, opt, u64, flags, void __user *, uaddr)
+{
+	return task_getshared(opt, flags, uaddr);
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/4] Scheduler time extention
  2024-11-13  0:01 [RFC PATCH 0/4] Scheduler time slice extension Prakash Sangappa
  2024-11-13  0:01 ` [RFC PATCH 1/4] Introduce per thread user-kernel shared structure Prakash Sangappa
@ 2024-11-13  0:01 ` Prakash Sangappa
  2024-11-13  3:57   ` K Prateek Nayak
  2024-11-13  0:01 ` [RFC PATCH 3/4] Indicate if schedular preemption delay request is granted Prakash Sangappa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13  0:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, peterz, tglx, daniel.m.jordan, prakash.sangappa

Introduce support for a thread to request extending execution time on
the cpu, when holding locks in user space. Adds a member 'sched_delay' to
the per thread shared mapped structure. Request for cpu execution time
extention is made by the thread by updating 'sched_delay' member.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/entry-common.h     | 10 +++++--
 include/linux/sched.h            | 17 +++++++++++
 include/uapi/linux/task_shared.h |  2 +-
 kernel/entry/common.c            | 15 ++++++----
 kernel/sched/core.c              | 16 ++++++++++
 kernel/sched/syscalls.c          |  7 +++++
 mm/task_shared.c                 | 50 ++++++++++++++++++++++++++++++++
 7 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1e50cdb83ae5..904f5cdfe0b7 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -302,7 +302,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
  * exit_to_user_mode_loop - do any pending work before leaving to user space
  */
 unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+				     unsigned long ti_work, bool irq);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
@@ -314,7 +314,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
  *    EXIT_TO_USER_MODE_WORK are set
  * 4) check that interrupts are still disabled
  */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs,
+						bool irq)
 {
 	unsigned long ti_work;
 
@@ -325,7 +326,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
 
 	ti_work = read_thread_flags();
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
+
+	if (irq)
+		taskshrd_delay_resched_fini();
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1ca7d4efa932..b53e7a878a01 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -326,6 +326,7 @@ extern int __must_check io_schedule_prepare(void);
 extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
+extern void hrtick_local_start(u64 delay);
 
 /**
  * struct prev_cputime - snapshot of system and user cputime
@@ -957,6 +958,9 @@ struct task_struct {
 	 * ->sched_remote_wakeup gets used, so it can be in this word.
 	 */
 	unsigned			sched_remote_wakeup:1;
+#ifdef CONFIG_TASKSHARED
+	unsigned			taskshrd_sched_delay:1;
+#endif
 #ifdef CONFIG_RT_MUTEXES
 	unsigned			sched_rt_mutex:1;
 #endif
@@ -2186,6 +2190,19 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_TASKSHARED
+
+extern bool taskshrd_delay_resched(void);
+extern void taskshrd_delay_resched_fini(void);
+extern void taskshrd_delay_resched_tick(void);
+#else
+
+static inline bool taskshrd_delay_resched(void) { return false; }
+static inline void taskshrd_delay_resched_fini(void) { }
+static inline void taskshrd_delay_resched_tick(void) { }
+
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
diff --git a/include/uapi/linux/task_shared.h b/include/uapi/linux/task_shared.h
index a07902c57380..6e4c664eea60 100644
--- a/include/uapi/linux/task_shared.h
+++ b/include/uapi/linux/task_shared.h
@@ -13,6 +13,6 @@
 #define TASK_SHAREDINFO 1
 
 struct task_sharedinfo {
-		int version;
+		volatile unsigned short sched_delay;
 };
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 11ec8320b59d..0e0360e8c127 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -89,7 +89,8 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
  * @ti_work:	TIF work flags as read by the caller
  */
 __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-						     unsigned long ti_work)
+						     unsigned long ti_work,
+						     bool irq)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
@@ -99,8 +100,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+		if (ti_work & _TIF_NEED_RESCHED) {
+			if (irq && taskshrd_delay_resched())
+				clear_tsk_need_resched(current);
+			else
+				schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
@@ -208,7 +213,7 @@ static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *reg
 {
 	syscall_exit_to_user_mode_prepare(regs);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, false);
 }
 
 void syscall_exit_to_user_mode_work(struct pt_regs *regs)
@@ -232,7 +237,7 @@ noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, true);
 	instrumentation_end();
 	exit_to_user_mode();
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71b6396db118..713c43491403 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -815,6 +815,7 @@ void update_rq_clock(struct rq *rq)
 
 static void hrtick_clear(struct rq *rq)
 {
+	taskshrd_delay_resched_tick();
 	if (hrtimer_active(&rq->hrtick_timer))
 		hrtimer_cancel(&rq->hrtick_timer);
 }
@@ -830,6 +831,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 
+	taskshrd_delay_resched_tick();
+
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
@@ -903,6 +906,16 @@ void hrtick_start(struct rq *rq, u64 delay)
 
 #endif /* CONFIG_SMP */
 
+void hrtick_local_start(u64 delay)
+{
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+
+	rq_lock(rq, &rf);
+	hrtick_start(rq, delay);
+	rq_unlock(rq, &rf);
+}
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -6645,6 +6658,9 @@ static void __sched notrace __schedule(int sched_mode)
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
+#ifdef CONFIG_TASKSHARED
+	prev->taskshrd_sched_delay = 0;
+#endif
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns = 0;
 #endif
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index d23c34b8b3eb..0904667924d8 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -1419,6 +1419,13 @@ static void do_sched_yield(void)
  */
 SYSCALL_DEFINE0(sched_yield)
 {
+
+#ifdef  CONFIG_TASKSHARED
+	if (current->taskshrd_sched_delay) {
+		schedule();
+		return 0;
+	}
+#endif
 	do_sched_yield();
 	return 0;
 }
diff --git a/mm/task_shared.c b/mm/task_shared.c
index cea45d913b91..575b335d6879 100644
--- a/mm/task_shared.c
+++ b/mm/task_shared.c
@@ -268,6 +268,56 @@ static int task_ushared_alloc(void)
 	return ret;
 }
 
+bool taskshrd_delay_resched(void)
+{
+	struct task_struct *t = current;
+	struct task_ushrd_struct *shrdp = t->task_ushrd;
+
+	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
+		return false;
+
+	if(shrdp == NULL || shrdp->kaddr == NULL)
+		return false;
+
+	if (t->taskshrd_sched_delay)
+		return false;
+
+	if (!(shrdp->kaddr->ts.sched_delay))
+		return false;
+
+	shrdp->kaddr->ts.sched_delay = 0;
+	t->taskshrd_sched_delay = 1;
+
+	return true;
+}
+
+void taskshrd_delay_resched_fini(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	struct task_struct *t = current;
+	/*
+	* IRQs off, guaranteed to return to userspace, start timer on this CPU
+	* to limit the resched-overdraft.
+	*
+	* If your critical section is longer than 50 us you get to keep the
+	* pieces.
+	*/
+	if (t->taskshrd_sched_delay)
+		hrtick_local_start(50 * NSEC_PER_USEC);
+#endif
+}
+
+void taskshrd_delay_resched_tick(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	struct task_struct *t = current;
+
+	if (t->taskshrd_sched_delay) {
+		set_tsk_need_resched(t);
+	}
+#endif
+}
+
 
 /*
  * Get Task Shared structure, allocate if needed and return mapped user address.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] Scheduler time extention
  2024-11-13  0:01 ` [RFC PATCH 2/4] Scheduler time extention Prakash Sangappa
@ 2024-11-13  3:57   ` K Prateek Nayak
  2024-11-13 17:40     ` Prakash Sangappa
  0 siblings, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2024-11-13  3:57 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel; +Cc: rostedt, peterz, tglx, daniel.m.jordan

Hello Prakash,

Full disclaimer: I haven't looked closely at the complete series but ...

On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
> [..snip..]
> @@ -99,8 +100,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   
>   		local_irq_enable_exit_to_user(ti_work);
>   
> -		if (ti_work & _TIF_NEED_RESCHED)
> -			schedule();
> +		if (ti_work & _TIF_NEED_RESCHED) {
> +			if (irq && taskshrd_delay_resched())
> +				clear_tsk_need_resched(current);

Suppose the current task had requested for a delayed resched but an RT
task's wakeup sets the TIF_NEED_RESCHED flag via an IPI, doesn't this
clear the flag indiscriminately and allow the task to run for an
extended amount of time? Am I missing something?

> +			else
> +				schedule();
> +		}
>   
>   		if (ti_work & _TIF_UPROBE)
>   			uprobe_notify_resume(regs);
> @@ -208,7 +213,7 @@ static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *reg
>   {
>   	syscall_exit_to_user_mode_prepare(regs);
>   	local_irq_disable_exit_to_user();
> -	exit_to_user_mode_prepare(regs);
> +	exit_to_user_mode_prepare(regs, false);
>   }
>   
>   void syscall_exit_to_user_mode_work(struct pt_regs *regs)
> @@ -232,7 +237,7 @@ noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
>   noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
>   {
>   	instrumentation_begin();
> -	exit_to_user_mode_prepare(regs);
> +	exit_to_user_mode_prepare(regs, true);
>   	instrumentation_end();
>   	exit_to_user_mode();
>   }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 71b6396db118..713c43491403 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -815,6 +815,7 @@ void update_rq_clock(struct rq *rq)
>   
>   static void hrtick_clear(struct rq *rq)
>   {
> +	taskshrd_delay_resched_tick();
>   	if (hrtimer_active(&rq->hrtick_timer))
>   		hrtimer_cancel(&rq->hrtick_timer);
>   }
> @@ -830,6 +831,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
>   
>   	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
>   
> +	taskshrd_delay_resched_tick();
> +
>   	rq_lock(rq, &rf);
>   	update_rq_clock(rq);
>   	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
> @@ -903,6 +906,16 @@ void hrtick_start(struct rq *rq, u64 delay)
>   
>   #endif /* CONFIG_SMP */
>   
> +void hrtick_local_start(u64 delay)
> +{
> +	struct rq *rq = this_rq();
> +	struct rq_flags rf;
> +
> +	rq_lock(rq, &rf);

You can use guard(rq_lock)(rq) and avoid declaring rf.

> +	hrtick_start(rq, delay);
> +	rq_unlock(rq, &rf);
> +}
> +
>   static void hrtick_rq_init(struct rq *rq)
>   {
>   #ifdef CONFIG_SMP
> @@ -6645,6 +6658,9 @@ static void __sched notrace __schedule(int sched_mode)
>   picked:
>   	clear_tsk_need_resched(prev);
>   	clear_preempt_need_resched();
> +#ifdef CONFIG_TASKSHARED
> +	prev->taskshrd_sched_delay = 0;
> +#endif
>   #ifdef CONFIG_SCHED_DEBUG
>   	rq->last_seen_need_resched_ns = 0;
>   #endif
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index d23c34b8b3eb..0904667924d8 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -1419,6 +1419,13 @@ static void do_sched_yield(void)
>    */
>   SYSCALL_DEFINE0(sched_yield)
>   {
> +
> +#ifdef  CONFIG_TASKSHARED
> +	if (current->taskshrd_sched_delay) {
> +		schedule();
> +		return 0;
> +	}
> +#endif
>   	do_sched_yield();
>   	return 0;
>   }
> diff --git a/mm/task_shared.c b/mm/task_shared.c
> index cea45d913b91..575b335d6879 100644
> --- a/mm/task_shared.c
> +++ b/mm/task_shared.c
> @@ -268,6 +268,56 @@ static int task_ushared_alloc(void)
>   	return ret;
>   }
>   
> +bool taskshrd_delay_resched(void)
> +{
> +	struct task_struct *t = current;
> +	struct task_ushrd_struct *shrdp = t->task_ushrd;
> +
> +	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
> +		return false;
> +
> +	if(shrdp == NULL || shrdp->kaddr == NULL)
> +		return false;
> +
> +	if (t->taskshrd_sched_delay)
> +		return false;
> +
> +	if (!(shrdp->kaddr->ts.sched_delay))
> +		return false;
> +
> +	shrdp->kaddr->ts.sched_delay = 0;
> +	t->taskshrd_sched_delay = 1;
> +
> +	return true;

Perhaps this needs to also check
"rq->nr_running == rq->cfs.h_nr_running" since I believe it only makes
sense for fair tasks to request that extra slice?

-- 
Thanks and Regards,
Prateek

> +}
> +
> +void taskshrd_delay_resched_fini(void)
> +{
> +#ifdef CONFIG_SCHED_HRTICK
> +	struct task_struct *t = current;
> +	/*
> +	* IRQs off, guaranteed to return to userspace, start timer on this CPU
> +	* to limit the resched-overdraft.
> +	*
> +	* If your critical section is longer than 50 us you get to keep the
> +	* pieces.
> +	*/
> +	if (t->taskshrd_sched_delay)
> +		hrtick_local_start(50 * NSEC_PER_USEC);
> +#endif
> +}
> +
> +void taskshrd_delay_resched_tick(void)
> +{
> +#ifdef CONFIG_SCHED_HRTICK
> +	struct task_struct *t = current;
> +
> +	if (t->taskshrd_sched_delay) {
> +		set_tsk_need_resched(t);
> +	}
> +#endif
> +}
> +
>   
>   /*
>    * Get Task Shared structure, allocate if needed and return mapped user address.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] Scheduler time extention
  2024-11-13  3:57   ` K Prateek Nayak
@ 2024-11-13 17:40     ` Prakash Sangappa
  0 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13 17:40 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	peterz@infradead.org, tglx@linutronix.de, Daniel Jordan



> On Nov 12, 2024, at 7:57 PM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Prakash,
> 
> Full disclaimer: I haven't looked closely at the complete series but ...
> 
> On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
>> [..snip..]
>> @@ -99,8 +100,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>     local_irq_enable_exit_to_user(ti_work);
>>  - if (ti_work & _TIF_NEED_RESCHED)
>> - schedule();
>> + if (ti_work & _TIF_NEED_RESCHED) {
>> + if (irq && taskshrd_delay_resched())
>> + clear_tsk_need_resched(current);
> 
> Suppose the current task had requested for a delayed resched but an RT
> task's wakeup sets the TIF_NEED_RESCHED flag via an IPI, doesn't this
> clear the flag indiscriminately and allow the task to run for an
> extended amount of time? Am I missing something?

If the scheduler extension delay has already been granted when the IPI from wakeup occurs, then it would not clear the TIF_NEED_RESCHED flag and so would preempt.

[...]

> 
>>  }
>> @@ -830,6 +831,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
>>     WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
>>  + taskshrd_delay_resched_tick();
>> +
>>   rq_lock(rq, &rf);
>>   update_rq_clock(rq);
>>   rq->curr->sched_class->task_tick(rq, rq->curr, 1);
>> @@ -903,6 +906,16 @@ void hrtick_start(struct rq *rq, u64 delay)
>>    #endif /* CONFIG_SMP */
>>  +void hrtick_local_start(u64 delay)
>> +{
>> + struct rq *rq = this_rq();
>> + struct rq_flags rf;
>> +
>> + rq_lock(rq, &rf);
> 
> You can use guard(rq_lock)(rq) and avoid declaring rf.

Will take a look and address it.

[...]

> 
>>  +bool taskshrd_delay_resched(void)
>> +{
>> + struct task_struct *t = current;
>> + struct task_ushrd_struct *shrdp = t->task_ushrd;
>> +
>> + if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
>> + return false;
>> +
>> + if(shrdp == NULL || shrdp->kaddr == NULL)
>> + return false;
>> +
>> + if (t->taskshrd_sched_delay)
>> + return false;
>> +
>> + if (!(shrdp->kaddr->ts.sched_delay))
>> + return false;
>> +
>> + shrdp->kaddr->ts.sched_delay = 0;
>> + t->taskshrd_sched_delay = 1;
>> +
>> + return true;
> 
> Perhaps this needs to also check
> "rq->nr_running == rq->cfs.h_nr_running" since I believe it only makes
> sense for fair tasks to request that extra slice?

From the discussion in 
https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/

It was ok to have this behavior for all tasks. It could be changed to work only for fair tasks, if there is agreement.


Thanks,
-Prakash

> 
> -- 
> Thanks and Regards,
> Prateek
> 
>> +}
>> +
>> +void taskshrd_delay_resched_fini(void)
>> +{
>> +#ifdef CONFIG_SCHED_HRTICK
>> + struct task_struct *t = current;
>> + /*
>> + * IRQs off, guaranteed to return to userspace, start timer on this CPU
>> + * to limit the resched-overdraft.
>> + *
>> + * If your critical section is longer than 50 us you get to keep the
>> + * pieces.
>> + */
>> + if (t->taskshrd_sched_delay)
>> + hrtick_local_start(50 * NSEC_PER_USEC);
>> +#endif
>> +}
>> +
>> +void taskshrd_delay_resched_tick(void)
>> +{
>> +#ifdef CONFIG_SCHED_HRTICK
>> + struct task_struct *t = current;
>> +
>> + if (t->taskshrd_sched_delay) {
>> + set_tsk_need_resched(t);
>> + }
>> +#endif
>> +}
>> +
>>    /*
>>   * Get Task Shared structure, allocate if needed and return mapped user address.
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/4] Indicate if schedular preemption delay request is granted
  2024-11-13  0:01 [RFC PATCH 0/4] Scheduler time slice extension Prakash Sangappa
  2024-11-13  0:01 ` [RFC PATCH 1/4] Introduce per thread user-kernel shared structure Prakash Sangappa
  2024-11-13  0:01 ` [RFC PATCH 2/4] Scheduler time extention Prakash Sangappa
@ 2024-11-13  0:01 ` Prakash Sangappa
  2024-11-13  0:01 ` [RFC PATCH 4/4] Add scheduler preemption delay granted stats Prakash Sangappa
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13  0:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, peterz, tglx, daniel.m.jordan, prakash.sangappa

Indicate to user space if the preemption delay request was granted
or denied.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/uapi/linux/task_shared.h | 11 +++++++++++
 mm/task_shared.c                 | 14 +++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/task_shared.h b/include/uapi/linux/task_shared.h
index 6e4c664eea60..a0f7ef0c69d0 100644
--- a/include/uapi/linux/task_shared.h
+++ b/include/uapi/linux/task_shared.h
@@ -15,4 +15,15 @@
 struct task_sharedinfo {
 		volatile unsigned short sched_delay;
 };
+
+/*
+ * 'sched_delay'  values:
+ * TASK_PREEMPT_DELAY_REQ - application sets to request preemption delay.
+ * TASK_PREEMPT_DELAY_GRANTED - set by kernel if granted extended time on cpu.
+ * TASK_PREEMPT_DELAY_DENIED- set by kernel if not granted because the
+ *     application requested preemption delay again within the extended time.
+ */
+#define TASK_PREEMPT_DELAY_REQ		1
+#define TASK_PREEMPT_DELAY_GRANTED	2
+#define TASK_PREEMPT_DELAY_DENIED	3
 #endif
diff --git a/mm/task_shared.c b/mm/task_shared.c
index 575b335d6879..5b8a068a6b44 100644
--- a/mm/task_shared.c
+++ b/mm/task_shared.c
@@ -279,13 +279,21 @@ bool taskshrd_delay_resched(void)
 	if(shrdp == NULL || shrdp->kaddr == NULL)
 		return false;
 
-	if (t->taskshrd_sched_delay)
+	if (t->taskshrd_sched_delay) {
+		if (shrdp->kaddr->ts.sched_delay
+				== TASK_PREEMPT_DELAY_REQ) {
+			/* not granted */
+			shrdp->kaddr->ts.sched_delay
+				= TASK_PREEMPT_DELAY_DENIED;
+		}
 		return false;
+	}
 
-	if (!(shrdp->kaddr->ts.sched_delay))
+	if (shrdp->kaddr->ts.sched_delay != TASK_PREEMPT_DELAY_REQ)
 		return false;
 
-	shrdp->kaddr->ts.sched_delay = 0;
+	/* granted */
+	shrdp->kaddr->ts.sched_delay = TASK_PREEMPT_DELAY_GRANTED;
 	t->taskshrd_sched_delay = 1;
 
 	return true;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 4/4] Add scheduler preemption delay granted stats
  2024-11-13  0:01 [RFC PATCH 0/4] Scheduler time slice extension Prakash Sangappa
                   ` (2 preceding siblings ...)
  2024-11-13  0:01 ` [RFC PATCH 3/4] Indicate if schedular preemption delay request is granted Prakash Sangappa
@ 2024-11-13  0:01 ` Prakash Sangappa
  2024-11-13  5:43 ` [RFC PATCH 0/4] Scheduler time slice extension K Prateek Nayak
  2024-11-13 18:50 ` Peter Zijlstra
  5 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13  0:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, peterz, tglx, daniel.m.jordan, prakash.sangappa

Add scheduler stats to record number of times preemption delay was granted
or denied.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h |  8 ++++++++
 kernel/sched/core.c   | 12 ++++++++++++
 kernel/sched/debug.c  |  4 ++++
 mm/task_shared.c      |  2 ++
 4 files changed, 26 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b53e7a878a01..e3f5760632f4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -327,6 +327,10 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 extern void hrtick_local_start(u64 delay);
+#ifdef CONFIG_TASKSHARED
+extern void update_stat_preempt_delayed(struct task_struct *t);
+extern void update_stat_preempt_denied(struct task_struct *t);
+#endif
 
 /**
  * struct prev_cputime - snapshot of system and user cputime
@@ -532,6 +536,10 @@ struct sched_statistics {
 	u64				nr_wakeups_affine_attempts;
 	u64				nr_wakeups_passive;
 	u64				nr_wakeups_idle;
+#ifdef CONFIG_TASKSHARED
+	u64				nr_preempt_delay_granted;
+	u64				nr_preempt_delay_denied;
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 713c43491403..54fa4b68adaf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -916,6 +916,18 @@ void hrtick_local_start(u64 delay)
 	rq_unlock(rq, &rf);
 }
 
+#ifdef CONFIG_TASKSHARED
+void update_stat_preempt_delayed(struct task_struct *t)
+{
+	schedstat_inc(t->stats.nr_preempt_delay_granted);
+}
+
+void update_stat_preempt_denied(struct task_struct *t)
+{
+	schedstat_inc(t->stats.nr_preempt_delay_denied);
+}
+#endif
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4a9fbbe843c0..ace7856f13c3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1215,6 +1215,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
 		P_SCHEDSTAT(nr_wakeups_passive);
 		P_SCHEDSTAT(nr_wakeups_idle);
+#ifdef CONFIG_TASKSHARED
+		P_SCHEDSTAT(nr_preempt_delay_granted);
+		P_SCHEDSTAT(nr_preempt_delay_denied);
+#endif
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
diff --git a/mm/task_shared.c b/mm/task_shared.c
index 5b8a068a6b44..35aecc718c8e 100644
--- a/mm/task_shared.c
+++ b/mm/task_shared.c
@@ -285,6 +285,7 @@ bool taskshrd_delay_resched(void)
 			/* not granted */
 			shrdp->kaddr->ts.sched_delay
 				= TASK_PREEMPT_DELAY_DENIED;
+			update_stat_preempt_denied(t);
 		}
 		return false;
 	}
@@ -295,6 +296,7 @@ bool taskshrd_delay_resched(void)
 	/* granted */
 	shrdp->kaddr->ts.sched_delay = TASK_PREEMPT_DELAY_GRANTED;
 	t->taskshrd_sched_delay = 1;
+	update_stat_preempt_delayed(t);
 
 	return true;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13  0:01 [RFC PATCH 0/4] Scheduler time slice extension Prakash Sangappa
                   ` (3 preceding siblings ...)
  2024-11-13  0:01 ` [RFC PATCH 4/4] Add scheduler preemption delay granted stats Prakash Sangappa
@ 2024-11-13  5:43 ` K Prateek Nayak
  2024-11-13 19:56   ` Prakash Sangappa
  2024-11-13 18:50 ` Peter Zijlstra
  5 siblings, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2024-11-13  5:43 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel; +Cc: rostedt, peterz, tglx, daniel.m.jordan

Hello Prakash,

Few questions around the benchmarks.

On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
> [..snip..] 
> 
> Test results:
> =============
> 	Test system 2 socket AMD Genoa
> 
> 	Lock table test:- a simple database test to grab table lock(spin lock).
> 	   Simulates sql query executions.
> 	   300 clients + 400 cpu hog tasks to generate load.

Have you tried running the 300 clients with a nice value of -20 and 400
CPU hogs with the default nice value / nice 19? Does that help this
particular case?

> 
> 		Without extension : 182K SQL exec/sec
> 		With extension    : 262K SQL exec/sec
> 	   44% improvement.
> 
> 	Swingbench - standard database benchmark
> 	   Cached(database files on tmpfs) run, with 1000 clients.

In this case, how does the performance fare when running the clients
under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs
"TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run?

I'm trying to understand what the performance looks like when using
existing features that inhibit preemption vs putting forward the
preemption when the userspace is holding a lock. Feel free to quote
the latency comparisons too if using the existing features lead to
unacceptable avg/tail latencies.

> 
> 		Without extension : 99K SQL exec/sec
> 		with extension    : 153K SQL exec/sec
> 	   55% improvement in throughput.
> 
> [..snip..]
> 

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13  5:43 ` [RFC PATCH 0/4] Scheduler time slice extension K Prateek Nayak
@ 2024-11-13 19:56   ` Prakash Sangappa
  0 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13 19:56 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	peterz@infradead.org, tglx@linutronix.de, Daniel Jordan



> On Nov 12, 2024, at 9:43 PM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Prakash,
> 
> Few questions around the benchmarks.
> 
> On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
>> [..snip..] Test results:
>> =============
>> Test system 2 socket AMD Genoa
>> Lock table test:- a simple database test to grab table lock(spin lock).
>>   Simulates sql query executions.
>>   300 clients + 400 cpu hog tasks to generate load.
> 
> Have you tried running the 300 clients with a nice value of -20 and 400
> CPU hogs with the default nice value / nice 19? Does that help this
> particular case?

Have not tried this with the database. Will have to try it.


> 
>> Without extension : 182K SQL exec/sec
>> With extension    : 262K SQL exec/sec
>>   44% improvement.
>> Swingbench - standard database benchmark
>>   Cached(database files on tmpfs) run, with 1000 clients.
> 
> In this case, how does the performance fare when running the clients
> under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs
> "TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run?

Not tried SCHED_BATCH. 

With this run, there were about avg ‘166' TASK_PREEMPT_DELAY_GRANTED grants per task, collected from the scheduler stats captured at the end of the run. Test runs for about 5min.  Don't have the count of how many times preempt delay was requested. If the task completes the critical section, it clears the  TASK_PREEMPT_DELAY_REQ flag, so kernel would not see it many cases as this may not be near the end of the time slice. We would have to capture the count in the application. 


> 
> I'm trying to understand what the performance looks like when using
> existing features that inhibit preemption vs putting forward the
> preemption when the userspace is holding a lock. Feel free to quote
> the latency comparisons too if using the existing features lead to
> unacceptable avg/tail latencies.
> 
>> Without extension : 99K SQL exec/sec
>> with extension    : 153K SQL exec/sec
>>   55% improvement in throughput.
>> [..snip..]
> 
> -- 
> Thanks and Regards,
> Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13  0:01 [RFC PATCH 0/4] Scheduler time slice extension Prakash Sangappa
                   ` (4 preceding siblings ...)
  2024-11-13  5:43 ` [RFC PATCH 0/4] Scheduler time slice extension K Prateek Nayak
@ 2024-11-13 18:50 ` Peter Zijlstra
  2024-11-13 19:36   ` Mathieu Desnoyers
  2024-11-13 19:50   ` Prakash Sangappa
  5 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2024-11-13 18:50 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: linux-kernel, rostedt, tglx, daniel.m.jordan, Mathieu Desnoyers

On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:

> This patch set implements the above mentioned 50us extension time as posted
> by Peter. But instead of using restartable sequences as API to set the flag
> to request the extension, this patch proposes a new API with use of a per
> thread shared structure implementation described below. This shared structure
> is accessible in both users pace and kernel. The user thread will set the
> flag in this shared structure to request execution time extension.

But why -- we already have rseq, glibc uses it by default. Why add yet
another thing?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13 18:50 ` Peter Zijlstra
@ 2024-11-13 19:36   ` Mathieu Desnoyers
  2024-11-13 20:10     ` Prakash Sangappa
  2024-11-14 10:14     ` Peter Zijlstra
  2024-11-13 19:50   ` Prakash Sangappa
  1 sibling, 2 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-11-13 19:36 UTC (permalink / raw)
  To: Peter Zijlstra, Prakash Sangappa
  Cc: linux-kernel, rostedt, tglx, daniel.m.jordan

On 2024-11-13 13:50, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> 
>> This patch set implements the above mentioned 50us extension time as posted
>> by Peter. But instead of using restartable sequences as API to set the flag
>> to request the extension, this patch proposes a new API with use of a per
>> thread shared structure implementation described below. This shared structure
>> is accessible in both users pace and kernel. The user thread will set the
>> flag in this shared structure to request execution time extension.
> 
> But why -- we already have rseq, glibc uses it by default. Why add yet
> another thing?

Indeed, what I'm not seeing in this RFC patch series cover letter is an
explanation that justifies adding yet another per-thread memory area
shared between kernel and userspace when we have extensible rseq
already.

Peter, was there anything fundamentally wrong with your approach based
on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net

The main thing I wonder is whether loading the rseq delay resched flag
on return to userspace is too late in your patch. Also, I'm not sure it is
realistic to require that no system calls should be done within time extension
slice. If we have this scenario:

A) userspace grabs lock
    - set rseq delay resched flag
B) syscall
    - reschedule
     [...]
    - return to userspace, load rseq delay-resched flag from userspace (too late)

I would have thought loading the delay resched flag should be attempted much
earlier in the scheduler code. Perhaps we could do this from a page fault
disable critical section, and accept that this hint may be a no-op if the
rseq page happens to be swapped out (which is really unlikely). This is
similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
which needed to be accessed from the scheduler:

   https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
   https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/

And we'd leave the delay-resched load in place on return to userspace, so
in the unlikely scenario where it is swapped out, at least it gets paged
back at that point.

Feel free to let me know if I'm missing an important point and/or saying
nonsense here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13 19:36   ` Mathieu Desnoyers
@ 2024-11-13 20:10     ` Prakash Sangappa
  2024-11-13 20:57       ` Mathieu Desnoyers
  2024-11-14 10:28       ` Peter Zijlstra
  2024-11-14 10:14     ` Peter Zijlstra
  1 sibling, 2 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13 20:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan



> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-13 13:50, Peter Zijlstra wrote:
>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>> This patch set implements the above mentioned 50us extension time as posted
>>> by Peter. But instead of using restartable sequences as API to set the flag
>>> to request the extension, this patch proposes a new API with use of a per
>>> thread shared structure implementation described below. This shared structure
>>> is accessible in both users pace and kernel. The user thread will set the
>>> flag in this shared structure to request execution time extension.
>> But why -- we already have rseq, glibc uses it by default. Why add yet
>> another thing?
> 
> Indeed, what I'm not seeing in this RFC patch series cover letter is an
> explanation that justifies adding yet another per-thread memory area
> shared between kernel and userspace when we have extensible rseq
> already.

It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.

> 
> Peter, was there anything fundamentally wrong with your approach based
> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> 
> The main thing I wonder is whether loading the rseq delay resched flag
> on return to userspace is too late in your patch. Also, I'm not sure it is
> realistic to require that no system calls should be done within time extension
> slice. If we have this scenario:

I am also not sure if we need to prevent system calls in this scenario. 
Was that restriction mainly because of restartable sequence API implements it?

-Prakash

> 
> A) userspace grabs lock
>   - set rseq delay resched flag
> B) syscall
>   - reschedule
>    [...]
>   - return to userspace, load rseq delay-resched flag from userspace (too late)
> 
> I would have thought loading the delay resched flag should be attempted much
> earlier in the scheduler code. Perhaps we could do this from a page fault
> disable critical section, and accept that this hint may be a no-op if the
> rseq page happens to be swapped out (which is really unlikely). This is
> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
> which needed to be accessed from the scheduler:
> 
>  https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>  https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
> 
> And we'd leave the delay-resched load in place on return to userspace, so
> in the unlikely scenario where it is swapped out, at least it gets paged
> back at that point.
> 
> Feel free to let me know if I'm missing an important point and/or saying
> nonsense here.
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13 20:10     ` Prakash Sangappa
@ 2024-11-13 20:57       ` Mathieu Desnoyers
  2024-11-13 23:24         ` Prakash Sangappa
  2024-11-14 10:28       ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-11-13 20:57 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan

On 2024-11-13 15:10, Prakash Sangappa wrote:
> 
> 
>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>> This patch set implements the above mentioned 50us extension time as posted
>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>> to request the extension, this patch proposes a new API with use of a per
>>>> thread shared structure implementation described below. This shared structure
>>>> is accessible in both users pace and kernel. The user thread will set the
>>>> flag in this shared structure to request execution time extension.
>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>> another thing?
>>
>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>> explanation that justifies adding yet another per-thread memory area
>> shared between kernel and userspace when we have extensible rseq
>> already.
> 
> It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.

Does the targeted use-case (scheduler time slice extension) require
pinned memory, or just future use-cases ?

Does having a missing time slice extension hint for a short while in
case of high memory pressure (rseq page swapped out) have any measurable
impact compared to the overhead of the page faults which will be
happening in case of the high memory pressure required to trigger this
scenario ?

> 
>>
>> Peter, was there anything fundamentally wrong with your approach based
>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>
>> The main thing I wonder is whether loading the rseq delay resched flag
>> on return to userspace is too late in your patch. Also, I'm not sure it is
>> realistic to require that no system calls should be done within time extension
>> slice. If we have this scenario:
> 
> I am also not sure if we need to prevent system calls in this scenario.

I suspect that if we prohibit system calls from being issued from a
delay-resched userspace critical section, then loading the delay-resched
rseq flag on return to userspace is always fine, because the kernel only
reschedules on return from interrupt or trap.

But I see this no-syscall restriction as being very cumbersome for
userspace.

> Was that restriction mainly because of restartable sequence API implements it?

I suspect that this restriction is just to avoid loading the
delay-resched flag from the scheduler when reschedule is called
from an interrupt handler nested over a system call for preemptible
kernels, but Peter could tell us more.

One open question here is whether we want to pin memory for
each thread in the system to hold this shared data between
userspace and the scheduler. AFAIU, this is a trade-off between
slice extension accuracy in high memory usage scenarios and
pinned memory footprint impact. If the tradeoff goes towards
making this memory pinned, then we may want to consider pinning
the per-thread rseq area on rseq registration.

Another option to consider is to use rseq to index a userspace
per-cpu data structure, which will be used as shared memory
between kernel and userspace. Userspace could store this
delay-resched flag into the current CPU's shared data, and
the scheduler could load it from there. If pinning per-cpu
data is more acceptable than pinning per-thread data, then
it could be an improvement.

This could be a new ABI between kernel and userspace, e.g.:

struct rseq_percpu_area {
     __u32 sched_state;	/* includes time slice extension flag. */
     char end[];
};

Registered to the kernel with the following parameters:

- Address of rseq_percpu_area for CPU 0,
- The stride of the per-cpu indexing (see librseq mempool per-cpu
   allocator [1]),
- offsetof(struct rseq_percpu_area, end) to have the feature size
   for extensibility.

Thanks,

Mathieu

[1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf

> 
> -Prakash
> 
>>
>> A) userspace grabs lock
>>    - set rseq delay resched flag
>> B) syscall
>>    - reschedule
>>     [...]
>>    - return to userspace, load rseq delay-resched flag from userspace (too late)
>>
>> I would have thought loading the delay resched flag should be attempted much
>> earlier in the scheduler code. Perhaps we could do this from a page fault
>> disable critical section, and accept that this hint may be a no-op if the
>> rseq page happens to be swapped out (which is really unlikely). This is
>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
>> which needed to be accessed from the scheduler:
>>
>>   https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>>   https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
>>
>> And we'd leave the delay-resched load in place on return to userspace, so
>> in the unlikely scenario where it is swapped out, at least it gets paged
>> back at that point.
>>
>> Feel free to let me know if I'm missing an important point and/or saying
>> nonsense here.
>>
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13 20:57       ` Mathieu Desnoyers
@ 2024-11-13 23:24         ` Prakash Sangappa
  0 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13 23:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan



> On Nov 13, 2024, at 12:57 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-13 15:10, Prakash Sangappa wrote:
>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>> It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.
> 
> Does the targeted use-case (scheduler time slice extension) require
> pinned memory, or just future use-cases ?

Probably not for sched time slice extension. Although, we have not run the database workload using restartable sequence. We can try that and get back.

What about publishing thread state in the shared structure? It would require updating user memory in context switching code path when thread is going off the cpu. Here having pinned memory would help. Thread state is the other requirement we are interested in.

> 
> Does having a missing time slice extension hint for a short while in
> case of high memory pressure (rseq page swapped out) have any measurable

Yes, but the restartable sequence based implementation does copy_from_user(), which would go thru the pagefault. It may be ok here. Perpaps it would require disabling page faults in this case? If the page is not present, then the hint would most likely not matter and kernel can skip.

> impact compared to the overhead of the page faults which will be
> happening in case of the high memory pressure required to trigger this
> scenario ?

We would have to test that. 

> 
>>> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>> 
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>> realistic to require that no system calls should be done within time extension
>>> slice. If we have this scenario:
>> I am also not sure if we need to prevent system calls in this scenario.
> 
> I suspect that if we prohibit system calls from being issued from a
> delay-resched userspace critical section, then loading the delay-resched
> rseq flag on return to userspace is always fine, because the kernel only
> reschedules on return from interrupt or trap.
> 
> But I see this no-syscall restriction as being very cumbersome for
> userspace.
> 
>> Was that restriction mainly because of restartable sequence API implements it?
> 
> I suspect that this restriction is just to avoid loading the
> delay-resched flag from the scheduler when reschedule is called
> from an interrupt handler nested over a system call for preemptible
> kernels, but Peter could tell us more.

Hoping Peter can comment on this.

> One open question here is whether we want to pin memory for
> each thread in the system to hold this shared data between
> userspace and the scheduler. AFAIU, this is a trade-off between
> slice extension accuracy in high memory usage scenarios and
> pinned memory footprint impact. If the tradeoff goes towards
> making this memory pinned, then we may want to consider pinning
> the per-thread rseq area on rseq registration.
> 
> Another option to consider is to use rseq to index a userspace
> per-cpu data structure, which will be used as shared memory
> between kernel and userspace. Userspace could store this
> delay-resched flag into the current CPU's shared data, and
> the scheduler could load it from there. If pinning per-cpu
> data is more acceptable than pinning per-thread data, then
> it could be an improvement.

Interesting. Having pre cpu shared memory may not work for all use cases.

Thanks,
-Prakash

> 
> This could be a new ABI between kernel and userspace, e.g.:
> 
> struct rseq_percpu_area {
>    __u32 sched_state; /* includes time slice extension flag. */
>    char end[];
> };
> 
> Registered to the kernel with the following parameters:
> 
> - Address of rseq_percpu_area for CPU 0,
> - The stride of the per-cpu indexing (see librseq mempool per-cpu
>  allocator [1]),
> - offsetof(struct rseq_percpu_area, end) to have the feature size
>  for extensibility.
> 
> Thanks,
> 
> Mathieu
> 
> [1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf
> 
>> -Prakash
>>> 
>>> A) userspace grabs lock
>>>   - set rseq delay resched flag
>>> B) syscall
>>>   - reschedule
>>>    [...]
>>>   - return to userspace, load rseq delay-resched flag from userspace (too late)
>>> 
>>> I would have thought loading the delay resched flag should be attempted much
>>> earlier in the scheduler code. Perhaps we could do this from a page fault
>>> disable critical section, and accept that this hint may be a no-op if the
>>> rseq page happens to be swapped out (which is really unlikely). This is
>>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
>>> which needed to be accessed from the scheduler:
>>> 
>>>  https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>>>  https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
>>> 
>>> And we'd leave the delay-resched load in place on return to userspace, so
>>> in the unlikely scenario where it is swapped out, at least it gets paged
>>> back at that point.
>>> 
>>> Feel free to let me know if I'm missing an important point and/or saying
>>> nonsense here.
>>> 
>>> Thanks,
>>> 
>>> Mathieu
>>> 
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13 20:10     ` Prakash Sangappa
  2024-11-13 20:57       ` Mathieu Desnoyers
@ 2024-11-14 10:28       ` Peter Zijlstra
  2024-11-14 19:42         ` Prakash Sangappa
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2024-11-14 10:28 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	rostedt@goodmis.org, tglx@linutronix.de, Daniel Jordan

On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
> 
> 
> > On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > On 2024-11-13 13:50, Peter Zijlstra wrote:
> >> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> >>> This patch set implements the above mentioned 50us extension time as posted
> >>> by Peter. But instead of using restartable sequences as API to set the flag
> >>> to request the extension, this patch proposes a new API with use of a per
> >>> thread shared structure implementation described below. This shared structure
> >>> is accessible in both users pace and kernel. The user thread will set the
> >>> flag in this shared structure to request execution time extension.
> >> But why -- we already have rseq, glibc uses it by default. Why add yet
> >> another thing?
> > 
> > Indeed, what I'm not seeing in this RFC patch series cover letter is an
> > explanation that justifies adding yet another per-thread memory area
> > shared between kernel and userspace when we have extensible rseq
> > already.
> 
> It mainly provides pinned memory, can be useful for  future use cases
> where updating user memory in kernel context can be fast or needs to
> avoid pagefaults.

'might be useful' it not good enough a justification. Also, I don't
think you actually need this.

See:

  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com

for a more elaborate scheme.

> > Peter, was there anything fundamentally wrong with your approach based
> > on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> > 
> > The main thing I wonder is whether loading the rseq delay resched flag
> > on return to userspace is too late in your patch. Also, I'm not sure it is
> > realistic to require that no system calls should be done within time extension
> > slice. If we have this scenario:
> 
> I am also not sure if we need to prevent system calls in this scenario. 
> Was that restriction mainly because of restartable sequence API implements it?

No, the whole premise of delaying resched was because people think that
syscalls are too slow. If you do not think this, then you shouldn't be
using this.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-14 10:28       ` Peter Zijlstra
@ 2024-11-14 19:42         ` Prakash Sangappa
  2024-11-15 14:13           ` Mathieu Desnoyers
  2024-12-09 20:36           ` Prakash Sangappa
  0 siblings, 2 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-14 19:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	rostedt@goodmis.org, tglx@linutronix.de, Daniel Jordan



> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>> 
>> 
>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>> 
>> It mainly provides pinned memory, can be useful for  future use cases
>> where updating user memory in kernel context can be fast or needs to
>> avoid pagefaults.
> 
> 'might be useful' it not good enough a justification. Also, I don't
> think you actually need this.

Will get back with database benchmark results using rseq API for scheduler time extension.

> 
> See:
> 
>  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
> 
> for a more elaborate scheme.
> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>> 
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>> realistic to require that no system calls should be done within time extension
>>> slice. If we have this scenario:
>> 
>> I am also not sure if we need to prevent system calls in this scenario. 
>> Was that restriction mainly because of restartable sequence API implements it?
> 
> No, the whole premise of delaying resched was because people think that
> syscalls are too slow. If you do not think this, then you shouldn't be
> using this.

Agree.

Thanks,
-Prakash


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-14 19:42         ` Prakash Sangappa
@ 2024-11-15 14:13           ` Mathieu Desnoyers
  2024-11-15 17:20             ` Prakash Sangappa
  2024-12-09 20:36           ` Prakash Sangappa
  1 sibling, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-11-15 14:13 UTC (permalink / raw)
  To: Prakash Sangappa, Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan

On 2024-11-14 14:42, Prakash Sangappa wrote:
> 
> 
>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:

[...]

>>
>> See:
>>
>>   https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>
>> for a more elaborate scheme.
>>
>>>> Peter, was there anything fundamentally wrong with your approach based
>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>
>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>> realistic to require that no system calls should be done within time extension
>>>> slice. If we have this scenario:
>>>
>>> I am also not sure if we need to prevent system calls in this scenario.
>>> Was that restriction mainly because of restartable sequence API implements it?
>>
>> No, the whole premise of delaying resched was because people think that
>> syscalls are too slow. If you do not think this, then you shouldn't be
>> using this.
> 
> Agree.

I only partially agree with Peter here. I agree that we don't want to
add system calls on the delay-resched critical section fast path,
because this would have a significant performance hit.

But there are scenarios where issuing system calls from within that
critical section would be needed, even though those would not belong
to the fast path:

1) If a signal handler nests over a delay-resched critical section.
    That signal handler is allowed to issue system calls.

2) If the critical section fast-path is calling GNU C library API and/or
    a vDSO, which is typically fast, but can end up calling a system call
    as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a
    system call by killing the application punches a hole in the
    abstractions meant to be provided by GNU libc and vDSO.

I would recommend that we allow issuing system calls while the
delay-resched bit is set. However, we may not strictly need to honor
the delay-resched hint from a system call context, as those would
be expected to be either infrequent or a portability fallback,
which means the enhanced performance provided by delay-resched
really won't matter.

Another scenario to keep in mind are page faults happening within a
delay-resched critical section. This is a scenario where page fault
handling can explicitly reschedule. If this happens, I suspect we
really don't care about the delay-resched hint, but we should consider
whether this hint should be left as-is or cleared.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-15 14:13           ` Mathieu Desnoyers
@ 2024-11-15 17:20             ` Prakash Sangappa
  0 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-15 17:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan



> On Nov 15, 2024, at 6:13 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-14 14:42, Prakash Sangappa wrote:
>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> [...]
> 
>>> 
>>> See:
>>> 
>>>  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>> 
>>> for a more elaborate scheme.
>>> 
>>>>> Peter, was there anything fundamentally wrong with your approach based
>>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>> 
>>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>>> realistic to require that no system calls should be done within time extension
>>>>> slice. If we have this scenario:
>>>> 
>>>> I am also not sure if we need to prevent system calls in this scenario.
>>>> Was that restriction mainly because of restartable sequence API implements it?
>>> 
>>> No, the whole premise of delaying resched was because people think that
>>> syscalls are too slow. If you do not think this, then you shouldn't be
>>> using this.
>> Agree.
> 
> I only partially agree with Peter here. I agree that we don't want to
> add system calls on the delay-resched critical section fast path,
> because this would have a significant performance hit.
> 
> But there are scenarios where issuing system calls from within that
> critical section would be needed, even though those would not belong
> to the fast path:
> 
> 1) If a signal handler nests over a delay-resched critical section.
>   That signal handler is allowed to issue system calls.
> 
> 2) If the critical section fast-path is calling GNU C library API and/or
>   a vDSO, which is typically fast, but can end up calling a system call
>   as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a
>   system call by killing the application punches a hole in the
>   abstractions meant to be provided by GNU libc and vDSO.
> 
> I would recommend that we allow issuing system calls while the
> delay-resched bit is set. However, we may not strictly need to honor
> the delay-resched hint from a system call context, as those would
> be expected to be either infrequent or a portability fallback,
> which means the enhanced performance provided by delay-resched
> really won't matter.
> 
> Another scenario to keep in mind are page faults happening within a
> delay-resched critical section. This is a scenario where page fault
> handling can explicitly reschedule. If this happens, I suspect we
> really don't care about the delay-resched hint, but we should consider
> whether this hint should be left as-is or cleared.
> 

There are no explicit checks right now that a system call occurred in the critical section, 
except under DEBUG_RSEQ, which probably will not apply for the delay-resched case.
 
But the question was should there be checks implemented, probably not 
due to scenarios you mentioned above.


> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-14 19:42         ` Prakash Sangappa
  2024-11-15 14:13           ` Mathieu Desnoyers
@ 2024-12-09 20:36           ` Prakash Sangappa
  2024-12-09 21:17             ` Mathieu Desnoyers
  1 sibling, 1 reply; 26+ messages in thread
From: Prakash Sangappa @ 2024-12-09 20:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	rostedt@goodmis.org, tglx@linutronix.de, Daniel Jordan



> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> 
> 
>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>> 
>>> 
>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>> 
>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>> thread shared structure implementation described below. This shared structure
>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>> flag in this shared structure to request execution time extension.
>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>> another thing?
>>>> 
>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>> explanation that justifies adding yet another per-thread memory area
>>>> shared between kernel and userspace when we have extensible rseq
>>>> already.
>>> 
>>> It mainly provides pinned memory, can be useful for  future use cases
>>> where updating user memory in kernel context can be fast or needs to
>>> avoid pagefaults.
>> 
>> 'might be useful' it not good enough a justification. Also, I don't
>> think you actually need this.
> 
> Will get back with database benchmark results using rseq API for scheduler time extension.

Sorry about the delay in response.

Here are the database swingbench numbers   - includes results with use of rseq API.

Test results:
=========
Test system 2 socket AMD Genoa

 	Swingbench - standard database benchmark 
		Cached(database files on tmpfs) run, with 1000 clients.

		Baseline(Without Sched time extension):  99K SQL exec/sec

		With Sched time extension: 
				Shared structure API use: 		153K SQL exec/sec  (Previously reported)
			55% improvement in throughput.

				Restartable sequences API use:	147K SQL exec/sec
 			48% improvement in throughput

While both show good performance benefit with scheduler time extension,
there is a 7% difference in throughput between Shared structure & Restartable sequences API. 
Use of shared structure is faster.



> 
>> 
>> See:
>> 
>> https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>> 
>> for a more elaborate scheme.
>> 
>>>> Peter, was there anything fundamentally wrong with your approach based
>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>> 
>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>> realistic to require that no system calls should be done within time extension
>>>> slice. If we have this scenario:
>>> 
>>> I am also not sure if we need to prevent system calls in this scenario. 
>>> Was that restriction mainly because of restartable sequence API implements it?
>> 
>> No, the whole premise of delaying resched was because people think that
>> syscalls are too slow. If you do not think this, then you shouldn't be
>> using this.
> 
> Agree.
> 
> Thanks,
> -Prakash



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-12-09 20:36           ` Prakash Sangappa
@ 2024-12-09 21:17             ` Mathieu Desnoyers
  2024-12-16 18:59               ` Prakash Sangappa
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-12-09 21:17 UTC (permalink / raw)
  To: Prakash Sangappa, Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan

On 2024-12-09 15:36, Prakash Sangappa wrote:
> 
> 
>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>
>>
>>
>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>
>>>>
>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>>
>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>> flag in this shared structure to request execution time extension.
>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>> another thing?
>>>>>
>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>> explanation that justifies adding yet another per-thread memory area
>>>>> shared between kernel and userspace when we have extensible rseq
>>>>> already.
>>>>
>>>> It mainly provides pinned memory, can be useful for  future use cases
>>>> where updating user memory in kernel context can be fast or needs to
>>>> avoid pagefaults.
>>>
>>> 'might be useful' it not good enough a justification. Also, I don't
>>> think you actually need this.
>>
>> Will get back with database benchmark results using rseq API for scheduler time extension.
> 
> Sorry about the delay in response.
> 
> Here are the database swingbench numbers   - includes results with use of rseq API.
> 
> Test results:
> =========
> Test system 2 socket AMD Genoa
> 
>   	Swingbench - standard database benchmark
> 		Cached(database files on tmpfs) run, with 1000 clients.
> 
> 		Baseline(Without Sched time extension):  99K SQL exec/sec
> 
> 		With Sched time extension:
> 				Shared structure API use: 		153K SQL exec/sec  (Previously reported)
> 			55% improvement in throughput.
> 
> 				Restartable sequences API use:	147K SQL exec/sec
>   			48% improvement in throughput
> 
> While both show good performance benefit with scheduler time extension,
> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
> Use of shared structure is faster.

Can you share the code for both test cases ? And do you have relevant
perf profile showing where time is spent ?

Thanks,

Mathieu

> 
> 
> 
>>
>>>
>>> See:
>>>
>>> https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>>
>>> for a more elaborate scheme.
>>>
>>>>> Peter, was there anything fundamentally wrong with your approach based
>>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>>
>>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>>> realistic to require that no system calls should be done within time extension
>>>>> slice. If we have this scenario:
>>>>
>>>> I am also not sure if we need to prevent system calls in this scenario.
>>>> Was that restriction mainly because of restartable sequence API implements it?
>>>
>>> No, the whole premise of delaying resched was because people think that
>>> syscalls are too slow. If you do not think this, then you shouldn't be
>>> using this.
>>
>> Agree.
>>
>> Thanks,
>> -Prakash
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-12-09 21:17             ` Mathieu Desnoyers
@ 2024-12-16 18:59               ` Prakash Sangappa
  2025-02-04  3:04                 ` Prakash Sangappa
  0 siblings, 1 reply; 26+ messages in thread
From: Prakash Sangappa @ 2024-12-16 18:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan



> On Dec 9, 2024, at 1:17 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-12-09 15:36, Prakash Sangappa wrote:
>>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> 
>>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>> 
>>>>> 
>>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>>> 
>>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>>> flag in this shared structure to request execution time extension.
>>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>>> another thing?
>>>>>> 
>>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>>> explanation that justifies adding yet another per-thread memory area
>>>>>> shared between kernel and userspace when we have extensible rseq
>>>>>> already.
>>>>> 
>>>>> It mainly provides pinned memory, can be useful for  future use cases
>>>>> where updating user memory in kernel context can be fast or needs to
>>>>> avoid pagefaults.
>>>> 
>>>> 'might be useful' it not good enough a justification. Also, I don't
>>>> think you actually need this.
>>> 
>>> Will get back with database benchmark results using rseq API for scheduler time extension.
>> Sorry about the delay in response.
>> Here are the database swingbench numbers   - includes results with use of rseq API.
>> Test results:
>> =========
>> Test system 2 socket AMD Genoa
>>   Swingbench - standard database benchmark
>> Cached(database files on tmpfs) run, with 1000 clients.
>> Baseline(Without Sched time extension):  99K SQL exec/sec
>> With Sched time extension:
>> Shared structure API use: 153K SQL exec/sec  (Previously reported)
>> 55% improvement in throughput.
>> Restartable sequences API use: 147K SQL exec/sec
>>   48% improvement in throughput
>> While both show good performance benefit with scheduler time extension,
>> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
>> Use of shared structure is faster.
> 
> Can you share the code for both test cases ? And do you have relevant
> perf profile showing where time is spent ?
> 
> Thanks,
> 
> Mathieu
> 

The changes are in the database(Oracle DB). 
The test is swingbench. https://www.dominicgiles.com/downloads/

Our database team is running the benchmark. I have requested them to repeat the test and 
capture perf profile.

With restartable sequences API, once a thread registers the 'structure rseq', 
kernel will need to 'copy_from_user' to check if the thread is requesting an extension in 
exit_to_user_mode_loop(), this potentially adds overhead? Were as with shared structure
It is just a memory access. 

I was trying to reproduce the performance difference using a microbenchmark. 
Used a modified version of the test(extend-sched.c) the Steven Rosted’s posted here 
https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/

Modified test to include use of the Shared structure API and Restartable sequences API to request
scheduler time extension.  Increased number of data objects that the threads contend on 
and grab_lock(). Added some delay inside critical section to increase lock hold time. 
Test runs for 180secs. 
(Modified test included below)

The test spawns number of threads equal to number of cpus. They update a shared data object. 
Simultaneously ran a cpu hog program with # threads equal to number of cpus.

The test takes an argument  
0 - No scheduler time extension
1 - Use Shared structure API to request extension 
2 - Use Restartable sequences API to request extension.
Option 1 & 2 would require relevant kernel running. 

The test increments a count in the shared data object indicating number of times it could complete
the critical section. Higher the number is better.

Running this test on a 50 core VM shows lot of variance in the results. Generally we see
performance improvement of 4 to 6% with use of either APIs in this test.
Unfortunately It does not show  consistent difference between the two APIs.

Approximately :-

No extension:

#./extend-sched2 0 |egrep -e "^Ran|^Total|^avg”
Ran for 402660 times <---
Total wait time: 16132.358985
avg # grants 0 avg reqs 4026
avg # blocks 14440


With extension: using shared structure API

#./extend-sched2 1 |egrep -e "^Ran|^Total|^avg”
Ran for 425210 times <---
Total wait time: 16043.160130
avg # grants 2582 avg reqs 4252
avg # blocks 14920

About 5.6% improvement.

WIth extension: Using restartable sequences API

#./extend-sched2 2 |egrep -e "^Ran|^Total|^avg”
Ran for 423765 times <---
Total wait time: 16045.406387
avg # grants 2580 avg reqs 4237
avg # blocks 14851

About 5.3 % improvement. 

The perf profile of the test are similar between the two APIs. With restartable sequences, 
we see ‘_copy_from_user’ appear.

# Total Lost Samples: 0
#
# Samples: 890K of event 'cycles:P'
# Event count (approx.): 32581540690314
#
# Overhead       Samples  Command       Shared Object       Symbol
# ........  ............  ............  ..................  .............................................
#
   98.67%        877109  lckshrsq2scl  lckshrsq2scl        [.] grab_lock
    0.84%          7496  lckshrsq2scl  [kernel.kallsyms]   [k] asm_sysvec_apic_timer_interrupt
    0.03%           285  lckshrsq2scl  [kernel.kallsyms]   [k] native_write_msr
    0.02%           200  lckshrsq2scl  [kernel.kallsyms]   [k] native_read_msr
    0.02%           175  lckshrsq2scl  [kernel.kallsyms]   [k] pvclock_clocksource_read_nowd
    0.01%           128  lckshrsq2scl  [kernel.kallsyms]   [k] sync_regs
    0.01%           120  lckshrsq2scl  [kernel.kallsyms]   [k] get_jiffies_update
    0.01%           116  lckshrsq2scl  [kernel.kallsyms]   [k] __update_load_avg_cfs_rq
    0.01%           113  lckshrsq2scl  [kernel.kallsyms]   [k] update_curr
    0.01%           137  lckshrsq2scl  [kernel.kallsyms]   [k] srso_alias_safe_ret
    0.01%            96  lckshrsq2scl  [kernel.kallsyms]   [k] __update_load_avg_se
    0.01%            85  lckshrsq2scl  [kernel.kallsyms]   [k] update_process_times
    0.01%            76  lckshrsq2scl  [kernel.kallsyms]   [k] update_rq_clock_task
    0.01%           101  lckshrsq2scl  [kernel.kallsyms]   [k] __raw_callee_save___pv_queued_spin_unlock
    0.01%            73  lckshrsq2scl  [kernel.kallsyms]   [k] update_irq_load_avg
    0.01%            72  lckshrsq2scl  lckshrsq2scl          [.] run_thread
    0.01%            66  lckshrsq2scl  [kernel.kallsyms]   [k] rcu_pending
    0.01%            65  lckshrsq2scl  [kernel.kallsyms]   [k] perf_adjust_freq_unthr_context
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] sched_tick
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] update_load_avg
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] hrtimer_interrupt
..
    0.00%             14  lckshrsq2scl  [kernel.kallsyms]   [k] _copy_from_user
..
    0.00%             1  lckshrsq2scl  [kernel.kallsyms]   [k] _copy_to_user



Test program:
/*— extend-sched2.c — */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#include <errno.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/time.h>
#include <linux/rseq.h>
#include <sys/syscall.h>

#define barrier() asm volatile ("" ::: "memory")
#define rmb() asm volatile ("lfence" ::: "memory")
#define wmb() asm volatile ("sfence" ::: "memory")


/*----- shared struct ------ */
/*
* Shared page -  shared structure API/ABI
*/
#define __NR_task_getshared 463
#define TASK_SHAREDINFO 1


/* a per thread shared structure */
struct task_sharedinfo {
       volatile unsigned short sched_delay;
};

/*
* Returns address of the task_ushared structure in the shared page
* mapped between userspace and kernel, for the calling thread.
*/
struct task_sharedinfo *task_getshared()
{
       struct task_sharedinfo *ts;
       if (!syscall(__NR_task_getshared, TASK_SHAREDINFO, 0, &ts))
               return ts;
       return NULL;
}

/*------- restartable seq ------- */
#define RSEQ_CS_SCHED_DELAY 3
#define RSEQ_SCHED_DELAY ((1U << RSEQ_CS_SCHED_DELAY))

static __thread struct rseq myrseq;
static __thread struct rseq *extend_map_rseq;


int init_rseql()
{
       syscall(__NR_rseq, &myrseq, 32, 0, 0x31456);
       extend_map_rseq = &myrseq;
}

/*--------------------------------*/

static pthread_barrier_t pbarrier;
static __thread struct task_sharedinfo *extend_map;
int enable_extend;

static void init_extend_map(void)
{
       if (!enable_extend)
               return;
       if (enable_extend == 1) {
               extend_map = task_getshared();
               if (extend_map == NULL) {
                       printf("Failed to allocate shared struct\n");
               }
       }
       if (enable_extend == 2) {
               init_rseql();
               if (extend_map_rseq == NULL) {
                       printf("Failed to init startable seq\n");
               }
       }

}

/*called from main only */
static void test_extend(void)
{
       int x;

       init_extend_map();
       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;
       printf("Spinning...");
       if (enable_extend == 1) {
               extend_map->sched_delay = 1;
               while(extend_map->sched_delay != 2) {
                       x++;
                       barrier();
               }
       }

       if (enable_extend == 2) {
               extend_map_rseq->flags = RSEQ_SCHED_DELAY;
               while (extend_map_rseq->flags & RSEQ_SCHED_DELAY) {
                       x++;
                       barrier();
               }
       }
       printf("Done\n");

}

struct thread_data {
       unsigned long long                      start_wait;
       unsigned long long                      x_count;
       unsigned long long                      total;
       unsigned long long                      max;
       unsigned long long                      min;
       unsigned long long                      total_wait;
       unsigned long long                      max_wait;
       unsigned long long                      min_wait;
       unsigned long long                      blocks;
       unsigned long long                      dlygrant;
       struct data                            *data;
};

struct data {
       unsigned long long              x;
       unsigned long                   lock;
       bool                            done;
};

int nobj;
struct data *data;
struct thread_data *thrdata;

static inline unsigned long
cmpxchg(volatile unsigned long *ptr, unsigned long old, unsigned long new)
{
       unsigned long prev;

       asm volatile("lock; cmpxchg %b1,%2"
                    : "=a"(prev)
                    : "q"(new), "m"(*(ptr)), "0"(old)
                    : "memory");
       return prev;
}

static inline unsigned long
xchg(volatile unsigned long *ptr, unsigned long new)
{
       unsigned long ret = new;

       asm volatile("xchg %b0,%1"
                    : "+r"(ret), "+m"(*(ptr))
                    : : "memory");
       return ret;
}

static void extend(struct thread_data *tdata)
{
       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;

       if (enable_extend == 1)
               extend_map->sched_delay = 1;
       if (enable_extend == 2)
               extend_map_rseq->flags = RSEQ_SCHED_DELAY;
}

static void unextend(struct thread_data *tdata)
{
       unsigned long prev;

       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;

       if (enable_extend == 1) {
               prev = extend_map->sched_delay;
               extend_map->sched_delay = 0;
       //      prev = xchg(&extend_map->sched_delay, 0);
               if (prev == 2) {
                       tdata->dlygrant++;
                       //printf("Yield!\n");
                       sched_yield();
               }
       }

       if (enable_extend == 2) {
               prev = extend_map_rseq->flags;
               extend_map_rseq->flags = 0;
               if (!(prev & RSEQ_SCHED_DELAY)) {
                       tdata->dlygrant++;
                       //printf("Yield!\n");
                       sched_yield();
               }
       }
}

#define sec2usec(sec) (sec * 1000000ULL)
#define usec2sec(usec) (usec / 1000000ULL)

static unsigned long long get_time(void)
{
       struct timeval tv;
       unsigned long long time;

       gettimeofday(&tv, NULL);

       time = sec2usec(tv.tv_sec);
       time += tv.tv_usec;

       return time;
}

static void grab_lock(struct thread_data *tdata, struct data *data)
{
       unsigned long long start, end, delta;
       unsigned long long end_wait;
       unsigned long long last;
       unsigned long prev;

       if (!tdata->start_wait)
               tdata->start_wait = get_time();

       while (data->lock && !data->done)
               rmb();

       extend(tdata);
       start = get_time();
       prev = cmpxchg(&data->lock, 0, 1);
       if (prev) {
               unextend(tdata);
               tdata->blocks++;
               return;
       }
       end_wait = get_time();
       //printf("Have lock!\n");

       delta = end_wait - tdata->start_wait;
       tdata->start_wait = 0;
       if (!tdata->total_wait || tdata->max_wait < delta)
               tdata->max_wait = delta;
       if (!tdata->total_wait || tdata->min_wait > delta)
               tdata->min_wait = delta;
       tdata->total_wait += delta;

       data->x++;
       last = data->x;

       if (data->lock != 1) {
               printf("Failed locking\n");
               exit(-1);
       }

       // extend hold time
       for(int i = 0; i < 3000000; i++);

       prev = cmpxchg(&data->lock, 1, 0);
       end = get_time();
       if (prev != 1) {
               printf("Failed unlocking\n");
               exit(-1);
       }
       //printf("released lock!\n");
       unextend(tdata);

       delta = end - start;
       if (!tdata->total || tdata->max < delta)
               tdata->max = delta;

       if (!tdata->total || tdata->min > delta)
               tdata->min = delta;

       tdata->total += delta;
       tdata->x_count++;

       /* Let someone else have a turn */
       while (data->x == last && !data->done)
               rmb();
}

static void *run_thread(void *d)
{
       struct thread_data *tdata = d;
       struct data *data = tdata->data;

       init_extend_map();

       pthread_barrier_wait(&pbarrier);

       while (!data->done) {
               grab_lock(tdata, data);
       }
       return NULL;
}

/* arg1 == 1 use shared struct, arg1 ==2 use rseq */
int main (int argc, char **argv)
{
       unsigned long long total_wait = 0;
       unsigned long long total_blocks = 0;
       unsigned long long total_grants = 0;
       unsigned long long total_runs = 0;
       unsigned long long secs;
       pthread_t *threads;
       int cpus;

       enable_extend = 0;
       if (argc < 2) {
               printf("Usage:%s <0 - no extend, 1 = shared page, 2 = rseq>\n",
                                                        argv[0]);
               exit(1);
       }
       enable_extend = atoi(argv[1]);
       printf("enable extend %d\n", enable_extend);

       test_extend();

       cpus = sysconf(_SC_NPROCESSORS_CONF);
       nobj = cpus / 10;


       threads = calloc(cpus + 1, sizeof(*threads));
       if (!threads) {
               perror("threads");
               exit(-1);
       }

       thrdata = calloc(cpus + 1, sizeof(struct thread_data));
       if (!thrdata) {
               perror("Allocating tdata");
               exit(-1);
       }

       data = calloc(nobj + 1, sizeof(struct data));
       if (!data) {
               perror("Allocating tdata");
               exit(-1);
       }

       pthread_barrier_init(&pbarrier, NULL, cpus + 1);

       for (int i = 0; i < cpus; i++) {
               int ret;

               thrdata[i].data = &data[ i % nobj];
               ret = pthread_create(&threads[i], NULL, run_thread, &thrdata[i]);
               if (ret < 0) {
                       perror("creating threads");
                       exit(-1);
               }
       }

       pthread_barrier_wait(&pbarrier);
       sleep(180);

       printf("Finish up\n");
       for (int i = 0; i < nobj; i++)
               data[i].done = true;
       wmb();

       for (int i = 0; i < cpus; i++) {
               pthread_join(threads[i], NULL);
               printf("thread %i:\n", i);
               printf("   count:\t%lld\n", thrdata[i].x_count);
               printf("   total:\t%lld\n", thrdata[i].total);
               printf("     max:\t%lld\n", thrdata[i].max);
               printf("     min:\t%lld\n", thrdata[i].min);
               printf("   total wait:\t%lld\n", thrdata[i].total_wait);
               printf("     max wait:\t%lld\n", thrdata[i].max_wait);
               printf("     min wait:\t%lld\n", thrdata[i].min_wait);
               printf("     #blocks :\t%lld\n", thrdata[i].blocks);
               printf("     #dlygrnt:\t%lld\n", thrdata[i].dlygrant);
               total_wait += thrdata[i].total_wait;
               total_blocks += thrdata[i].blocks;
               total_grants += thrdata[i].dlygrant;
       }

       for(int i = 0; i < nobj; i++)
               total_runs += data[i].x;

       secs = usec2sec(total_wait);

       printf("Ran for %lld times\n", total_runs);
       printf("Total wait time: %lld.%06lld\n", secs, total_wait - sec2usec(secs));
       printf("avg # grants %lld avg reqs %lld \n", total_grants/cpus, total_runs/cpus);
       printf("avg # blocks %lld\n", total_blocks/cpus);
       return 0;
}

/*———————————*/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-12-16 18:59               ` Prakash Sangappa
@ 2025-02-04  3:04                 ` Prakash Sangappa
  0 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2025-02-04  3:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan



> On Dec 16, 2024, at 10:59 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> 
> 
>> On Dec 9, 2024, at 1:17 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> 
>> On 2024-12-09 15:36, Prakash Sangappa wrote:
>>>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> 
>>>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>>>> 
>>>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>>>> flag in this shared structure to request execution time extension.
>>>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>>>> another thing?
>>>>>>> 
>>>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>>>> explanation that justifies adding yet another per-thread memory area
>>>>>>> shared between kernel and userspace when we have extensible rseq
>>>>>>> already.
>>>>>> 
>>>>>> It mainly provides pinned memory, can be useful for  future use cases
>>>>>> where updating user memory in kernel context can be fast or needs to
>>>>>> avoid pagefaults.
>>>>> 
>>>>> 'might be useful' it not good enough a justification. Also, I don't
>>>>> think you actually need this.
>>>> 
>>>> Will get back with database benchmark results using rseq API for scheduler time extension.
>>> Sorry about the delay in response.
>>> Here are the database swingbench numbers   - includes results with use of rseq API.
>>> Test results:
>>> =========
>>> Test system 2 socket AMD Genoa
>>>  Swingbench - standard database benchmark
>>> Cached(database files on tmpfs) run, with 1000 clients.
>>> Baseline(Without Sched time extension):  99K SQL exec/sec
>>> With Sched time extension:
>>> Shared structure API use: 153K SQL exec/sec  (Previously reported)
>>> 55% improvement in throughput.
>>> Restartable sequences API use: 147K SQL exec/sec
>>>  48% improvement in throughput
>>> While both show good performance benefit with scheduler time extension,
>>> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
>>> Use of shared structure is faster.
>> 
>> Can you share the code for both test cases ? And do you have relevant
>> perf profile showing where time is spent ?
>> 
>> Thanks,
>> 
>> Mathieu
>> 
> 
> The changes are in the database(Oracle DB). 
> The test is swingbench. https://www.dominicgiles.com/downloads/
> 
> Our database team is running the benchmark. I have requested them to repeat the test and 
> capture perf profile.
> 

Update about the database testing comparing performance difference between the two APIs
Shared structure and Rseq. 

The workload shows variations in the runs. Repeating the tests number of times shows the
Performance on average is similar with use of either APIs to request Scheduler time extension.

He shared following results from the Lock table database workload, which simulates SQL query 
execution. This shows min, max, median and average of 5 runs with  scheduler time extension 
enabled and disabled.

Locktable test(# of query executions per sec).
Run					Min			Max			Median		Average
====================================================================
Disabled				162,911.10	187,543.40	177,035.30	175,675.24(Baseline)
Enabled(Shared struct)	167,896.20	237,709.50	191,269.50	195,113.80(+11%)
Enabled(Rseq)		166,410.40	251,349.30	191,917.60	199,740.40(+13%)

In this case Rseq shows better performance.

The conclusion seems to be both APIs perform similarly. 

How do we proceed?
We/our Database team wants the scheduler time extension feature integrated.
Not particular about the Shared structure API for this. 

I can submit the Rseq based patch PeterZ proposed. The  copy_from/to_user() calls
can be modified to use the nofault version.


> I was trying to reproduce the performance difference using a microbenchmark. 
> Used a modified version of the test(extend-sched.c) the Steven Rosted’s posted here 
> https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
...
> 
> The test increments a count in the shared data object indicating number of times it could complete
> the critical section. Higher the number is better.
> 
> Running this test on a 50 core VM shows lot of variance in the results. Generally we see
> performance improvement of 4 to 6% with use of either APIs in this test.
> Unfortunately It does not show  consistent difference between the two APIs.

...

-Prakash

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13 19:36   ` Mathieu Desnoyers
  2024-11-13 20:10     ` Prakash Sangappa
@ 2024-11-14 10:14     ` Peter Zijlstra
  2024-11-15 14:41       ` Mathieu Desnoyers
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2024-11-14 10:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Prakash Sangappa, linux-kernel, rostedt, tglx, daniel.m.jordan

On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
> On 2024-11-13 13:50, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> > 
> > > This patch set implements the above mentioned 50us extension time as posted
> > > by Peter. But instead of using restartable sequences as API to set the flag
> > > to request the extension, this patch proposes a new API with use of a per
> > > thread shared structure implementation described below. This shared structure
> > > is accessible in both users pace and kernel. The user thread will set the
> > > flag in this shared structure to request execution time extension.
> > 
> > But why -- we already have rseq, glibc uses it by default. Why add yet
> > another thing?
> 
> Indeed, what I'm not seeing in this RFC patch series cover letter is an
> explanation that justifies adding yet another per-thread memory area
> shared between kernel and userspace when we have extensible rseq
> already.
> 
> Peter, was there anything fundamentally wrong with your approach based
> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net

Not that I can remember, but it's a long time ago :-)

> The main thing I wonder is whether loading the rseq delay resched flag
> on return to userspace is too late in your patch.

Too late how? It only loads it at the point we would've called
schedule() -- no point in looking at it otherwise, right?

> Also, I'm not sure it is
> realistic to require that no system calls should be done within time extension
> slice. If we have this scenario:

Well, the whole premise is that syscalls are too expensive. If they are
not, then you shouldn't be using this in the first place.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-14 10:14     ` Peter Zijlstra
@ 2024-11-15 14:41       ` Mathieu Desnoyers
  2024-11-15 17:49         ` Prakash Sangappa
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-11-15 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Prakash Sangappa, linux-kernel, rostedt, tglx, daniel.m.jordan

On 2024-11-14 05:14, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>
>>>> This patch set implements the above mentioned 50us extension time as posted
>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>> to request the extension, this patch proposes a new API with use of a per
>>>> thread shared structure implementation described below. This shared structure
>>>> is accessible in both users pace and kernel. The user thread will set the
>>>> flag in this shared structure to request execution time extension.
>>>
>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>> another thing?
>>
>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>> explanation that justifies adding yet another per-thread memory area
>> shared between kernel and userspace when we have extensible rseq
>> already.
>>
>> Peter, was there anything fundamentally wrong with your approach based
>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> 
> Not that I can remember, but it's a long time ago :-)
> 
>> The main thing I wonder is whether loading the rseq delay resched flag
>> on return to userspace is too late in your patch.
> 
> Too late how? It only loads it at the point we would've called
> schedule() -- no point in looking at it otherwise, right?

[...]

For the specific return-to-userspace path, I think where you've placed
the delay-resched flag check is fine.

I'm concerned about other code paths that invoke schedule() besides
return-to-userspace. For instance:

raw_irqentry_exit_cond_resched():

         if (!preempt_count()) {
[...]
                 if (need_resched())
                         preempt_schedule_irq();
         }

AFAIU, this could be triggered by an interrupt handler exit when nested
over a page fault handler, exception handler, or system call.

We may decide that we cannot care less about those scenarios, and just
ignore the delay-resched flag, but it's relevant to take those into
consideration and clearly document the rationale behind our decision.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-15 14:41       ` Mathieu Desnoyers
@ 2024-11-15 17:49         ` Prakash Sangappa
  0 siblings, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-15 17:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan



> On Nov 15, 2024, at 6:41 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-14 05:14, Peter Zijlstra wrote:
>> On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>> 
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> 
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>>> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>> Not that I can remember, but it's a long time ago :-)
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch.
>> Too late how? It only loads it at the point we would've called
>> schedule() -- no point in looking at it otherwise, right?
> 
> [...]
> 
> For the specific return-to-userspace path, I think where you've placed
> the delay-resched flag check is fine.
> 
> I'm concerned about other code paths that invoke schedule() besides
> return-to-userspace. For instance:
> 
> raw_irqentry_exit_cond_resched():
> 
>        if (!preempt_count()) {
> [...]
>                if (need_resched())
>                        preempt_schedule_irq();
>        }
> 
> AFAIU, this could be triggered by an interrupt handler exit when nested
> over a page fault handler, exception handler, or system call.
> 
> We may decide that we cannot care less about those scenarios, and just
> ignore the delay-resched flag, but it's relevant to take those into
> consideration and clearly document the rationale behind our decision.

Don’t think the delay-resched will address all scenarios where preemption can 
occur when in critical section.  We could aim to address frequent paths where
a task can get preempted.  Initially the intent was to prevent preemption mainly
at the end of time slice, if the thread is in a critical section in the user space and 
has requested delaying reschedule.. 

Another path to consider is the wakeups occurring on a different cpu which could 
enqueue  a thread and attempt to preempt this thread when it is running in the
critical section. Should it check if the thread running has been granted extra time 
i.e the ‘taskshrd_sched_delay’ has been set for the running thread, avoid setting 
TIF_NEED_RESCHED in resched_curr() and sending IPI, ie like lazy preemption? 
If ’taskshrd_sched_delay’ has been set we know it will get preempted  due to the 
timer soon, or the task would sched_yield(if it is a well behaving application).

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] Scheduler time slice extension
  2024-11-13 18:50 ` Peter Zijlstra
  2024-11-13 19:36   ` Mathieu Desnoyers
@ 2024-11-13 19:50   ` Prakash Sangappa
  1 sibling, 0 replies; 26+ messages in thread
From: Prakash Sangappa @ 2024-11-13 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	tglx@linutronix.de, Daniel Jordan, Mathieu Desnoyers



> On Nov 13, 2024, at 10:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> 
>> This patch set implements the above mentioned 50us extension time as posted
>> by Peter. But instead of using restartable sequences as API to set the flag
>> to request the extension, this patch proposes a new API with use of a per
>> thread shared structure implementation described below. This shared structure
>> is accessible in both users pace and kernel. The user thread will set the
>> flag in this shared structure to request execution time extension.
> 
> But why -- we already have rseq, glibc uses it by default. Why add yet
> another thing?

Mainly this provides pinned memory, where kernel can access without the concern of taking a page fault.There could be other use cases in future, where updating user memory in kernel context cannot afford to take pagefaults.  For example implementing  use case like providing the thread state, needs to be done in the context switch path when the thread is going off cpu. 

-Prakash

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-02-04  3:04 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-13  0:01 [RFC PATCH 0/4] Scheduler time slice extension Prakash Sangappa
2024-11-13  0:01 ` [RFC PATCH 1/4] Introduce per thread user-kernel shared structure Prakash Sangappa
2024-11-13  0:01 ` [RFC PATCH 2/4] Scheduler time extention Prakash Sangappa
2024-11-13  3:57   ` K Prateek Nayak
2024-11-13 17:40     ` Prakash Sangappa
2024-11-13  0:01 ` [RFC PATCH 3/4] Indicate if schedular preemption delay request is granted Prakash Sangappa
2024-11-13  0:01 ` [RFC PATCH 4/4] Add scheduler preemption delay granted stats Prakash Sangappa
2024-11-13  5:43 ` [RFC PATCH 0/4] Scheduler time slice extension K Prateek Nayak
2024-11-13 19:56   ` Prakash Sangappa
2024-11-13 18:50 ` Peter Zijlstra
2024-11-13 19:36   ` Mathieu Desnoyers
2024-11-13 20:10     ` Prakash Sangappa
2024-11-13 20:57       ` Mathieu Desnoyers
2024-11-13 23:24         ` Prakash Sangappa
2024-11-14 10:28       ` Peter Zijlstra
2024-11-14 19:42         ` Prakash Sangappa
2024-11-15 14:13           ` Mathieu Desnoyers
2024-11-15 17:20             ` Prakash Sangappa
2024-12-09 20:36           ` Prakash Sangappa
2024-12-09 21:17             ` Mathieu Desnoyers
2024-12-16 18:59               ` Prakash Sangappa
2025-02-04  3:04                 ` Prakash Sangappa
2024-11-14 10:14     ` Peter Zijlstra
2024-11-15 14:41       ` Mathieu Desnoyers
2024-11-15 17:49         ` Prakash Sangappa
2024-11-13 19:50   ` Prakash Sangappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox