All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Provide fast access to thread-specific data
@ 2021-08-27 23:42 Prakash Sangappa
  2021-08-27 23:42 ` [RFC PATCH 1/3] Introduce per thread user-kernel shared structure Prakash Sangappa
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Prakash Sangappa @ 2021-08-27 23:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: prakash.sangappa

Sending this as RFC, looking for feedback.

Some applications, like a Databases require reading thread specific stats
frequently from the kernel in latency sensitive codepath. The overhead of
reading stats from kernel using system call affects performance.
One use case is reading thread's scheduler stats from /proc schedstat file
(/proc/pid/schedstat) to collect time spent by a thread executing on the
cpu(sum_exec_runtime), time blocked waiting on runq(run_delay). These
scheduler stats, read several times per transaction in latency-sensitive
codepath, are used to measure time taken by DB operations.

This patch proposes to introduce a mechanism for kernel to share thread
stats thru a per thread shared structure shared between userspace and
kernel. The per thread shared structure is allocated on a page shared
mapped between user space and kernel, which will provide a way for fast
communication between user and kernel. Kernel publishes stats in this
shared structure. Application thread can read from it in user space
without requiring system calls.

Similarly, there can be other use cases for such shared structure
mechanism.

Introduce 'off cpu' time:

The time spent executing on a cpu(sum_exec_runtime) by a thread,
currently available thru thread's schedstat file, can be shared thru
the shared structure mentioned above. However, when a thread is running 
on the cpu, this time gets updated periodically, can take upto 1ms or
more as part of scheduler tick processing. If the application has to 
measure cpu time consumed across some DB operations, using
'sum_exec_runtime' will not be accurate. To address this the proposal
is to introduce a thread's 'off cpu' time, which is measured at context
switch, similar to time on runq(ie run_delay in schedstat file) is and
should be more accurate. With that the application can determine cpu time
consumed by taking the elapsed time and subtracting off cpu time. The
off cpu time will be made available thru the shared structure along with
the other schedstats from /proc/pid/schedstat file.

The elapsed time itself can be measured using clock_gettime, which is
vdso optimized and would be fast. The schedstats(runq time & off cpu time)
published in the shared structure will be accumulated time, same as what
is available thru schedstat file, all in units of nanoseconds. The
application would take the difference of the values from before and after
the operation for measurement.

Preliminary results from a simple cached read Database workload shows
performance benefit, when the database uses shared struct for reading
stats vs reading from /proc directly.

Implementation:

A new system call is added to request use of shared structure by a user
thread. Kernel will allocate page(s), shared mapped with user space in
which per-thread shared structures will be allocated. These structures
are padded to 128 bytes. This will contain struct members or nested
structures corresponding to supported stats, like the thread's schedstats,
published by the kernel for user space consumption. More struct members
can be added as new feature support is implemented. Multiple such shared
structures will be allocated from a page(upto 32 per 4k page) and avoid
having to allocate one page per thread of a process. Although, will need
optimizing for locality. Additional pages will be allocated as needed to
accommodate more threads requesting use of shared structures. Aim is to
not expose the layout of the shared structure itself to the application,
which will allow future enhancements/changes without affecting the API.

The system call will return a pointer(user space mapped address) to the per
thread shared structure members. Application would save this per thread
pointer in a TLS variable and reference it.

The system call is of the form.
int task_getshared(int option, int flags, void __user *uaddr)

// Currently only TASK_SCHEDSTAT option is supported - returns pointer
// to struct task_schedstat. The struct task_schedstat is nested within
// the shared structure.

struct task_schedstat {
        volatile u64    sum_exec_runtime;
        volatile u64    run_delay;
        volatile u64    pcount;
        volatile u64    off_cpu;
};

Usage:

__thread struct task_schedstat *ts;
task_getshared(TASK_SCHEDSTAT, 0, &ts);

Subsequently the stats are accessed using the 'ts' pointer by the thread

Prakash Sangappa (3):
  Introduce per thread user-kernel shared structure
  Publish tasks's scheduler stats thru the shared structure
  Introduce task's 'off cpu' time

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/mm_types.h               |   2 +
 include/linux/sched.h                  |   9 +
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  92 ++++++++++
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/task_shared.h       |  23 +++
 kernel/fork.c                          |   7 +
 kernel/sched/deadline.c                |   1 +
 kernel/sched/fair.c                    |   1 +
 kernel/sched/rt.c                      |   1 +
 kernel/sched/sched.h                   |   1 +
 kernel/sched/stats.h                   |  55 ++++--
 kernel/sched/stop_task.c               |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Makefile                            |   2 +-
 mm/task_shared.c                       | 314 +++++++++++++++++++++++++++++++++
 18 files changed, 501 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 include/uapi/linux/task_shared.h
 create mode 100644 mm/task_shared.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 8+ messages in thread
* Re: [RFC PATCH 1/3] Introduce per thread user-kernel shared structure
@ 2021-08-28 10:18 kernel test robot
  0 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2021-08-28 10:18 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 7004 bytes --]

CC: kbuild-all(a)lists.01.org
In-Reply-To: <1630107736-18269-2-git-send-email-prakash.sangappa@oracle.com>
References: <1630107736-18269-2-git-send-email-prakash.sangappa@oracle.com>
TO: Prakash Sangappa <prakash.sangappa@oracle.com>

Hi Prakash,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on linus/master]
[also build test WARNING on v5.14-rc7]
[cannot apply to tip/sched/core hnaz-linux-mm/master tip/x86/asm next-20210827]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Prakash-Sangappa/Provide-fast-access-to-thread-specific-data/20210828-073533
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 8f9d0349841a2871624bb1e85309e03e9867c16e
:::::: branch date: 11 hours ago
:::::: commit date: 11 hours ago
config: x86_64-randconfig-s022-20210827 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-348-gf0e6938b-dirty
        # https://github.com/0day-ci/linux/commit/4afb2fb1653308287e0f2347dfff5c499acedee7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Prakash-Sangappa/Provide-fast-access-to-thread-specific-data/20210828-073533
        git checkout 4afb2fb1653308287e0f2347dfff5c499acedee7
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> mm/task_shared.c:264:1: sparse: sparse: unused label 'out'

vim +/out +264 mm/task_shared.c

4afb2fb1653308 Prakash Sangappa 2021-08-27  200  
4afb2fb1653308 Prakash Sangappa 2021-08-27  201  
4afb2fb1653308 Prakash Sangappa 2021-08-27  202  /*
4afb2fb1653308 Prakash Sangappa 2021-08-27  203   * Allocate task_ushared struct for calling thread.
4afb2fb1653308 Prakash Sangappa 2021-08-27  204   */
4afb2fb1653308 Prakash Sangappa 2021-08-27  205  static int task_ushared_alloc(void)
4afb2fb1653308 Prakash Sangappa 2021-08-27  206  {
4afb2fb1653308 Prakash Sangappa 2021-08-27  207  	struct mm_struct *mm = current->mm;
4afb2fb1653308 Prakash Sangappa 2021-08-27  208  	struct ushared_pg *ent = NULL;
4afb2fb1653308 Prakash Sangappa 2021-08-27  209  	struct task_ushrd_struct *ushrd;
4afb2fb1653308 Prakash Sangappa 2021-08-27  210  	struct ushared_pages *usharedpg;
4afb2fb1653308 Prakash Sangappa 2021-08-27  211  	int tryalloc = 0;
4afb2fb1653308 Prakash Sangappa 2021-08-27  212  	int slot = -1;
4afb2fb1653308 Prakash Sangappa 2021-08-27  213  	int ret = -ENOMEM;
4afb2fb1653308 Prakash Sangappa 2021-08-27  214  
4afb2fb1653308 Prakash Sangappa 2021-08-27  215  	if (mm->usharedpg == NULL && init_mm_ushared(mm))
4afb2fb1653308 Prakash Sangappa 2021-08-27  216  		return ret;
4afb2fb1653308 Prakash Sangappa 2021-08-27  217  
4afb2fb1653308 Prakash Sangappa 2021-08-27  218  	if (current->task_ushrd == NULL && init_task_ushrd(current))
4afb2fb1653308 Prakash Sangappa 2021-08-27  219  		return ret;
4afb2fb1653308 Prakash Sangappa 2021-08-27  220  
4afb2fb1653308 Prakash Sangappa 2021-08-27  221  	usharedpg = mm->usharedpg;
4afb2fb1653308 Prakash Sangappa 2021-08-27  222  	ushrd = current->task_ushrd;
4afb2fb1653308 Prakash Sangappa 2021-08-27  223  repeat:
4afb2fb1653308 Prakash Sangappa 2021-08-27  224  	if (mmap_write_lock_killable(mm))
4afb2fb1653308 Prakash Sangappa 2021-08-27  225  		return -EINTR;
4afb2fb1653308 Prakash Sangappa 2021-08-27  226  
4afb2fb1653308 Prakash Sangappa 2021-08-27  227  	ent = list_empty(&usharedpg->frlist) ? NULL :
4afb2fb1653308 Prakash Sangappa 2021-08-27  228  		list_entry(usharedpg->frlist.next,
4afb2fb1653308 Prakash Sangappa 2021-08-27  229  		struct ushared_pg, fr_list);
4afb2fb1653308 Prakash Sangappa 2021-08-27  230  
4afb2fb1653308 Prakash Sangappa 2021-08-27  231  	if (ent == NULL || ent->slot_count == 0) {
4afb2fb1653308 Prakash Sangappa 2021-08-27  232  		if (tryalloc == 0) {
4afb2fb1653308 Prakash Sangappa 2021-08-27  233  			mmap_write_unlock(mm);
4afb2fb1653308 Prakash Sangappa 2021-08-27  234  			(void)ushared_allocpg();
4afb2fb1653308 Prakash Sangappa 2021-08-27  235  			tryalloc = 1;
4afb2fb1653308 Prakash Sangappa 2021-08-27  236  			goto repeat;
4afb2fb1653308 Prakash Sangappa 2021-08-27  237  		} else {
4afb2fb1653308 Prakash Sangappa 2021-08-27  238  			ent = NULL;
4afb2fb1653308 Prakash Sangappa 2021-08-27  239  		}
4afb2fb1653308 Prakash Sangappa 2021-08-27  240  	}
4afb2fb1653308 Prakash Sangappa 2021-08-27  241  
4afb2fb1653308 Prakash Sangappa 2021-08-27  242  	if (ent) {
4afb2fb1653308 Prakash Sangappa 2021-08-27  243  		slot = find_first_zero_bit((unsigned long *)(&ent->bitmap),
4afb2fb1653308 Prakash Sangappa 2021-08-27  244  		  TASK_USHARED_SLOTS);
4afb2fb1653308 Prakash Sangappa 2021-08-27  245  		BUG_ON(slot >=  TASK_USHARED_SLOTS);
4afb2fb1653308 Prakash Sangappa 2021-08-27  246  
4afb2fb1653308 Prakash Sangappa 2021-08-27  247  		set_bit(slot, (unsigned long *)(&ent->bitmap));
4afb2fb1653308 Prakash Sangappa 2021-08-27  248  
4afb2fb1653308 Prakash Sangappa 2021-08-27  249  		ushrd->uaddr = (struct task_ushared *)(ent->vaddr +
4afb2fb1653308 Prakash Sangappa 2021-08-27  250  		  (slot * sizeof(union task_shared)));
4afb2fb1653308 Prakash Sangappa 2021-08-27  251  		ushrd->kaddr = (struct task_ushared *)(ent->kaddr +
4afb2fb1653308 Prakash Sangappa 2021-08-27  252  		  (slot * sizeof(union task_shared)));
4afb2fb1653308 Prakash Sangappa 2021-08-27  253  		ushrd->upg = ent;
4afb2fb1653308 Prakash Sangappa 2021-08-27  254  		ent->slot_count--;
4afb2fb1653308 Prakash Sangappa 2021-08-27  255  		/* move it to tail */
4afb2fb1653308 Prakash Sangappa 2021-08-27  256  		if (ent->slot_count == 0) {
4afb2fb1653308 Prakash Sangappa 2021-08-27  257  			list_del(&ent->fr_list);
4afb2fb1653308 Prakash Sangappa 2021-08-27  258  			list_add_tail(&ent->fr_list, &usharedpg->frlist);
4afb2fb1653308 Prakash Sangappa 2021-08-27  259  		}
4afb2fb1653308 Prakash Sangappa 2021-08-27  260  
4afb2fb1653308 Prakash Sangappa 2021-08-27  261  	       ret = 0;
4afb2fb1653308 Prakash Sangappa 2021-08-27  262  	}
4afb2fb1653308 Prakash Sangappa 2021-08-27  263  
4afb2fb1653308 Prakash Sangappa 2021-08-27 @264  out:
4afb2fb1653308 Prakash Sangappa 2021-08-27  265  	mmap_write_unlock(mm);
4afb2fb1653308 Prakash Sangappa 2021-08-27  266  	return ret;
4afb2fb1653308 Prakash Sangappa 2021-08-27  267  }
4afb2fb1653308 Prakash Sangappa 2021-08-27  268  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 33997 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-08-28 10:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-08-27 23:42 [RFC PATCH 0/3] Provide fast access to thread-specific data Prakash Sangappa
2021-08-27 23:42 ` [RFC PATCH 1/3] Introduce per thread user-kernel shared structure Prakash Sangappa
2021-08-28  7:36   ` kernel test robot
2021-08-28  8:21   ` kernel test robot
2021-08-27 23:42 ` [RFC PATCH 2/3] Publish tasks's scheduler stats thru the " Prakash Sangappa
2021-08-28  6:35   ` kernel test robot
2021-08-27 23:42 ` [RFC PATCH 3/3] Introduce task's 'off cpu' time Prakash Sangappa
  -- strict thread matches above, loose matches on Subject: below --
2021-08-28 10:18 [RFC PATCH 1/3] Introduce per thread user-kernel shared structure kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.