[RFC v2 00/35] optimize cost of inter-process communication

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 00/35] optimize cost of inter-process communication
@ 2025-05-30  9:27 Bo Li
  2025-05-30  9:27 ` [RFC v2 01/35] Kbuild: rpal support Bo Li
                   ` (40 more replies)
  0 siblings, 41 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Changelog:

v2:
- Port the RPAL functions to the latest v6.15 kernel.
- Add a supplementary introduction to the application scenarios and
  security considerations of RPAL.

link to v1:
https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/

--------------------------------------------------------------------------

# Introduction

We mainly apply RPAL to the service mesh architecture widely adopted in
modern cloud-native data centers. Before the rise of the service mesh
architecture, network functions were usually integrated into monolithic
applications as libraries, and the main business programs invoked them
through function calls. However, to facilitate the independent development
and operation and maintenance of the main business programs and network
functions, the service mesh removed the network functions from the main
business programs and made them independent processes (called sidecars).
Inter-process communication (IPC) is used for interaction between the main
business program and the sidecar, and the introduced inter-process
communication has led to a sharp increase in resource consumption in
cloud-native data centers, and may even occupy more than 10% of the CPU of
the entire microservice cluster.

To achieve the efficient function call mechanism of the monolithic
architecture under the service mesh architecture, we introduced the RPAL
(Running Process As Library) architecture, which implements the sharing of
the virtual address space of processes and the switching threads in user
mode. Through the analysis of the service mesh architecture, we found that
the process memory isolation between the main business program and the
sidecar is not particularly important because they are split from one
application and were an integral part of the original monolithic
application. It is more important for the two processes to be independent
of each other because they need to be independently developed and
maintained to ensure the architectural advantages of the service mesh.
Therefore, RPAL breaks the isolation between processes while preserving the
independence between them.  We think that RPAL can also be applied to other
scenarios featuring sidecar-like architectures, such as distributed file
storage systems in LLM infra.

In RPAL architecture, multiple processes share a virtual address space, so
this architecture can be regarded as an advanced version of the Linux
shared memory mechanism:

1. Traditional shared memory requires two processes to negotiate to ensure
the mapping of the same piece of memory. In RPAL architecture, two RPAL
processes still need to reach a consensus before they can successfully
invoke the relevant system calls of RPAL to share the virtual address
space.
2. Traditional shared memory only shares part of the data. However, in RPAL
architecture, processes that have established an RPAL communication
relationship share a virtual address space, and all user memory (such as
data segments and code segments) of each RPAL process is shared among these
processes. However, a process cannot access the memory of other processes
at any time. We use the MPK mechanism to ensure that the memory of other
processes can only be accessed when special RPAL functions are called.
Otherwise, a page fault will be triggered.
3. In RPAL architecture, to ensure the consistency of the execution context
of the shared code (such as the stack and thread local storage), we further
implement the thread context switching in user mode based on the ability to
share the virtual address space of different processes, enabling the
threads of different processes to directly perform fast switching in user
mode without falling into kernel mode for slow switching.

# Background

In traditional inter-process communication (IPC) scenarios, Unix domain
sockets are commonly used in conjunction with the epoll() family for event
multiplexing. IPC operations involve system calls on both the data and
control planes, thereby imposing a non-trivial overhead on the interacting
processes. Even when shared memory is employed to optimize the data plane,
two data copies still remain. Specifically, data is initially copied from
a process's private memory space into the shared memory area, and then it
is copied from the shared memory into the private memory of another
process.

This poses a question: Is it possible to reduce the overhead of IPC with
only minimal modifications at the application level? To address this, we
observed that the functionality of IPC, which encompasses data transfer
and invocation of the target thread, is similar to a function call, where
arguments are passed and the callee function is invoked to process them.
Inspired by this analogy, we introduce RPAL (Run Process As Library), a
framework designed to enable one process to invoke another as if making
a local function call, all without going through the kernel.

# Design

First, let’s formalize RPAL’s core objectives:

1. Data-plane efficiency: Reduce the number of data copies from two (in the
   shared memory solution) to one.
2. Control-plane optimization: Eliminate the overhead of system calls and
   kernel's thread switches.
3. Application compatibility: Minimize the modifications to existing
   applications that utilize Unix domain sockets and the epoll() family.

To attain the first objective, processes that use RPAL share the same
virtual address space. So one process can access another's data directly
via a data pointer. This means data can be transferred from one process to
another with just one copy operation. 

To meet the second goal, RPAL relies on the shared address space to do
lightweight context switching in user space, which we call an "RPAL call".
This allows one process to execute another process's code just like a
local function call.

To achieve the third target, RPAL stays compatible with the epoll family
of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
application uses epoll for IPC, developers can switch to RPAL with just a
few small changes. For instance, you can just replace epoll_wait() with
rpal_epoll_wait(). The basic epoll procedure, where a process waits for
another to write to a monitored descriptor using an epoll file descriptor,
still works fine with RPAL.

## Address space sharing

For address space sharing, RPAL partitions the entire userspace virtual
address space and allocates non-overlapping memory ranges to each process.
On x86_64 architectures, RPAL uses a memory range size covered by a
single PUD (Page Upper Directory) entry, which is 512GB. This restricts
each process’s virtual address space to 512GB on x86_64, sufficient for
most applications in our scenario. The rationale is straightforward: 
address space sharing can be simply achieved by copying the PUD from one
process’s page table to another’s. So one process can directly use the
data pointer to access another's memory.

 |------------| <- 0
 |------------| <- 512 GB
 |  Process A |
 |------------| <- 2*512 GB
 |------------| <- n*512 GB
 |  Process B |
 |------------| <- (n+1)*512 GB
 |------------| <- STACK_TOP
 |  Kernel    |
 |------------|

## RPAL call

We refer to the lightweight userspace context switching mechanism as RPAL
call. It enables the caller (or sender) thread of one process to directly
switch to the callee (or receiver) thread of another process. 

When Process A’s caller thread initiates an RPAL call to Process B’s
callee thread, the CPU saves the caller’s context and loads the callee’s
context. This enables direct userspace control flow transfer from the
caller to the callee. After the callee finishes data processing, the CPU
saves Process B’s callee context and switches back to Process A’s caller
context, completing a full IPC cycle.

 |------------|                |---------------------|  
 |  Process A |                |  Process B          |
 | |-------|  |                | |-------|           |     
 | | caller| --- RPAL call --> | | callee|    handle |
 | | thread| <------------------ | thread| -> event  |
 | |-------|  |                | |-------|           |
 |------------|                |---------------------|

# Security and compatibility with kernel subsystems

## Memory protection between processes

Since processes using RPAL share the address space, unintended
cross-process memory access may occur and corrupt the data of another
process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
architectures.

MPK assigns 4 bits in each page table entry to a "protection key", which
is paired with a userspace register (PKRU). The PKRU register defines
access permissions for memory regions protected by specific keys (for
detailed implementation, refer to the kernel documentation "Memory
Protection Keys"). With MPK, even though the address space is shared
among processes, cross-process access is restricted: a process can only
access the memory protected by a key if its PKRU register is configured
with the corresponding permission. This ensures that processes cannot
access each other’s memory unless an explicit PKRU configuration is set.

## Page fault handling and TLB flushing

Due to the shared address space architecture, both page fault handling and
TLB flushing require careful consideration. For instance, when Process A
accesses Process B’s memory, a page fault may occur in Process A's
context, but the faulting address belongs to Process B. In this case, we
must pass Process B's mm_struct to the page fault handler.

TLB flushing is more complex. When a thread flushes the TLB, since the
address space is shared, not only other threads in the current process but
also other processes that share the address space may access the
corresponding memory (related to the TLB flush). Therefore, the cpuset used
for TLB flushing should be the union of the mm_cpumasks of all processes
that share the address space.

## Lazy switch of kernel context

In RPAL, a mismatch may arise between the user context and the kernel
context. The RPAL call is designed solely to switch the user context,
leaving the kernel context unchanged. For instance, when a RPAL call takes
place, transitioning from caller thread to callee thread, and subsequently
a system call is initiated within callee thread, the kernel will
incorrectly utilize the caller's kernel context (such as the kernel stack)
to process the system call.

To resolve context mismatch issues, a kernel context switch is triggered at
the kernel entry point when the callee initiates a syscall or an
exception/interrupt occurs. This mechanism ensures context consistency
before processing system calls, interrupts, or exceptions. We refer to this
kernel context switch as a "lazy switch" because it defers the switching
operation from the traditional thread switch point to the next kernel entry
point.

Lazy switch should be minimized as much as possible, as it significantly
degrades performance. We currently utilize RPAL in an RPC framework, in
which the RPC sender thread relies on the RPAL call to invoke the RPC
receiver thread entirely in user space. In most cases, the receiver
thread is free of system calls and the code execution time is relatively
short. This characteristic effectively reduces the probability of a lazy
switch occurring.

## Time slice correction

After an RPAL call, the callee's user mode code executes. However, the
kernel incorrectly attributes this CPU time to the caller due to the
unchanged kernel context.

To resolve this, we use the Time Stamp Counter (TSC) register to measure
CPU time consumed by the callee thread in user space. The kernel then uses
this user-reported timing data to adjust the CPU accounting for both the
caller and callee thread, similar to how CPU steal time is implemented.

## Process recovery

Since processes can access each other’s memory, there is a risk that the
target process’s memory may become invalid at the access time (e.g., if
the target process has exited unexpectedly). The kernel must handle such
cases; otherwise, the accessing process could be terminated due to
failures originating from another process.

To address this issue, each thread of the process should pre-establish a
recovery point when accessing the memory of other processes. When such an
invalid access occurs, the thread traps into the kernel. Inside the page
fault handler, the kernel restores the user context of the thread to the
recovery point. This mechanism ensures that processes maintain mutual
independence, preventing cascading failures caused by cross-process memory
issues.

# Performance

To quantify the performance improvements driven by RPAL, we measured
latency both before and after its deployment. Experiments were conducted on
a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
and 1 TB of memory. Latency was defined as the duration from when the
client thread initiates a message to when the server thread is invoked and
receives it.

During testing, the client transmitted 1 million 32-byte messages, and we
computed the per-message average latency. The results are as follows:

*****************
Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
 Message count: 1000000, Average latency: 19616 cycles
With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
 Message count: 1000000, Average latency: 1703 cycles
*****************

These results confirm that RPAL delivers substantial latency improvements
over the current epoll implementation—achieving a 17,913-cycle reduction
(an ~91.3% improvement) for 32-byte messages.

We have applied RPAL to an RPC framework that is widely used in our data
center. With RPAL, we have successfully achieved up to 15.5% reduction in
the CPU utilization of processes in real-world microservice scenario. The
gains primarily stem from minimizing control plane overhead through the
utilization of userspace context switches. Additionally, by leveraging
address space sharing, the number of memory copies is significantly
reduced.

# Future Work

Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
which is supported by a range of Intel CPUs. For AMD architectures, MPK is
supported only on the latest processor, specifically, 3th Generation AMD
EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
support to systems lacking MPK hardware will be provided later.

Accompanying test programs are also provided in the samples/rpal/
directory. And the user-mode RPAL library, which realizes user-space RPAL
call, is in the samples/rpal/librpal directory.

We hope to get some community discussions and feedback on RPAL's
optimization approaches and architecture.

Look forward to your comments.

Bo Li (35):
  Kbuild: rpal support
  RPAL: add struct rpal_service
  RPAL: add service registration interface
  RPAL: add member to task_struct and mm_struct
  RPAL: enable virtual address space partitions
  RPAL: add user interface
  RPAL: enable shared page mmap
  RPAL: enable sender/receiver registration
  RPAL: enable address space sharing
  RPAL: allow service enable/disable
  RPAL: add service request/release
  RPAL: enable service disable notification
  RPAL: add tlb flushing support
  RPAL: enable page fault handling
  RPAL: add sender/receiver state
  RPAL: add cpu lock interface
  RPAL: add a mapping between fsbase and tasks
  sched: pick a specified task
  RPAL: add lazy switch main logic
  RPAL: add rpal_ret_from_lazy_switch
  RPAL: add kernel entry handling for lazy switch
  RPAL: rebuild receiver state
  RPAL: resume cpumask when fork
  RPAL: critical section optimization
  RPAL: add MPK initialization and interface
  RPAL: enable MPK support
  RPAL: add epoll support
  RPAL: add rpal_uds_fdmap() support
  RPAL: fix race condition in pkru update
  RPAL: fix pkru setup when fork
  RPAL: add receiver waker
  RPAL: fix unknown nmi on AMD CPU
  RPAL: enable time slice correction
  RPAL: enable fast epoll wait
  samples/rpal: add RPAL samples

 arch/x86/Kbuild                               |    2 +
 arch/x86/Kconfig                              |    2 +
 arch/x86/entry/entry_64.S                     |  160 ++
 arch/x86/events/amd/core.c                    |   14 +
 arch/x86/include/asm/pgtable.h                |   25 +
 arch/x86/include/asm/pgtable_types.h          |   11 +
 arch/x86/include/asm/tlbflush.h               |   10 +
 arch/x86/kernel/asm-offsets.c                 |    3 +
 arch/x86/kernel/cpu/common.c                  |    8 +-
 arch/x86/kernel/fpu/core.c                    |    8 +-
 arch/x86/kernel/nmi.c                         |   20 +
 arch/x86/kernel/process.c                     |   25 +-
 arch/x86/kernel/process_64.c                  |  118 +
 arch/x86/mm/fault.c                           |  271 ++
 arch/x86/mm/mmap.c                            |   10 +
 arch/x86/mm/tlb.c                             |  172 ++
 arch/x86/rpal/Kconfig                         |   21 +
 arch/x86/rpal/Makefile                        |    6 +
 arch/x86/rpal/core.c                          |  477 ++++
 arch/x86/rpal/internal.h                      |   69 +
 arch/x86/rpal/mm.c                            |  426 +++
 arch/x86/rpal/pku.c                           |  196 ++
 arch/x86/rpal/proc.c                          |  279 ++
 arch/x86/rpal/service.c                       |  776 ++++++
 arch/x86/rpal/thread.c                        |  313 +++
 fs/binfmt_elf.c                               |   98 +-
 fs/eventpoll.c                                |  320 +++
 fs/exec.c                                     |   11 +
 include/linux/mm_types.h                      |    3 +
 include/linux/rpal.h                          |  633 +++++
 include/linux/sched.h                         |   21 +
 init/init_task.c                              |    6 +
 kernel/exit.c                                 |    5 +
 kernel/fork.c                                 |   32 +
 kernel/sched/core.c                           |  676 +++++
 kernel/sched/fair.c                           |  109 +
 kernel/sched/sched.h                          |    8 +
 mm/mmap.c                                     |   16 +
 mm/mprotect.c                                 |  106 +
 mm/rmap.c                                     |    4 +
 mm/vma.c                                      |   18 +
 samples/rpal/Makefile                         |   17 +
 samples/rpal/asm_define.c                     |   14 +
 samples/rpal/client.c                         |  178 ++
 samples/rpal/librpal/asm_define.h             |    6 +
 samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
 samples/rpal/librpal/debug.h                  |   12 +
 samples/rpal/librpal/fiber.c                  |  119 +
 samples/rpal/librpal/fiber.h                  |   64 +
 .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
 .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
 .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
 samples/rpal/librpal/private.h                |  341 +++
 samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
 samples/rpal/librpal/rpal.h                   |  149 ++
 samples/rpal/librpal/rpal_pkru.h              |   78 +
 samples/rpal/librpal/rpal_queue.c             |  239 ++
 samples/rpal/librpal/rpal_queue.h             |   55 +
 samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
 samples/rpal/offset.sh                        |    5 +
 samples/rpal/server.c                         |  249 ++
 61 files changed, 9710 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/rpal/Kconfig
 create mode 100644 arch/x86/rpal/Makefile
 create mode 100644 arch/x86/rpal/core.c
 create mode 100644 arch/x86/rpal/internal.h
 create mode 100644 arch/x86/rpal/mm.c
 create mode 100644 arch/x86/rpal/pku.c
 create mode 100644 arch/x86/rpal/proc.c
 create mode 100644 arch/x86/rpal/service.c
 create mode 100644 arch/x86/rpal/thread.c
 create mode 100644 include/linux/rpal.h
 create mode 100644 samples/rpal/Makefile
 create mode 100644 samples/rpal/asm_define.c
 create mode 100644 samples/rpal/client.c
 create mode 100644 samples/rpal/librpal/asm_define.h
 create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
 create mode 100644 samples/rpal/librpal/debug.h
 create mode 100644 samples/rpal/librpal/fiber.c
 create mode 100644 samples/rpal/librpal/fiber.h
 create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/private.h
 create mode 100644 samples/rpal/librpal/rpal.c
 create mode 100644 samples/rpal/librpal/rpal.h
 create mode 100644 samples/rpal/librpal/rpal_pkru.h
 create mode 100644 samples/rpal/librpal/rpal_queue.c
 create mode 100644 samples/rpal/librpal/rpal_queue.h
 create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
 create mode 100755 samples/rpal/offset.sh
 create mode 100644 samples/rpal/server.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v2 01/35] Kbuild: rpal support
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 02/35] RPAL: add struct rpal_service Bo Li
                   ` (39 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Add kbuild support for RPAL, including new folder arch/x86/kernel/rpal and
new config CONFIG_RPAL.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/Kbuild        |  2 ++
 arch/x86/Kconfig       |  2 ++
 arch/x86/rpal/Kconfig  | 11 +++++++++++
 arch/x86/rpal/Makefile |  0
 4 files changed, 15 insertions(+)
 create mode 100644 arch/x86/rpal/Kconfig
 create mode 100644 arch/x86/rpal/Makefile

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index f7fb3d88c57b..26c406442d79 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -34,5 +34,7 @@ obj-$(CONFIG_KEXEC_FILE) += purgatory/
 
 obj-y += virt/
 
+obj-y += rpal/
+
 # for cleaning
 subdir- += boot tools
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 121f9f03bd5c..3f53b6fc943f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2359,6 +2359,8 @@ config X86_BUS_LOCK_DETECT
 	  Enable Split Lock Detect and Bus Lock Detect functionalities.
 	  See <file:Documentation/arch/x86/buslock.rst> for more information.
 
+source "arch/x86/rpal/Kconfig"
+
 endmenu
 
 config CC_HAS_NAMED_AS
diff --git a/arch/x86/rpal/Kconfig b/arch/x86/rpal/Kconfig
new file mode 100644
index 000000000000..e5e6996553ea
--- /dev/null
+++ b/arch/x86/rpal/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# This Kconfig describes RPAL options
+#
+
+config RPAL
+	def_bool y
+	depends on X86_64
+	help
+		This option enables system support for Run Process As
+		library (RPAL).
\ No newline at end of file
diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile
new file mode 100644
index 000000000000..e69de29bb2d1
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 02/35] RPAL: add struct rpal_service
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
  2025-05-30  9:27 ` [RFC v2 01/35] Kbuild: rpal support Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 03/35] RPAL: add service registration interface Bo Li
                   ` (38 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Each process that uses RPAL features is called an RPAL service.

This patch adds the RPAL header file rpal.h and defines the rpal_service
structure. The struct rpal_service uses a dedicated kmem_cache for
allocation and deallocation, and atomic variables to maintain references
to the struct rpal_service. Additionally, the patch introduces the
rpal_get_service() and rpal_put_service() interfaces to manage reference
counts.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/Makefile   |  5 ++++
 arch/x86/rpal/core.c     | 32 +++++++++++++++++++++++
 arch/x86/rpal/internal.h | 13 ++++++++++
 arch/x86/rpal/service.c  | 56 ++++++++++++++++++++++++++++++++++++++++
 include/linux/rpal.h     | 43 ++++++++++++++++++++++++++++++
 5 files changed, 149 insertions(+)
 create mode 100644 arch/x86/rpal/core.c
 create mode 100644 arch/x86/rpal/internal.h
 create mode 100644 arch/x86/rpal/service.c
 create mode 100644 include/linux/rpal.h

diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile
index e69de29bb2d1..ee3698b5a9b3 100644
--- a/arch/x86/rpal/Makefile
+++ b/arch/x86/rpal/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_RPAL)		+= rpal.o
+
+rpal-y := service.o core.o
diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
new file mode 100644
index 000000000000..495dbc1b1536
--- /dev/null
+++ b/arch/x86/rpal/core.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+#include <linux/rpal.h>
+
+#include "internal.h"
+
+int __init rpal_init(void);
+
+bool rpal_inited;
+
+int __init rpal_init(void)
+{
+	int ret = 0;
+
+	ret = rpal_service_init();
+	if (ret)
+		goto fail;
+
+	rpal_inited = true;
+	return 0;
+
+fail:
+	rpal_err("rpal init fail\n");
+	return -1;
+}
+subsys_initcall(rpal_init);
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
new file mode 100644
index 000000000000..e44e6fc79677
--- /dev/null
+++ b/arch/x86/rpal/internal.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+extern bool rpal_inited;
+
+/* service.c */
+int __init rpal_service_init(void);
+void __init rpal_service_exit(void);
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
new file mode 100644
index 000000000000..c8e609798d4f
--- /dev/null
+++ b/arch/x86/rpal/service.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+#include <linux/rpal.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/slab.h>
+
+#include "internal.h"
+
+static struct kmem_cache *service_cache;
+
+static void __rpal_put_service(struct rpal_service *rs)
+{
+	kmem_cache_free(service_cache, rs);
+}
+
+struct rpal_service *rpal_get_service(struct rpal_service *rs)
+{
+	if (!rs)
+		return NULL;
+	atomic_inc(&rs->refcnt);
+	return rs;
+}
+
+void rpal_put_service(struct rpal_service *rs)
+{
+	if (!rs)
+		return;
+
+	if (atomic_dec_and_test(&rs->refcnt))
+		__rpal_put_service(rs);
+}
+
+int __init rpal_service_init(void)
+{
+	service_cache = kmem_cache_create("rpal_service_cache",
+					  sizeof(struct rpal_service), 0,
+					  SLAB_PANIC, NULL);
+	if (!service_cache) {
+		rpal_err("service init fail\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+void __init rpal_service_exit(void)
+{
+	kmem_cache_destroy(service_cache);
+}
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
new file mode 100644
index 000000000000..73468884cc5d
--- /dev/null
+++ b/include/linux/rpal.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+#ifndef _LINUX_RPAL_H
+#define _LINUX_RPAL_H
+
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define RPAL_ERROR_MSG "rpal error: "
+#define rpal_err(x...) pr_err(RPAL_ERROR_MSG x)
+#define rpal_err_ratelimited(x...) pr_err_ratelimited(RPAL_ERROR_MSG x)
+
+struct rpal_service {
+	/* reference count of this struct */
+	atomic_t refcnt;
+};
+
+/**
+ * @brief get new reference to a rpal service, a corresponding
+ *  rpal_put_service() should be called later by the caller.
+ *
+ * @param rs The struct rpal_service to get.
+ *
+ * @return new reference of struct rpal_service.
+ */
+struct rpal_service *rpal_get_service(struct rpal_service *rs);
+
+/**
+ * @brief put a reference to a rpal service. If the reference count of
+ *  the service turns to be 0, then release its struct rpal_service.
+ *  rpal_put_service() may be used in an atomic context.
+ *
+ * @param rs The struct rpal_service to put.
+ */
+void rpal_put_service(struct rpal_service *rs);
+#endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 03/35] RPAL: add service registration interface
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
  2025-05-30  9:27 ` [RFC v2 01/35] Kbuild: rpal support Bo Li
  2025-05-30  9:27 ` [RFC v2 02/35] RPAL: add struct rpal_service Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 04/35] RPAL: add member to task_struct and mm_struct Bo Li
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Every rpal service should be registered and managed. Each RPAL service has
a 64-bit key as its unique identifier, the key should never repeat before
kernel reboot. Each RPAL service has an ID to indicate which 512GB virtual
address space it can use. Any alive RPAL service has its unique ID, which
will never be reused until the service dead.

This patch adds a registration interface for RPAL services. Newly
registered rpal_service instances are assigned a key that increments
starting from 1. The 64-bit length of the key ensures that keys are almost
impossible to exhaust before system reboot. Meanwhile, a bitmap is used to
allocate IDs, ensuring no duplicate IDs are assigned. RPAL services are
managed via a hash list, which facilitates quick lookup of the
corresponding rpal_service by key.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/service.c | 130 ++++++++++++++++++++++++++++++++++++++++
 include/linux/rpal.h    |  31 ++++++++++
 2 files changed, 161 insertions(+)

diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index c8e609798d4f..609c9550540d 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -13,13 +13,56 @@
 
 #include "internal.h"
 
+static DECLARE_BITMAP(rpal_id_bitmap, RPAL_NR_ID);
+static atomic64_t service_key_counter;
+static DEFINE_HASHTABLE(service_hash_table, ilog2(RPAL_NR_ID));
+DEFINE_SPINLOCK(hash_table_lock);
 static struct kmem_cache *service_cache;
 
+static inline void rpal_free_service_id(int id)
+{
+	clear_bit(id, rpal_id_bitmap);
+}
+
 static void __rpal_put_service(struct rpal_service *rs)
 {
 	kmem_cache_free(service_cache, rs);
 }
 
+static int rpal_alloc_service_id(void)
+{
+	int id;
+
+	do {
+		id = find_first_zero_bit(rpal_id_bitmap, RPAL_NR_ID);
+		if (id == RPAL_NR_ID) {
+			id = RPAL_INVALID_ID;
+			break;
+		}
+	} while (test_and_set_bit(id, rpal_id_bitmap));
+
+	return id;
+}
+
+static bool is_valid_id(int id)
+{
+	return id >= 0 && id < RPAL_NR_ID;
+}
+
+static u64 rpal_alloc_service_key(void)
+{
+	u64 key;
+
+	/* confirm we do not run out keys */
+	if (unlikely(atomic64_read(&service_key_counter) == _AC(-1, UL))) {
+		rpal_err("key is exhausted\n");
+		return RPAL_INVALID_KEY;
+	}
+
+	key = atomic64_fetch_inc(&service_key_counter);
+	return key;
+}
+
 struct rpal_service *rpal_get_service(struct rpal_service *rs)
 {
 	if (!rs)
@@ -37,6 +80,90 @@ void rpal_put_service(struct rpal_service *rs)
 		__rpal_put_service(rs);
 }
 
+static u32 get_hash_key(u64 key)
+{
+	return key % RPAL_NR_ID;
+}
+
+struct rpal_service *rpal_get_service_by_key(u64 key)
+{
+	struct rpal_service *rs, *rsp;
+	u32 hash_key = get_hash_key(key);
+
+	rs = NULL;
+	hash_for_each_possible(service_hash_table, rsp, hlist, hash_key) {
+		if (rsp->key == key) {
+			rs = rsp;
+			break;
+		}
+	}
+	return rpal_get_service(rs);
+}
+
+static void insert_service(struct rpal_service *rs)
+{
+	unsigned long flags;
+	int hash_key;
+
+	hash_key = get_hash_key(rs->key);
+
+	spin_lock_irqsave(&hash_table_lock, flags);
+	hash_add(service_hash_table, &rs->hlist, hash_key);
+	spin_unlock_irqrestore(&hash_table_lock, flags);
+}
+
+static void delete_service(struct rpal_service *rs)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&hash_table_lock, flags);
+	hash_del(&rs->hlist);
+	spin_unlock_irqrestore(&hash_table_lock, flags);
+}
+
+struct rpal_service *rpal_register_service(void)
+{
+	struct rpal_service *rs;
+
+	if (!rpal_inited)
+		return NULL;
+
+	rs = kmem_cache_zalloc(service_cache, GFP_KERNEL);
+	if (!rs)
+		goto alloc_fail;
+
+	rs->id = rpal_alloc_service_id();
+	if (!is_valid_id(rs->id))
+		goto id_fail;
+
+	rs->key = rpal_alloc_service_key();
+	if (unlikely(rs->key == RPAL_INVALID_KEY))
+		goto key_fail;
+
+	atomic_set(&rs->refcnt, 1);
+
+	insert_service(rs);
+
+	return rs;
+
+key_fail:
+	kmem_cache_free(service_cache, rs);
+id_fail:
+	rpal_free_service_id(rs->id);
+alloc_fail:
+	return NULL;
+}
+
+void rpal_unregister_service(struct rpal_service *rs)
+{
+	if (!rs)
+		return;
+
+	delete_service(rs);
+
+	rpal_put_service(rs);
+}
+
 int __init rpal_service_init(void)
 {
 	service_cache = kmem_cache_create("rpal_service_cache",
@@ -47,6 +174,9 @@ int __init rpal_service_init(void)
 		return -1;
 	}
 
+	bitmap_zero(rpal_id_bitmap, RPAL_NR_ID);
+	atomic64_set(&service_key_counter, RPAL_FIRST_KEY);
+
 	return 0;
 }
 
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 73468884cc5d..75c5acf33844 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -11,13 +11,40 @@
 
 #include <linux/sched.h>
 #include <linux/types.h>
+#include <linux/hashtable.h>
 #include <linux/atomic.h>
 
 #define RPAL_ERROR_MSG "rpal error: "
 #define rpal_err(x...) pr_err(RPAL_ERROR_MSG x)
 #define rpal_err_ratelimited(x...) pr_err_ratelimited(RPAL_ERROR_MSG x)
 
+/*
+ * The first 512GB is reserved due to mmap_min_addr.
+ * The last 512GB is dropped since stack will be initially
+ * allocated at TASK_SIZE_MAX.
+ */
+#define RPAL_NR_ID 254
+#define RPAL_INVALID_ID -1
+#define RPAL_FIRST_KEY _AC(1, UL)
+#define RPAL_INVALID_KEY _AC(0, UL)
+
+/*
+ * Each RPAL service has a 64-bit key as its unique identifier, and
+ * the 64-bit length ensures that the key will never repeat before
+ * the kernel reboot.
+ *
+ * Each RPAL service has an ID to indicate which 512GB virtual address
+ * space it can use. All alive RPAL processes have unique IDs, ensuring
+ * their address spaces do not overlap. When a process exits, its ID
+ * is released, allowing newly started RPAL services to reuse the ID.
+ */
 struct rpal_service {
+	/* Unique identifier for RPAL service */
+	u64 key;
+	/* virtual address space id */
+	int id;
+	/* Hashtable list for this struct */
+	struct hlist_node hlist;
 	/* reference count of this struct */
 	atomic_t refcnt;
 };
@@ -40,4 +67,8 @@ struct rpal_service *rpal_get_service(struct rpal_service *rs);
  * @param rs The struct rpal_service to put.
  */
 void rpal_put_service(struct rpal_service *rs);
+
+void rpal_unregister_service(struct rpal_service *rs);
+struct rpal_service *rpal_register_service(void);
+struct rpal_service *rpal_get_service_by_key(u64 key);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 04/35] RPAL: add member to task_struct and mm_struct
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (2 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 03/35] RPAL: add service registration interface Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 05/35] RPAL: enable virtual address space partitions Bo Li
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

In lazy switch and memory-related operations, there is a need to quickly
locate the corresponding rpal_service structure. Therefore, rpal_service
members are added to these two data structures.

This patch adds an rpal_service member to both task_struct and mm_struct,
and introduces initialization operations. Meanwhile, rpal_service is also
augmented with references to the task_struct and mm_struct of the
group_leader. For threads created via fork, the kernel acquires a reference
to rpal_service and assigns it to the new task_struct. References to
rpal_service are released when threads exit.

Regarding the deallocation of rpal_struct, since rpal_put_service may be
called in an atomic context (where mmdrop() cannot be invoked), this patch
uses delayed work for deallocation. The work delay is set to 30 seconds,
which ensures that IDs are not recycled and reused in the short term,
preventing other processes from confusing the reallocated ID with the
previous one due to race conditions.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/service.c  | 77 +++++++++++++++++++++++++++++++++++++---
 fs/exec.c                | 11 ++++++
 include/linux/mm_types.h |  3 ++
 include/linux/rpal.h     | 29 +++++++++++++++
 include/linux/sched.h    |  5 +++
 init/init_task.c         |  3 ++
 kernel/exit.c            |  5 +++
 kernel/fork.c            | 16 +++++++++
 8 files changed, 145 insertions(+), 4 deletions(-)

diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 609c9550540d..55ecb7e0ef8c 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -26,9 +26,24 @@ static inline void rpal_free_service_id(int id)
 
 static void __rpal_put_service(struct rpal_service *rs)
 {
+	pr_debug("rpal: free service %d, tgid: %d\n", rs->id,
+		 rs->group_leader->pid);
+
+	rs->mm->rpal_rs = NULL;
+	mmdrop(rs->mm);
+	put_task_struct(rs->group_leader);
+	rpal_free_service_id(rs->id);
 	kmem_cache_free(service_cache, rs);
 }
 
+static void rpal_put_service_async_fn(struct work_struct *work)
+{
+	struct rpal_service *rs =
+		container_of(work, struct rpal_service, delayed_put_work.work);
+
+	__rpal_put_service(rs);
+}
+
 static int rpal_alloc_service_id(void)
 {
 	int id;
@@ -75,9 +90,16 @@ void rpal_put_service(struct rpal_service *rs)
 {
 	if (!rs)
 		return;
-
-	if (atomic_dec_and_test(&rs->refcnt))
-		__rpal_put_service(rs);
+    /*
+     * Since __rpal_put_service() calls mmdrop() (which
+     * cannot be invoked in atomic context), we use
+     * delayed work to release rpal_service.
+     */
+	if (atomic_dec_and_test(&rs->refcnt)) {
+		INIT_DELAYED_WORK(&rs->delayed_put_work,
+				  rpal_put_service_async_fn);
+		schedule_delayed_work(&rs->delayed_put_work, HZ * 30);
+	}
 }
 
 static u32 get_hash_key(u64 key)
@@ -128,6 +150,12 @@ struct rpal_service *rpal_register_service(void)
 	if (!rpal_inited)
 		return NULL;
 
+	if (!thread_group_leader(current)) {
+		rpal_err("task %d is not group leader %d\n", current->pid,
+			 current->tgid);
+		goto alloc_fail;
+	}
+
 	rs = kmem_cache_zalloc(service_cache, GFP_KERNEL);
 	if (!rs)
 		goto alloc_fail;
@@ -140,10 +168,27 @@ struct rpal_service *rpal_register_service(void)
 	if (unlikely(rs->key == RPAL_INVALID_KEY))
 		goto key_fail;
 
-	atomic_set(&rs->refcnt, 1);
+	current->rpal_rs = rs;
+
+	rs->group_leader = get_task_struct(current);
+	mmgrab(current->mm);
+	current->mm->rpal_rs = rs;
+	rs->mm = current->mm;
+
+	/*
+	 * The reference comes from:
+	 * 1. registered service always has one reference
+	 * 2. leader_thread also has one reference
+	 * 3. mm also hold one reference
+	 */
+	atomic_set(&rs->refcnt, 3);
 
 	insert_service(rs);
 
+	pr_debug(
+		"rpal: register service, key: %llx, id: %d, command: %s, tgid: %d\n",
+		rs->key, rs->id, current->comm, current->tgid);
+
 	return rs;
 
 key_fail:
@@ -161,7 +206,31 @@ void rpal_unregister_service(struct rpal_service *rs)
 
 	delete_service(rs);
 
+	pr_debug("rpal: unregister service, id: %d, tgid: %d\n", rs->id,
+		 rs->group_leader->tgid);
+
+	rpal_put_service(rs);
+}
+
+void copy_rpal(struct task_struct *p)
+{
+	struct rpal_service *cur = rpal_current_service();
+
+	p->rpal_rs = rpal_get_service(cur);
+}
+
+void exit_rpal(bool group_dead)
+{
+	struct rpal_service *rs = rpal_current_service();
+
+	if (!rs)
+		return;
+
+	current->rpal_rs = NULL;
 	rpal_put_service(rs);
+
+	if (group_dead)
+		rpal_unregister_service(rs);
 }
 
 int __init rpal_service_init(void)
diff --git a/fs/exec.c b/fs/exec.c
index cfbb2b9ee3c9..922728aebebe 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -68,6 +68,7 @@
 #include <linux/user_events.h>
 #include <linux/rseq.h>
 #include <linux/ksm.h>
+#include <linux/rpal.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1076,6 +1077,16 @@ static int de_thread(struct task_struct *tsk)
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
 
+#if IS_ENABLED(CONFIG_RPAL)
+	/*
+	 * The rpal process is going to load another binary, we
+	 * need to unregister rpal since it is going to be another
+	 * process. Other threads have already exited by the time
+	 * we come here, we need to set group_dead as true.
+	 */
+	exit_rpal(true);
+#endif
+
 	BUG_ON(!thread_group_leader(tsk));
 	return 0;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 32ba5126e221..b29adef082c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1172,6 +1172,9 @@ struct mm_struct {
 #ifdef CONFIG_MM_ID
 		mm_id_t mm_id;
 #endif /* CONFIG_MM_ID */
+#ifdef CONFIG_RPAL
+		struct rpal_service *rpal_rs;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 75c5acf33844..7b9d90b62b3f 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -11,6 +11,8 @@
 
 #include <linux/sched.h>
 #include <linux/types.h>
+#include <linux/sched/mm.h>
+#include <linux/workqueue.h>
 #include <linux/hashtable.h>
 #include <linux/atomic.h>
 
@@ -29,6 +31,9 @@
 #define RPAL_INVALID_KEY _AC(0, UL)
 
 /*
+ * Each RPAL process (a.k.a RPAL service) should have a pointer to
+ * struct rpal_service in all its tasks' task_struct.
+ *
  * Each RPAL service has a 64-bit key as its unique identifier, and
  * the 64-bit length ensures that the key will never repeat before
  * the kernel reboot.
@@ -39,10 +44,23 @@
  * is released, allowing newly started RPAL services to reuse the ID.
  */
 struct rpal_service {
+	/* The task_struct of thread group leader. */
+	struct task_struct *group_leader;
+	/* mm_struct of thread group */
+	struct mm_struct *mm;
 	/* Unique identifier for RPAL service */
 	u64 key;
 	/* virtual address space id */
 	int id;
+
+    /*
+     * Fields above should never change after initialization.
+     * Fields below may change after initialization.
+     */
+
+	/* delayed service put work */
+	struct delayed_work delayed_put_work;
+
 	/* Hashtable list for this struct */
 	struct hlist_node hlist;
 	/* reference count of this struct */
@@ -68,7 +86,18 @@ struct rpal_service *rpal_get_service(struct rpal_service *rs);
  */
 void rpal_put_service(struct rpal_service *rs);
 
+#ifdef CONFIG_RPAL
+static inline struct rpal_service *rpal_current_service(void)
+{
+	return current->rpal_rs;
+}
+#else
+static inline struct rpal_service *rpal_current_service(void) { return NULL; }
+#endif
+
 void rpal_unregister_service(struct rpal_service *rs);
 struct rpal_service *rpal_register_service(void);
 struct rpal_service *rpal_get_service_by_key(u64 key);
+void copy_rpal(struct task_struct *p);
+void exit_rpal(bool group_dead);
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e5953b8f32..ad35b197543c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -72,6 +72,7 @@ struct rcu_node;
 struct reclaim_state;
 struct robust_list_head;
 struct root_domain;
+struct rpal_service;
 struct rq;
 struct sched_attr;
 struct sched_dl_entity;
@@ -1645,6 +1646,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_RPAL
+	struct rpal_service			*rpal_rs;
+#endif
+
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
 
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..0c5b1927da41 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -220,6 +220,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_RPAL
+	.rpal_rs = NULL,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 38645039dd8f..0c8387da59da 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -70,6 +70,7 @@
 #include <linux/user_events.h>
 #include <linux/uaccess.h>
 #include <linux/pidfs.h>
+#include <linux/rpal.h>
 
 #include <uapi/linux/wait.h>
 
@@ -944,6 +945,10 @@ void __noreturn do_exit(long code)
 	taskstats_exit(tsk, group_dead);
 	trace_sched_process_exit(tsk, group_dead);
 
+#if IS_ENABLED(CONFIG_RPAL)
+	exit_rpal(group_dead);
+#endif
+
 	exit_mm();
 
 	if (group_dead)
diff --git a/kernel/fork.c b/kernel/fork.c
index 85afccfdf3b1..1d1c8484a8f2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/rpal.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -1216,6 +1217,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->mm_cid_active = 0;
 	tsk->migrate_from_cpu = -1;
 #endif
+
+#ifdef CONFIG_RPAL
+	tsk->rpal_rs = NULL;
+#endif
 	return tsk;
 
 free_stack:
@@ -1312,6 +1317,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #endif
 	mm_init_uprobes_state(mm);
 	hugetlb_count_init(mm);
+#ifdef CONFIG_RPAL
+	mm->rpal_rs = NULL;
+#endif
 
 	if (current->mm) {
 		mm->flags = mmf_init_flags(current->mm->flags);
@@ -2651,6 +2659,14 @@ __latent_entropy struct task_struct *copy_process(
 			current->signal->nr_threads++;
 			current->signal->quick_threads++;
 			atomic_inc(&current->signal->live);
+#if IS_ENABLED(CONFIG_RPAL)
+			/*
+			 * For rpal process, the child thread needs to
+			 * inherit p->rpal_rs. Therefore, we can get the
+			 * struct rpal_service for any thread of rpal process.
+			 */
+			copy_rpal(p);
+#endif
 			refcount_inc(&current->signal->sigcnt);
 			task_join_group_stop(p);
 			list_add_tail_rcu(&p->thread_node,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 05/35] RPAL: enable virtual address space partitions
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (3 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 04/35] RPAL: add member to task_struct and mm_struct Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 06/35] RPAL: add user interface Bo Li
                   ` (35 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Each RPAL service occupies a contiguous 512GB virtual address space, with
its base address determined by the id assigned during initialization. For
userspace virtual address space beyond this 512GB range, we employ memory
ballooning to occupy these regions, ensuring that processes do not utilize
these virtual addresses.

Since the address space layout is determined when the process is loaded,
RPAL sets the unused fields in the header of the ELF binary to the "RPAL"
characters to alter the loading method of RPAL processes, enabling the
process to be located within the correct 512GB address space upon loading.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/mm/mmap.c      | 10 +++++
 arch/x86/rpal/Makefile  |  2 +-
 arch/x86/rpal/mm.c      | 70 +++++++++++++++++++++++++++++
 arch/x86/rpal/service.c |  8 ++++
 fs/binfmt_elf.c         | 98 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/rpal.h    | 65 +++++++++++++++++++++++++++
 6 files changed, 251 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/rpal/mm.c

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 5ed2109211da..504f2b9a0e8e 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -19,6 +19,7 @@
 #include <linux/sched/mm.h>
 #include <linux/compat.h>
 #include <linux/elf-randomize.h>
+#include <linux/rpal.h>
 #include <asm/elf.h>
 #include <asm/io.h>
 
@@ -119,6 +120,15 @@ static void arch_pick_mmap_base(unsigned long *base, unsigned long *legacy_base,
 		*base = mmap_base(random_factor, task_size, rlim_stack);
 }
 
+#ifdef CONFIG_RPAL
+void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack)
+{
+	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
+			arch_rnd(RPAL_MAX_RAND_BITS), rpal_get_top(mm->rpal_rs),
+			rlim_stack);
+}
+#endif
+
 void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
 {
 	if (mmap_is_legacy())
diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile
index ee3698b5a9b3..2c858a8d7b9e 100644
--- a/arch/x86/rpal/Makefile
+++ b/arch/x86/rpal/Makefile
@@ -2,4 +2,4 @@
 
 obj-$(CONFIG_RPAL)		+= rpal.o
 
-rpal-y := service.o core.o
+rpal-y := service.o core.o mm.o
diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c
new file mode 100644
index 000000000000..f469bcf57b66
--- /dev/null
+++ b/arch/x86/rpal/mm.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+#include <linux/rpal.h>
+#include <linux/security.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+
+static inline int rpal_balloon_mapping(unsigned long base, unsigned long size)
+{
+	struct vm_area_struct *vma;
+	unsigned long addr, populate;
+	int is_fail = 0;
+
+	if (size == 0)
+		return 0;
+
+	addr = do_mmap(NULL, base, size, PROT_NONE,
+		       MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE,
+		       VM_DONTEXPAND | VM_PFNMAP | VM_DONTDUMP, 0, &populate,
+		       NULL);
+
+	is_fail = base != addr;
+
+	if (is_fail) {
+		pr_info("rpal: Balloon mapping 0x%016lx - 0x%016lx, %s, addr: 0x%016lx\n",
+			base, base + size, is_fail ? "Fail" : "Success", addr);
+	}
+	vma = find_vma(current->mm, addr);
+	if (vma->vm_start != addr || vma->vm_end != addr + size) {
+		is_fail = 1;
+		rpal_err("rpal: find vma 0x%016lx - 0x%016lx fail\n", addr,
+			 addr + size);
+	}
+
+	return is_fail;
+}
+
+#define RPAL_USER_TOP TASK_SIZE
+
+int rpal_balloon_init(unsigned long base)
+{
+	unsigned long top;
+	struct mm_struct *mm = current->mm;
+	int ret;
+
+	top = base + RPAL_ADDR_SPACE_SIZE;
+
+	mmap_write_lock(mm);
+
+	if (base > mmap_min_addr) {
+		ret = rpal_balloon_mapping(mmap_min_addr, base - mmap_min_addr);
+		if (ret)
+			goto out;
+	}
+
+	ret = rpal_balloon_mapping(top, RPAL_USER_TOP - top);
+	if (ret && base > mmap_min_addr)
+		do_munmap(mm, mmap_min_addr, base - mmap_min_addr, NULL);
+
+out:
+	mmap_write_unlock(mm);
+
+	return ret;
+}
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 55ecb7e0ef8c..caa4afa5a2c6 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -143,6 +143,11 @@ static void delete_service(struct rpal_service *rs)
 	spin_unlock_irqrestore(&hash_table_lock, flags);
 }
 
+static inline unsigned long calculate_base_address(int id)
+{
+	return RPAL_ADDRESS_SPACE_LOW + RPAL_ADDR_SPACE_SIZE * id;
+}
+
 struct rpal_service *rpal_register_service(void)
 {
 	struct rpal_service *rs;
@@ -168,6 +173,9 @@ struct rpal_service *rpal_register_service(void)
 	if (unlikely(rs->key == RPAL_INVALID_KEY))
 		goto key_fail;
 
+	rs->bad_service = false;
+	rs->base = calculate_base_address(rs->id);
+
 	current->rpal_rs = rs;
 
 	rs->group_leader = get_task_struct(current);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a43363d593e5..9d27d9922de4 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <linux/rseq.h>
+#include <linux/rpal.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -814,6 +815,61 @@ static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
 	return ret == -ENOENT ? 0 : ret;
 }
 
+#if IS_ENABLED(CONFIG_RPAL)
+static int rpal_create_service(char *e_ident, struct rpal_service **rs,
+			unsigned long *rpal_base, int *retval,
+			struct linux_binprm *bprm, int executable_stack)
+{
+	/*
+	 * The first 16 bytes of the elf binary is magic number, and the last
+	 * 7 bytes of that is reserved and ignored. We use the last 4 bytes
+	 * to indicate a rpal binary. If the last 4 bytes is "RPAL", then this
+	 * is a rpal binary and we need to do register routinue.
+	 */
+	if (memcmp(e_ident + RPAL_MAGIC_OFFSET, RPAL_MAGIC, RPAL_MAGIC_LEN) ==
+	    0) {
+		unsigned long rpal_stack_top = STACK_TOP;
+
+		*rs = rpal_register_service();
+		if (*rs != NULL) {
+			*rpal_base = rpal_get_base(*rs);
+			rpal_stack_top = *rpal_base + RPAL_ADDR_SPACE_SIZE;
+			/*
+			 * We need to recalculate the mmap_base, otherwise the address space
+			 * layout randomization will not make any difference.
+			 */
+			rpal_pick_mmap_base(current->mm, &bprm->rlim_stack);
+		}
+		/*
+		 * RPAL process only has a contiguous 512GB address space, Whose base
+		 * address is given by its struct rpal_service. We need to rearrange
+		 * the user stack in this 512GB address space.
+		 */
+		*retval = setup_arg_pages(bprm,
+					  randomize_stack_top(rpal_stack_top),
+					  executable_stack);
+		/*
+		 * We use memory ballon to avoid kernel allocating vma other than
+		 * the process's 512GB memory.
+		 */
+		if (unlikely(*rs != NULL && rpal_balloon_init(*rpal_base))) {
+			rpal_err("pid: %d, comm: %s: rpal balloon init fail\n",
+					 current->pid, current->comm);
+			rpal_unregister_service(*rs);
+			*rs = NULL;
+			*retval = -EINVAL;
+			goto out;
+		}
+	} else {
+		*retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
+					  executable_stack);
+	}
+
+out:
+	return 0;
+}
+#endif
+
 static int load_elf_binary(struct linux_binprm *bprm)
 {
 	struct file *interpreter = NULL; /* to shut gcc up */
@@ -836,6 +892,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
 	struct mm_struct *mm;
 	struct pt_regs *regs;
+#ifdef CONFIG_RPAL
+	struct rpal_service *rs = NULL;
+	unsigned long rpal_base;
+#endif
 
 	retval = -ENOEXEC;
 	/* First of all, some simple consistency checks */
@@ -1008,10 +1068,19 @@ static int load_elf_binary(struct linux_binprm *bprm)
 
 	setup_new_exec(bprm);
 
+#ifdef CONFIG_RPAL
+	/* call original function if fails */
+	if (rpal_create_service((char *)&elf_ex->e_ident, &rs, &rpal_base,
+				&retval, bprm, executable_stack))
+		retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
+					 executable_stack);
+#else
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
 	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
 				 executable_stack);
+#endif
+
 	if (retval < 0)
 		goto out_free_dentry;
 
@@ -1055,6 +1124,22 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			 * is needed.
 			 */
 			elf_flags |= MAP_FIXED_NOREPLACE;
+#ifdef CONFIG_RPAL
+			/*
+			 * If We load MAP_FIXED binary, it will either fail when
+			 * doing mmap, as we have done the memory balloon before,
+			 * or work well, where we are so lucky to have fixed address
+			 * in it's RPAL address space. A MAP_FIXED binary should
+			 * by no means be a RPAL service. Here we only print
+			 * an error. Maybe we will handle it in the future.
+			 */
+			if (unlikely(rs != NULL)) {
+				rpal_err(
+					"pid: %d, common: %s, load a binary with MAP_FIXED segment\n",
+					current->pid, current->comm);
+				rs->bad_service = true;
+			}
+#endif
 		} else if (elf_ex->e_type == ET_DYN) {
 			/*
 			 * This logic is run once for the first LOAD Program
@@ -1128,6 +1213,12 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				/* Adjust alignment as requested. */
 				if (alignment)
 					load_bias &= ~(alignment - 1);
+#ifdef CONFIG_RPAL
+				if (rs != NULL) {
+					load_bias &= RPAL_RAND_ADDR_SPACE_MASK;
+					load_bias += rpal_base;
+				}
+#endif
 				elf_flags |= MAP_FIXED_NOREPLACE;
 			} else {
 				/*
@@ -1306,7 +1397,12 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (!IS_ENABLED(CONFIG_COMPAT_BRK) &&
 	    IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
 	    elf_ex->e_type == ET_DYN && !interpreter) {
-		elf_brk = ELF_ET_DYN_BASE;
+#ifdef CONFIG_RPAL
+		if (rs && !rs->bad_service)
+			elf_brk = rpal_base;
+		else
+#endif
+			elf_brk = ELF_ET_DYN_BASE;
 		/* This counts as moving the brk, so let brk(2) know. */
 		brk_moved = true;
 	}
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 7b9d90b62b3f..f7c0de747f55 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -15,11 +15,17 @@
 #include <linux/workqueue.h>
 #include <linux/hashtable.h>
 #include <linux/atomic.h>
+#include <linux/sizes.h>
 
 #define RPAL_ERROR_MSG "rpal error: "
 #define rpal_err(x...) pr_err(RPAL_ERROR_MSG x)
 #define rpal_err_ratelimited(x...) pr_err_ratelimited(RPAL_ERROR_MSG x)
 
+/* RPAL magic macros in binary elf header */
+#define RPAL_MAGIC "RPAL"
+#define RPAL_MAGIC_OFFSET 12
+#define RPAL_MAGIC_LEN 4
+
 /*
  * The first 512GB is reserved due to mmap_min_addr.
  * The last 512GB is dropped since stack will be initially
@@ -30,6 +36,47 @@
 #define RPAL_FIRST_KEY _AC(1, UL)
 #define RPAL_INVALID_KEY _AC(0, UL)
 
+/*
+ * Process Virtual Address Space Layout (For 4-level Paging)
+ *  |-------------|
+ *  |  No Mapping |
+ *  |-------------| <-- 64 KB (mmap_min_addr)
+ *  |     ...     |
+ *  |-------------| <-- 1 * 512GB
+ *  |  service 0  |
+ *  |-------------| <-- 2 * 512 GB
+ *  |  Service 1  |
+ *  |-------------| <-- 3 * 512 GB
+ *  |  Service 2  |
+ *  |-------------| <-- 4 * 512 GB
+ *  |     ...     |
+ *  |-------------| <-- 255 * 512 GB
+ *  | Service 254 |
+ *  |-------------| <-- 128 TB
+ *  |             |
+ *  |     ...     |
+ *  |-------------| <-- PAGE_OFFSET
+ *  |             |
+ *  |    Kernel   |
+ *  |_____________|
+ *
+ */
+#define RPAL_ADDR_SPACE_SIZE (_AC(512, UL) * SZ_1G)
+/*
+ * Since RPAL restricts the virtual address space used by a single
+ * process to 512GB, the number of bits for address randomization
+ * must be correspondingly reduced; otherwise, issues such as overlaps
+ * in randomized addresses could occur. RPAL employs 20-bit (page number)
+ * address randomization to balance security and usability.
+ */
+#define RPAL_RAND_ADDR_SPACE_MASK _AC(0xfffffff0, UL)
+#define RPAL_MAX_RAND_BITS 20
+
+#define RPAL_NR_ADDR_SPACE 256
+
+#define RPAL_ADDRESS_SPACE_LOW  ((0UL) + RPAL_ADDR_SPACE_SIZE)
+#define RPAL_ADDRESS_SPACE_HIGH ((0UL) + RPAL_NR_ADDR_SPACE * RPAL_ADDR_SPACE_SIZE)
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -52,6 +99,10 @@ struct rpal_service {
 	u64 key;
 	/* virtual address space id */
 	int id;
+	/* virtual address space base address of this service */
+	unsigned long base;
+	/* bad rpal binary */
+	bool bad_service;
 
     /*
      * Fields above should never change after initialization.
@@ -86,6 +137,16 @@ struct rpal_service *rpal_get_service(struct rpal_service *rs);
  */
 void rpal_put_service(struct rpal_service *rs);
 
+static inline unsigned long rpal_get_base(struct rpal_service *rs)
+{
+	return rs->base;
+}
+
+static inline unsigned long rpal_get_top(struct rpal_service *rs)
+{
+	return rs->base + RPAL_ADDR_SPACE_SIZE;
+}
+
 #ifdef CONFIG_RPAL
 static inline struct rpal_service *rpal_current_service(void)
 {
@@ -100,4 +161,8 @@ struct rpal_service *rpal_register_service(void);
 struct rpal_service *rpal_get_service_by_key(u64 key);
 void copy_rpal(struct task_struct *p);
 void exit_rpal(bool group_dead);
+int rpal_balloon_init(unsigned long base);
+
+extern void rpal_pick_mmap_base(struct mm_struct *mm,
+	struct rlimit *rlim_stack);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 06/35] RPAL: add user interface
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (4 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 05/35] RPAL: enable virtual address space partitions Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 07/35] RPAL: enable shared page mmap Bo Li
                   ` (34 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Add the userspace interface of RPAL. The interface makes use of /proc
files. Compared with adding syscalls, /proc files provide more interfaces,
such as mmap, poll, etc. These interfaces can facilitate RPAL to implement
more complex kernel-space/user-space interaction functions in the future.

This patch implements the ioctl interface, The interfaces initially
implemented include obtaining the RPAL version, and retrieving the key and
ID of the RPAL service.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/Makefile   |  2 +-
 arch/x86/rpal/core.c     |  3 ++
 arch/x86/rpal/internal.h |  3 ++
 arch/x86/rpal/proc.c     | 71 ++++++++++++++++++++++++++++++++++++++++
 include/linux/rpal.h     | 34 +++++++++++++++++++
 5 files changed, 112 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/rpal/proc.c

diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile
index 2c858a8d7b9e..a5926fc19334 100644
--- a/arch/x86/rpal/Makefile
+++ b/arch/x86/rpal/Makefile
@@ -2,4 +2,4 @@
 
 obj-$(CONFIG_RPAL)		+= rpal.o
 
-rpal-y := service.o core.o mm.o
+rpal-y := service.o core.o mm.o proc.o
diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 495dbc1b1536..61f5d40b0157 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -13,11 +13,14 @@
 int __init rpal_init(void);
 
 bool rpal_inited;
+unsigned long rpal_cap;
 
 int __init rpal_init(void)
 {
 	int ret = 0;
 
+	rpal_cap = 0;
+
 	ret = rpal_service_init();
 	if (ret)
 		goto fail;
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index e44e6fc79677..c102a4c50515 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -6,6 +6,9 @@
  *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
  */
 
+#define RPAL_COMPAT_VERSION 1
+#define RPAL_API_VERSION 1
+
 extern bool rpal_inited;
 
 /* service.c */
diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
new file mode 100644
index 000000000000..1ced30e25c15
--- /dev/null
+++ b/arch/x86/rpal/proc.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+#include <linux/rpal.h>
+#include <linux/proc_fs.h>
+
+#include "internal.h"
+
+static int rpal_open(struct inode *inode,
+			     struct file *file)
+{
+	return 0;
+}
+
+static int rpal_get_api_version_and_cap(void __user *p)
+{
+	struct rpal_version_info rvi;
+	int ret;
+
+	rvi.compat_version = RPAL_COMPAT_VERSION;
+	rvi.api_version = RPAL_API_VERSION;
+	rvi.cap = rpal_cap;
+
+	ret = copy_to_user(p, &rvi, sizeof(rvi));
+	if (ret)
+		return -EFAULT;
+
+	return 0;
+}
+
+static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct rpal_service *cur = rpal_current_service();
+	int ret = 0;
+
+	if (!cur)
+		return -EINVAL;
+
+	switch (cmd) {
+	case RPAL_IOCTL_GET_API_VERSION_AND_CAP:
+		ret = rpal_get_api_version_and_cap((void __user *)arg);
+		break;
+	case RPAL_IOCTL_GET_SERVICE_KEY:
+		ret = put_user(cur->key, (u64 __user *)arg);
+		break;
+	case RPAL_IOCTL_GET_SERVICE_ID:
+		ret = put_user(cur->id, (int __user *)arg);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+const struct proc_ops proc_rpal_operations = {
+	.proc_open = rpal_open,
+	.proc_ioctl = rpal_ioctl,
+};
+
+static int __init proc_rpal_init(void)
+{
+	proc_create("rpal", 0644, NULL, &proc_rpal_operations);
+	return 0;
+}
+fs_initcall(proc_rpal_init);
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index f7c0de747f55..3bc2a2a44265 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -77,6 +77,8 @@
 #define RPAL_ADDRESS_SPACE_LOW  ((0UL) + RPAL_ADDR_SPACE_SIZE)
 #define RPAL_ADDRESS_SPACE_HIGH ((0UL) + RPAL_NR_ADDR_SPACE * RPAL_ADDR_SPACE_SIZE)
 
+extern unsigned long rpal_cap;
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -118,6 +120,38 @@ struct rpal_service {
 	atomic_t refcnt;
 };
 
+/*
+ * Following structures should have the same memory layout with user.
+ * It seems nothing being different between kernel and user structure
+ * padding by different C compilers on x86_64, so we need to do nothing
+ * special here.
+ */
+/* Begin */
+struct rpal_version_info {
+	int compat_version;
+	int api_version;
+	unsigned long cap;
+};
+
+/* End */
+
+enum rpal_command_type {
+	RPAL_CMD_GET_API_VERSION_AND_CAP,
+	RPAL_CMD_GET_SERVICE_KEY,
+	RPAL_CMD_GET_SERVICE_ID,
+	RPAL_NR_CMD,
+};
+
+/* RPAL ioctl macro */
+#define RPAL_IOCTL_MAGIC 0x33
+#define RPAL_IOCTL_GET_API_VERSION_AND_CAP                        \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_API_VERSION_AND_CAP, \
+	      struct rpal_version_info *)
+#define RPAL_IOCTL_GET_SERVICE_KEY \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_KEY, u64 *)
+#define RPAL_IOCTL_GET_SERVICE_ID \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_ID, int *)
+
 /**
  * @brief get new reference to a rpal service, a corresponding
  *  rpal_put_service() should be called later by the caller.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 07/35] RPAL: enable shared page mmap
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (5 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 06/35] RPAL: add user interface Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 08/35] RPAL: enable sender/receiver registration Bo Li
                   ` (33 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

RPAL needs to create shared memory between the kernel and user space for
the transfer of states and data.

This patch implements the rpal_mmap() interface. User processes can create
shared memory by calling mmap() on /proc/rpal. To prevent users from
creating excessive memory, rpal_mmap() limits the total size of the shared
memory that can be created. The shared memory is maintained through
reference counting, and rpal_munmap() is implemented for the release of
the shared memory.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/internal.h |  20 ++++++
 arch/x86/rpal/mm.c       | 147 +++++++++++++++++++++++++++++++++++++++
 arch/x86/rpal/proc.c     |   1 +
 arch/x86/rpal/service.c  |   4 ++
 include/linux/rpal.h     |  15 ++++
 mm/mmap.c                |   4 ++
 6 files changed, 191 insertions(+)

diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index c102a4c50515..65fd14a26f0e 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -9,8 +9,28 @@
 #define RPAL_COMPAT_VERSION 1
 #define RPAL_API_VERSION 1
 
+#include <linux/mm.h>
+#include <linux/file.h>
+
 extern bool rpal_inited;
 
 /* service.c */
 int __init rpal_service_init(void);
 void __init rpal_service_exit(void);
+
+/* mm.c */
+static inline struct rpal_shared_page *
+rpal_get_shared_page(struct rpal_shared_page *rsp)
+{
+	atomic_inc(&rsp->refcnt);
+	return rsp;
+}
+
+static inline void rpal_put_shared_page(struct rpal_shared_page *rsp)
+{
+	atomic_dec(&rsp->refcnt);
+}
+
+int rpal_mmap(struct file *filp, struct vm_area_struct *vma);
+struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs,
+					       unsigned long addr);
diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c
index f469bcf57b66..8a738c502d1d 100644
--- a/arch/x86/rpal/mm.c
+++ b/arch/x86/rpal/mm.c
@@ -11,6 +11,8 @@
 #include <linux/mman.h>
 #include <linux/mm.h>
 
+#include "internal.h"
+
 static inline int rpal_balloon_mapping(unsigned long base, unsigned long size)
 {
 	struct vm_area_struct *vma;
@@ -68,3 +70,148 @@ int rpal_balloon_init(unsigned long base)
 
 	return ret;
 }
+
+static void rpal_munmap(struct vm_area_struct *area)
+{
+	struct mm_struct *mm = area->vm_mm;
+	struct rpal_service *rs = mm->rpal_rs;
+	struct rpal_shared_page *rsp = area->vm_private_data;
+
+	if (!rs) {
+		rpal_err(
+			"free shared page after exit_mmap or fork a child process\n");
+		return;
+	}
+
+	mutex_lock(&rs->mutex);
+	if (unlikely(!atomic_dec_and_test(&rsp->refcnt))) {
+		rpal_err("refcnt(%d) of shared page is not 0\n", atomic_read(&rsp->refcnt));
+		send_sig_info(SIGKILL, SEND_SIG_PRIV, rs->group_leader);
+	}
+
+	list_del(&rsp->list);
+	rs->nr_shared_pages -= rsp->npage;
+	__free_pages(virt_to_page(rsp->kernel_start), get_order(rsp->npage));
+	kfree(rsp);
+	mutex_unlock(&rs->mutex);
+}
+
+const struct vm_operations_struct rpal_vm_ops = { .close = rpal_munmap };
+
+#define RPAL_MAX_SHARED_PAGES 8192
+
+int rpal_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_shared_page *rsp;
+	struct page *page = NULL;
+	unsigned long size = (unsigned long)(vma->vm_end - vma->vm_start);
+	int npage;
+	int order = -1;
+	int ret = 0;
+
+	if (!cur) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Check whether the vma is aligned and whether the page number
+	 * is power of 2. This makes shared pages easy to manage.
+	 */
+	if (!IS_ALIGNED(size, PAGE_SIZE) ||
+	    !IS_ALIGNED(vma->vm_start, PAGE_SIZE)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	npage = size >> PAGE_SHIFT;
+	if (!is_power_of_2(npage)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	order = get_order(size);
+
+	mutex_lock(&cur->mutex);
+
+	/* make sure user does not alloc too much pages */
+	if (cur->nr_shared_pages + npage > RPAL_MAX_SHARED_PAGES) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	rsp = kmalloc(sizeof(*rsp), GFP_KERNEL);
+	if (!rsp) {
+		ret = -EAGAIN;
+		goto unlock;
+	}
+
+	page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
+	if (!page) {
+		ret = -ENOMEM;
+		goto free_rsp;
+	}
+
+	rsp->user_start = vma->vm_start;
+	rsp->kernel_start = (unsigned long)page_address(page);
+	rsp->npage = npage;
+	atomic_set(&rsp->refcnt, 1);
+	INIT_LIST_HEAD(&rsp->list);
+	list_add(&rsp->list, &cur->shared_pages);
+
+	vma->vm_ops = &rpal_vm_ops;
+	vma->vm_private_data = rsp;
+
+	/* map to shared pages userspace */
+	ret = remap_pfn_range(vma, vma->vm_start, page_to_pfn(page), size,
+			      vma->vm_page_prot);
+	if (ret)
+		goto free_page;
+
+	cur->nr_shared_pages += npage;
+	mutex_unlock(&cur->mutex);
+
+	return 0;
+
+free_page:
+	__free_pages(page, order);
+	list_del(&rsp->list);
+free_rsp:
+	kfree(rsp);
+unlock:
+	mutex_unlock(&cur->mutex);
+out:
+	return ret;
+}
+
+struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs,
+					       unsigned long addr)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_shared_page *rsp, *ret = NULL;
+
+	mutex_lock(&cur->mutex);
+	list_for_each_entry(rsp, &rs->shared_pages, list) {
+		if (rsp->user_start <= addr &&
+		    addr < rsp->user_start + rsp->npage * PAGE_SIZE) {
+			ret = rpal_get_shared_page(rsp);
+			break;
+		}
+	}
+	mutex_unlock(&cur->mutex);
+
+	return ret;
+}
+
+void rpal_exit_mmap(struct mm_struct *mm)
+{
+	struct rpal_service *rs = mm->rpal_rs;
+
+	if (rs) {
+		mm->rpal_rs = NULL;
+		/* all shared pages should be freed at this time */
+		WARN_ON_ONCE(rs->nr_shared_pages != 0);
+		rpal_put_service(rs);
+	}
+}
diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
index 1ced30e25c15..86947dc233d0 100644
--- a/arch/x86/rpal/proc.c
+++ b/arch/x86/rpal/proc.c
@@ -61,6 +61,7 @@ static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 const struct proc_ops proc_rpal_operations = {
 	.proc_open = rpal_open,
 	.proc_ioctl = rpal_ioctl,
+	.proc_mmap = rpal_mmap,
 };
 
 static int __init proc_rpal_init(void)
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index caa4afa5a2c6..f29a046fc22f 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -173,6 +173,10 @@ struct rpal_service *rpal_register_service(void)
 	if (unlikely(rs->key == RPAL_INVALID_KEY))
 		goto key_fail;
 
+	mutex_init(&rs->mutex);
+	rs->nr_shared_pages = 0;
+	INIT_LIST_HEAD(&rs->shared_pages);
+
 	rs->bad_service = false;
 	rs->base = calculate_base_address(rs->id);
 
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 3bc2a2a44265..986dfbd16fc9 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -110,6 +110,12 @@ struct rpal_service {
      * Fields above should never change after initialization.
      * Fields below may change after initialization.
      */
+	/* Mutex for time consuming operations */
+	struct mutex mutex;
+
+	/* pinned pages */
+	int nr_shared_pages;
+	struct list_head shared_pages;
 
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
@@ -135,6 +141,14 @@ struct rpal_version_info {
 
 /* End */
 
+struct rpal_shared_page {
+	unsigned long user_start;
+	unsigned long kernel_start;
+	int npage;
+	atomic_t refcnt;
+	struct list_head list;
+};
+
 enum rpal_command_type {
 	RPAL_CMD_GET_API_VERSION_AND_CAP,
 	RPAL_CMD_GET_SERVICE_KEY,
@@ -196,6 +210,7 @@ struct rpal_service *rpal_get_service_by_key(u64 key);
 void copy_rpal(struct task_struct *p);
 void exit_rpal(bool group_dead);
 int rpal_balloon_init(unsigned long base);
+void rpal_exit_mmap(struct mm_struct *mm);
 
 extern void rpal_pick_mmap_base(struct mm_struct *mm,
 	struct rlimit *rlim_stack);
diff --git a/mm/mmap.c b/mm/mmap.c
index bd210aaf7ebd..98bb33d2091e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -48,6 +48,7 @@
 #include <linux/sched/mm.h>
 #include <linux/ksm.h>
 #include <linux/memfd.h>
+#include <linux/rpal.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1319,6 +1320,9 @@ void exit_mmap(struct mm_struct *mm)
 	__mt_destroy(&mm->mm_mt);
 	mmap_write_unlock(mm);
 	vm_unacct_memory(nr_accounted);
+#if IS_ENABLED(CONFIG_RPAL)
+	rpal_exit_mmap(mm);
+#endif
 }
 
 /* Insert vm structure into process list sorted by address
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 08/35] RPAL: enable sender/receiver registration
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (6 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 07/35] RPAL: enable shared page mmap Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 09/35] RPAL: enable address space sharing Bo Li
                   ` (32 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

In RPAL, there are two roles: the sender (caller) and the receiver (
callee). This patch provides an interface for threads to register as
a sender or a receiver with the kernel. Each sender and receiver has
its own data structure, along with a block of memory shared between the
user space and the kernel space, which is allocated through rpal_mmap().

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/Makefile   |   2 +-
 arch/x86/rpal/internal.h |   7 ++
 arch/x86/rpal/proc.c     |  12 +++
 arch/x86/rpal/service.c  |   6 ++
 arch/x86/rpal/thread.c   | 165 +++++++++++++++++++++++++++++++++++++++
 include/linux/rpal.h     |  79 +++++++++++++++++++
 include/linux/sched.h    |  15 ++++
 init/init_task.c         |   2 +
 kernel/fork.c            |   2 +
 9 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/rpal/thread.c

diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile
index a5926fc19334..89f745382c51 100644
--- a/arch/x86/rpal/Makefile
+++ b/arch/x86/rpal/Makefile
@@ -2,4 +2,4 @@
 
 obj-$(CONFIG_RPAL)		+= rpal.o
 
-rpal-y := service.o core.o mm.o proc.o
+rpal-y := service.o core.o mm.o proc.o thread.o
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index 65fd14a26f0e..3559c9c6e868 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -34,3 +34,10 @@ static inline void rpal_put_shared_page(struct rpal_shared_page *rsp)
 int rpal_mmap(struct file *filp, struct vm_area_struct *vma);
 struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs,
 					       unsigned long addr);
+
+/* thread.c */
+int rpal_register_sender(unsigned long addr);
+int rpal_unregister_sender(void);
+int rpal_register_receiver(unsigned long addr);
+int rpal_unregister_receiver(void);
+void exit_rpal_thread(void);
diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
index 86947dc233d0..8a1e4a8a2271 100644
--- a/arch/x86/rpal/proc.c
+++ b/arch/x86/rpal/proc.c
@@ -51,6 +51,18 @@ static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case RPAL_IOCTL_GET_SERVICE_ID:
 		ret = put_user(cur->id, (int __user *)arg);
 		break;
+	case RPAL_IOCTL_REGISTER_SENDER:
+		ret = rpal_register_sender(arg);
+		break;
+	case RPAL_IOCTL_UNREGISTER_SENDER:
+		ret = rpal_unregister_sender();
+		break;
+	case RPAL_IOCTL_REGISTER_RECEIVER:
+		ret = rpal_register_receiver(arg);
+		break;
+	case RPAL_IOCTL_UNREGISTER_RECEIVER:
+		ret = rpal_unregister_receiver();
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index f29a046fc22f..42fb719dbb2a 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -176,6 +176,7 @@ struct rpal_service *rpal_register_service(void)
 	mutex_init(&rs->mutex);
 	rs->nr_shared_pages = 0;
 	INIT_LIST_HEAD(&rs->shared_pages);
+	atomic_set(&rs->thread_cnt, 0);
 
 	rs->bad_service = false;
 	rs->base = calculate_base_address(rs->id);
@@ -216,6 +217,9 @@ void rpal_unregister_service(struct rpal_service *rs)
 	if (!rs)
 		return;
 
+	while (atomic_read(&rs->thread_cnt) != 0)
+		schedule();
+
 	delete_service(rs);
 
 	pr_debug("rpal: unregister service, id: %d, tgid: %d\n", rs->id,
@@ -238,6 +242,8 @@ void exit_rpal(bool group_dead)
 	if (!rs)
 		return;
 
+	exit_rpal_thread();
+
 	current->rpal_rs = NULL;
 	rpal_put_service(rs);
 
diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
new file mode 100644
index 000000000000..7550ad94b63f
--- /dev/null
+++ b/arch/x86/rpal/thread.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+#include <linux/rpal.h>
+
+#include "internal.h"
+
+static void rpal_common_data_init(struct rpal_common_data *rcd)
+{
+	rcd->bp_task = current;
+	rcd->service_id = rpal_current_service()->id;
+}
+
+int rpal_register_sender(unsigned long addr)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_shared_page *rsp;
+	struct rpal_sender_data *rsd;
+	long ret = 0;
+
+	if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	rsp = rpal_find_shared_page(cur, addr);
+	if (!rsp) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (addr + sizeof(struct rpal_sender_call_context) >
+	    rsp->user_start + rsp->npage * PAGE_SIZE) {
+		ret = -EINVAL;
+		goto put_shared_page;
+	}
+
+	rsd = kzalloc(sizeof(*rsd), GFP_KERNEL);
+	if (rsd == NULL) {
+		ret = -ENOMEM;
+		goto put_shared_page;
+	}
+
+	rpal_common_data_init(&rsd->rcd);
+	rsd->rsp = rsp;
+	rsd->scc = (struct rpal_sender_call_context *)(addr - rsp->user_start +
+						       rsp->kernel_start);
+
+	current->rpal_sd = rsd;
+	rpal_set_current_thread_flag(RPAL_SENDER_BIT);
+
+	atomic_inc(&cur->thread_cnt);
+
+	return 0;
+
+put_shared_page:
+	rpal_put_shared_page(rsp);
+out:
+	return ret;
+}
+
+int rpal_unregister_sender(void)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_sender_data *rsd = current->rpal_sd;
+	long ret = 0;
+
+	if (!rpal_test_current_thread_flag(RPAL_SENDER_BIT)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	rpal_put_shared_page(rsd->rsp);
+	rpal_clear_current_thread_flag(RPAL_SENDER_BIT);
+	kfree(rsd);
+
+	atomic_dec(&cur->thread_cnt);
+
+out:
+	return ret;
+}
+
+int rpal_register_receiver(unsigned long addr)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_receiver_data *rrd;
+	struct rpal_shared_page *rsp;
+	long ret = 0;
+
+	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	rsp = rpal_find_shared_page(cur, addr);
+	if (!rsp) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (addr + sizeof(struct rpal_receiver_call_context) >
+	    rsp->user_start + rsp->npage * PAGE_SIZE) {
+		ret = -EINVAL;
+		goto put_shared_page;
+	}
+
+	rrd = kzalloc(sizeof(*rrd), GFP_KERNEL);
+	if (rrd == NULL) {
+		ret = -ENOMEM;
+		goto put_shared_page;
+	}
+
+	rpal_common_data_init(&rrd->rcd);
+	rrd->rsp = rsp;
+	rrd->rcc =
+		(struct rpal_receiver_call_context *)(addr - rsp->user_start +
+						      rsp->kernel_start);
+
+	current->rpal_rd = rrd;
+	rpal_set_current_thread_flag(RPAL_RECEIVER_BIT);
+
+	atomic_inc(&cur->thread_cnt);
+
+	return 0;
+
+put_shared_page:
+	rpal_put_shared_page(rsp);
+out:
+	return ret;
+}
+
+int rpal_unregister_receiver(void)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_receiver_data *rrd = current->rpal_rd;
+	long ret = 0;
+
+	if (!rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	rpal_put_shared_page(rrd->rsp);
+	rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT);
+	kfree(rrd);
+
+	atomic_dec(&cur->thread_cnt);
+
+out:
+	return ret;
+}
+
+void exit_rpal_thread(void)
+{
+	if (rpal_test_current_thread_flag(RPAL_SENDER_BIT))
+		rpal_unregister_sender();
+
+	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT))
+		rpal_unregister_receiver();
+}
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 986dfbd16fc9..c33425e896af 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -79,6 +79,11 @@
 
 extern unsigned long rpal_cap;
 
+enum rpal_task_flag_bits {
+	RPAL_SENDER_BIT,
+	RPAL_RECEIVER_BIT,
+};
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -117,6 +122,9 @@ struct rpal_service {
 	int nr_shared_pages;
 	struct list_head shared_pages;
 
+	/* sender/receiver thread count */
+	atomic_t thread_cnt;
+
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
 
@@ -149,10 +157,55 @@ struct rpal_shared_page {
 	struct list_head list;
 };
 
+struct rpal_common_data {
+	/* back pointer to task_struct */
+	struct task_struct *bp_task;
+	/* service id of rpal_service */
+	int service_id;
+};
+
+/* User registers state */
+struct rpal_task_context {
+	u64 r15;
+	u64 r14;
+	u64 r13;
+	u64 r12;
+	u64 rbx;
+	u64 rbp;
+	u64 rip;
+	u64 rsp;
+};
+
+struct rpal_receiver_call_context {
+	struct rpal_task_context rtc;
+	int receiver_id;
+};
+
+struct rpal_receiver_data {
+	struct rpal_common_data rcd;
+	struct rpal_shared_page *rsp;
+	struct rpal_receiver_call_context *rcc;
+};
+
+struct rpal_sender_call_context {
+	struct rpal_task_context rtc;
+	int sender_id;
+};
+
+struct rpal_sender_data {
+	struct rpal_common_data rcd;
+	struct rpal_shared_page *rsp;
+	struct rpal_sender_call_context *scc;
+};
+
 enum rpal_command_type {
 	RPAL_CMD_GET_API_VERSION_AND_CAP,
 	RPAL_CMD_GET_SERVICE_KEY,
 	RPAL_CMD_GET_SERVICE_ID,
+	RPAL_CMD_REGISTER_SENDER,
+	RPAL_CMD_UNREGISTER_SENDER,
+	RPAL_CMD_REGISTER_RECEIVER,
+	RPAL_CMD_UNREGISTER_RECEIVER,
 	RPAL_NR_CMD,
 };
 
@@ -165,6 +218,14 @@ enum rpal_command_type {
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_KEY, u64 *)
 #define RPAL_IOCTL_GET_SERVICE_ID \
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_ID, int *)
+#define RPAL_IOCTL_REGISTER_SENDER \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_SENDER, unsigned long)
+#define RPAL_IOCTL_UNREGISTER_SENDER \
+	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_SENDER)
+#define RPAL_IOCTL_REGISTER_RECEIVER \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_RECEIVER, unsigned long)
+#define RPAL_IOCTL_UNREGISTER_RECEIVER \
+	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_RECEIVER)
 
 /**
  * @brief get new reference to a rpal service, a corresponding
@@ -200,8 +261,26 @@ static inline struct rpal_service *rpal_current_service(void)
 {
 	return current->rpal_rs;
 }
+
+static inline void rpal_set_current_thread_flag(unsigned long bit)
+{
+	set_bit(bit, &current->rpal_flag);
+}
+
+static inline void rpal_clear_current_thread_flag(unsigned long bit)
+{
+	clear_bit(bit, &current->rpal_flag);
+}
+
+static inline bool rpal_test_current_thread_flag(unsigned long bit)
+{
+	return test_bit(bit, &current->rpal_flag);
+}
 #else
 static inline struct rpal_service *rpal_current_service(void) { return NULL; }
+static inline void rpal_set_current_thread_flag(unsigned long bit) { }
+static inline void rpal_clear_current_thread_flag(unsigned long bit) { }
+static inline bool rpal_test_current_thread_flag(unsigned long bit) { return false; }
 #endif
 
 void rpal_unregister_service(struct rpal_service *rs);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad35b197543c..5f25cc09fb71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -72,6 +72,9 @@ struct rcu_node;
 struct reclaim_state;
 struct robust_list_head;
 struct root_domain;
+struct rpal_common_data;
+struct rpal_receiver_data;
+struct rpal_sender_data;
 struct rpal_service;
 struct rq;
 struct sched_attr;
@@ -1648,6 +1651,18 @@ struct task_struct {
 
 #ifdef CONFIG_RPAL
 	struct rpal_service			*rpal_rs;
+	unsigned long				rpal_flag;
+	/*
+	 * The first member of both rpal_sd and rpal_rd has a type
+	 * of struct rpal_common_data. So if we do not care whether
+	 * it is a struct rpal_sender_data or a struct rpal_receiver_data,
+	 * use rpal_cd instead of rpal_sd or rpal_rd.
+	 */
+	union {
+		struct rpal_common_data *rpal_cd;
+		struct rpal_sender_data *rpal_sd;
+		struct rpal_receiver_data *rpal_rd;
+	};
 #endif
 
 	/* CPU-specific state of this task: */
diff --git a/init/init_task.c b/init/init_task.c
index 0c5b1927da41..2eb08b96e66b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -222,6 +222,8 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 #endif
 #ifdef CONFIG_RPAL
 	.rpal_rs = NULL,
+	.rpal_flag = 0,
+	.rpal_cd = NULL,
 #endif
 };
 EXPORT_SYMBOL(init_task);
diff --git a/kernel/fork.c b/kernel/fork.c
index 1d1c8484a8f2..01cd48eadf68 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1220,6 +1220,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_RPAL
 	tsk->rpal_rs = NULL;
+	tsk->rpal_flag = 0;
+	tsk->rpal_cd = NULL;
 #endif
 	return tsk;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 09/35] RPAL: enable address space sharing
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (7 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 08/35] RPAL: enable sender/receiver registration Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 10/35] RPAL: allow service enable/disable Bo Li
                   ` (31 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

RPAL's memory sharing is implemented by copying p4d entries, which requires
implementing corresponding interfaces. Meanwhile, copying p4d entries can
cause the process's page table to contain p4d entries that do not belong to
it, and RPAL needs to resolve compatibility issues with other subsystems
caused by this.

This patch implements the rpal_map_service() interface to complete the
mutual copying of p4d entries between two RPAL services. For the copied p4d
entries, RPAL adds a _PAGE_RPAL_IGN flag to them. This flag makes
p4d_none() return true and p4d_present() return false, ensuring that these
p4d entries are invisible to other kernel subsystems. The protection of p4d
entries is guaranteed by the memory balloon, which ensures that the address
space corresponding to the p4d entries is not used by the current service.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/include/asm/pgtable.h       |  25 ++++
 arch/x86/include/asm/pgtable_types.h |  11 ++
 arch/x86/rpal/internal.h             |   2 +
 arch/x86/rpal/mm.c                   | 175 +++++++++++++++++++++++++++
 4 files changed, 213 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5ddba366d3b4..54351bfe4e47 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1137,12 +1137,37 @@ static inline int pud_bad(pud_t pud)
 #if CONFIG_PGTABLE_LEVELS > 3
 static inline int p4d_none(p4d_t p4d)
 {
+#if IS_ENABLED(CONFIG_RPAL)
+	p4dval_t p4dv = native_p4d_val(p4d);
+
+	/*
+	 * Since RPAL copy p4d entry to share address space,
+	 * it is important that other process will not manipulate
+	 * this copied p4d. Thus, make p4d_none() always return
+	 * 0 to bypass kernel page table logic on copied p4d.
+	 */
+	return (p4dv & _PAGE_RPAL_IGN) ||
+	       ((p4dv & ~(_PAGE_KNL_ERRATUM_MASK)) == 0);
+#else
 	return (native_p4d_val(p4d) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
+#endif
 }
 
 static inline int p4d_present(p4d_t p4d)
 {
+#if IS_ENABLED(CONFIG_RPAL)
+	p4dval_t p4df = p4d_flags(p4d);
+
+	/*
+	 * Since RPAL copy p4d entry to share address space,
+	 * it is important that other process will not manipulate
+	 * this copied p4d. Thus, make p4d_present() always return
+	 * 0 to bypass kernel page table logic on copied p4d.
+	 */
+	return ((p4df & (_PAGE_PRESENT | _PAGE_RPAL_IGN)) == _PAGE_PRESENT);
+#else
 	return p4d_flags(p4d) & _PAGE_PRESENT;
+#endif
 }
 
 static inline pud_t *p4d_pgtable(p4d_t p4d)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74ec5c3643b..781b0f5bc359 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -35,6 +35,13 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_KERNEL_4K	_PAGE_BIT_SOFTW3 /* page must not be converted to large */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
+/*
+ * _PAGE_BIT_SOFTW1 is used by _PAGE_BIT_SPECIAL.
+ * but we are not conflicted with _PAGE_BIT_SPECIAL
+ * as we use it only on p4d/pud level and _PAGE_BIT_SPECIAL
+ * is only used on pte level.
+ */
+#define _PAGE_BIT_RPAL_IGN	_PAGE_BIT_SOFTW1
 
 #ifdef CONFIG_X86_64
 #define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */
@@ -95,6 +102,10 @@
 #define _PAGE_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+#if IS_ENABLED(CONFIG_RPAL)
+#define _PAGE_RPAL_IGN	(_AT(pteval_t, 1) << _PAGE_BIT_RPAL_IGN)
+#endif
+
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index 3559c9c6e868..65f2cf4baf8f 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -34,6 +34,8 @@ static inline void rpal_put_shared_page(struct rpal_shared_page *rsp)
 int rpal_mmap(struct file *filp, struct vm_area_struct *vma);
 struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs,
 					       unsigned long addr);
+int rpal_map_service(struct rpal_service *tgt);
+void rpal_unmap_service(struct rpal_service *tgt);
 
 /* thread.c */
 int rpal_register_sender(unsigned long addr);
diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c
index 8a738c502d1d..f1003baae001 100644
--- a/arch/x86/rpal/mm.c
+++ b/arch/x86/rpal/mm.c
@@ -215,3 +215,178 @@ void rpal_exit_mmap(struct mm_struct *mm)
 		rpal_put_service(rs);
 	}
 }
+
+/*
+ * Since the user address space size of rpal process is 512G, which
+ * is the size of one p4d, we assume p4d entry will never change after
+ * rpal process is created.
+ */
+static int mm_link_p4d(struct mm_struct *dst_mm, p4d_t src_p4d,
+		       unsigned long addr)
+{
+	spinlock_t *dst_ptl = &dst_mm->page_table_lock;
+	unsigned long flags;
+	pgd_t *dst_pgdp;
+	p4d_t p4d, *dst_p4dp;
+	p4dval_t p4dv;
+	int ret = 0;
+
+	BUILD_BUG_ON(CONFIG_PGTABLE_LEVELS < 4);
+
+	mmap_write_lock(dst_mm);
+	spin_lock_irqsave(dst_ptl, flags);
+	dst_pgdp = pgd_offset(dst_mm, addr);
+	/*
+	 * dst_pgd must exists, otherwise we need to alloc pgd entry. When
+	 * src_p4d is freed, we also need to free the pgd entry. This should
+	 * be supported in the future.
+	 */
+	if (unlikely(pgd_none_or_clear_bad(dst_pgdp))) {
+		rpal_err("cannot find pgd entry for addr 0x%016lx\n", addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	dst_p4dp = p4d_offset(dst_pgdp, addr);
+	if (unlikely(!p4d_none_or_clear_bad(dst_p4dp))) {
+		rpal_err("p4d is previously mapped\n");
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	p4dv = p4d_val(src_p4d);
+
+	/*
+	 * Since RPAL copy p4d entry to share address space,
+	 * it is important that other process will not manipulate
+	 * this copied p4d. We need mark the copied p4d and make
+	 * p4d_present() and p4d_none() ignore such p4d.
+	 */
+	p4dv |= _PAGE_RPAL_IGN;
+
+	if (boot_cpu_has(X86_FEATURE_PTI))
+		p4d = native_make_p4d((~_PAGE_NX) & p4dv);
+	else
+		p4d = native_make_p4d(p4dv);
+
+	set_p4d(dst_p4dp, p4d);
+	spin_unlock_irqrestore(dst_ptl, flags);
+	mmap_write_unlock(dst_mm);
+
+	return 0;
+unlock:
+	spin_unlock_irqrestore(dst_ptl, flags);
+	mmap_write_unlock(dst_mm);
+	return ret;
+}
+
+static void mm_unlink_p4d(struct mm_struct *mm, unsigned long addr)
+{
+	spinlock_t *ptl = &mm->page_table_lock;
+	unsigned long flags;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+
+	mmap_write_lock(mm);
+	spin_lock_irqsave(ptl, flags);
+	pgdp = pgd_offset(mm, addr);
+	p4dp = p4d_offset(pgdp, addr);
+	p4d_clear(p4dp);
+	spin_unlock_irqrestore(ptl, flags);
+	mmap_write_unlock(mm);
+
+	flush_tlb_mm(mm);
+}
+
+static int get_mm_p4d(struct mm_struct *mm, unsigned long addr, p4d_t *srcp)
+{
+	spinlock_t *ptl;
+	unsigned long flags;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	int ret = 0;
+
+	ptl = &mm->page_table_lock;
+	spin_lock_irqsave(ptl, flags);
+	pgdp = pgd_offset(mm, addr);
+	if (pgd_none(*pgdp)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	p4dp = p4d_offset(pgdp, addr);
+	if (p4d_none(*p4dp) || p4d_bad(*p4dp)) {
+		ret = -EINVAL;
+		goto out;
+	}
+	*srcp = *p4dp;
+
+out:
+	spin_unlock_irqrestore(ptl, flags);
+
+	return ret;
+}
+
+int rpal_map_service(struct rpal_service *tgt)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct mm_struct *cur_mm, *tgt_mm;
+	unsigned long cur_addr, tgt_addr;
+	p4d_t cur_p4d, tgt_p4d;
+	int ret = 0;
+
+	cur_mm = current->mm;
+	tgt_mm = tgt->mm;
+	if (!mmget_not_zero(tgt_mm)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	cur_addr = rpal_get_base(cur);
+	tgt_addr = rpal_get_base(tgt);
+
+	ret = get_mm_p4d(tgt_mm, tgt_addr, &tgt_p4d);
+	if (ret)
+		goto put_tgt;
+
+	ret = get_mm_p4d(cur_mm, cur_addr, &cur_p4d);
+	if (ret)
+		goto put_tgt;
+
+	ret = mm_link_p4d(cur_mm, tgt_p4d, tgt_addr);
+	if (ret)
+		goto put_tgt;
+
+	ret = mm_link_p4d(tgt_mm, cur_p4d, cur_addr);
+	if (ret) {
+		mm_unlink_p4d(cur_mm, tgt_addr);
+		goto put_tgt;
+	}
+
+put_tgt:
+	mmput(tgt_mm);
+out:
+	return ret;
+}
+
+void rpal_unmap_service(struct rpal_service *tgt)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct mm_struct *cur_mm, *tgt_mm;
+	unsigned long cur_addr, tgt_addr;
+
+	cur_mm = current->mm;
+	tgt_mm = tgt->mm;
+
+	cur_addr = rpal_get_base(cur);
+	tgt_addr = rpal_get_base(tgt);
+
+	if (mmget_not_zero(tgt_mm)) {
+		mm_unlink_p4d(tgt_mm, cur_addr);
+		mmput(tgt_mm);
+	} else {
+		/* If tgt has exited, then we get a NULL tgt_mm */
+		pr_debug("rpal: [%d] cannot find target mm\n", current->pid);
+	}
+	mm_unlink_p4d(cur_mm, tgt->base);
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 10/35] RPAL: allow service enable/disable
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (8 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 09/35] RPAL: enable address space sharing Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 11/35] RPAL: add service request/release Bo Li
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Since RPAL involves communication between services, and services require
certain preparations (e.g., registering senders/receivers) before
communication, the kernel needs to sense whether a service is ready to
perform RPAL call-related operations.

This patch adds two interfaces: rpal_enable_service() and
rpal_disable_service(). rpal_enable_service() passes necessary information
to the kernel and marks the service as enabled. RPAL only permits
communication between services in the enabled state. rpal_disable_service()
clears the service's enabled state, thereby prohibiting communication
between the service and others via RPAL.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/internal.h |  2 ++
 arch/x86/rpal/proc.c     |  6 +++++
 arch/x86/rpal/service.c  | 50 ++++++++++++++++++++++++++++++++++++++++
 include/linux/rpal.h     | 18 +++++++++++++++
 4 files changed, 76 insertions(+)

diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index 65f2cf4baf8f..769d3bbe5a6b 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -17,6 +17,8 @@ extern bool rpal_inited;
 /* service.c */
 int __init rpal_service_init(void);
 void __init rpal_service_exit(void);
+int rpal_enable_service(unsigned long arg);
+int rpal_disable_service(void);
 
 /* mm.c */
 static inline struct rpal_shared_page *
diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
index 8a1e4a8a2271..acd814f31649 100644
--- a/arch/x86/rpal/proc.c
+++ b/arch/x86/rpal/proc.c
@@ -63,6 +63,12 @@ static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case RPAL_IOCTL_UNREGISTER_RECEIVER:
 		ret = rpal_unregister_receiver();
 		break;
+	case RPAL_IOCTL_ENABLE_SERVICE:
+		ret = rpal_enable_service(arg);
+		break;
+	case RPAL_IOCTL_DISABLE_SERVICE:
+		ret = rpal_disable_service();
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 42fb719dbb2a..8a7b679bc28b 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -177,6 +177,7 @@ struct rpal_service *rpal_register_service(void)
 	rs->nr_shared_pages = 0;
 	INIT_LIST_HEAD(&rs->shared_pages);
 	atomic_set(&rs->thread_cnt, 0);
+	rs->enabled = false;
 
 	rs->bad_service = false;
 	rs->base = calculate_base_address(rs->id);
@@ -228,6 +229,52 @@ void rpal_unregister_service(struct rpal_service *rs)
 	rpal_put_service(rs);
 }
 
+int rpal_enable_service(unsigned long arg)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_service_metadata rsm;
+	int ret = 0;
+
+	if (cur->bad_service) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = copy_from_user(&rsm, (void __user *)arg, sizeof(rsm));
+	if (ret) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	mutex_lock(&cur->mutex);
+	if (!cur->enabled) {
+		cur->rsm = rsm;
+		cur->enabled = true;
+	}
+	mutex_unlock(&cur->mutex);
+
+out:
+	return ret;
+}
+
+int rpal_disable_service(void)
+{
+	struct rpal_service *cur = rpal_current_service();
+	int ret = 0;
+
+	mutex_lock(&cur->mutex);
+	if (cur->enabled) {
+		cur->enabled = false;
+	} else {
+		ret = -EINVAL;
+		goto unlock_mutex;
+	}
+
+unlock_mutex:
+	mutex_unlock(&cur->mutex);
+	return ret;
+}
+
 void copy_rpal(struct task_struct *p)
 {
 	struct rpal_service *cur = rpal_current_service();
@@ -244,6 +291,9 @@ void exit_rpal(bool group_dead)
 
 	exit_rpal_thread();
 
+	if (group_dead)
+		rpal_disable_service();
+
 	current->rpal_rs = NULL;
 	rpal_put_service(rs);
 
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index c33425e896af..2e5010602177 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -84,6 +84,14 @@ enum rpal_task_flag_bits {
 	RPAL_RECEIVER_BIT,
 };
 
+/*
+ * user_meta will be sent to other service when requested.
+ */
+struct rpal_service_metadata {
+	unsigned long version;
+	void __user *user_meta;
+};
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -125,6 +133,10 @@ struct rpal_service {
 	/* sender/receiver thread count */
 	atomic_t thread_cnt;
 
+	/* service metadata */
+	bool enabled;
+	struct rpal_service_metadata rsm;
+
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
 
@@ -206,6 +218,8 @@ enum rpal_command_type {
 	RPAL_CMD_UNREGISTER_SENDER,
 	RPAL_CMD_REGISTER_RECEIVER,
 	RPAL_CMD_UNREGISTER_RECEIVER,
+	RPAL_CMD_ENABLE_SERVICE,
+	RPAL_CMD_DISABLE_SERVICE,
 	RPAL_NR_CMD,
 };
 
@@ -226,6 +240,10 @@ enum rpal_command_type {
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_RECEIVER, unsigned long)
 #define RPAL_IOCTL_UNREGISTER_RECEIVER \
 	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_RECEIVER)
+#define RPAL_IOCTL_ENABLE_SERVICE \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_ENABLE_SERVICE, unsigned long)
+#define RPAL_IOCTL_DISABLE_SERVICE \
+	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_DISABLE_SERVICE)
 
 /**
  * @brief get new reference to a rpal service, a corresponding
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 11/35] RPAL: add service request/release
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (9 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 10/35] RPAL: allow service enable/disable Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 12/35] RPAL: enable service disable notification Bo Li
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Services communicating via RPAL require a series of operations to perform
RPAL calls, such as mapping each other's memory and obtaining each other's
metadata.

This patch adds the rpal_request_service() and rpal_release_service()
interfaces. Before communication, services must first complete a handshake
process by mutually requesting each other. Only after both parties have
completed their requests will RPAL copy each other's p4d entries into the
other party's page tables, thereby achieving address space sharing. The
patch defines RPAL_REQUEST_MAP and RPAL_REVERSE_MAP to indicate whether a
service has requested another service or has been requested by another
service.

rpal_release_service() can release previously requested services, which
triggers the removal of mutual p4d entries and terminates address space
sharing. When a service exits the enabled state, the kernel will release
all services it has ever requested, thereby terminating all address space
sharing involving this service.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/internal.h |   5 +
 arch/x86/rpal/proc.c     |   6 +
 arch/x86/rpal/service.c  | 265 ++++++++++++++++++++++++++++++++++++++-
 include/linux/rpal.h     |  42 +++++++
 4 files changed, 316 insertions(+), 2 deletions(-)

diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index 769d3bbe5a6b..c504b6efff64 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -12,6 +12,9 @@
 #include <linux/mm.h>
 #include <linux/file.h>
 
+#define RPAL_REQUEST_MAP 0x1
+#define RPAL_REVERSE_MAP 0x2
+
 extern bool rpal_inited;
 
 /* service.c */
@@ -19,6 +22,8 @@ int __init rpal_service_init(void);
 void __init rpal_service_exit(void);
 int rpal_enable_service(unsigned long arg);
 int rpal_disable_service(void);
+int rpal_request_service(unsigned long arg);
+int rpal_release_service(u64 key);
 
 /* mm.c */
 static inline struct rpal_shared_page *
diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
index acd814f31649..f001afd40562 100644
--- a/arch/x86/rpal/proc.c
+++ b/arch/x86/rpal/proc.c
@@ -69,6 +69,12 @@ static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case RPAL_IOCTL_DISABLE_SERVICE:
 		ret = rpal_disable_service();
 		break;
+	case RPAL_IOCTL_REQUEST_SERVICE:
+		ret = rpal_request_service(arg);
+		break;
+	case RPAL_IOCTL_RELEASE_SERVICE:
+		ret = rpal_release_service(arg);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 8a7b679bc28b..16a2155873a1 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -178,6 +178,9 @@ struct rpal_service *rpal_register_service(void)
 	INIT_LIST_HEAD(&rs->shared_pages);
 	atomic_set(&rs->thread_cnt, 0);
 	rs->enabled = false;
+	atomic_set(&rs->req_avail_cnt, MAX_REQUEST_SERVICE);
+	bitmap_zero(rs->requested_service_bitmap, RPAL_NR_ID);
+	spin_lock_init(&rs->lock);
 
 	rs->bad_service = false;
 	rs->base = calculate_base_address(rs->id);
@@ -229,6 +232,262 @@ void rpal_unregister_service(struct rpal_service *rs)
 	rpal_put_service(rs);
 }
 
+static inline void set_requested_service_bitmap(struct rpal_service *rs, int id)
+{
+	set_bit(id, rs->requested_service_bitmap);
+}
+
+static inline void clear_requested_service_bitmap(struct rpal_service *rs, int id)
+{
+	clear_bit(id, rs->requested_service_bitmap);
+}
+
+static int add_mapped_service(struct rpal_service *rs, struct rpal_service *tgt,
+			      int type_bit)
+{
+	struct rpal_mapped_service *node;
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&rs->lock, flags);
+	node = rpal_get_mapped_node(rs, tgt->id);
+	if (type_bit == RPAL_REQUEST_MAP) {
+		if (atomic_read(&rs->req_avail_cnt) == 0) {
+			ret = -EINVAL;
+			goto unlock;
+		}
+	}
+
+	if (node->rs == NULL) {
+		node->rs = rpal_get_service(tgt);
+		set_bit(type_bit, &node->type);
+	} else {
+		if (node->rs != tgt) {
+			ret = -EINVAL;
+			goto unlock;
+		} else {
+			if (test_and_set_bit(type_bit, &node->type)) {
+				ret = -EINVAL;
+				goto unlock;
+			}
+		}
+	}
+
+	if (type_bit == RPAL_REQUEST_MAP) {
+		set_requested_service_bitmap(rs, tgt->id);
+		atomic_dec(&rs->req_avail_cnt);
+	}
+
+unlock:
+	spin_unlock_irqrestore(&rs->lock, flags);
+	return ret;
+}
+
+static void remove_mapped_service(struct rpal_service *rs, int id, int type_bit)
+{
+	struct rpal_mapped_service *node;
+	struct rpal_service *t;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rs->lock, flags);
+	node = rpal_get_mapped_node(rs, id);
+	if (node->rs == NULL)
+		goto unlock;
+
+	clear_bit(type_bit, &node->type);
+	if (type_bit == RPAL_REQUEST_MAP) {
+		clear_requested_service_bitmap(rs, id);
+		atomic_inc(&rs->req_avail_cnt);
+	}
+
+	if (node->type == 0) {
+		t = node->rs;
+		node->rs = NULL;
+		rpal_put_service(t);
+	}
+
+unlock:
+	spin_unlock_irqrestore(&rs->lock, flags);
+}
+
+static bool ready_to_map(struct rpal_service *cur, int tgt_id)
+{
+	struct rpal_mapped_service *node;
+	unsigned long flags;
+	bool need_map = false;
+
+	spin_lock_irqsave(&cur->lock, flags);
+	node = rpal_get_mapped_node(cur, tgt_id);
+	if (test_bit(RPAL_REQUEST_MAP, &node->type) &&
+	    test_bit(RPAL_REVERSE_MAP, &node->type)) {
+		need_map = true;
+	}
+	spin_unlock_irqrestore(&cur->lock, flags);
+
+	return need_map;
+}
+
+int rpal_request_service(unsigned long arg)
+{
+	struct rpal_service *cur, *tgt;
+	struct rpal_request_arg rra;
+	long ret = 0;
+	int id;
+
+	cur = rpal_current_service();
+
+	if (copy_from_user(&rra, (void __user *)arg, sizeof(rra))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (cur->key == rra.key) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (atomic_read(&cur->req_avail_cnt) == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	mutex_lock(&cur->mutex);
+
+	if (!cur->enabled) {
+		ret = -EINVAL;
+		goto unlock_mutex;
+	}
+
+	tgt = rpal_get_service_by_key(rra.key);
+	if (tgt == NULL) {
+		ret = -EINVAL;
+		goto unlock_mutex;
+	}
+
+	if (!tgt->enabled) {
+		ret = -EPERM;
+		goto put_service;
+	}
+
+	ret = put_user((unsigned long)(tgt->rsm.user_meta), rra.user_metap);
+	if (ret) {
+		ret = -EFAULT;
+		goto put_service;
+	}
+
+	ret = put_user(tgt->id, rra.id);
+	if (ret) {
+		ret = -EFAULT;
+		goto put_service;
+	}
+
+	id = tgt->id;
+	ret = add_mapped_service(cur, tgt, RPAL_REQUEST_MAP);
+	if (ret < 0)
+		goto put_service;
+
+	ret = add_mapped_service(tgt, cur, RPAL_REVERSE_MAP);
+	if (ret < 0)
+		goto remove_request;
+
+	/* only map shared address space when both process request each other */
+	if (ready_to_map(cur, id)) {
+		ret = rpal_map_service(tgt);
+		if (ret < 0)
+			goto remove_reverse;
+	}
+
+	mutex_unlock(&cur->mutex);
+
+	rpal_put_service(tgt);
+
+	return 0;
+
+remove_reverse:
+	remove_mapped_service(tgt, cur->id, RPAL_REVERSE_MAP);
+remove_request:
+	remove_mapped_service(cur, tgt->id, RPAL_REQUEST_MAP);
+put_service:
+	rpal_put_service(tgt);
+unlock_mutex:
+	mutex_unlock(&cur->mutex);
+out:
+	return ret;
+}
+
+static int release_service(struct rpal_service *cur, struct rpal_service *tgt)
+{
+	remove_mapped_service(tgt, cur->id, RPAL_REVERSE_MAP);
+	remove_mapped_service(cur, tgt->id, RPAL_REQUEST_MAP);
+	rpal_unmap_service(tgt);
+
+	return 0;
+}
+
+static void rpal_release_service_all(void)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_service *tgt;
+	int ret, i;
+
+	rpal_for_each_requested_service(cur, i) {
+		struct rpal_mapped_service *node;
+
+		if (i == cur->id)
+			continue;
+		node = rpal_get_mapped_node(cur, i);
+		tgt = rpal_get_service(node->rs);
+		if (!tgt)
+			continue;
+
+		if (test_bit(RPAL_REQUEST_MAP, &node->type)) {
+			ret = release_service(cur, tgt);
+			if (unlikely(ret)) {
+				rpal_err("service %d release service %d fail\n",
+					 cur->id, tgt->id);
+			}
+		}
+		rpal_put_service(tgt);
+	}
+}
+
+int rpal_release_service(u64 key)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_service *tgt = NULL;
+	struct rpal_mapped_service *node;
+	int ret = 0;
+	int i;
+
+	mutex_lock(&cur->mutex);
+
+	if (cur->key == key) {
+		ret = -EINVAL;
+		goto unlock_mutex;
+	}
+
+	rpal_for_each_requested_service(cur, i) {
+		node = rpal_get_mapped_node(cur, i);
+		if (node->rs->key == key) {
+			tgt = rpal_get_service(node->rs);
+			break;
+		}
+	}
+
+	if (!tgt) {
+		ret = -EINVAL;
+		goto unlock_mutex;
+	}
+
+	ret = release_service(cur, tgt);
+
+	rpal_put_service(tgt);
+
+unlock_mutex:
+	mutex_unlock(&cur->mutex);
+	return ret;
+}
+
 int rpal_enable_service(unsigned long arg)
 {
 	struct rpal_service *cur = rpal_current_service();
@@ -270,6 +529,8 @@ int rpal_disable_service(void)
 		goto unlock_mutex;
 	}
 
+	rpal_release_service_all();
+
 unlock_mutex:
 	mutex_unlock(&cur->mutex);
 	return ret;
@@ -289,11 +550,11 @@ void exit_rpal(bool group_dead)
 	if (!rs)
 		return;
 
-	exit_rpal_thread();
-
 	if (group_dead)
 		rpal_disable_service();
 
+	exit_rpal_thread();
+
 	current->rpal_rs = NULL;
 	rpal_put_service(rs);
 
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 2e5010602177..1fe177523a36 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -77,6 +77,9 @@
 #define RPAL_ADDRESS_SPACE_LOW  ((0UL) + RPAL_ADDR_SPACE_SIZE)
 #define RPAL_ADDRESS_SPACE_HIGH ((0UL) + RPAL_NR_ADDR_SPACE * RPAL_ADDR_SPACE_SIZE)
 
+/* No more than 15 services can be requested due to limitation of MPK. */
+#define MAX_REQUEST_SERVICE 15
+
 extern unsigned long rpal_cap;
 
 enum rpal_task_flag_bits {
@@ -92,6 +95,18 @@ struct rpal_service_metadata {
 	void __user *user_meta;
 };
 
+struct rpal_request_arg {
+	unsigned long version;
+	u64 key;
+	unsigned long __user *user_metap;
+	int __user *id;
+};
+
+struct rpal_mapped_service {
+	unsigned long type;
+	struct rpal_service *rs;
+};
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -125,6 +140,8 @@ struct rpal_service {
      */
 	/* Mutex for time consuming operations */
 	struct mutex mutex;
+	/* spinlock for short operations */
+	spinlock_t lock;
 
 	/* pinned pages */
 	int nr_shared_pages;
@@ -137,6 +154,13 @@ struct rpal_service {
 	bool enabled;
 	struct rpal_service_metadata rsm;
 
+	/* the number of services allow to be requested */
+	atomic_t req_avail_cnt;
+
+	/* map for services required, being required and mapped  */
+	struct rpal_mapped_service service_map[RPAL_NR_ID];
+	DECLARE_BITMAP(requested_service_bitmap, RPAL_NR_ID);
+
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
 
@@ -220,6 +244,8 @@ enum rpal_command_type {
 	RPAL_CMD_UNREGISTER_RECEIVER,
 	RPAL_CMD_ENABLE_SERVICE,
 	RPAL_CMD_DISABLE_SERVICE,
+	RPAL_CMD_REQUEST_SERVICE,
+	RPAL_CMD_RELEASE_SERVICE,
 	RPAL_NR_CMD,
 };
 
@@ -244,6 +270,16 @@ enum rpal_command_type {
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_ENABLE_SERVICE, unsigned long)
 #define RPAL_IOCTL_DISABLE_SERVICE \
 	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_DISABLE_SERVICE)
+#define RPAL_IOCTL_REQUEST_SERVICE \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REQUEST_SERVICE, unsigned long)
+#define RPAL_IOCTL_RELEASE_SERVICE \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long)
+
+#define rpal_for_each_requested_service(rs, idx)                             \
+	for (idx = find_first_bit(rs->requested_service_bitmap, RPAL_NR_ID); \
+	     idx < RPAL_NR_ID;                                               \
+	     idx = find_next_bit(rs->requested_service_bitmap, RPAL_NR_ID,   \
+				 idx + 1))
 
 /**
  * @brief get new reference to a rpal service, a corresponding
@@ -274,6 +310,12 @@ static inline unsigned long rpal_get_top(struct rpal_service *rs)
 	return rs->base + RPAL_ADDR_SPACE_SIZE;
 }
 
+static inline struct rpal_mapped_service *
+rpal_get_mapped_node(struct rpal_service *rs, int id)
+{
+	return &rs->service_map[id];
+}
+
 #ifdef CONFIG_RPAL
 static inline struct rpal_service *rpal_current_service(void)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 12/35] RPAL: enable service disable notification
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (10 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 11/35] RPAL: add service request/release Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 13/35] RPAL: add tlb flushing support Bo Li
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

When a service is disabled, all services that request this service need
to be notified.

This patch use poll functions of file to implement such notification.
When a service is disabled, it will notify all services that request it
by set bit in others services' dead_key_bitmap. And the poll function
will then issue a poll epoll event, other services can aware the service
has been disabled.

The key of disabled service can be read from the proc file.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/proc.c    | 61 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/rpal/service.c | 37 +++++++++++++++++++++++--
 include/linux/rpal.h    | 10 +++++++
 3 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
index f001afd40562..16ac9612bfc5 100644
--- a/arch/x86/rpal/proc.c
+++ b/arch/x86/rpal/proc.c
@@ -8,6 +8,7 @@
 
 #include <linux/rpal.h>
 #include <linux/proc_fs.h>
+#include <linux/poll.h>
 
 #include "internal.h"
 
@@ -82,10 +83,70 @@ static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return ret;
 }
 
+static ssize_t rpal_read(struct file *file, char __user *buf, size_t count,
+		  loff_t *pos)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_poll_data *rpd;
+	u64 released_keys[MAX_REQUEST_SERVICE];
+	unsigned long flags;
+	int nr_key = 0;
+	int nr_byte = 0;
+	int idx;
+
+	if (!cur)
+		return -EINVAL;
+
+	rpd = &cur->rpd;
+
+	spin_lock_irqsave(&rpd->poll_lock, flags);
+	idx = find_first_bit(rpd->dead_key_bitmap, RPAL_NR_ID);
+	while (idx < RPAL_NR_ID) {
+		released_keys[nr_key++] = rpd->dead_keys[idx];
+		idx = find_next_bit(rpd->dead_key_bitmap, RPAL_NR_ID, idx + 1);
+	}
+	spin_unlock_irqrestore(&rpd->poll_lock, flags);
+	nr_byte = nr_key * sizeof(u64);
+
+	if (copy_to_user(buf, released_keys, nr_byte)) {
+		nr_byte = -EAGAIN;
+		goto out;
+	}
+out:
+	return nr_byte;
+}
+
+static __poll_t rpal_poll(struct file *filep, struct poll_table_struct *wait)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_poll_data *rpd;
+	unsigned long flags;
+	__poll_t mask = 0;
+
+	if (unlikely(!cur)) {
+		rpal_err("Not a rpal service\n");
+		goto out;
+	}
+
+	rpd = &cur->rpd;
+
+	poll_wait(filep, &rpd->rpal_waitqueue, wait);
+
+	spin_lock_irqsave(&rpd->poll_lock, flags);
+	if (find_first_bit(rpd->dead_key_bitmap, RPAL_NR_ID) < RPAL_NR_ID)
+		mask |= EPOLLIN | EPOLLRDNORM;
+	spin_unlock_irqrestore(&rpd->poll_lock, flags);
+
+out:
+	return mask;
+}
+
 const struct proc_ops proc_rpal_operations = {
 	.proc_open = rpal_open,
+	.proc_read = rpal_read,
 	.proc_ioctl = rpal_ioctl,
 	.proc_mmap = rpal_mmap,
+	.proc_poll = rpal_poll,
 };
 
 static int __init proc_rpal_init(void)
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 16a2155873a1..f490ab07301d 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -181,6 +181,9 @@ struct rpal_service *rpal_register_service(void)
 	atomic_set(&rs->req_avail_cnt, MAX_REQUEST_SERVICE);
 	bitmap_zero(rs->requested_service_bitmap, RPAL_NR_ID);
 	spin_lock_init(&rs->lock);
+	spin_lock_init(&rs->rpd.poll_lock);
+	bitmap_zero(rs->rpd.dead_key_bitmap, RPAL_NR_ID);
+	init_waitqueue_head(&rs->rpd.rpal_waitqueue);
 
 	rs->bad_service = false;
 	rs->base = calculate_base_address(rs->id);
@@ -296,6 +299,7 @@ static void remove_mapped_service(struct rpal_service *rs, int id, int type_bit)
 
 	clear_bit(type_bit, &node->type);
 	if (type_bit == RPAL_REQUEST_MAP) {
+		clear_bit(id, rs->rpd.dead_key_bitmap);
 		clear_requested_service_bitmap(rs, id);
 		atomic_inc(&rs->req_avail_cnt);
 	}
@@ -424,15 +428,30 @@ static int release_service(struct rpal_service *cur, struct rpal_service *tgt)
 	return 0;
 }
 
+static void rpal_notify_disable(struct rpal_poll_data *rpd, u64 key, int id)
+{
+	unsigned long flags;
+	bool need_wake = false;
+
+	spin_lock_irqsave(&rpd->poll_lock, flags);
+	if (!test_bit(id, rpd->dead_key_bitmap)) {
+		need_wake = true;
+		rpd->dead_keys[id] = key;
+		set_bit(id, rpd->dead_key_bitmap);
+	}
+	spin_unlock_irqrestore(&rpd->poll_lock, flags);
+	if (need_wake)
+		wake_up_interruptible(&rpd->rpal_waitqueue);
+}
+
 static void rpal_release_service_all(void)
 {
 	struct rpal_service *cur = rpal_current_service();
 	struct rpal_service *tgt;
+	struct rpal_mapped_service *node;
 	int ret, i;
 
 	rpal_for_each_requested_service(cur, i) {
-		struct rpal_mapped_service *node;
-
 		if (i == cur->id)
 			continue;
 		node = rpal_get_mapped_node(cur, i);
@@ -449,6 +468,20 @@ static void rpal_release_service_all(void)
 		}
 		rpal_put_service(tgt);
 	}
+
+	for (i = 0; i < RPAL_NR_ID; i++) {
+		if (i == cur->id)
+			continue;
+
+		node = rpal_get_mapped_node(cur, i);
+		tgt = rpal_get_service(node->rs);
+		if (!tgt)
+			continue;
+
+		if (test_bit(RPAL_REVERSE_MAP, &node->type))
+			rpal_notify_disable(&tgt->rpd, cur->key, cur->id);
+		rpal_put_service(tgt);
+	}
 }
 
 int rpal_release_service(u64 key)
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 1fe177523a36..b9622f0235bf 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -107,6 +107,13 @@ struct rpal_mapped_service {
 	struct rpal_service *rs;
 };
 
+struct rpal_poll_data {
+	spinlock_t poll_lock;
+	u64 dead_keys[RPAL_NR_ID];
+	DECLARE_BITMAP(dead_key_bitmap, RPAL_NR_ID);
+	wait_queue_head_t rpal_waitqueue;
+};
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -161,6 +168,9 @@ struct rpal_service {
 	struct rpal_mapped_service service_map[RPAL_NR_ID];
 	DECLARE_BITMAP(requested_service_bitmap, RPAL_NR_ID);
 
+	/* Notify service is released by others */
+	struct rpal_poll_data rpd;
+
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 13/35] RPAL: add tlb flushing support
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (11 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 12/35] RPAL: enable service disable notification Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 14/35] RPAL: enable page fault handling Bo Li
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

When a thread flushes the TLB, since the address space is shared,
not only other threads in the current process but also other processes
that share the address space may access the corresponding memory (related
to the TLB flush). Therefore, the cpuset used for TLB flushing should be
the union of the mm_cpumasks of all processes that share the address
space.

This patch extend flush_tlb_info to store other process's mm_struct,
and when a CPU in the union of the mm_cpumasks if invoked to handle
tlb flushing, it will check whether cpu_tlbstate.loaded_mm matches any
of mm_structs stored in flush_tlb_info. If match, the CPU will do local
tlb flushing for that mm_struct.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/include/asm/tlbflush.h |  10 ++
 arch/x86/mm/tlb.c               | 172 ++++++++++++++++++++++++++++++++
 arch/x86/rpal/internal.h        |   3 -
 include/linux/rpal.h            |  12 +++
 mm/rmap.c                       |   4 +
 5 files changed, 198 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e9b81876ebe4..f57b745af75c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -227,6 +227,11 @@ struct flush_tlb_info {
 	u8			stride_shift;
 	u8			freed_tables;
 	u8			trim_cpumask;
+#ifdef CONFIG_RPAL
+	struct mm_struct **mm_list;
+	u64 *tlb_gen_list;
+	int nr_mm;
+#endif
 };
 
 void flush_tlb_local(void);
@@ -356,6 +361,11 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
+#ifdef CONFIG_RPAL
+void rpal_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+	struct mm_struct *mm);
+#endif
+
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..a0fe17b13887 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/task_work.h>
 #include <linux/mmu_notifier.h>
 #include <linux/mmu_context.h>
+#include <linux/rpal.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1361,6 +1362,169 @@ void flush_tlb_multi(const struct cpumask *cpumask,
 	__flush_tlb_multi(cpumask, info);
 }
 
+#ifdef CONFIG_RPAL
+static void rpal_flush_tlb_func_remote(void *info)
+{
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+	struct flush_tlb_info *f = info;
+	struct flush_tlb_info tf = *f;
+	int i;
+
+	/* As it comes from RPAL path, f->mm cannot be NULL */
+	if (f->mm == loaded_mm) {
+		flush_tlb_func(f);
+		return;
+	}
+
+	for (i = 0; i < f->nr_mm; i++) {
+		/* We always have f->mm_list[i] != NULL */
+		if (f->mm_list[i] == loaded_mm) {
+			tf.mm = f->mm_list[i];
+			tf.new_tlb_gen = f->tlb_gen_list[i];
+			flush_tlb_func(&tf);
+			return;
+		}
+	}
+}
+
+static void rpal_flush_tlb_func_multi(const struct cpumask *cpumask,
+			       const struct flush_tlb_info *info)
+{
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	if (info->end == TLB_FLUSH_ALL)
+		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
+	else
+		trace_tlb_flush(TLB_REMOTE_SEND_IPI,
+				(info->end - info->start) >> PAGE_SHIFT);
+
+	if (info->freed_tables || mm_in_asid_transition(info->mm))
+		on_each_cpu_mask(cpumask, rpal_flush_tlb_func_remote,
+				 (void *)info, true);
+	else
+		on_each_cpu_cond_mask(should_flush_tlb,
+				      rpal_flush_tlb_func_remote, (void *)info,
+				      1, cpumask);
+}
+
+static void rpal_flush_tlb_func_local(struct mm_struct *mm, int cpu,
+				      struct flush_tlb_info *info,
+				      u64 new_tlb_gen)
+{
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+	if (loaded_mm == info->mm) {
+		lockdep_assert_irqs_enabled();
+		local_irq_disable();
+		flush_tlb_func(info);
+		local_irq_enable();
+	} else {
+		int i;
+
+		for (i = 0; i < info->nr_mm; i++) {
+			if (info->mm_list[i] == loaded_mm) {
+				lockdep_assert_irqs_enabled();
+				local_irq_disable();
+				info->mm = info->mm_list[i];
+				info->new_tlb_gen = info->tlb_gen_list[i];
+				flush_tlb_func(info);
+				info->mm = mm;
+				info->new_tlb_gen = new_tlb_gen;
+				local_irq_enable();
+			}
+		}
+	}
+}
+
+static void rpal_flush_tlb_mm_range(struct mm_struct *mm, int cpu,
+			     struct flush_tlb_info *info, u64 new_tlb_gen)
+{
+	struct rpal_service *cur = mm->rpal_rs;
+	cpumask_t merged_mask;
+	struct mm_struct *mm_list[MAX_REQUEST_SERVICE];
+	u64 tlb_gen_list[MAX_REQUEST_SERVICE];
+	int nr_mm = 0;
+	int i;
+
+	cpumask_copy(&merged_mask, mm_cpumask(mm));
+	if (cur) {
+		struct rpal_service *tgt;
+		struct mm_struct *tgt_mm;
+
+		rpal_for_each_requested_service(cur, i) {
+			struct rpal_mapped_service *node;
+
+			if (i == cur->id)
+				continue;
+			node = rpal_get_mapped_node(cur, i);
+			if (!rpal_service_mapped(node))
+				continue;
+
+			tgt = rpal_get_service(node->rs);
+			if (!tgt)
+				continue;
+			tgt_mm = tgt->mm;
+			if (!mmget_not_zero(tgt_mm)) {
+				rpal_put_service(tgt);
+				continue;
+			}
+			mm_list[nr_mm] = tgt_mm;
+			tlb_gen_list[nr_mm] = inc_mm_tlb_gen(tgt_mm);
+
+			nr_mm++;
+			cpumask_or(&merged_mask, &merged_mask,
+				   mm_cpumask(tgt_mm));
+			rpal_put_service(tgt);
+		}
+		info->mm_list = mm_list;
+		info->tlb_gen_list = tlb_gen_list;
+		info->nr_mm = nr_mm;
+	}
+
+	if (cpumask_any_but(&merged_mask, cpu) < nr_cpu_ids)
+		rpal_flush_tlb_func_multi(&merged_mask, info);
+	else
+		rpal_flush_tlb_func_local(mm, cpu, info, new_tlb_gen);
+
+	for (i = 0; i < nr_mm; i++)
+		mmput_async(mm_list[i]);
+}
+
+void rpal_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+			      struct mm_struct *mm)
+{
+	struct rpal_service *cur = mm->rpal_rs;
+	struct rpal_service *tgt;
+	struct mm_struct *tgt_mm;
+	int i;
+
+	rpal_for_each_requested_service(cur, i) {
+		struct rpal_mapped_service *node;
+
+		if (i == cur->id)
+			continue;
+
+		node = rpal_get_mapped_node(cur, i);
+		if (!rpal_service_mapped(node))
+			continue;
+
+		tgt = rpal_get_service(node->rs);
+		if (!tgt)
+			continue;
+		tgt_mm = tgt->mm;
+		if (!mmget_not_zero(tgt_mm)) {
+			rpal_put_service(tgt);
+			continue;
+		}
+		inc_mm_tlb_gen(tgt_mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask,
+			   mm_cpumask(tgt_mm));
+		mmu_notifier_arch_invalidate_secondary_tlbs(tgt_mm, 0, -1UL);
+		rpal_put_service(tgt);
+		mmput_async(tgt_mm);
+	}
+}
+#endif
+
 /*
  * See Documentation/arch/x86/tlb.rst for details.  We choose 33
  * because it is large enough to cover the vast majority (at
@@ -1439,6 +1603,11 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
+#if IS_ENABLED(CONFIG_RPAL)
+	if (mm->rpal_rs)
+		rpal_flush_tlb_mm_range(mm, cpu, info, new_tlb_gen);
+	else {
+#endif
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
 	 * a local TLB flush is needed. Optimize this use-case by calling
@@ -1456,6 +1625,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		flush_tlb_func(info);
 		local_irq_enable();
 	}
+#if IS_ENABLED(CONFIG_RPAL)
+	}
+#endif
 
 	put_flush_tlb_info();
 	put_cpu();
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index c504b6efff64..cf6d608a994a 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -12,9 +12,6 @@
 #include <linux/mm.h>
 #include <linux/file.h>
 
-#define RPAL_REQUEST_MAP 0x1
-#define RPAL_REVERSE_MAP 0x2
-
 extern bool rpal_inited;
 
 /* service.c */
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index b9622f0235bf..36be1ab6a9f3 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -80,6 +80,11 @@
 /* No more than 15 services can be requested due to limitation of MPK. */
 #define MAX_REQUEST_SERVICE 15
 
+enum {
+	RPAL_REQUEST_MAP,
+	RPAL_REVERSE_MAP,
+};
+
 extern unsigned long rpal_cap;
 
 enum rpal_task_flag_bits {
@@ -326,6 +331,13 @@ rpal_get_mapped_node(struct rpal_service *rs, int id)
 	return &rs->service_map[id];
 }
 
+static inline bool rpal_service_mapped(struct rpal_mapped_service *node)
+{
+	unsigned long type = (1 << RPAL_REQUEST_MAP) | (1 << RPAL_REVERSE_MAP);
+
+	return (node->type & type) == type;
+}
+
 #ifdef CONFIG_RPAL
 static inline struct rpal_service *rpal_current_service(void)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index 67bb273dfb80..e68384f97ab9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -682,6 +682,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		return;
 
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end);
+#ifdef CONFIG_RPAL
+	if (mm->rpal_rs)
+		rpal_tlbbatch_add_pending(&tlb_ubc->arch, mm);
+#endif
 	tlb_ubc->flush_required = true;
 
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 14/35] RPAL: enable page fault handling
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (12 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 13/35] RPAL: add tlb flushing support Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30 13:59   ` Dave Hansen
  2025-05-30  9:27 ` [RFC v2 15/35] RPAL: add sender/receiver state Bo Li
                   ` (26 subsequent siblings)
  40 siblings, 1 reply; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

RPAL's address space sharing allows one process to access the memory of
another process, which may trigger page faults. To ensure programs can run
normally, RPAL needs to handle page faults occurring in the address space
of other processes. Additionally, to prevent processes from generating
coredumps due to invalid memory in other processes, RPAL must also restore
the current thread state to a pre-saved state under specific circumstances.

For handling page faults, by passing the correct vm_area_struct to
handle_page_fault(), RPAL locates the process corresponding to the address
where the page fault occurred and uses its mm_struct to handle the page
fault. Regarding thread state restoration, RPAL restores the thread's
state to a predefined state in userspace when it cannot locate the
mm_struct of the corresponding process (i.e., when the process has already
exited).

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/mm/fault.c     | 271 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/rpal/mm.c      |  34 +++++
 arch/x86/rpal/service.c |  24 ++++
 arch/x86/rpal/thread.c  |  23 ++++
 include/linux/rpal.h    |  81 ++++++++----
 5 files changed, 412 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 998bd807fc7b..35f7c60a5e4f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
 #include <linux/mm_types.h>
 #include <linux/mm.h>			/* find_and_lock_vma() */
 #include <linux/vmalloc.h>
+#include <linux/rpal.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1460,6 +1461,268 @@ trace_page_fault_entries(struct pt_regs *regs, unsigned long error_code,
 		trace_page_fault_kernel(address, regs, error_code);
 }
 
+#if IS_ENABLED(CONFIG_RPAL)
+static void rpal_do_user_addr_fault(struct pt_regs *regs, unsigned long error_code,
+			     unsigned long address, struct mm_struct *real_mm)
+{
+	struct vm_area_struct *vma;
+	vm_fault_t fault;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+
+	if (unlikely(error_code & X86_PF_RSVD))
+		pgtable_bad(regs, error_code, address);
+
+	if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
+		     !(error_code & X86_PF_USER) &&
+		     !(regs->flags & X86_EFLAGS_AC))) {
+		page_fault_oops(regs, error_code, address);
+		return;
+	}
+
+	if (unlikely(faulthandler_disabled())) {
+		bad_area_nosemaphore(regs, error_code, address);
+		return;
+	}
+
+	if (WARN_ON_ONCE(!(regs->flags & X86_EFLAGS_IF))) {
+		bad_area_nosemaphore(regs, error_code, address);
+		return;
+	}
+
+	local_irq_enable();
+
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
+	if (error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
+	if (error_code & X86_PF_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+	if (error_code & X86_PF_INSTR)
+		flags |= FAULT_FLAG_INSTRUCTION;
+
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+
+#ifdef CONFIG_X86_64
+	if (is_vsyscall_vaddr(address)) {
+		if (emulate_vsyscall(error_code, regs, address))
+			return;
+	}
+#endif
+
+	if (!(flags & FAULT_FLAG_USER))
+		goto lock_mmap;
+
+	vma = lock_vma_under_rcu(real_mm, address);
+	if (!vma)
+		goto lock_mmap;
+
+	if (unlikely(access_error(error_code, vma))) {
+		bad_area_access_error(regs, error_code, address, NULL, vma);
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		return;
+	}
+
+	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+	if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+		vma_end_read(vma);
+
+	if (!(fault & VM_FAULT_RETRY)) {
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		goto done;
+	}
+	count_vm_vma_lock_event(VMA_LOCK_RETRY);
+	if (fault & VM_FAULT_MAJOR)
+		flags |= FAULT_FLAG_TRIED;
+
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGBUS, BUS_ADRERR,
+						 ARCH_DEFAULT_PKEY);
+		return;
+	}
+lock_mmap:
+
+retry:
+	/*
+	 * Here we don't need to lock current->mm since no vma in
+	 * current->mm is used to handle this page fault. However,
+	 * we do need to lock real_mm, as the address belongs to
+	 * real_mm's vma.
+	 */
+	vma = lock_mm_and_find_vma(real_mm, address, regs);
+	if (unlikely(!vma)) {
+		bad_area_nosemaphore(regs, error_code, address);
+		return;
+	}
+
+	if (unlikely(access_error(error_code, vma))) {
+		bad_area_access_error(regs, error_code, address, real_mm, vma);
+		return;
+	}
+
+	fault = handle_mm_fault(vma, address, flags, regs);
+
+	if (fault_signal_pending(fault, regs)) {
+		/*
+		 * Quick path to respond to signals.  The core mm code
+		 * has unlocked the mm for us if we get here.
+		 */
+		if (!user_mode(regs))
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGBUS, BUS_ADRERR,
+						 ARCH_DEFAULT_PKEY);
+		return;
+	}
+
+	/* The fault is fully completed (including releasing mmap lock) */
+	if (fault & VM_FAULT_COMPLETED)
+		return;
+
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		flags |= FAULT_FLAG_TRIED;
+		goto retry;
+	}
+
+	mmap_read_unlock(real_mm);
+done:
+	if (likely(!(fault & VM_FAULT_ERROR)))
+		return;
+
+	if (fatal_signal_pending(current) && !user_mode(regs)) {
+		kernelmode_fixup_or_oops(regs, error_code, address, 0, 0,
+					 ARCH_DEFAULT_PKEY);
+		return;
+	}
+
+	if (fault & VM_FAULT_OOM) {
+		/* Kernel mode? Handle exceptions or die: */
+		if (!user_mode(regs)) {
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGSEGV, SEGV_MAPERR,
+						 ARCH_DEFAULT_PKEY);
+			return;
+		}
+
+		pagefault_out_of_memory();
+	} else {
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
+			     VM_FAULT_HWPOISON_LARGE))
+			do_sigbus(regs, error_code, address, fault);
+		else if (fault & VM_FAULT_SIGSEGV)
+			bad_area_nosemaphore(regs, error_code, address);
+		else
+			BUG();
+	}
+}
+NOKPROBE_SYMBOL(rpal_do_user_addr_fault);
+
+static inline void rpal_try_to_rebuild_context(struct pt_regs *regs,
+					       unsigned long address,
+					       int error_code)
+{
+	int handle_more = 0;
+
+	/*
+	 * We only rebuild sender's context, as other threads are not supposed
+	 * to access other process's memory, thus they will not trigger a page
+	 * fault.
+	 */
+	handle_more = rpal_rebuild_sender_context_on_fault(regs, address, -1);
+	/*
+	 * If we are not able to rebuild sender's context, just
+	 * send a signal to let it coredump.
+	 */
+	if (handle_more)
+		force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address);
+}
+
+/*
+ * Most logic of this function is copied from do_user_addr_fault().
+ * RPAL logic is added to handle special cases, such as find another
+ * process's mm and rebuild sender's context if such page table is
+ * not able to be handled.
+ */
+static bool rpal_try_user_addr_fault(struct pt_regs *regs, unsigned long error_code,
+			      unsigned long address)
+{
+	struct mm_struct *real_mm;
+	int rebuild = 0;
+
+	/* fast path: avoid mmget and mmput */
+	if (unlikely((error_code & (X86_PF_USER | X86_PF_INSTR)) ==
+		     X86_PF_INSTR)) {
+		/*
+		 * Whoops, this is kernel mode code trying to execute from
+		 * user memory.  Unless this is AMD erratum #93, which
+		 * corrupts RIP such that it looks like a user address,
+		 * this is unrecoverable.  Don't even try to look up the
+		 * VMA or look for extable entries.
+		 */
+		if (is_errata93(regs, address))
+			return true;
+
+		page_fault_oops(regs, error_code, address);
+		return true;
+	}
+
+	/* kprobes don't want to hook the spurious faults: */
+	if (WARN_ON_ONCE(kprobe_page_fault(regs, X86_TRAP_PF)))
+		return true;
+
+	real_mm = rpal_pf_get_real_mm(address, &rebuild);
+
+	if (real_mm) {
+#ifdef CONFIG_MEMCG
+		struct mem_cgroup *memcg = NULL;
+
+		prefetchw(&real_mm->mmap_lock);
+		/* try to charge page alloc to real_mm's memcg */
+		if (!current->active_memcg) {
+			memcg = get_mem_cgroup_from_mm(real_mm);
+			if (memcg)
+				set_active_memcg(memcg);
+		}
+		rpal_do_user_addr_fault(regs, error_code, address, real_mm);
+		if (memcg) {
+			set_active_memcg(NULL);
+			mem_cgroup_put(memcg);
+		}
+#else
+		prefetchw(&real_mm->mmap_lock);
+		rpal_do_user_addr_fault(regs, error_code, address, real_mm);
+#endif
+		mmput_async(real_mm);
+		return true;
+	} else if (user_mode(regs) && rebuild) {
+		rpal_try_to_rebuild_context(regs, address, -1);
+		return true;
+	}
+
+	return false;
+}
+
+static bool rpal_handle_page_fault(struct pt_regs *regs, unsigned long error_code,
+			   unsigned long address)
+{
+	struct rpal_service *cur = rpal_current_service();
+
+	/*
+	 * For RPAL process, it may access another process's memory and
+	 * there may be page fault. We handle this case with our own routine.
+	 * If we cannot handle this page fault, just let it go and handle
+	 * it as a normal page fault.
+	 */
+	if (cur && !rpal_is_correct_address(cur, address)) {
+		if (rpal_try_user_addr_fault(regs, error_code, address))
+			return true;
+	}
+	return false;
+}
+#endif
+
 static __always_inline void
 handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 			      unsigned long address)
@@ -1473,7 +1736,15 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (unlikely(fault_in_kernel_space(address))) {
 		do_kern_addr_fault(regs, error_code, address);
 	} else {
+#ifdef CONFIG_RPAL
+		if (rpal_handle_page_fault(regs, error_code, address)) {
+			local_irq_disable();
+			return;
+		}
+		do_user_addr_fault(regs, error_code, address);
+#else /* !CONFIG_RPAL */
 		do_user_addr_fault(regs, error_code, address);
+#endif
 		/*
 		 * User address page fault handling might have reenabled
 		 * interrupts. Fixing up all potential exit points of
diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c
index f1003baae001..be7714ede2bf 100644
--- a/arch/x86/rpal/mm.c
+++ b/arch/x86/rpal/mm.c
@@ -390,3 +390,37 @@ void rpal_unmap_service(struct rpal_service *tgt)
 	}
 	mm_unlink_p4d(cur_mm, tgt->base);
 }
+
+static inline bool check_service_mapped(struct rpal_service *cur, int tgt_id)
+{
+	struct rpal_mapped_service *node;
+	bool is_mapped = true;
+	unsigned long type = (1 << RPAL_REVERSE_MAP) | (1 << RPAL_REQUEST_MAP);
+
+	node = rpal_get_mapped_node(cur, tgt_id);
+	if (unlikely((node->type & type) != type))
+		is_mapped = false;
+
+	return is_mapped;
+}
+
+struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild)
+{
+	struct rpal_service *cur, *tgt;
+	struct mm_struct *mm = NULL;
+
+	cur = rpal_current_service();
+
+	tgt = rpal_get_mapped_service_by_addr(cur, address);
+	if (tgt == NULL)
+		goto out;
+
+	mm = tgt->mm;
+	if (unlikely(!check_service_mapped(cur, tgt->id) ||
+		     !mmget_not_zero(mm)))
+		mm = NULL;
+	*rebuild = 1;
+	rpal_put_service(tgt);
+out:
+	return mm;
+}
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index f490ab07301d..49458321e7dc 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -148,6 +148,30 @@ static inline unsigned long calculate_base_address(int id)
 	return RPAL_ADDRESS_SPACE_LOW + RPAL_ADDR_SPACE_SIZE * id;
 }
 
+struct rpal_service *rpal_get_mapped_service_by_id(struct rpal_service *rs,
+						   int id)
+{
+	struct rpal_service *ret;
+
+	if (!is_valid_id(id))
+		return NULL;
+
+	ret = rpal_get_service(rs->service_map[id].rs);
+
+	return ret;
+}
+
+/* This function must be called after rpal_is_correct_address () */
+struct rpal_service *rpal_get_mapped_service_by_addr(struct rpal_service *rs,
+						     unsigned long addr)
+{
+	int id;
+
+	id = (addr - RPAL_ADDRESS_SPACE_LOW) / RPAL_ADDR_SPACE_SIZE;
+
+	return rpal_get_mapped_service_by_id(rs, id);
+}
+
 struct rpal_service *rpal_register_service(void)
 {
 	struct rpal_service *rs;
diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
index 7550ad94b63f..e50a4c865ff8 100644
--- a/arch/x86/rpal/thread.c
+++ b/arch/x86/rpal/thread.c
@@ -155,6 +155,29 @@ int rpal_unregister_receiver(void)
 	return ret;
 }
 
+int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
+					 unsigned long addr, int error_code)
+{
+	if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) {
+		struct rpal_sender_call_context *scc = current->rpal_sd->scc;
+		unsigned long erip, ersp;
+		int magic;
+
+		erip = scc->ec.erip;
+		ersp = scc->ec.ersp;
+		magic = scc->ec.magic;
+		if (magic == RPAL_ERROR_MAGIC) {
+			regs->ax = error_code;
+			regs->ip = erip;
+			regs->sp = ersp;
+			/* avoid rebuild again */
+			scc->ec.magic = 0;
+			return 0;
+		}
+	}
+	return -EINVAL;
+}
+
 void exit_rpal_thread(void)
 {
 	if (rpal_test_current_thread_flag(RPAL_SENDER_BIT))
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 36be1ab6a9f3..3310d222739e 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -85,6 +85,8 @@ enum {
 	RPAL_REVERSE_MAP,
 };
 
+#define RPAL_ERROR_MAGIC 0x98CC98CC
+
 extern unsigned long rpal_cap;
 
 enum rpal_task_flag_bits {
@@ -198,23 +200,6 @@ struct rpal_version_info {
 	unsigned long cap;
 };
 
-/* End */
-
-struct rpal_shared_page {
-	unsigned long user_start;
-	unsigned long kernel_start;
-	int npage;
-	atomic_t refcnt;
-	struct list_head list;
-};
-
-struct rpal_common_data {
-	/* back pointer to task_struct */
-	struct task_struct *bp_task;
-	/* service id of rpal_service */
-	int service_id;
-};
-
 /* User registers state */
 struct rpal_task_context {
 	u64 r15;
@@ -232,17 +217,44 @@ struct rpal_receiver_call_context {
 	int receiver_id;
 };
 
-struct rpal_receiver_data {
-	struct rpal_common_data rcd;
-	struct rpal_shared_page *rsp;
-	struct rpal_receiver_call_context *rcc;
+/* recovery point for sender */
+struct rpal_error_context {
+	unsigned long fsbase;
+	u64 erip;
+	u64 ersp;
+	int state;
+	int magic;
 };
 
 struct rpal_sender_call_context {
 	struct rpal_task_context rtc;
+	struct rpal_error_context ec;
 	int sender_id;
 };
 
+/* End */
+
+struct rpal_shared_page {
+	unsigned long user_start;
+	unsigned long kernel_start;
+	int npage;
+	atomic_t refcnt;
+	struct list_head list;
+};
+
+struct rpal_common_data {
+	/* back pointer to task_struct */
+	struct task_struct *bp_task;
+	/* service id of rpal_service */
+	int service_id;
+};
+
+struct rpal_receiver_data {
+	struct rpal_common_data rcd;
+	struct rpal_shared_page *rsp;
+	struct rpal_receiver_call_context *rcc;
+};
+
 struct rpal_sender_data {
 	struct rpal_common_data rcd;
 	struct rpal_shared_page *rsp;
@@ -338,6 +350,26 @@ static inline bool rpal_service_mapped(struct rpal_mapped_service *node)
 	return (node->type & type) == type;
 }
 
+static inline bool rpal_is_correct_address(struct rpal_service *rs, unsigned long address)
+{
+	if (likely(rs->base <= address &&
+		   address < rs->base + RPAL_ADDR_SPACE_SIZE))
+		return true;
+
+	/*
+	 * [rs->base, rs->base + RPAL_ADDR_SPACE_SIZE) is always a
+	 * sub range of [RPAL_ADDRESS_SPACE_LOW, RPAL_ADDRESS_SPACE_HIGH).
+	 * Therefore, we can only check whether the address is in
+	 * [RPAL_ADDRESS_SPACE_LOW, RPAL_ADDRESS_SPACE_HIGH) to determine
+	 * whether the address may belong to another RPAL service.
+	 */
+	if (address >= RPAL_ADDRESS_SPACE_LOW &&
+	    address < RPAL_ADDRESS_SPACE_HIGH)
+		return false;
+
+	return true;
+}
+
 #ifdef CONFIG_RPAL
 static inline struct rpal_service *rpal_current_service(void)
 {
@@ -372,6 +404,13 @@ void copy_rpal(struct task_struct *p);
 void exit_rpal(bool group_dead);
 int rpal_balloon_init(unsigned long base);
 void rpal_exit_mmap(struct mm_struct *mm);
+struct rpal_service *rpal_get_mapped_service_by_addr(struct rpal_service *rs,
+						     unsigned long addr);
+struct rpal_service *rpal_get_mapped_service_by_id(struct rpal_service *rs,
+						   int id);
+int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
+					 unsigned long addr, int error_code);
+struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild);
 
 extern void rpal_pick_mmap_base(struct mm_struct *mm,
 	struct rlimit *rlim_stack);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC v2 14/35] RPAL: enable page fault handling
  2025-05-30  9:27 ` [RFC v2 14/35] RPAL: enable page fault handling Bo Li
@ 2025-05-30 13:59   ` Dave Hansen
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Hansen @ 2025-05-30 13:59 UTC (permalink / raw)
  To: Bo Li, tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff

On 5/30/25 02:27, Bo Li wrote:
>  arch/x86/mm/fault.c     | 271 ++++++++++++++++++++++++++++++++++++++++
>  arch/x86/rpal/mm.c      |  34 +++++
>  arch/x86/rpal/service.c |  24 ++++
>  arch/x86/rpal/thread.c  |  23 ++++
>  include/linux/rpal.h    |  81 ++++++++----
>  5 files changed, 412 insertions(+), 21 deletions(-)

I'm actually impressed again that you've managed to get this ported over
to a newer kernel _and_ broken it up.

But just taking a quick peek at _one_ patch, this is far, far below the
standards by which we do kernel development. This appears to have simply
copied chunks of existing code, hacked it to work with "RPAL" and then
#ifdef'd.

This is, unfortunately, copying and pasting at its worst. It creates
dual paths that inevitably bit rot in some way and are hard to maintain.

So, just to be clear: it's a full an unequivocal NAK from me on this
series. This introduces massive change, massive security risk, can't
possibly be backward compatible, has no users (well, maybe one) and the
series is not put together in anything remotely resembling how we like
to do kernel development.

I'd appreciate if you could not cc me on future versions if you choose
to go forward with this. But I urge you to stop now.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v2 15/35] RPAL: add sender/receiver state
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (13 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 14/35] RPAL: enable page fault handling Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 16/35] RPAL: add cpu lock interface Bo Li
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

The lazy switch defines six receiver states, and their state transitions
are as follows:

   |<--->READY<----> WAIT <----> CALL ----> LAZY_SWITCH ---> KERNEL_RET
   |                                             |               |
RUNNING <----------------------------------------|---------------|

The receiver thread initially starts in the RUNNING state and can
transition to the WAIT state voluntarily. The READY state is a temporary
state before entering WAIT state.  For a receiver in the WAIT state, it
must be in the TASK_INTERRUPTIBLE state. If the receiver thread is woken
up, the WAIT state can transition to the RUNNING state.

Once the receiver is in the WAIT state, the sender thread can
initiate an RPAL call, causing the receiver to enter the CALL state. A
receiver thread in the CALL state cannot be awakened until a lazy switch
occurs or its state changes. The call state carries additional service_id
and sender_id information.

If the sender completes executing the receiver's code without entering the
kernel after issuing the RPAL call, the receiver transitions back from the
CALL state to the WAIT state. Conversely, if the sender enters the kernel
during the RPAL call, the receiver's state changes to LAZY_SWITCH.

From the LAZY_SWITCH state, the receiver thread has two possible state
transitions: When the receiver thread finishes execution and switches back
to the sender via a lazy switch, it first enters the KERNEL_RET state and
then transitions to the RUNNING state. If the receiver thread runs for too
long and the scheduler resumes the sender, the receiver directly
transitions to the RUNNING state. Transitions to the RUNNING state can be
done in userspace.

The lazy switch mechanism defines three states for the sender thread:

 - RUNNING: The sender starts in this state. When the sender initiates
   an RPAL call to switch from user mode to the receiver, it transitions
   to the CALL state.

 - CALL: The sender remains in this state while the receiver is executing
   the code triggered by the RPAL call. When the receiver switches back to
   the sender from user mode, the sender returns to the RUNNING state.

 - KERNEL_RET: If the receiver takes an extended period to switch back to
   the sender after a lazy switch, the scheduler may preempt the sender to
   run other tasks. In this case, the sender enters the KERNEL_RET state
   while in the kernel. Once the sender resumes execution in user mode, it
   transitions back to the RUNNING state.

This patch implements the handling and transition of the receiver's state.
When a receiver leaves the run queue in the READY state, its state
transitions to the WAIT state; otherwise, it transitions to the RUNNING
state. The patch also modifies try_to_wake_up() to handling different
states: for the READY and WAIT states, try_to_wake_up() causes the state
to change to the RUNNING state. For the CALL state, try_to_wake_up() cannot
wake up the task. The patch provides a special interface,
rpal_try_to_wake_up(), to wake up tasks in the CALL state, which can be
used for lazy switches.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/kernel/process_64.c |  43 ++++++++++++
 arch/x86/rpal/internal.h     |   7 ++
 include/linux/rpal.h         |  50 ++++++++++++++
 kernel/sched/core.c          | 130 +++++++++++++++++++++++++++++++++++
 4 files changed, 230 insertions(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f39ff02e498d..4830e9215de7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -40,6 +40,7 @@
 #include <linux/ftrace.h>
 #include <linux/syscalls.h>
 #include <linux/iommu.h>
+#include <linux/rpal.h>
 
 #include <asm/processor.h>
 #include <asm/pkru.h>
@@ -596,6 +597,36 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp, bool x32)
 }
 #endif
 
+#ifdef CONFIG_RPAL
+static void rpal_receiver_enter_wait(struct task_struct *prev_p)
+{
+	if (READ_ONCE(prev_p->__state) == TASK_INTERRUPTIBLE) {
+		atomic_cmpxchg(&prev_p->rpal_rd->rcc->receiver_state,
+			       RPAL_RECEIVER_STATE_READY,
+			       RPAL_RECEIVER_STATE_WAIT);
+	} else {
+		/*
+		 * Simply check RPAL_RECEIVER_STATE_READY is not enough. It is
+		 * possible task's state is TASK_RUNNING. Consider following case:
+		 *
+		 * CPU 0(prev_p)            CPU 1(waker)
+		 * set TASK_INTERRUPTIBLE
+		 * set RPAL_RECEIVER_STATE_READY
+		 *                          check TASK_INTERRUPTIBLE
+		 * clear RPAL_RECEIVER_STATE_READY
+		 * clear TASK_INTERRUPTIBLE
+		 * set TASK_INTERRUPTIBLE
+		 * set RPAL_RECEIVER_STATE_READY
+		 *                          ttwu_runnable()
+		 * schedule()
+		 */
+		atomic_cmpxchg(&prev_p->rpal_rd->rcc->receiver_state,
+			       RPAL_RECEIVER_STATE_READY,
+			       RPAL_RECEIVER_STATE_RUNNING);
+	}
+}
+#endif
+
 /*
  *	switch_to(x,y) should switch tasks from x to y.
  *
@@ -704,6 +735,18 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			loadsegment(ss, __KERNEL_DS);
 	}
 
+#ifdef CONFIG_RPAL
+	/*
+	 * When we come to here, the stack switching is finished. Therefore,
+	 * the receiver thread is prepared for a lazy switch. We then change
+	 * the receiver_state from RPAL_RECEIVER_STATE_REAY to
+	 * RPAL_RECEIVER_STATE_WAIT and other thread is able to call it with
+	 * RPAL call.
+	 */
+	if (rpal_test_task_thread_flag(prev_p, RPAL_RECEIVER_BIT))
+		rpal_receiver_enter_wait(prev_p);
+#endif
+
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in(next_p);
 
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index cf6d608a994a..6256172bb79e 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -47,3 +47,10 @@ int rpal_unregister_sender(void);
 int rpal_register_receiver(unsigned long addr);
 int rpal_unregister_receiver(void);
 void exit_rpal_thread(void);
+
+static inline unsigned long
+rpal_build_call_state(const struct rpal_sender_data *rsd)
+{
+	return ((rsd->rcd.service_id << RPAL_SID_SHIFT) |
+		(rsd->scc->sender_id << RPAL_ID_SHIFT) | RPAL_RECEIVER_STATE_CALL);
+}
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 3310d222739e..4f4719bb7eae 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -87,6 +87,13 @@ enum {
 
 #define RPAL_ERROR_MAGIC 0x98CC98CC
 
+#define RPAL_SID_SHIFT 24
+#define RPAL_ID_SHIFT 8
+#define RPAL_RECEIVER_STATE_MASK ((1 << RPAL_ID_SHIFT) - 1)
+#define RPAL_SID_MASK (~((1 << RPAL_SID_SHIFT) - 1))
+#define RPAL_ID_MASK (~(0 | RPAL_RECEIVER_STATE_MASK | RPAL_SID_MASK))
+#define RPAL_MAX_ID ((1 << (RPAL_SID_SHIFT - RPAL_ID_SHIFT)) - 1)
+
 extern unsigned long rpal_cap;
 
 enum rpal_task_flag_bits {
@@ -94,6 +101,22 @@ enum rpal_task_flag_bits {
 	RPAL_RECEIVER_BIT,
 };
 
+enum rpal_receiver_state {
+	RPAL_RECEIVER_STATE_RUNNING,
+	RPAL_RECEIVER_STATE_KERNEL_RET,
+	RPAL_RECEIVER_STATE_READY,
+	RPAL_RECEIVER_STATE_WAIT,
+	RPAL_RECEIVER_STATE_CALL,
+	RPAL_RECEIVER_STATE_LAZY_SWITCH,
+	RPAL_RECEIVER_STATE_MAX,
+};
+
+enum rpal_sender_state {
+	RPAL_SENDER_STATE_RUNNING,
+	RPAL_SENDER_STATE_CALL,
+	RPAL_SENDER_STATE_KERNEL_RET,
+};
+
 /*
  * user_meta will be sent to other service when requested.
  */
@@ -215,6 +238,8 @@ struct rpal_task_context {
 struct rpal_receiver_call_context {
 	struct rpal_task_context rtc;
 	int receiver_id;
+	atomic_t receiver_state;
+	atomic_t sender_state;
 };
 
 /* recovery point for sender */
@@ -390,11 +415,35 @@ static inline bool rpal_test_current_thread_flag(unsigned long bit)
 {
 	return test_bit(bit, &current->rpal_flag);
 }
+
+static inline bool rpal_test_task_thread_flag(struct task_struct *tsk,
+					      unsigned long bit)
+{
+	return test_bit(bit, &tsk->rpal_flag);
+}
+
+static inline void rpal_set_task_thread_flag(struct task_struct *tsk,
+					     unsigned long bit)
+{
+	set_bit(bit, &tsk->rpal_flag);
+}
+
+static inline void rpal_clear_task_thread_flag(struct task_struct *tsk,
+					       unsigned long bit)
+{
+	clear_bit(bit, &tsk->rpal_flag);
+}
 #else
 static inline struct rpal_service *rpal_current_service(void) { return NULL; }
 static inline void rpal_set_current_thread_flag(unsigned long bit) { }
 static inline void rpal_clear_current_thread_flag(unsigned long bit) { }
 static inline bool rpal_test_current_thread_flag(unsigned long bit) { return false; }
+static inline bool rpal_test_task_thread_flag(struct task_struct *tsk,
+	unsigned long bit) { return false; }
+static inline void rpal_set_task_thread_flag(struct task_struct *tsk,
+					     unsigned long bit) { }
+static inline void rpal_clear_task_thread_flag(struct task_struct *tsk,
+					       unsigned long bit) { }
 #endif
 
 void rpal_unregister_service(struct rpal_service *rs);
@@ -414,4 +463,5 @@ struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild);
 
 extern void rpal_pick_mmap_base(struct mm_struct *mm,
 	struct rlimit *rlim_stack);
+int rpal_try_to_wake_up(struct task_struct *p);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62b3416f5e43..045e92ee2e3b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -67,6 +67,7 @@
 #include <linux/wait_api.h>
 #include <linux/workqueue_api.h>
 #include <linux/livepatch_sched.h>
+#include <linux/rpal.h>
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 # ifdef CONFIG_GENERIC_ENTRY
@@ -3820,6 +3821,40 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 	return ret;
 }
 
+#ifdef CONFIG_RPAL
+static bool rpal_check_state(struct task_struct *p)
+{
+	bool ret = true;
+
+	if (rpal_test_task_thread_flag(p, RPAL_RECEIVER_BIT)) {
+		struct rpal_receiver_call_context *rcc = p->rpal_rd->rcc;
+		int state;
+
+retry:
+		state = atomic_read(&rcc->receiver_state) & RPAL_RECEIVER_STATE_MASK;
+		switch (state) {
+		case RPAL_RECEIVER_STATE_READY:
+		case RPAL_RECEIVER_STATE_WAIT:
+			if (state != atomic_cmpxchg(&rcc->receiver_state, state,
+						     RPAL_RECEIVER_STATE_RUNNING))
+				goto retry;
+			break;
+		case RPAL_RECEIVER_STATE_KERNEL_RET:
+		case RPAL_RECEIVER_STATE_LAZY_SWITCH:
+		case RPAL_RECEIVER_STATE_RUNNING:
+			break;
+		case RPAL_RECEIVER_STATE_CALL:
+			ret = false;
+			break;
+		default:
+			rpal_err("%s: invalid state: %d\n", __func__, state);
+			break;
+		}
+	}
+	return ret;
+}
+#endif
+
 #ifdef CONFIG_SMP
 void sched_ttwu_pending(void *arg)
 {
@@ -3841,6 +3876,11 @@ void sched_ttwu_pending(void *arg)
 		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
 			set_task_cpu(p, cpu_of(rq));
 
+#ifdef CONFIG_RPAL
+		if (!rpal_check_state(p))
+			continue;
+#endif
+
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
 	}
 
@@ -4208,6 +4248,17 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
+#ifdef CONFIG_RPAL
+		/*
+		 * For rpal thread, we need to check if it can be woken up. If not,
+		 * we do not wake it up here but wake it up later by kernel worker.
+		 *
+		 * For normal thread, nothing happens.
+		 */
+		if (!rpal_check_state(p))
+			goto out;
+#endif
+
 		trace_sched_waking(p);
 		ttwu_do_wakeup(p);
 		goto out;
@@ -4224,6 +4275,11 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		if (!ttwu_state_match(p, state, &success))
 			break;
 
+#ifdef CONFIG_RPAL
+		if (!rpal_check_state(p))
+			break;
+#endif
+
 		trace_sched_waking(p);
 
 		/*
@@ -4344,6 +4400,56 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	return success;
 }
 
+#ifdef CONFIG_RPAL
+int rpal_try_to_wake_up(struct task_struct *p)
+{
+	guard(preempt)();
+	int cpu, success = 0;
+	int wake_flags = WF_TTWU;
+
+	BUG_ON(READ_ONCE(p->__state) == TASK_RUNNING);
+
+	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
+		smp_mb__after_spinlock();
+		if (!ttwu_state_match(p, TASK_NORMAL, &success))
+			break;
+
+		trace_sched_waking(p);
+		/* see try_to_wake_up() */
+		smp_rmb();
+
+#ifdef CONFIG_SMP
+		smp_acquire__after_ctrl_dep();
+		WRITE_ONCE(p->__state, TASK_WAKING);
+		/* see try_to_wake_up() */
+		if (smp_load_acquire(&p->on_cpu) &&
+		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
+			break;
+		smp_cond_load_acquire(&p->on_cpu, !VAL);
+
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+		if (task_cpu(p) != cpu) {
+			if (p->in_iowait) {
+				delayacct_blkio_end(p);
+				atomic_dec(&task_rq(p)->nr_iowait);
+			}
+
+			wake_flags |= WF_MIGRATED;
+			psi_ttwu_dequeue(p);
+			set_task_cpu(p, cpu);
+		}
+#else
+		cpu = task_cpu(p);
+#endif
+	}
+	ttwu_queue(p, cpu, wake_flags);
+	if (success)
+		ttwu_stat(p, task_cpu(p), wake_flags);
+
+	return success;
+}
+#endif
+
 static bool __task_needs_rq_lock(struct task_struct *p)
 {
 	unsigned int state = READ_ONCE(p->__state);
@@ -6574,6 +6680,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 #define SM_PREEMPT		1
 #define SM_RTLOCK_WAIT		2
 
+#ifdef CONFIG_RPAL
+static inline void rpal_check_ready_state(struct task_struct *tsk, int state)
+{
+	if (rpal_test_task_thread_flag(tsk, RPAL_RECEIVER_BIT)) {
+		struct rpal_receiver_call_context *rcc = tsk->rpal_rd->rcc;
+
+		atomic_cmpxchg(&rcc->receiver_state, state,
+			       RPAL_RECEIVER_STATE_RUNNING);
+	}
+}
+#endif
+
 /*
  * Helper function for __schedule()
  *
@@ -6727,7 +6845,19 @@ static void __sched notrace __schedule(int sched_mode)
 			goto picked;
 		}
 	} else if (!preempt && prev_state) {
+#ifdef CONFIG_RPAL
+		if (!try_to_block_task(rq, prev, &prev_state)) {
+			/*
+			 * As the task enter TASK_RUNNING state, we should clean up
+			 * RPAL_RECEIVER_STATE_READY status. Therefore, the receiver's
+			 * state will not be change to RPAL_RECEIVER_STATE_WAIT. Thus,
+			 * there is no RPAL call when a receiver is at TASK_RUNNING state.
+			 */
+			rpal_check_ready_state(prev, RPAL_RECEIVER_STATE_READY);
+		}
+#else
 		try_to_block_task(rq, prev, &prev_state);
+#endif
 		switch_count = &prev->nvcsw;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 16/35] RPAL: add cpu lock interface
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (14 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 15/35] RPAL: add sender/receiver state Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 17/35] RPAL: add a mapping between fsbase and tasks Bo Li
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Lazy switch enables the kernel to switch from one task to another to keep
the kernel context and user context matched. For the scheduler, both tasks
involved in the context switch must reside in the same run queue (rq).
Therefore, before a lazy switch occurs, the kernel must first bind both
tasks to the same CPU to facilitate the subsequent context switch.

This patch introduces the rpal_lock_cpu() interface, which binds two tasks
to the same CPU while bypassing cpumask restrictions. The rpal_unlock_cpu()
function serves as the inverse operation to release this binding. To ensure
consistency, the kernel must prevent other threads from modifying the CPU
affinity of tasks locked by rpal_lock_cpu(). Therefore, when using
set_cpus_allowed_ptr() to change a task's CPU affinity, other threads must
wait until the binding established by rpal_lock_cpu() is released before
proceeding with modifications.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/core.c   |  18 +++++++
 arch/x86/rpal/thread.c |  14 ++++++
 include/linux/rpal.h   |   8 +++
 kernel/sched/core.c    | 109 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 149 insertions(+)

diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 61f5d40b0157..c185a453c1b2 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -15,6 +15,24 @@ int __init rpal_init(void);
 bool rpal_inited;
 unsigned long rpal_cap;
 
+static inline void rpal_lock_cpu(struct task_struct *tsk)
+{
+	rpal_set_cpus_allowed_ptr(tsk, true);
+	if (unlikely(!irqs_disabled())) {
+		local_irq_disable();
+		rpal_err("%s: irq is enabled\n", __func__);
+	}
+}
+
+static inline void rpal_unlock_cpu(struct task_struct *tsk)
+{
+	rpal_set_cpus_allowed_ptr(tsk, false);
+	if (unlikely(!irqs_disabled())) {
+		local_irq_disable();
+		rpal_err("%s: irq is enabled\n", __func__);
+	}
+}
+
 int __init rpal_init(void)
 {
 	int ret = 0;
diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
index e50a4c865ff8..bc203e9c6e5e 100644
--- a/arch/x86/rpal/thread.c
+++ b/arch/x86/rpal/thread.c
@@ -47,6 +47,10 @@ int rpal_register_sender(unsigned long addr)
 	}
 
 	rpal_common_data_init(&rsd->rcd);
+	if (rpal_init_thread_pending(&rsd->rcd)) {
+		ret = -ENOMEM;
+		goto free_rsd;
+	}
 	rsd->rsp = rsp;
 	rsd->scc = (struct rpal_sender_call_context *)(addr - rsp->user_start +
 						       rsp->kernel_start);
@@ -58,6 +62,8 @@ int rpal_register_sender(unsigned long addr)
 
 	return 0;
 
+free_rsd:
+	kfree(rsd);
 put_shared_page:
 	rpal_put_shared_page(rsp);
 out:
@@ -77,6 +83,7 @@ int rpal_unregister_sender(void)
 
 	rpal_put_shared_page(rsd->rsp);
 	rpal_clear_current_thread_flag(RPAL_SENDER_BIT);
+	rpal_free_thread_pending(&rsd->rcd);
 	kfree(rsd);
 
 	atomic_dec(&cur->thread_cnt);
@@ -116,6 +123,10 @@ int rpal_register_receiver(unsigned long addr)
 	}
 
 	rpal_common_data_init(&rrd->rcd);
+	if (rpal_init_thread_pending(&rrd->rcd)) {
+		ret = -ENOMEM;
+		goto free_rrd;
+	}
 	rrd->rsp = rsp;
 	rrd->rcc =
 		(struct rpal_receiver_call_context *)(addr - rsp->user_start +
@@ -128,6 +139,8 @@ int rpal_register_receiver(unsigned long addr)
 
 	return 0;
 
+free_rrd:
+	kfree(rrd);
 put_shared_page:
 	rpal_put_shared_page(rsp);
 out:
@@ -147,6 +160,7 @@ int rpal_unregister_receiver(void)
 
 	rpal_put_shared_page(rrd->rsp);
 	rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT);
+	rpal_free_thread_pending(&rrd->rcd);
 	kfree(rrd);
 
 	atomic_dec(&cur->thread_cnt);
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 4f4719bb7eae..5b115be14a55 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -99,6 +99,7 @@ extern unsigned long rpal_cap;
 enum rpal_task_flag_bits {
 	RPAL_SENDER_BIT,
 	RPAL_RECEIVER_BIT,
+	RPAL_CPU_LOCKED_BIT,
 };
 
 enum rpal_receiver_state {
@@ -270,8 +271,12 @@ struct rpal_shared_page {
 struct rpal_common_data {
 	/* back pointer to task_struct */
 	struct task_struct *bp_task;
+	/* pending struct for cpu locking */
+	void *pending;
 	/* service id of rpal_service */
 	int service_id;
+	/* cpumask before locked */
+	cpumask_t old_mask;
 };
 
 struct rpal_receiver_data {
@@ -464,4 +469,7 @@ struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild);
 extern void rpal_pick_mmap_base(struct mm_struct *mm,
 	struct rlimit *rlim_stack);
 int rpal_try_to_wake_up(struct task_struct *p);
+int rpal_init_thread_pending(struct rpal_common_data *rcd);
+void rpal_free_thread_pending(struct rpal_common_data *rcd);
+int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045e92ee2e3b..a862bf4a0161 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3155,6 +3155,104 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	return ret;
 }
 
+#ifdef CONFIG_RPAL
+int rpal_init_thread_pending(struct rpal_common_data *rcd)
+{
+	struct set_affinity_pending *pending;
+
+	pending = kzalloc(sizeof(*pending), GFP_KERNEL);
+	if (!pending)
+		return -ENOMEM;
+	pending->stop_pending = 0;
+	pending->arg = (struct migration_arg){
+		.task = current,
+		.pending = NULL,
+	};
+	rcd->pending = pending;
+	return 0;
+}
+
+void rpal_free_thread_pending(struct rpal_common_data *rcd)
+{
+	if (rcd->pending != NULL)
+		kfree(rcd->pending);
+}
+
+/*
+ * CPU lock is forced and all cpumask will be ignored by RPAL temporary.
+ */
+int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock)
+{
+	const struct cpumask *cpu_valid_mask = cpu_active_mask;
+	struct set_affinity_pending *pending = p->rpal_cd->pending;
+	struct cpumask mask;
+	unsigned int dest_cpu;
+	struct rq_flags rf;
+	struct rq *rq;
+	int ret = 0;
+	struct affinity_context ac = {
+		.new_mask = &mask,
+		.flags = 0,
+	};
+
+	if (unlikely(p->flags & PF_KTHREAD))
+		rpal_err("p: %d, p->flags & PF_KTHREAD\n", p->pid);
+
+	rq = task_rq_lock(p, &rf);
+
+	if (is_lock) {
+		cpumask_copy(&p->rpal_cd->old_mask, &p->cpus_mask);
+		cpumask_clear(&mask);
+		cpumask_set_cpu(smp_processor_id(), &mask);
+		rpal_set_task_thread_flag(p, RPAL_CPU_LOCKED_BIT);
+	} else {
+		cpumask_copy(&mask, &p->rpal_cd->old_mask);
+		rpal_clear_task_thread_flag(p, RPAL_CPU_LOCKED_BIT);
+	}
+
+	update_rq_clock(rq);
+
+	if (cpumask_equal(&p->cpus_mask, ac.new_mask))
+		goto out;
+	/*
+	 * Picking a ~random cpu helps in cases where we are changing affinity
+	 * for groups of tasks (ie. cpuset), so that load balancing is not
+	 * immediately required to distribute the tasks within their new mask.
+	 */
+	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ac.new_mask);
+	if (dest_cpu >= nr_cpu_ids) {
+		ret = -EINVAL;
+		goto out;
+	}
+	__do_set_cpus_allowed(p, &ac);
+	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+		preempt_disable();
+		task_rq_unlock(rq, p, &rf);
+		preempt_enable();
+	} else {
+		pending->arg.dest_cpu = dest_cpu;
+
+		if (task_on_cpu(rq, p) ||
+		    READ_ONCE(p->__state) == TASK_WAKING) {
+			preempt_disable();
+			task_rq_unlock(rq, p, &rf);
+			stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
+					    &pending->arg, &pending->stop_work);
+		} else {
+			if (task_on_rq_queued(p))
+				rq = move_queued_task(rq, &rf, p, dest_cpu);
+			task_rq_unlock(rq, p, &rf);
+		}
+	}
+
+	return 0;
+
+out:
+	task_rq_unlock(rq, p, &rf);
+	return ret;
+}
+#endif
+
 /*
  * Change a given task's CPU affinity. Migrate the thread to a
  * proper CPU and schedule it away if the CPU it's executing on
@@ -3169,7 +3267,18 @@ int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx)
 	struct rq_flags rf;
 	struct rq *rq;
 
+#ifdef CONFIG_RPAL
+retry:
+	rq = task_rq_lock(p, &rf);
+	if (rpal_test_task_thread_flag(p, RPAL_CPU_LOCKED_BIT)) {
+		update_rq_clock(rq);
+		task_rq_unlock(rq, p, &rf);
+		schedule();
+		goto retry;
+	}
+#else
 	rq = task_rq_lock(p, &rf);
+#endif
 	/*
 	 * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
 	 * flags are set.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 17/35] RPAL: add a mapping between fsbase and tasks
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (15 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 16/35] RPAL: add cpu lock interface Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 18/35] sched: pick a specified task Bo Li
                   ` (23 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

RPAL relies on the value of the fsbase register to determine whether a lazy
switch is necessary. Therefore, a mapping between fsbase and tasks must be
established.

This patch allows a thread to register its fsbase value when it is
registered as a receiver. The rpal_find_next_task() interface is used to
locate the receiver corresponding to a given fsbase value. Additionally, a
new rpal_misidentify() interface has been added to check if the current
fsbase value matches the current task. If they do not match, the task
corresponding to the fsbase is identified, the RPAL_LAZY_SWITCHED_BIT flag
is set, and the current task is recorded. The kernel can later use this
flag and the recorded task to backtrack to the task before the lazy switch.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/core.c   | 85 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/rpal/thread.c | 57 +++++++++++++++++++++++++++-
 include/linux/rpal.h   | 15 ++++++++
 3 files changed, 156 insertions(+), 1 deletion(-)

diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index c185a453c1b2..19c4ef38bca3 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/rpal.h>
+#include <asm/fsgsbase.h>
 
 #include "internal.h"
 
@@ -33,12 +34,96 @@ static inline void rpal_unlock_cpu(struct task_struct *tsk)
 	}
 }
 
+
+static inline struct task_struct *rpal_get_sender_task(void)
+{
+	struct task_struct *next;
+
+	next = current->rpal_rd->sender;
+	current->rpal_rd->sender = NULL;
+
+	return next;
+}
+
+/*
+ * RPAL uses the value of fsbase (which libc uses as the base
+ * address for thread-local storage) to determine whether a
+ * lazy switch should be performed.
+ */
+static inline struct task_struct *rpal_misidentify(void)
+{
+	struct task_struct *next = NULL;
+	struct rpal_service *cur = rpal_current_service();
+	unsigned long fsbase;
+
+	fsbase = rdfsbase();
+	if (unlikely(!rpal_is_correct_address(cur, fsbase))) {
+		if (rpal_test_current_thread_flag(RPAL_LAZY_SWITCHED_BIT)) {
+			/* current is receiver, next is sender */
+			next = rpal_get_sender_task();
+			if (unlikely(next == NULL)) {
+				rpal_err("cannot find sender task\n");
+				goto out;
+			}
+		} else {
+			/* current is sender, next is receiver */
+			next = rpal_find_next_task(fsbase);
+			if (unlikely(next == NULL)) {
+				rpal_err(
+					"cannot find receiver task, fsbase: 0x%016lx\n",
+					fsbase);
+				goto out;
+			}
+			rpal_set_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT);
+			next->rpal_rd->sender = current;
+		}
+	}
+out:
+	return next;
+}
+
+struct task_struct *rpal_find_next_task(unsigned long fsbase)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_service *tgt;
+	struct task_struct *tsk = NULL;
+	int i;
+
+	tgt = rpal_get_mapped_service_by_addr(cur, fsbase);
+	if (unlikely(!tgt)) {
+		pr_debug("rpal debug: cannot find legal rs, fsbase: 0x%016lx\n",
+			 fsbase);
+		return NULL;
+	}
+	for (i = 0; i < RPAL_MAX_RECEIVER_NUM; ++i) {
+		if (tgt->fs_tsk_map[i].fsbase == fsbase) {
+			tsk = tgt->fs_tsk_map[i].tsk;
+			break;
+		}
+	}
+	rpal_put_service(tgt);
+
+	return tsk;
+}
+
+static bool check_hardware_features(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_FSGSBASE)) {
+		rpal_err("no fsgsbase feature\n");
+		return false;
+	}
+	return true;
+}
+
 int __init rpal_init(void)
 {
 	int ret = 0;
 
 	rpal_cap = 0;
 
+	if (!check_hardware_features())
+		goto fail;
+
 	ret = rpal_service_init();
 	if (ret)
 		goto fail;
diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
index bc203e9c6e5e..db3b13ff82be 100644
--- a/arch/x86/rpal/thread.c
+++ b/arch/x86/rpal/thread.c
@@ -7,9 +7,53 @@
  */
 
 #include <linux/rpal.h>
+#include <asm/fsgsbase.h>
 
 #include "internal.h"
 
+static bool set_fs_tsk_map(void)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_fsbase_tsk_map *ftm;
+	unsigned long fsbase = rdfsbase();
+	bool success = false;
+	int i = 0;
+
+	for (i = 0; i < RPAL_MAX_RECEIVER_NUM; ++i) {
+		ftm = &cur->fs_tsk_map[i];
+		if (ftm->fsbase == 0 &&
+		    cmpxchg64(&ftm->fsbase, 0, fsbase) == 0) {
+			ftm->tsk = current;
+			success = true;
+			break;
+		}
+	}
+
+	return success;
+}
+
+static bool clear_fs_tsk_map(void)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct rpal_fsbase_tsk_map *ftm;
+	unsigned long fsbase = rdfsbase();
+	bool success = false;
+	int i = 0;
+
+	for (i = 0; i < RPAL_MAX_RECEIVER_NUM; ++i) {
+		ftm = &cur->fs_tsk_map[i];
+		if (ftm->fsbase == fsbase) {
+			ftm->tsk = NULL;
+			barrier();
+			ftm->fsbase = 0;
+			success = true;
+			break;
+		}
+	}
+
+	return success;
+}
+
 static void rpal_common_data_init(struct rpal_common_data *rcd)
 {
 	rcd->bp_task = current;
@@ -54,6 +98,7 @@ int rpal_register_sender(unsigned long addr)
 	rsd->rsp = rsp;
 	rsd->scc = (struct rpal_sender_call_context *)(addr - rsp->user_start +
 						       rsp->kernel_start);
+	rsd->receiver = NULL;
 
 	current->rpal_sd = rsd;
 	rpal_set_current_thread_flag(RPAL_SENDER_BIT);
@@ -122,15 +167,21 @@ int rpal_register_receiver(unsigned long addr)
 		goto put_shared_page;
 	}
 
+	if (!set_fs_tsk_map()) {
+		ret = -EAGAIN;
+		goto free_rrd;
+	}
+
 	rpal_common_data_init(&rrd->rcd);
 	if (rpal_init_thread_pending(&rrd->rcd)) {
 		ret = -ENOMEM;
-		goto free_rrd;
+		goto clear_fs;
 	}
 	rrd->rsp = rsp;
 	rrd->rcc =
 		(struct rpal_receiver_call_context *)(addr - rsp->user_start +
 						      rsp->kernel_start);
+	rrd->sender = NULL;
 
 	current->rpal_rd = rrd;
 	rpal_set_current_thread_flag(RPAL_RECEIVER_BIT);
@@ -139,6 +190,8 @@ int rpal_register_receiver(unsigned long addr)
 
 	return 0;
 
+clear_fs:
+	clear_fs_tsk_map();
 free_rrd:
 	kfree(rrd);
 put_shared_page:
@@ -158,6 +211,8 @@ int rpal_unregister_receiver(void)
 		goto out;
 	}
 
+	clear_fs_tsk_map();
+
 	rpal_put_shared_page(rrd->rsp);
 	rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT);
 	rpal_free_thread_pending(&rrd->rcd);
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 5b115be14a55..45137770fac6 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -80,6 +80,9 @@
 /* No more than 15 services can be requested due to limitation of MPK. */
 #define MAX_REQUEST_SERVICE 15
 
+/* We allow at most 16 receiver thread in one process */
+#define RPAL_MAX_RECEIVER_NUM  16
+
 enum {
 	RPAL_REQUEST_MAP,
 	RPAL_REVERSE_MAP,
@@ -100,6 +103,7 @@ enum rpal_task_flag_bits {
 	RPAL_SENDER_BIT,
 	RPAL_RECEIVER_BIT,
 	RPAL_CPU_LOCKED_BIT,
+	RPAL_LAZY_SWITCHED_BIT,
 };
 
 enum rpal_receiver_state {
@@ -145,6 +149,11 @@ struct rpal_poll_data {
 	wait_queue_head_t rpal_waitqueue;
 };
 
+struct rpal_fsbase_tsk_map {
+	unsigned long fsbase;
+	struct task_struct *tsk;
+};
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -202,6 +211,9 @@ struct rpal_service {
 	/* Notify service is released by others */
 	struct rpal_poll_data rpd;
 
+	/* fsbase / pid map */
+	struct rpal_fsbase_tsk_map fs_tsk_map[RPAL_MAX_RECEIVER_NUM];
+
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
 
@@ -283,12 +295,14 @@ struct rpal_receiver_data {
 	struct rpal_common_data rcd;
 	struct rpal_shared_page *rsp;
 	struct rpal_receiver_call_context *rcc;
+	struct task_struct *sender;
 };
 
 struct rpal_sender_data {
 	struct rpal_common_data rcd;
 	struct rpal_shared_page *rsp;
 	struct rpal_sender_call_context *scc;
+	struct task_struct *receiver;
 };
 
 enum rpal_command_type {
@@ -465,6 +479,7 @@ struct rpal_service *rpal_get_mapped_service_by_id(struct rpal_service *rs,
 int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 					 unsigned long addr, int error_code);
 struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild);
+struct task_struct *rpal_find_next_task(unsigned long fsbase);
 
 extern void rpal_pick_mmap_base(struct mm_struct *mm,
 	struct rlimit *rlim_stack);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 18/35] sched: pick a specified task
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (16 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 17/35] RPAL: add a mapping between fsbase and tasks Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 19/35] RPAL: add lazy switch main logic Bo Li
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

When a lazy switch occurs, the kernel already gets the task_struct of the
next task to switch to. However, the CFS does not provide an interface to
explicitly specify the next task. Therefore, RPAL must implement its own
mechanism to pick a specified task.

This patch introduces two interfaces, rpal_pick_next_task_fair() and
rpal_pick_task_fair(), to achieve this functionality. These interfaces
leverage the sched_entity of the target task to modify the CFS data
structures directly. Additionally, the patch adapts to the SCHED_CORE
feature by temporarily setting the highest weight for the specified task,
ensuring that the core will select this task preferentially during
scheduling decisions.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 kernel/sched/core.c  | 212 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 109 ++++++++++++++++++++++
 kernel/sched/sched.h |   8 ++
 3 files changed, 329 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a862bf4a0161..2e76376c5172 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11003,3 +11003,215 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 		set_next_task(rq, ctx->p);
 }
 #endif	/* CONFIG_SCHED_CLASS_EXT */
+
+#ifdef CONFIG_RPAL
+#ifdef CONFIG_SCHED_CORE
+static inline struct task_struct *
+__rpal_pick_next_task(struct rq *rq, struct task_struct *prev,
+		      struct task_struct *next, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	if (likely(prev->sched_class == &fair_sched_class &&
+		   next->sched_class == &fair_sched_class)) {
+		p = rpal_pick_next_task_fair(prev, next, rq, rf);
+		return p;
+	}
+
+	BUG();
+}
+
+static struct task_struct *rpal_pick_next_task(struct rq *rq,
+					       struct task_struct *prev,
+					       struct task_struct *next,
+					       struct rq_flags *rf)
+{
+	struct task_struct *p;
+	const struct cpumask *smt_mask;
+	bool fi_before = false;
+	bool core_clock_updated = (rq == rq->core);
+	unsigned long cookie;
+	int i, cpu, occ = 0;
+	struct rq *rq_i;
+	bool need_sync;
+
+	if (!sched_core_enabled(rq))
+		return __rpal_pick_next_task(rq, prev, next, rf);
+
+	cpu = cpu_of(rq);
+
+	/* Stopper task is switching into idle, no need core-wide selection. */
+	if (cpu_is_offline(cpu)) {
+		rq->core_pick = NULL;
+		return __rpal_pick_next_task(rq, prev, next, rf);
+	}
+
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq &&
+		rq->core_pick) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		/* ignore rq->core_pick, always pick next */
+		if (rq->core_pick == next) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+
+			rq->core_pick = NULL;
+			goto out;
+		}
+	}
+
+	put_prev_task_balance(rq, prev, rf);
+
+	smt_mask = cpu_smt_mask(cpu);
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle_count) {
+		if (!core_clock_updated) {
+			update_rq_clock(rq->core);
+			core_clock_updated = true;
+		}
+		sched_core_account_forceidle(rq);
+		/* reset after accounting force idle */
+		rq->core->core_forceidle_start = 0;
+		rq->core->core_forceidle_count = 0;
+		rq->core->core_forceidle_occupation = 0;
+		need_sync = true;
+		fi_before = true;
+	}
+
+	rq->core->core_task_seq++;
+
+	if (!need_sync) {
+		next = rpal_pick_task_fair(rq, next);
+		if (!next->core_cookie) {
+			rq->core_pick = NULL;
+			/*
+			 * For robustness, update the min_vruntime_fi for
+			 * unconstrained picks as well.
+			 */
+			WARN_ON_ONCE(fi_before);
+			task_vruntime_update(rq, next, false);
+			goto out_set_next;
+		}
+	}
+
+	for_each_cpu_wrap(i, smt_mask, cpu) {
+		rq_i = cpu_rq(i);
+
+		if (i != cpu && (rq_i != rq->core || !core_clock_updated))
+			update_rq_clock(rq_i);
+
+		/* ignore prio, always pick next */
+		if (i == cpu)
+			rq_i->core_pick = rpal_pick_task_fair(rq, next);
+		else
+			rq_i->core_pick = pick_task(rq_i);
+	}
+
+	cookie = rq->core->core_cookie = next->core_cookie;
+
+	for_each_cpu(i, smt_mask) {
+		rq_i = cpu_rq(i);
+		p = rq_i->core_pick;
+
+		if (!cookie_equals(p, cookie)) {
+			p = NULL;
+			if (cookie)
+				p = sched_core_find(rq_i, cookie);
+			if (!p)
+				p = idle_sched_class.pick_task(rq_i);
+		}
+
+		rq_i->core_pick = p;
+
+		if (p == rq_i->idle) {
+			if (rq_i->nr_running) {
+				rq->core->core_forceidle_count++;
+				if (!fi_before)
+					rq->core->core_forceidle_seq++;
+			}
+		} else {
+			occ++;
+		}
+	}
+
+	if (schedstat_enabled() && rq->core->core_forceidle_count) {
+		rq->core->core_forceidle_start = rq_clock(rq->core);
+		rq->core->core_forceidle_occupation = occ;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	WARN_ON_ONCE(next != rq->core_pick);
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	for_each_cpu(i, smt_mask) {
+		rq_i = cpu_rq(i);
+
+		/*
+		 * An online sibling might have gone offline before a task
+		 * could be picked for it, or it might be offline but later
+		 * happen to come online, but its too late and nothing was
+		 * picked for it.  That's Ok - it will pick tasks for itself,
+		 * so ignore it.
+		 */
+		if (!rq_i->core_pick)
+			continue;
+
+		/*
+		 * Update for new !FI->FI transitions, or if continuing to be in !FI:
+		 * fi_before     fi      update?
+		 *  0            0       1
+		 *  0            1       1
+		 *  1            0       1
+		 *  1            1       0
+		 */
+		if (!(fi_before && rq->core->core_forceidle_count))
+			task_vruntime_update(rq_i, rq_i->core_pick,
+					     !!rq->core->core_forceidle_count);
+
+		rq_i->core_pick->core_occupation = occ;
+
+		if (i == cpu) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+
+		if (rq_i->curr == rq_i->core_pick) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		resched_curr(rq_i);
+	}
+
+out_set_next:
+	set_next_task(rq, next);
+out:
+	if (rq->core->core_forceidle_count && next == rq->idle)
+		queue_core_balance(rq);
+
+	return next;
+}
+#else
+static inline struct task_struct *
+rpal_pick_next_task(struct rq *rq, struct task_struct *prev,
+		      struct task_struct *next, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	if (likely(prev->sched_class == &fair_sched_class &&
+		   next->sched_class == &fair_sched_class)) {
+		p = rpal_pick_next_task_fair(prev, next, rq, rf);
+		return p;
+	}
+
+	BUG();
+}
+#endif
+#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 125912c0e9dd..d9c16d974a47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8983,6 +8983,115 @@ void fair_server_init(struct rq *rq)
 	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
 }
 
+#ifdef CONFIG_RPAL
+/* if the next is throttled, unthrottle it */
+static void rpal_unthrottle(struct rq *rq, struct task_struct *next)
+{
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+
+	se = &next->se;
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
+
+		if (cfs_rq == &rq->cfs)
+			break;
+	}
+}
+
+struct task_struct *rpal_pick_task_fair(struct rq *rq, struct task_struct *next)
+{
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+
+	rpal_unthrottle(rq, next);
+
+	se = &next->se;
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq->curr && cfs_rq->curr->on_rq)
+			update_curr(cfs_rq);
+
+		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+			continue;
+
+		clear_buddies(cfs_rq, se);
+	}
+
+	return next;
+}
+
+struct task_struct *rpal_pick_next_task_fair(struct task_struct *prev,
+					     struct task_struct *next,
+					     struct rq *rq, struct rq_flags *rf)
+{
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se;
+	struct task_struct *p;
+
+	rpal_unthrottle(rq, next);
+
+	p = rpal_pick_task_fair(rq, next);
+
+	if (!sched_fair_runnable(rq))
+		panic("rpal error: !sched_fair_runnable\n");
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	__put_prev_set_next_dl_server(rq, prev, next);
+
+	se = &next->se;
+	p = task_of(se);
+
+	/*
+	 * Since we haven't yet done put_prev_entity and if the selected task
+	 * is a different task than we started out with, try and touch the
+	 * least amount of cfs_rqs.
+	 */
+	if (prev != p) {
+		struct sched_entity *pse = &prev->se;
+
+		while (!(cfs_rq = is_same_group(se, pse))) {
+			int se_depth = se->depth;
+			int pse_depth = pse->depth;
+
+			if (se_depth <= pse_depth) {
+				put_prev_entity(cfs_rq_of(pse), pse);
+				pse = parent_entity(pse);
+			}
+			if (se_depth >= pse_depth) {
+				set_next_entity(cfs_rq_of(se), se);
+				se = parent_entity(se);
+			}
+		}
+
+		put_prev_entity(cfs_rq, pse);
+		set_next_entity(cfs_rq, se);
+	}
+#endif
+#ifdef CONFIG_SMP
+	/*
+	 * Move the next running task to the front of
+	 * the list, so our cfs_tasks list becomes MRU
+	 * one.
+	 */
+	list_move(&p->se.group_node, &rq->cfs_tasks);
+#endif
+
+	WARN_ON_ONCE(se->sched_delayed);
+
+	if (hrtick_enabled_fair(rq))
+		hrtick_start_fair(rq, p);
+
+	update_misfit_status(p, rq);
+	sched_fair_update_stop_tick(rq, p);
+
+	return p;
+}
+#endif
+
 /*
  * Account for a descheduled task:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6d..f8fd26b584c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2575,6 +2575,14 @@ static inline bool sched_fair_runnable(struct rq *rq)
 
 extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 extern struct task_struct *pick_task_idle(struct rq *rq);
+#ifdef CONFIG_RPAL
+extern struct task_struct *rpal_pick_task_fair(struct rq *rq,
+					       struct task_struct *next);
+extern struct task_struct *rpal_pick_next_task_fair(struct task_struct *prev,
+						    struct task_struct *next,
+						    struct rq *rq,
+						    struct rq_flags *rf);
+#endif
 
 #define SCA_CHECK		0x01
 #define SCA_MIGRATE_DISABLE	0x02
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 19/35] RPAL: add lazy switch main logic
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (17 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 18/35] sched: pick a specified task Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 20/35] RPAL: add rpal_ret_from_lazy_switch Bo Li
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

The implementation of lazy switch differs from a regular schedule() in
three key aspects:

1. It occurs at the kernel entry with irq disabled.
2. The next task is explicitly pre-determined rather than selected by
   the scheduler.
3. User-space context (excluding general-purpose registers) remains
   unchanged across the switch.

This patch introduces the rpal_schedule() interface to address these
requirements. Firstly, the rpal_schedule() skips irq enabling in
finish_lock_switch(), preserving the irq-disabled state required
during kernel entry. Secondly, the rpal_pick_next_task() interface is
used to explicitly specify the target task, bypassing the default
scheduler's decision-making process. Thirdly, non-general-purpose
registers (e.g., FPU, vector units) are not restored during the switch,
ensuring user space context remains intact. Handling of general-purpose
registers will be addressed in a subsequent patch by RPAL before invoking
rpal_schedule().

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/kernel/process_64.c |  75 +++++++++++++++++++++
 include/linux/rpal.h         |   3 +
 kernel/sched/core.c          | 126 +++++++++++++++++++++++++++++++++++
 3 files changed, 204 insertions(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4830e9215de7..efc3f238c486 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -753,6 +753,81 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	return prev_p;
 }
 
+#ifdef CONFIG_RPAL
+__no_kmsan_checks
+__visible __notrace_funcgraph struct task_struct *
+__rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p)
+{
+	struct thread_struct *prev = &prev_p->thread;
+	struct thread_struct *next = &next_p->thread;
+	int cpu = smp_processor_id();
+
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
+		     this_cpu_read(hardirq_stack_inuse));
+
+	/* no need to switch fpu */
+	/* __fpu_invalidate_fpregs_state() */
+	x86_task_fpu(prev_p)->last_cpu = -1;
+	/* fpregs_activate() */
+	__this_cpu_write(fpu_fpregs_owner_ctx, x86_task_fpu(next_p));
+	trace_x86_fpu_regs_activated(x86_task_fpu(next_p));
+	x86_task_fpu(next_p)->last_cpu = cpu;
+	set_tsk_thread_flag(prev_p, TIF_NEED_FPU_LOAD);
+	clear_tsk_thread_flag(next_p, TIF_NEED_FPU_LOAD);
+
+	/* no need to save fs */
+	savesegment(gs, prev_p->thread.gsindex);
+	if (static_cpu_has(X86_FEATURE_FSGSBASE))
+		prev_p->thread.gsbase = __rdgsbase_inactive();
+	else
+		save_base_legacy(prev_p, prev_p->thread.gsindex, GS);
+
+	load_TLS(next, cpu);
+
+	arch_end_context_switch(next_p);
+
+	savesegment(es, prev->es);
+	if (unlikely(next->es | prev->es))
+		loadsegment(es, next->es);
+
+	savesegment(ds, prev->ds);
+	if (unlikely(next->ds | prev->ds))
+		loadsegment(ds, next->ds);
+
+	/* no need to load fs */
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		if (unlikely(prev->gsindex || next->gsindex))
+			loadseg(GS, next->gsindex);
+
+		__wrgsbase_inactive(next->gsbase);
+	} else {
+		load_seg_legacy(prev->gsindex, prev->gsbase, next->gsindex,
+				next->gsbase, GS);
+	}
+
+	/* skip pkru load as we will use pkru in RPAL */
+
+	this_cpu_write(current_task, next_p);
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
+
+	/* no need to load fpu */
+
+	update_task_stack(next_p);
+	switch_to_extra(prev_p, next_p);
+
+	if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) {
+		unsigned short ss_sel;
+
+		savesegment(ss, ss_sel);
+		if (ss_sel != __KERNEL_DS)
+			loadsegment(ss, __KERNEL_DS);
+	}
+	resctrl_sched_in(next_p);
+
+	return prev_p;
+}
+#endif
+
 void set_personality_64bit(void)
 {
 	/* inherit personality from parent */
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 45137770fac6..0813db4552c0 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -487,4 +487,7 @@ int rpal_try_to_wake_up(struct task_struct *p);
 int rpal_init_thread_pending(struct rpal_common_data *rcd);
 void rpal_free_thread_pending(struct rpal_common_data *rcd);
 int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock);
+void rpal_schedule(struct task_struct *next);
+asmlinkage struct task_struct *
+__rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e76376c5172..760d88458b39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6827,6 +6827,12 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (unlikely(is_special_task_state(task_state)))
 		flags |= DEQUEUE_SPECIAL;
 
+#ifdef CONFIG_RPAL
+	/* DELAY_DEQUEUE will cause CPU stalls after lazy switch, skip it */
+	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT))
+		flags |= DEQUEUE_SPECIAL;
+#endif
+
 	/*
 	 * __schedule()			ttwu()
 	 *   prev_state = prev->state;    if (p->on_rq && ...)
@@ -11005,6 +11011,62 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 #endif	/* CONFIG_SCHED_CLASS_EXT */
 
 #ifdef CONFIG_RPAL
+static struct rq *rpal_finish_task_switch(struct task_struct *prev)
+	__releases(rq->lock)
+{
+	struct rq *rq = this_rq();
+	struct mm_struct *mm = rq->prev_mm;
+
+	if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
+		      "corrupted preempt_count: %s/%d/0x%x\n",
+		      current->comm, current->pid, preempt_count()))
+		preempt_count_set(FORK_PREEMPT_COUNT);
+
+	rq->prev_mm = NULL;
+	vtime_task_switch(prev);
+	perf_event_task_sched_in(prev, current);
+	finish_task(prev);
+	tick_nohz_task_switch();
+
+	/* finish_lock_switch, not enable irq */
+	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	__balance_callbacks(rq);
+	raw_spin_rq_unlock(rq);
+
+	finish_arch_post_lock_switch();
+	kcov_finish_switch(current);
+	kmap_local_sched_in();
+
+	fire_sched_in_preempt_notifiers(current);
+	if (mm) {
+		membarrier_mm_sync_core_before_usermode(mm);
+		mmdrop(mm);
+	}
+
+	return rq;
+}
+
+static __always_inline struct rq *rpal_context_switch(struct rq *rq,
+						      struct task_struct *prev,
+						      struct task_struct *next,
+						      struct rq_flags *rf)
+{
+	/* irq is off */
+	prepare_task_switch(rq, prev, next);
+	arch_start_context_switch(prev);
+
+	membarrier_switch_mm(rq, prev->active_mm, next->mm);
+	switch_mm_irqs_off(prev->active_mm, next->mm, next);
+	lru_gen_use_mm(next->mm);
+
+	switch_mm_cid(rq, prev, next);
+
+	prepare_lock_switch(rq, next, rf);
+	__rpal_switch_to(prev, next);
+	barrier();
+	return rpal_finish_task_switch(prev);
+}
+
 #ifdef CONFIG_SCHED_CORE
 static inline struct task_struct *
 __rpal_pick_next_task(struct rq *rq, struct task_struct *prev,
@@ -11214,4 +11276,68 @@ rpal_pick_next_task(struct rq *rq, struct task_struct *prev,
 	BUG();
 }
 #endif
+
+/* enter and exit with irqs disabled() */
+void __sched notrace rpal_schedule(struct task_struct *next)
+{
+	struct task_struct *prev, *picked;
+	bool preempt = false;
+	unsigned long *switch_count;
+	unsigned long prev_state;
+	struct rq_flags rf;
+	struct rq *rq;
+	int cpu;
+
+	/* sched_mode = SM_NONE */
+
+	preempt_disable();
+
+	trace_sched_entry_tp(preempt, CALLER_ADDR0);
+
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+	prev = rq->curr;
+
+	schedule_debug(prev, preempt);
+
+	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
+		hrtick_clear(rq);
+
+	rcu_note_context_switch(preempt);
+	rq_lock(rq, &rf);
+	smp_mb__after_spinlock();
+
+	rq->clock_update_flags <<= 1;
+	update_rq_clock(rq);
+	rq->clock_update_flags = RQCF_UPDATED;
+
+	switch_count = &prev->nivcsw;
+
+	prev_state = READ_ONCE(prev->__state);
+	if (prev_state) {
+		try_to_block_task(rq, prev, &prev_state);
+		switch_count = &prev->nvcsw;
+	}
+
+	picked = rpal_pick_next_task(rq, prev, next, &rf);
+	rq_set_donor(rq, next);
+	if (unlikely(next != picked))
+		panic("rpal error: next != picked\n");
+
+	clear_tsk_need_resched(prev);
+	clear_preempt_need_resched();
+	rq->last_seen_need_resched_ns = 0;
+
+	rq->nr_switches++;
+	RCU_INIT_POINTER(rq->curr, next);
+	++*switch_count;
+	migrate_disable_switch(rq, prev);
+	psi_account_irqtime(rq, prev, next);
+	psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
+					     prev->se.sched_delayed);
+	trace_sched_switch(preempt, prev, next, prev_state);
+	rq = rpal_context_switch(rq, prev, next, &rf);
+	trace_sched_exit_tp(true, CALLER_ADDR0);
+	preempt_enable_no_resched();
+}
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 20/35] RPAL: add rpal_ret_from_lazy_switch
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (18 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 19/35] RPAL: add lazy switch main logic Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 21/35] RPAL: add kernel entry handling for lazy switch Bo Li
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

After lazy switch the task before the lazy switch will lose its user mode
context (which is passed to the task after the lazy switch). Therefore,
RPAL needs to handle the issue of the previous task losing its user mode
context.

After the lazy switch occurs, the sender can resume execution in two ways.
One way is to be scheduled by the scheduler. In this case, RPAL handles
this issue in a manner similar to ret_from_fork. the sender will enter
rpal_ret_from_lazy_switch through the constructed stack frame by lazy
switchto execute the return logic and finally return to the pre-defined
user mode (referred to as "kernel return"). The other way is to be switched
back to by the receiver through another lazy switch. In this case, the
receiver will pass the user mode context to the sender, so there is no need
to construct a user mode context for the sender. And the receiver can
return to the user mode through the kernel return method.

rpal_ret_from_lazy_switch primarily handles scheduler cleanup work, similar
to schedule_tail(), but does not perform set_child_tid-otherwise, it might
cause set_child_tid to be executed repeatedly. It then calls
rpal_kernel_ret(), which is primarily used to set the states of the sender
and receiver and attempt to unlock the CPU. Finally, it performs syscall
cleanup work and returns to user mode.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/entry/entry_64.S | 23 ++++++++++++++++++++
 arch/x86/rpal/core.c      | 45 +++++++++++++++++++++++++++++++++++++--
 include/linux/rpal.h      |  5 ++++-
 kernel/sched/core.c       | 25 +++++++++++++++++++++-
 4 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ed04a968cc7d..13b4d0684575 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -169,6 +169,29 @@ SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
 	int3
 SYM_CODE_END(entry_SYSCALL_64)
 
+#ifdef CONFIG_RPAL
+SYM_CODE_START(rpal_ret_from_lazy_switch)
+	UNWIND_HINT_END_OF_STACK
+	ANNOTATE_NOENDBR
+	movq	%rax, %rdi
+	call	rpal_schedule_tail
+
+	movq	%rsp, %rdi
+	call	rpal_kernel_ret
+
+	movq	%rsp, %rdi
+	call	syscall_exit_to_user_mode	/* returns with IRQs disabled */
+
+	UNWIND_HINT_REGS
+#ifdef CONFIG_X86_FRED
+	ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \
+		    "jmp asm_fred_exit_user", X86_FEATURE_FRED
+#else
+	jmp	swapgs_restore_regs_and_return_to_usermode
+#endif
+SYM_CODE_END(rpal_ret_from_lazy_switch)
+#endif
+
 /*
  * %rdi: prev task
  * %rsi: next task
diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 19c4ef38bca3..ed4c11e6838c 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -18,7 +18,7 @@ unsigned long rpal_cap;
 
 static inline void rpal_lock_cpu(struct task_struct *tsk)
 {
-	rpal_set_cpus_allowed_ptr(tsk, true);
+	rpal_set_cpus_allowed_ptr(tsk, true, false);
 	if (unlikely(!irqs_disabled())) {
 		local_irq_disable();
 		rpal_err("%s: irq is enabled\n", __func__);
@@ -27,13 +27,54 @@ static inline void rpal_lock_cpu(struct task_struct *tsk)
 
 static inline void rpal_unlock_cpu(struct task_struct *tsk)
 {
-	rpal_set_cpus_allowed_ptr(tsk, false);
+	rpal_set_cpus_allowed_ptr(tsk, false, false);
 	if (unlikely(!irqs_disabled())) {
 		local_irq_disable();
 		rpal_err("%s: irq is enabled\n", __func__);
 	}
 }
 
+static inline void rpal_unlock_cpu_kernel_ret(struct task_struct *tsk)
+{
+	rpal_set_cpus_allowed_ptr(tsk, false, true);
+}
+
+void rpal_kernel_ret(struct pt_regs *regs)
+{
+	struct task_struct *tsk;
+	struct rpal_receiver_call_context *rcc;
+	int state;
+
+	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) {
+		rcc = current->rpal_rd->rcc;
+		atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET);
+	} else {
+		tsk = current->rpal_sd->receiver;
+		rcc = tsk->rpal_rd->rcc;
+		rpal_clear_task_thread_flag(tsk, RPAL_LAZY_SWITCHED_BIT);
+		state = atomic_xchg(&rcc->sender_state, RPAL_SENDER_STATE_KERNEL_RET);
+		WARN_ON_ONCE(state != RPAL_SENDER_STATE_CALL);
+		/* make sure kernel return is finished */
+		smp_mb();
+		WRITE_ONCE(tsk->rpal_rd->sender, NULL);
+		/*
+		 * We must unlock receiver first, otherwise we may unlock
+		 * receiver which is already locked by another sender.
+		 *
+		 *  Sender A			Receiver B      Sender C
+		 *	lazy switch (A->B)
+		 *  kernel return
+		 *      unlock cpu A
+		 *                      epoll_wait
+		 *                                         lazy switch(C->B)
+		 *                                         lock cpu B
+		 *		unlock cpu B
+		 *						BUG()			BUG()
+		 */
+		rpal_unlock_cpu_kernel_ret(tsk);
+		rpal_unlock_cpu_kernel_ret(current);
+	}
+}
 
 static inline struct task_struct *rpal_get_sender_task(void)
 {
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 0813db4552c0..01b582fa821e 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -480,14 +480,17 @@ int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 					 unsigned long addr, int error_code);
 struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild);
 struct task_struct *rpal_find_next_task(unsigned long fsbase);
+void rpal_kernel_ret(struct pt_regs *regs);
 
 extern void rpal_pick_mmap_base(struct mm_struct *mm,
 	struct rlimit *rlim_stack);
 int rpal_try_to_wake_up(struct task_struct *p);
 int rpal_init_thread_pending(struct rpal_common_data *rcd);
 void rpal_free_thread_pending(struct rpal_common_data *rcd);
-int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock);
+int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock,
+	bool is_kernel_ret);
 void rpal_schedule(struct task_struct *next);
 asmlinkage struct task_struct *
 __rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p);
+asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 760d88458b39..0f9343698198 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3181,7 +3181,8 @@ void rpal_free_thread_pending(struct rpal_common_data *rcd)
 /*
  * CPU lock is forced and all cpumask will be ignored by RPAL temporary.
  */
-int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock)
+int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock,
+							 bool is_kernel_ret)
 {
 	const struct cpumask *cpu_valid_mask = cpu_active_mask;
 	struct set_affinity_pending *pending = p->rpal_cd->pending;
@@ -3210,6 +3211,9 @@ int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock)
 		rpal_clear_task_thread_flag(p, RPAL_CPU_LOCKED_BIT);
 	}
 
+	if (is_kernel_ret)
+		return __set_cpus_allowed_ptr_locked(p, &ac, rq, &rf);
+
 	update_rq_clock(rq);
 
 	if (cpumask_equal(&p->cpus_mask, ac.new_mask))
@@ -11011,6 +11015,25 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 #endif	/* CONFIG_SCHED_CLASS_EXT */
 
 #ifdef CONFIG_RPAL
+asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev)
+	__releases(rq->lock)
+{
+	/*
+	 * New tasks start with FORK_PREEMPT_COUNT, see there and
+	 * finish_task_switch() for details.
+	 *
+	 * finish_task_switch() will drop rq->lock() and lower preempt_count
+	 * and the preempt_enable() will end up enabling preemption (on
+	 * PREEMPT_COUNT kernels).
+	 */
+
+	finish_task_switch(prev);
+	trace_sched_exit_tp(true, CALLER_ADDR0);
+	preempt_enable();
+
+	calculate_sigpending();
+}
+
 static struct rq *rpal_finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 21/35] RPAL: add kernel entry handling for lazy switch
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (19 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 20/35] RPAL: add rpal_ret_from_lazy_switch Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 22/35] RPAL: rebuild receiver state Bo Li
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

At the kernel entry point, RPAL performs a lazy switch. Therefore, it is
necessary to hook all kernel entry points to execute the logic related to
the lazy switch. At the kernel entry, apart from some necessary operations
related to the lazy switch (such as ensuring that the general-purpose
registers remain unchanged before and after the lazy switch), the task
before the lazy switch will lose its user mode context (which is passed to
the task after the lazy switch). Therefore, the kernel entry also needs to
handle the issue of the previous task losing its user mode context.

This patch hooks all locations where the transition from user mode to
kernel mode occurs, including entry_SYSCALL_64, error_entry, and
asm_exc_nmi. When the kernel detects a mismatch between the kernel-mode and
user mode contexts, it executes the logic related to the lazy switch.
Taking the switch from the sender to the receiver as an example, the
receiver thread is first locked to the CPU where the sender is located.
Then, the receiver thread in the CALL state is woken up through
rpal_try_to_wake_up(). The general purpose register state (pt_regs) of the
sender is copied to the receiver, and rpal_schedule() is executed to
complete the lazy switch. Regarding the issue of the sender losing its
context, the kernel loads the pre-saved user mode context of the sender
into the sender's pt_regs and constructs the kernel stack frame of the
sender in a manner similar to the fork operation.

The handling of the switch from the receiver to the sender is similar,
except that the receiver will be unlocked from the current CPU, and the
receiver can only return to the user mode through the kernel return method.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/entry/entry_64.S     | 137 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/asm-offsets.c |   3 +
 arch/x86/rpal/core.c          | 137 ++++++++++++++++++++++++++++++++++
 include/linux/rpal.h          |   6 ++
 4 files changed, 283 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 13b4d0684575..59c38627510d 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -118,6 +118,20 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	UNTRAIN_RET
 	CLEAR_BRANCH_HISTORY
 
+#ifdef CONFIG_RPAL
+	/*
+	 * We first check if it is a RPAL sender/receiver with
+	 * current->rpal_cd. For non-RPAL task, we just skip it.
+	 * For rpal task, We may need to check if it needs to do
+	 * lazy switch.
+	 */
+	movq	PER_CPU_VAR(current_task), %r13
+	movq	TASK_rpal_cd(%r13), %rax
+	testq	%rax, %rax
+	jz		_do_syscall
+	jmp 	do_rpal_syscall
+_do_syscall:
+#endif
 	call	do_syscall_64		/* returns with IRQs disabled */
 
 	/*
@@ -190,6 +204,101 @@ SYM_CODE_START(rpal_ret_from_lazy_switch)
 	jmp	swapgs_restore_regs_and_return_to_usermode
 #endif
 SYM_CODE_END(rpal_ret_from_lazy_switch)
+
+/* return address offset of stack frame */
+#define RPAL_FRAME_RET_ADDR_OFFSET -56
+
+SYM_CODE_START(do_rpal_syscall)
+	movq	%rsp, %r14
+	call	rpal_syscall_64_context_switch
+	testq   %rax, %rax
+	jz		1f
+
+	/*
+	 * When we come here, everything but stack switching is finished.
+	 * This makes current task use another task's kernel stack. Thus,
+	 * we need to do stack switching here.
+	 *
+	 * At the meanwhile, the previous task's stack content is corrupted,
+	 * we also need to rebuild its stack frames, so that it will jump to
+	 * rpal_ret_from_lazy_switch when it is scheduled in. This is inspired
+	 * by ret_from_fork.
+	 */
+	movq    TASK_threadsp(%rax), %rsp
+#ifdef CONFIG_STACKPROTECTOR
+	movq	TASK_stack_canary(%rax), %rbx
+	movq	%rbx, PER_CPU_VAR(__stack_chk_guard)
+#endif
+	/* rebuild src's frame */
+	movq	$rpal_ret_from_lazy_switch, -8(%r14)
+	leaq	RPAL_FRAME_RET_ADDR_OFFSET(%r14), %rbx
+	movq	%rbx, TASK_threadsp(%r13)
+
+	movq	%r13, %rdi
+	/*
+	 * Everything of task switch is done, but we still need to do
+	 * a little extra things for lazy switch.
+	 */
+	call	rpal_lazy_switch_tail
+
+1:
+	movq	ORIG_RAX(%rsp), %rsi
+	movq	%rsp, %rdi
+	jmp		_do_syscall
+SYM_CODE_END(do_rpal_syscall)
+
+SYM_CODE_START(do_rpal_error)
+	popq	%r12
+	movq	%rax, %rsp
+	movq	%rax, %r14
+	movq	%rax, %rdi
+	call	rpal_exception_context_switch
+	testq   %rax, %rax
+	jz		1f
+
+	movq	TASK_threadsp(%rax), %rsp
+	ENCODE_FRAME_POINTER
+#ifdef CONFIG_STACKPROTECTOR
+	movq	TASK_stack_canary(%rax), %rbx
+	movq	%rbx, PER_CPU_VAR(__stack_chk_guard)
+#endif
+	/* rebuild src's frame */
+	movq	$rpal_ret_from_lazy_switch, -8(%r14)
+	leaq	RPAL_FRAME_RET_ADDR_OFFSET(%r14), %rbx
+	movq	%rbx, TASK_threadsp(%r13)
+
+	movq	%r13, %rdi
+	call	rpal_lazy_switch_tail
+1:
+	movq	%rsp, %rax
+	pushq	%r12
+	jmp		_do_error
+SYM_CODE_END(do_rpal_error)
+
+SYM_CODE_START(do_rpal_nmi)
+	movq	%rsp, %r14
+	movq	%rsp, %rdi
+	call	rpal_nmi_context_switch
+	testq   %rax, %rax
+	jz		1f
+
+	movq    TASK_threadsp(%rax), %rsp
+	ENCODE_FRAME_POINTER
+#ifdef CONFIG_STACKPROTECTOR
+	movq	TASK_stack_canary(%rax), %rbx
+	movq	%rbx, PER_CPU_VAR(__stack_chk_guard)
+#endif
+	/* rebuild src's frame */
+	movq	$rpal_ret_from_lazy_switch, -8(%r14)
+	leaq	RPAL_FRAME_RET_ADDR_OFFSET(%r14), %rbx
+	movq	%rbx, TASK_threadsp(%r13)
+
+	movq	%r13, %rdi
+	call	rpal_lazy_switch_tail
+
+1:
+	jmp		_do_nmi
+SYM_CODE_END(do_rpal_nmi)
 #endif
 
 /*
@@ -1047,7 +1156,22 @@ SYM_CODE_START(error_entry)
 
 	leaq	8(%rsp), %rdi			/* arg0 = pt_regs pointer */
 	/* Put us onto the real thread stack. */
+#ifdef CONFIG_RPAL
+	call sync_regs
+	/*
+	 * Check whether we need to perform lazy switch after we
+	 * switch to the real thread stack.
+	 */
+	movq	PER_CPU_VAR(current_task), %r13
+	movq	TASK_rpal_cd(%r13), %rdi
+	testq	%rdi, %rdi
+	jz		_do_error
+	jmp 	do_rpal_error
+_do_error:
+	RET
+#else
 	jmp	sync_regs
+#endif
 
 	/*
 	 * There are two places in the kernel that can potentially fault with
@@ -1206,6 +1330,19 @@ SYM_CODE_START(asm_exc_nmi)
 	IBRS_ENTER
 	UNTRAIN_RET
 
+#ifdef CONFIG_RPAL
+	/*
+	 * Check whether we need to perform lazy switch only when
+	 * we come from userspace.
+	 */
+	movq	PER_CPU_VAR(current_task), %r13
+	movq	TASK_rpal_cd(%r13), %rax
+	testq	%rax, %rax
+	jz		_do_nmi
+	jmp 	do_rpal_nmi
+_do_nmi:
+#endif
+
 	/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 6259b474073b..010202c31b37 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -46,6 +46,9 @@ static void __used common(void)
 #ifdef CONFIG_STACKPROTECTOR
 	OFFSET(TASK_stack_canary, task_struct, stack_canary);
 #endif
+#ifdef CONFIG_RPAL
+	OFFSET(TASK_rpal_cd, task_struct, rpal_cd);
+#endif
 
 	BLANK();
 	OFFSET(pbe_address, pbe, address);
diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index ed4c11e6838c..c48df1ce4324 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/rpal.h>
+#include <linux/sched/task_stack.h>
 #include <asm/fsgsbase.h>
 
 #include "internal.h"
@@ -39,6 +40,20 @@ static inline void rpal_unlock_cpu_kernel_ret(struct task_struct *tsk)
 	rpal_set_cpus_allowed_ptr(tsk, false, true);
 }
 
+void rpal_lazy_switch_tail(struct task_struct *tsk)
+{
+	struct rpal_receiver_call_context *rcc;
+
+	if (rpal_test_task_thread_flag(current, RPAL_LAZY_SWITCHED_BIT)) {
+		rcc = current->rpal_rd->rcc;
+		atomic_cmpxchg(&rcc->receiver_state, rpal_build_call_state(tsk->rpal_sd),
+			       RPAL_RECEIVER_STATE_LAZY_SWITCH);
+	} else {
+		rpal_unlock_cpu(tsk);
+		rpal_unlock_cpu(current);
+	}
+}
+
 void rpal_kernel_ret(struct pt_regs *regs)
 {
 	struct task_struct *tsk;
@@ -76,6 +91,87 @@ void rpal_kernel_ret(struct pt_regs *regs)
 	}
 }
 
+static inline void rebuild_stack(struct rpal_task_context *ctx,
+				 struct pt_regs *regs)
+{
+	regs->r12 = ctx->r12;
+	regs->r13 = ctx->r13;
+	regs->r14 = ctx->r14;
+	regs->r15 = ctx->r15;
+	regs->bx = ctx->rbx;
+	regs->bp = ctx->rbp;
+	regs->ip = ctx->rip;
+	regs->sp = ctx->rsp;
+}
+
+static inline void rebuild_sender_stack(struct rpal_sender_data *rsd,
+				 struct pt_regs *regs)
+{
+	rebuild_stack(&rsd->scc->rtc, regs);
+}
+
+static inline void rebuild_receiver_stack(struct rpal_receiver_data *rrd,
+				   struct pt_regs *regs)
+{
+	rebuild_stack(&rrd->rcc->rtc, regs);
+}
+
+static inline void update_dst_stack(struct task_struct *next,
+				    struct pt_regs *src)
+{
+	struct pt_regs *dst;
+
+	dst = task_pt_regs(next);
+	*dst = *src;
+	next->thread.sp = (unsigned long)dst;
+}
+
+/*
+ * rpal_do_kernel_context_switch - the main routine of RPAL lazy switch
+ * @next: task to switch to
+ * @regs: the user pt_regs saved in kernel entry
+ *
+ * This function performs the lazy switch. When switch from sender to
+ * receiver, we need to lock both task to current CPU to avoid double
+ * control flow when we perform lazy switch and after then.
+ */
+static struct task_struct *
+rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *regs)
+{
+	struct task_struct *prev = current;
+
+	if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) {
+		current->rpal_sd->receiver = next;
+		rpal_lock_cpu(current);
+		rpal_lock_cpu(next);
+		rpal_try_to_wake_up(next);
+		update_dst_stack(next, regs);
+		/*
+		 * When a lazy switch occurs, we need to set the sender's
+		 * user-mode context to a predefined state by the sender.
+		 * Otherwise, sender's user context will be corrupted.
+		 */
+		rebuild_sender_stack(current->rpal_sd, regs);
+		rpal_schedule(next);
+	} else {
+		update_dst_stack(next, regs);
+		/*
+		 * When a lazy switch occurs, we need to set the receiver's
+		 * user-mode context to a predefined state by the receiver.
+		 * Otherwise, sender's user context will be corrupted.
+		 */
+		rebuild_receiver_stack(current->rpal_rd, regs);
+		rpal_schedule(next);
+		rpal_clear_task_thread_flag(prev, RPAL_LAZY_SWITCHED_BIT);
+		prev->rpal_rd->sender = NULL;
+	}
+	if (unlikely(!irqs_disabled())) {
+		local_irq_disable();
+		rpal_err("%s: irq is enabled\n", __func__);
+	}
+	return next;
+}
+
 static inline struct task_struct *rpal_get_sender_task(void)
 {
 	struct task_struct *next;
@@ -123,6 +219,18 @@ static inline struct task_struct *rpal_misidentify(void)
 	return next;
 }
 
+static inline struct task_struct *
+rpal_kernel_context_switch(struct pt_regs *regs)
+{
+	struct task_struct *next = NULL;
+
+	next = rpal_misidentify();
+	if (unlikely(next != NULL))
+		next = rpal_do_kernel_context_switch(next, regs);
+
+	return next;
+}
+
 struct task_struct *rpal_find_next_task(unsigned long fsbase)
 {
 	struct rpal_service *cur = rpal_current_service();
@@ -147,6 +255,35 @@ struct task_struct *rpal_find_next_task(unsigned long fsbase)
 	return tsk;
 }
 
+__visible struct task_struct *
+rpal_syscall_64_context_switch(struct pt_regs *regs, unsigned long nr)
+{
+	struct task_struct *next;
+
+	next = rpal_kernel_context_switch(regs);
+
+	return next;
+}
+
+__visible struct task_struct *
+rpal_exception_context_switch(struct pt_regs *regs)
+{
+	struct task_struct *next;
+
+	next = rpal_kernel_context_switch(regs);
+
+	return next;
+}
+
+__visible struct task_struct *rpal_nmi_context_switch(struct pt_regs *regs)
+{
+	struct task_struct *next;
+
+	next = rpal_kernel_context_switch(regs);
+
+	return next;
+}
+
 static bool check_hardware_features(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_FSGSBASE)) {
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 01b582fa821e..b24176f3f245 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -479,7 +479,13 @@ struct rpal_service *rpal_get_mapped_service_by_id(struct rpal_service *rs,
 int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 					 unsigned long addr, int error_code);
 struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild);
+__visible struct task_struct *
+rpal_syscall_64_context_switch(struct pt_regs *regs, unsigned long nr);
+__visible struct task_struct *
+rpal_exception_context_switch(struct pt_regs *regs);
+__visible struct task_struct *rpal_nmi_context_switch(struct pt_regs *regs);
 struct task_struct *rpal_find_next_task(unsigned long fsbase);
+void rpal_lazy_switch_tail(struct task_struct *tsk);
 void rpal_kernel_ret(struct pt_regs *regs);
 
 extern void rpal_pick_mmap_base(struct mm_struct *mm,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 22/35] RPAL: rebuild receiver state
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (20 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 21/35] RPAL: add kernel entry handling for lazy switch Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 23/35] RPAL: resume cpumask when fork Bo Li
                   ` (18 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

When an RPAL call occurs, the sender modifies the receiver's state. If the
sender exits abnormally after modifying the state or encounters an
unhandled page fault and returns to a recovery point, the receiver's state
will remain as modified by the sender (e.g., in the CALL state). Since the
sender may have exited, the lazy switch will not occur, leaving the
receiver unrecoverable (unable to be woken up via try_to_wake_up()).
Therefore, the kernel must ensure the receiver's state remains valid in
these cases.

This patch addresses this by rebuild receiver's state during unhandled page
faults or sender exits. The kernel detect the fsbase value recorded by
the sender and use the fsbase value to locate the corresponding receiver.
Then kernel checking if the receiver is in the CALL state set by the
sender (using sender_id and service_id carried in the CALL state). If true,
transitioning the receiver from CALL to WAIT state and notifying the
receiver via sender_state that the RPAL call has completed.

This ensures that even if the sender fails, the receiver can recover and
resume normal operation by resetting its state and avoiding permanent
blocking.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/thread.c | 44 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
index db3b13ff82be..02c1a9c22dd7 100644
--- a/arch/x86/rpal/thread.c
+++ b/arch/x86/rpal/thread.c
@@ -224,6 +224,45 @@ int rpal_unregister_receiver(void)
 	return ret;
 }
 
+/* sender may corrupt receiver's state if unexpectedly exited, rebuild it */
+static void rpal_rebuild_receiver_context_on_exit(void)
+{
+	struct task_struct *receiver = NULL;
+	struct rpal_sender_data *rsd = current->rpal_sd;
+	struct rpal_sender_call_context *scc = rsd->scc;
+	struct rpal_receiver_data *rrd;
+	struct rpal_receiver_call_context *rcc;
+	unsigned long fsbase;
+	int state = rpal_build_call_state(rsd);
+
+	if (scc->ec.magic != RPAL_ERROR_MAGIC)
+		goto out;
+
+	fsbase = scc->ec.fsbase;
+	if (rpal_is_correct_address(rpal_current_service(), fsbase))
+		goto out;
+
+	receiver = rpal_find_next_task(fsbase);
+	if (!receiver)
+		goto out;
+
+	rrd = receiver->rpal_rd;
+	if (!rrd)
+		goto out;
+
+	rcc = rrd->rcc;
+
+	if (atomic_read(&rcc->receiver_state) == state) {
+		atomic_cmpxchg(&rcc->sender_state, RPAL_SENDER_STATE_CALL,
+			       RPAL_SENDER_STATE_KERNEL_RET);
+		atomic_cmpxchg(&rcc->receiver_state, state,
+			       RPAL_RECEIVER_STATE_WAIT);
+	}
+
+out:
+	return;
+}
+
 int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 					 unsigned long addr, int error_code)
 {
@@ -232,6 +271,7 @@ int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 		unsigned long erip, ersp;
 		int magic;
 
+		rpal_rebuild_receiver_context_on_exit();
 		erip = scc->ec.erip;
 		ersp = scc->ec.ersp;
 		magic = scc->ec.magic;
@@ -249,8 +289,10 @@ int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 
 void exit_rpal_thread(void)
 {
-	if (rpal_test_current_thread_flag(RPAL_SENDER_BIT))
+	if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) {
+		rpal_rebuild_receiver_context_on_exit();
 		rpal_unregister_sender();
+	}
 
 	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT))
 		rpal_unregister_receiver();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 23/35] RPAL: resume cpumask when fork
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (21 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 22/35] RPAL: rebuild receiver state Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 24/35] RPAL: critical section optimization Bo Li
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

After a lazy switch occurs, RPAL locks the receiver to the current CPU by
modifying its cpumask. If the receiver performs a fork operation at this
point, the kernel will copy the modified cpumask to the new task, causing
the new task to be permanently locked on the current CPU.

This patch addresses this issue by detecting whether the original task is
locked to the current CPU by RPAL during fork. If locked, assigning the
cpumask that existed before the lazy switch to the new task. This ensures
the new task will not be locked to the current CPU.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/kernel/process.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c1d2dac72b9c..be8845e2ca4d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -29,6 +29,7 @@
 #include <trace/events/power.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/entry-common.h>
+#include <linux/rpal.h>
 #include <asm/cpu.h>
 #include <asm/cpuid/api.h>
 #include <asm/apic.h>
@@ -88,6 +89,19 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
 
+#ifdef CONFIG_RPAL
+static void rpal_fix_task_dump(struct task_struct *dst,
+			      struct task_struct *src)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&src->pi_lock, flags);
+	if (rpal_test_task_thread_flag(src, RPAL_CPU_LOCKED_BIT))
+		cpumask_copy(&dst->cpus_mask, &src->rpal_cd->old_mask);
+	raw_spin_unlock_irqrestore(&src->pi_lock, flags);
+}
+#endif
+
 /*
  * this gets called so that we can store lazy state into memory and copy the
  * current task into the new thread.
@@ -100,6 +114,10 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 #ifdef CONFIG_VM86
 	dst->thread.vm86 = NULL;
 #endif
+#ifdef CONFIG_RPAL
+	if (src->rpal_rs)
+		rpal_fix_task_dump(dst, src);
+#endif
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 24/35] RPAL: critical section optimization
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (22 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 23/35] RPAL: resume cpumask when fork Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 25/35] RPAL: add MPK initialization and interface Bo Li
                   ` (16 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

The critical section is defined as the user mode code segment within the
receiver that executes when control returns from the receiver to the
sender. This code segment, located in the receiver, involves operations
such as switching the fsbase register and changing the stack pointer.
Handling the critical section can be categorized into two scenarios:

- First Scenario: If no lazy switch has occurred prior to the return and
  the fsbase switch is incomplete, a lazy switch is triggered to
  transition the kernel context from the sender to the receiver. After
  the fsbase is updated in user mode, another lazy switch occurs to revert
  the kernel context from the receiver back to the sender. This results in
  two unnecessary lazy switches.

- Second Scenario: If a lazy switch has already occurred during execution
  of the critical section, the lazy switch can be preemptively triggered.
  This avoids re-entering the kernel solely to initiate another lazy
  switch.

The implementation of the critical section involves modifying the fsbase
register in kernel mode and setting the sender's user mode context to a
predefined state. These steps minimize redundant user/kernel transitions
and lazy switches.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/core.c    | 88 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/rpal/service.c | 12 ++++++
 include/linux/rpal.h    |  6 +++
 3 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index c48df1ce4324..406d54788bac 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -219,14 +219,98 @@ static inline struct task_struct *rpal_misidentify(void)
 	return next;
 }
 
+static bool in_ret_section(struct rpal_service *rs, unsigned long ip)
+{
+	return ip >= rs->rsm.rcs.ret_begin && ip < rs->rsm.rcs.ret_end;
+}
+
+/*
+ * rpal_update_fsbase - fastpath when RPAL call returns
+ * @regs: pt_regs saved in kernel entry
+ *
+ * If the user is executing rpal call return code and it does
+ * not update fsbase yet, force fsbase update to perform a
+ * lazy switch immediately.
+ */
+static inline void rpal_update_fsbase(struct pt_regs *regs)
+{
+	struct rpal_service *cur = rpal_current_service();
+	struct task_struct *sender = current->rpal_rd->sender;
+
+	if (in_ret_section(cur, regs->ip))
+		wrfsbase(sender->thread.fsbase);
+}
+
+/*
+ * rpal_skip_receiver_code - skip rpal call return code
+ * @next: the next task to be lazy switched to.
+ * @regs: pt_regs saved in kernel entry
+ *
+ * If the user is executing rpal call return code and we are about
+ * to perform a lazy switch, skip the remaining return code to
+ * release receiver's stack. This avoids stack conflict when there
+ * are more than one senders calls the receiver.
+ */
+static inline void rpal_skip_receiver_code(struct task_struct *next,
+					   struct pt_regs *regs)
+{
+	rebuild_sender_stack(next->rpal_sd, regs);
+}
+
+/*
+ * rpal_skip_receiver_code - skip lazy switch when rpal call return
+ * @next: the next task to be lazy switched to.
+ * @regs: pt_regs saved in kernel entry
+ *
+ * If the user is executing rpal call return code and we have not
+ * performed a lazy switch, there is no need to perform lazy switch
+ * now. Update fsbase and other states to avoid lazy switch.
+ */
+static inline struct task_struct *
+rpal_skip_lazy_switch(struct task_struct *next, struct pt_regs *regs)
+{
+	struct rpal_service *tgt;
+
+	tgt = next->rpal_rs;
+	if (in_ret_section(tgt, regs->ip)) {
+		wrfsbase(current->thread.fsbase);
+		rebuild_sender_stack(current->rpal_sd, regs);
+		rpal_clear_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT);
+		next->rpal_rd->sender = NULL;
+		next = NULL;
+	}
+	return next;
+}
+
+static struct task_struct *rpal_fix_critical_section(struct task_struct *next,
+						     struct pt_regs *regs)
+{
+	struct rpal_service *cur = rpal_current_service();
+
+	/* sender->receiver */
+	if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT))
+		next = rpal_skip_lazy_switch(next, regs);
+	/* receiver->sender */
+	else if (rpal_is_correct_address(cur, regs->ip))
+		rpal_skip_receiver_code(next, regs);
+
+	return next;
+}
+
 static inline struct task_struct *
 rpal_kernel_context_switch(struct pt_regs *regs)
 {
 	struct task_struct *next = NULL;
 
+	if (rpal_test_current_thread_flag(RPAL_LAZY_SWITCHED_BIT))
+		rpal_update_fsbase(regs);
+
 	next = rpal_misidentify();
-	if (unlikely(next != NULL))
-		next = rpal_do_kernel_context_switch(next, regs);
+	if (unlikely(next != NULL)) {
+		next = rpal_fix_critical_section(next, regs);
+		if (next)
+			next = rpal_do_kernel_context_switch(next, regs);
+	}
 
 	return next;
 }
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 49458321e7dc..16e94d710445 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -545,6 +545,13 @@ int rpal_release_service(u64 key)
 	return ret;
 }
 
+static bool rpal_check_critical_section(struct rpal_service *rs,
+			     struct rpal_critical_section *rcs)
+{
+	return rpal_is_correct_address(rs, rcs->ret_begin) &&
+	       rpal_is_correct_address(rs, rcs->ret_end);
+}
+
 int rpal_enable_service(unsigned long arg)
 {
 	struct rpal_service *cur = rpal_current_service();
@@ -562,6 +569,11 @@ int rpal_enable_service(unsigned long arg)
 		goto out;
 	}
 
+	if (!rpal_check_critical_section(cur, &rsm.rcs)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	mutex_lock(&cur->mutex);
 	if (!cur->enabled) {
 		cur->rsm = rsm;
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index b24176f3f245..4f1d92053818 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -122,12 +122,18 @@ enum rpal_sender_state {
 	RPAL_SENDER_STATE_KERNEL_RET,
 };
 
+struct rpal_critical_section {
+	unsigned long ret_begin;
+	unsigned long ret_end;
+};
+
 /*
  * user_meta will be sent to other service when requested.
  */
 struct rpal_service_metadata {
 	unsigned long version;
 	void __user *user_meta;
+	struct rpal_critical_section rcs;
 };
 
 struct rpal_request_arg {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 25/35] RPAL: add MPK initialization and interface
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (23 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 24/35] RPAL: critical section optimization Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 26/35] RPAL: enable MPK support Bo Li
                   ` (15 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

RPAL uses MPK (Memory Protection Keys) to protect memory. Therefore, RPAL
needs to perform MPK initialization, allocation, and other related tasks,
while providing corresponding user-mode interfaces.

This patch executes MPK initialization operations, including feature
detection, implementation of user mode interfaces for setting and
retrieving pkeys, and development of utility functions. For pkey
allocation, RPAL prioritizes using pkeys provided by user mode, with user
mode responsible for preventing pkey collisions between different services.
If user mode does not provide a valid pkey, RPAL generates a pkey via
id % arch_max_pkey() to maximize the avoidance of pkey collisions.
Additionally, RPAL does not permit services to manipulate pkeys
independently; thus, all pkeys are marked as allocated, and services are
prohibited from releasing pkeys.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/Kconfig    | 12 +++++++-
 arch/x86/rpal/Makefile   |  1 +
 arch/x86/rpal/core.c     | 13 ++++++++
 arch/x86/rpal/internal.h |  5 +++
 arch/x86/rpal/pku.c      | 47 ++++++++++++++++++++++++++++
 arch/x86/rpal/proc.c     |  5 +++
 arch/x86/rpal/service.c  | 24 +++++++++++++++
 include/linux/rpal.h     | 66 ++++++++++++++++++++++++++++++++++++++++
 mm/mprotect.c            |  9 ++++++
 9 files changed, 181 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/rpal/pku.c

diff --git a/arch/x86/rpal/Kconfig b/arch/x86/rpal/Kconfig
index e5e6996553ea..5434fdb2940d 100644
--- a/arch/x86/rpal/Kconfig
+++ b/arch/x86/rpal/Kconfig
@@ -8,4 +8,14 @@ config RPAL
 	depends on X86_64
 	help
 		This option enables system support for Run Process As
-		library (RPAL).
\ No newline at end of file
+		library (RPAL).
+
+config RPAL_PKU
+	bool "mpk protection for RPAL"
+	default y
+	depends on RPAL
+	help
+		Memory protection key (MPK) can achieve intra-process
+		memory separation which is broken by RPAL, Always keep
+		it on when use RPAL. CPU feature will be detected at
+		boot time as some CPUs do not support it.
\ No newline at end of file
diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile
index 89f745382c51..42a42b0393be 100644
--- a/arch/x86/rpal/Makefile
+++ b/arch/x86/rpal/Makefile
@@ -3,3 +3,4 @@
 obj-$(CONFIG_RPAL)		+= rpal.o
 
 rpal-y := service.o core.o mm.o proc.o thread.o
+rpal-$(CONFIG_RPAL_PKU) += pku.o
\ No newline at end of file
diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 406d54788bac..41111d693994 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -8,6 +8,7 @@
 
 #include <linux/rpal.h>
 #include <linux/sched/task_stack.h>
+#include <linux/pkeys.h>
 #include <asm/fsgsbase.h>
 
 #include "internal.h"
@@ -374,6 +375,14 @@ static bool check_hardware_features(void)
 		rpal_err("no fsgsbase feature\n");
 		return false;
 	}
+
+#ifdef CONFIG_RPAL_PKU
+	if (!arch_pkeys_enabled()) {
+		rpal_err("MPK is not enabled\n");
+		return false;
+	}
+#endif
+
 	return true;
 }
 
@@ -390,6 +399,10 @@ int __init rpal_init(void)
 	if (ret)
 		goto fail;
 
+#ifdef CONFIG_RPAL_PKU
+	rpal_set_cap(RPAL_CAP_PKU);
+#endif
+
 	rpal_inited = true;
 	return 0;
 
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index 6256172bb79e..71afa8225450 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -54,3 +54,8 @@ rpal_build_call_state(const struct rpal_sender_data *rsd)
 	return ((rsd->rcd.service_id << RPAL_SID_SHIFT) |
 		(rsd->scc->sender_id << RPAL_ID_SHIFT) | RPAL_RECEIVER_STATE_CALL);
 }
+
+/* pkey.c */
+int rpal_alloc_pkey(struct rpal_service *rs, int pkey);
+int rpal_pkey_setup(struct rpal_service *rs, int pkey);
+void rpal_service_pku_init(void);
diff --git a/arch/x86/rpal/pku.c b/arch/x86/rpal/pku.c
new file mode 100644
index 000000000000..4c5151ca5b8b
--- /dev/null
+++ b/arch/x86/rpal/pku.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * RPAL service level operations
+ * Copyright (c) 2025, ByteDance. All rights reserved.
+ *
+ *     Author: Jiadong Sun <sunjiadong.lff@bytedance.com>
+ */
+
+#include <linux/rpal.h>
+#include <linux/pkeys.h>
+
+#include "internal.h"
+
+void rpal_service_pku_init(void)
+{
+	u16 all_pkeys_mask = ((1U << arch_max_pkey()) - 1);
+	struct mm_struct *mm = current->mm;
+
+	/* We consume all pkeys so that no pkeys will be allocated by others */
+	mmap_write_lock(mm);
+	if (mm->context.pkey_allocation_map != 0x1)
+		rpal_err("pkey has been allocated: %u\n",
+			 mm->context.pkey_allocation_map);
+	mm->context.pkey_allocation_map = all_pkeys_mask;
+	mmap_write_unlock(mm);
+}
+
+int rpal_pkey_setup(struct rpal_service *rs, int pkey)
+{
+	int val;
+
+	val = rpal_pkey_to_pkru(pkey);
+	rs->pkey = pkey;
+	return 0;
+}
+
+int rpal_alloc_pkey(struct rpal_service *rs, int pkey)
+{
+	int ret;
+
+	if (pkey >= 0 && pkey < arch_max_pkey())
+		return pkey;
+
+	ret = rs->id % arch_max_pkey();
+
+	return ret;
+}
diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
index 16ac9612bfc5..2f9cceec4992 100644
--- a/arch/x86/rpal/proc.c
+++ b/arch/x86/rpal/proc.c
@@ -76,6 +76,11 @@ static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case RPAL_IOCTL_RELEASE_SERVICE:
 		ret = rpal_release_service(arg);
 		break;
+#ifdef CONFIG_RPAL_PKU
+	case RPAL_IOCTL_GET_SERVICE_PKEY:
+		ret = put_user(cur->pkey, (int __user *)arg);
+		break;
+#endif
 	default:
 		return -EINVAL;
 	}
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 16e94d710445..ca795dacc90d 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -208,6 +208,10 @@ struct rpal_service *rpal_register_service(void)
 	spin_lock_init(&rs->rpd.poll_lock);
 	bitmap_zero(rs->rpd.dead_key_bitmap, RPAL_NR_ID);
 	init_waitqueue_head(&rs->rpd.rpal_waitqueue);
+#ifdef CONFIG_RPAL_PKU
+	rs->pkey = -1;
+	rpal_service_pku_init();
+#endif
 
 	rs->bad_service = false;
 	rs->base = calculate_base_address(rs->id);
@@ -288,6 +292,9 @@ static int add_mapped_service(struct rpal_service *rs, struct rpal_service *tgt,
 	if (node->rs == NULL) {
 		node->rs = rpal_get_service(tgt);
 		set_bit(type_bit, &node->type);
+#ifdef CONFIG_RPAL_PKU
+		node->pkey = tgt->pkey;
+#endif
 	} else {
 		if (node->rs != tgt) {
 			ret = -EINVAL;
@@ -397,6 +404,19 @@ int rpal_request_service(unsigned long arg)
 		goto put_service;
 	}
 
+#ifdef CONFIG_RPAL_PKU
+	if (cur->pkey == tgt->pkey) {
+		ret = -EINVAL;
+		goto put_service;
+	}
+
+	ret = put_user(tgt->pkey, rra.pkey);
+	if (ret) {
+		ret = -EFAULT;
+		goto put_service;
+	}
+#endif
+
 	ret = put_user((unsigned long)(tgt->rsm.user_meta), rra.user_metap);
 	if (ret) {
 		ret = -EFAULT;
@@ -577,6 +597,10 @@ int rpal_enable_service(unsigned long arg)
 	mutex_lock(&cur->mutex);
 	if (!cur->enabled) {
 		cur->rsm = rsm;
+#ifdef CONFIG_RPAL_PKU
+		rsm.pkey = rpal_alloc_pkey(cur, rsm.pkey);
+		rpal_pkey_setup(cur, rsm.pkey);
+#endif
 		cur->enabled = true;
 	}
 	mutex_unlock(&cur->mutex);
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 4f1d92053818..2f2982d281cc 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -97,6 +97,12 @@ enum {
 #define RPAL_ID_MASK (~(0 | RPAL_RECEIVER_STATE_MASK | RPAL_SID_MASK))
 #define RPAL_MAX_ID ((1 << (RPAL_SID_SHIFT - RPAL_ID_SHIFT)) - 1)
 
+#define RPAL_PKRU_BASE_CODE_READ 0xAAAAAAAA
+#define RPAL_PKRU_BASE_CODE 0xFFFFFFFF
+#define RPAL_PKRU_SET 0
+#define RPAL_PKRU_UNION 1
+#define RPAL_PKRU_INTERSECT 2
+
 extern unsigned long rpal_cap;
 
 enum rpal_task_flag_bits {
@@ -122,6 +128,10 @@ enum rpal_sender_state {
 	RPAL_SENDER_STATE_KERNEL_RET,
 };
 
+enum rpal_capability {
+	RPAL_CAP_PKU
+};
+
 struct rpal_critical_section {
 	unsigned long ret_begin;
 	unsigned long ret_end;
@@ -134,6 +144,7 @@ struct rpal_service_metadata {
 	unsigned long version;
 	void __user *user_meta;
 	struct rpal_critical_section rcs;
+	int pkey;
 };
 
 struct rpal_request_arg {
@@ -141,11 +152,17 @@ struct rpal_request_arg {
 	u64 key;
 	unsigned long __user *user_metap;
 	int __user *id;
+#ifdef CONFIG_RPAL_PKU
+	int __user *pkey;
+#endif
 };
 
 struct rpal_mapped_service {
 	unsigned long type;
 	struct rpal_service *rs;
+#ifdef CONFIG_RPAL_PKU
+	int pkey;
+#endif
 };
 
 struct rpal_poll_data {
@@ -220,6 +237,11 @@ struct rpal_service {
 	/* fsbase / pid map */
 	struct rpal_fsbase_tsk_map fs_tsk_map[RPAL_MAX_RECEIVER_NUM];
 
+#ifdef CONFIG_RPAL_PKU
+	/* pkey */
+	int pkey;
+#endif
+
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
 
@@ -323,6 +345,7 @@ enum rpal_command_type {
 	RPAL_CMD_DISABLE_SERVICE,
 	RPAL_CMD_REQUEST_SERVICE,
 	RPAL_CMD_RELEASE_SERVICE,
+	RPAL_CMD_GET_SERVICE_PKEY,
 	RPAL_NR_CMD,
 };
 
@@ -351,6 +374,8 @@ enum rpal_command_type {
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REQUEST_SERVICE, unsigned long)
 #define RPAL_IOCTL_RELEASE_SERVICE \
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long)
+#define RPAL_IOCTL_GET_SERVICE_PKEY \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_PKEY, int *)
 
 #define rpal_for_each_requested_service(rs, idx)                             \
 	for (idx = find_first_bit(rs->requested_service_bitmap, RPAL_NR_ID); \
@@ -420,6 +445,47 @@ static inline bool rpal_is_correct_address(struct rpal_service *rs, unsigned lon
 	return true;
 }
 
+static inline void rpal_set_cap(unsigned long cap)
+{
+	set_bit(cap, &rpal_cap);
+}
+
+static inline void rpal_clear_cap(unsigned long cap)
+{
+	clear_bit(cap, &rpal_cap);
+}
+
+static inline bool rpal_has_cap(unsigned long cap)
+{
+	return test_bit(cap, &rpal_cap);
+}
+
+static inline u32 rpal_pkey_to_pkru(int pkey)
+{
+	int offset = pkey * 2;
+	u32 mask = 0x3 << offset;
+
+	return RPAL_PKRU_BASE_CODE & ~mask;
+}
+
+static inline u32 rpal_pkey_to_pkru_read(int pkey)
+{
+	int offset = pkey * 2;
+	u32 mask = 0x3 << offset;
+
+	return RPAL_PKRU_BASE_CODE_READ & ~mask;
+}
+
+static inline u32 rpal_pkru_union(u32 pkru0, u32 pkru1)
+{
+	return pkru0 & pkru1;
+}
+
+static inline u32 rpal_pkru_intersect(u32 pkru0, u32 pkru1)
+{
+	return pkru0 | pkru1;
+}
+
 #ifdef CONFIG_RPAL
 static inline struct rpal_service *rpal_current_service(void)
 {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 62c1f7945741..982f911ffaba 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,6 +33,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/memory-tiers.h>
 #include <uapi/linux/mman.h>
+#include <linux/rpal.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -895,6 +896,14 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
 {
 	int ret;
 
+#ifdef CONFIG_RPAL_PKU
+	if (rpal_current_service()) {
+		rpal_err("try_to_free pkey: %d %s\n", current->pid,
+			 current->comm);
+		return -EINVAL;
+	}
+#endif
+
 	mmap_write_lock(current->mm);
 	ret = mm_pkey_free(current->mm, pkey);
 	mmap_write_unlock(current->mm);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 26/35] RPAL: enable MPK support
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (24 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 25/35] RPAL: add MPK initialization and interface Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30 17:03   ` Dave Hansen
  2025-05-30  9:27 ` [RFC v2 27/35] RPAL: add epoll support Bo Li
                   ` (14 subsequent siblings)
  40 siblings, 1 reply; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

RPAL leverages Memory Protection Keys (MPK) to safeguard shared memory
from illegal access and corruption by other processes. MPK-based memory
protection involves two key mechanisms: First, for already allocated
memory, when RPAL is enabled, the protection key fields in all page tables
must be set to the process’s corresponding pkey value. Second, for newly
allocated memory, when the kernel detects that the process is an RPAL
service, it sets the corresponding pkey flag in the relevant memory data
structures. Together, these measures ensure that all memory belonging to
the current process is protected by its own pkey.

For MPK initialization, RPAL needs to set the pkeys of all allocated page
table pages to the pkeys assigned by RPAL to the service. This is
completed in three steps: First, enable permissions for all pkeys of the
service, allowing it to access memory protected by any pkey. Then, update
the pkeys in the page tables. Since permissions for all pkeys are already
enabled at this stage, even if old and new pkeys coexist during the page
table update, the service's memory access remains unaffected. Finally,
after the page table update is complete, set the service's pkey permissions
to the corresponding values, thereby achieving memory protection.

Additionally, RPAL must manage the values of the PKRU register during
lazy switch operations and signal handling. This ensures the process
avoids coredumps causing by MPK.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/kernel/cpu/common.c |   8 +-
 arch/x86/kernel/fpu/core.c   |   8 +-
 arch/x86/kernel/process.c    |   7 +-
 arch/x86/rpal/core.c         |  14 +++-
 arch/x86/rpal/internal.h     |   1 +
 arch/x86/rpal/pku.c          | 139 ++++++++++++++++++++++++++++++++++-
 arch/x86/rpal/service.c      |   1 +
 arch/x86/rpal/thread.c       |   5 ++
 include/linux/rpal.h         |   3 +
 kernel/sched/core.c          |   3 +
 mm/mmap.c                    |  12 +++
 mm/mprotect.c                |  96 ++++++++++++++++++++++++
 mm/vma.c                     |  18 +++++
 13 files changed, 310 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8feb8fd2957a..2678453cdf76 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -26,6 +26,7 @@
 #include <linux/pgtable.h>
 #include <linux/stackprotector.h>
 #include <linux/utsname.h>
+#include <linux/rpal.h>
 
 #include <asm/alternative.h>
 #include <asm/cmdline.h>
@@ -532,7 +533,12 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
 
 	cr4_set_bits(X86_CR4_PKE);
 	/* Load the default PKRU value */
-	pkru_write_default();
+#ifdef CONFIG_RPAL_PKU
+	if (rpal_current_service() && rpal_current_service()->pku_on)
+		write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey));
+	else
+#endif
+		pkru_write_default();
 }
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index ea138583dd92..251b1ddee726 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/pkeys.h>
 #include <linux/vmalloc.h>
+#include <linux/rpal.h>
 
 #include "context.h"
 #include "internal.h"
@@ -746,7 +747,12 @@ static inline void restore_fpregs_from_init_fpstate(u64 features_mask)
 	else
 		frstor(&init_fpstate.regs.fsave);
 
-	pkru_write_default();
+#ifdef CONFIG_RPAL_PKU
+	if (rpal_current_service() && rpal_current_service()->pku_on)
+		write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey));
+	else
+#endif
+		pkru_write_default();
 }
 
 /*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index be8845e2ca4d..b74de35218f9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -285,7 +285,12 @@ static void pkru_flush_thread(void)
 	 * If PKRU is enabled the default PKRU value has to be loaded into
 	 * the hardware right here (similar to context switch).
 	 */
-	pkru_write_default();
+#ifdef CONFIG_RPAL_PKU
+	if (rpal_current_service() && rpal_current_service()->pku_on)
+		write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey));
+	else
+#endif
+		pkru_write_default();
 }
 
 void flush_thread(void)
diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 41111d693994..47c9e551344e 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -275,6 +275,13 @@ rpal_skip_lazy_switch(struct task_struct *next, struct pt_regs *regs)
 	tgt = next->rpal_rs;
 	if (in_ret_section(tgt, regs->ip)) {
 		wrfsbase(current->thread.fsbase);
+#ifdef CONFIG_RPAL_PKU
+		rpal_set_current_pkru(
+			rpal_pkru_union(
+				rpal_pkey_to_pkru(rpal_current_service()->pkey),
+				rpal_pkey_to_pkru(next->rpal_rs->pkey)),
+			RPAL_PKRU_SET);
+#endif
 		rebuild_sender_stack(current->rpal_sd, regs);
 		rpal_clear_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT);
 		next->rpal_rd->sender = NULL;
@@ -292,8 +299,13 @@ static struct task_struct *rpal_fix_critical_section(struct task_struct *next,
 	if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT))
 		next = rpal_skip_lazy_switch(next, regs);
 	/* receiver->sender */
-	else if (rpal_is_correct_address(cur, regs->ip))
+	else if (rpal_is_correct_address(cur, regs->ip)) {
 		rpal_skip_receiver_code(next, regs);
+#ifdef CONFIG_RPAL_PKU
+		write_pkru(rpal_pkru_union(
+			rpal_pkey_to_pkru(next->rpal_rs->pkey), rdpkru()));
+#endif
+	}
 
 	return next;
 }
diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index 71afa8225450..e49febce8645 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -58,4 +58,5 @@ rpal_build_call_state(const struct rpal_sender_data *rsd)
 /* pkey.c */
 int rpal_alloc_pkey(struct rpal_service *rs, int pkey);
 int rpal_pkey_setup(struct rpal_service *rs, int pkey);
+void rpal_set_current_pkru(u32 val, int mode);
 void rpal_service_pku_init(void);
diff --git a/arch/x86/rpal/pku.c b/arch/x86/rpal/pku.c
index 4c5151ca5b8b..26cef324f41f 100644
--- a/arch/x86/rpal/pku.c
+++ b/arch/x86/rpal/pku.c
@@ -25,12 +25,149 @@ void rpal_service_pku_init(void)
 	mmap_write_unlock(mm);
 }
 
+void rpal_set_pku_schedule_tail(struct task_struct *prev)
+{
+	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) {
+		struct rpal_service *cur = rpal_current_service();
+		u32 val = rpal_pkey_to_pkru(cur->pkey);
+
+		rpal_set_current_pkru(val, RPAL_PKRU_SET);
+	} else {
+		struct rpal_service *cur = rpal_current_service();
+		u32 val = rpal_pkey_to_pkru(cur->pkey);
+
+		val = rpal_pkru_union(
+			val,
+			rpal_pkey_to_pkru(
+				current->rpal_sd->receiver->rpal_rs->pkey));
+		rpal_set_current_pkru(val, RPAL_PKRU_SET);
+	}
+}
+
+static inline u32 rpal_get_new_val(u32 old_val, u32 new_val, int mode)
+{
+	switch (mode) {
+	case RPAL_PKRU_SET:
+		return new_val;
+	case RPAL_PKRU_UNION:
+		return rpal_pkru_union(old_val, new_val);
+	case RPAL_PKRU_INTERSECT:
+		return rpal_pkru_intersect(old_val, new_val);
+	default:
+		rpal_err("%s: invalid mode: %d\n", __func__, mode);
+		return old_val;
+	}
+}
+
+static int rpal_set_task_fpu_pkru(struct task_struct *task, u32 val, int mode)
+{
+	struct thread_struct *t = &task->thread;
+
+	val = rpal_get_new_val(t->pkru, val, mode);
+	t->pkru = val;
+
+	return 0;
+}
+
+void rpal_set_current_pkru(u32 val, int mode)
+{
+	u32 new_val;
+
+	new_val = rpal_get_new_val(rdpkru(), val, mode);
+	write_pkru(new_val);
+}
+
+struct task_function_data {
+	struct task_struct *task;
+	u32 val;
+	int mode;
+	int ret;
+};
+
+static void rpal_set_remote_pkru(void *data)
+{
+	struct task_function_data *tfd = data;
+	struct task_struct *task = tfd->task;
+
+	if (task) {
+		/* -EAGAIN */
+		if (task_cpu(task) != smp_processor_id())
+			return;
+
+		tfd->ret = -ESRCH;
+		if (task == current) {
+			rpal_set_current_pkru(tfd->val, tfd->mode);
+			tfd->ret = 0;
+		} else {
+			tfd->ret = rpal_set_task_fpu_pkru(task, tfd->val,
+							  tfd->mode);
+		}
+		return;
+	}
+}
+
+static int rpal_task_function_call(struct task_struct *task, u32 val, int mode)
+{
+	struct task_function_data data = {
+		.task = task,
+		.val = val,
+		.mode = mode,
+		.ret = -EAGAIN,
+	};
+	int ret;
+
+	for (;;) {
+		smp_call_function_single(task_cpu(task), rpal_set_remote_pkru,
+					 &data, 1);
+		ret = data.ret;
+
+		if (ret != -EAGAIN)
+			break;
+
+		cond_resched();
+	}
+
+	return ret;
+}
+
+static void rpal_set_task_pkru(struct task_struct *task, u32 val, int mode)
+{
+	if (task == current)
+		rpal_set_current_pkru(val, mode);
+	else
+		rpal_task_function_call(task, val, mode);
+}
+
+static void rpal_set_group_pkru(u32 val, int mode)
+{
+	struct task_struct *p;
+
+	for_each_thread(current, p) {
+		rpal_set_task_pkru(p, val, mode);
+	}
+}
+
 int rpal_pkey_setup(struct rpal_service *rs, int pkey)
 {
-	int val;
+	int err, val;
 
 	val = rpal_pkey_to_pkru(pkey);
+
+	mmap_write_lock(current->mm);
+	if (rs->pku_on) {
+		mmap_write_unlock(current->mm);
+		return 0;
+	}
 	rs->pkey = pkey;
+	/* others must see rs->pkey before rs->pku_on */
+	barrier();
+	rs->pku_on = true;
+	mmap_write_unlock(current->mm);
+	rpal_set_group_pkru(val, RPAL_PKRU_UNION);
+	err = do_rpal_mprotect_pkey(rs->base, RPAL_ADDR_SPACE_SIZE, pkey);
+	if (unlikely(err))
+		rpal_err("do_rpal_mprotect_key error: %d\n", err);
+	rpal_set_group_pkru(val, RPAL_PKRU_SET);
 	return 0;
 }
 
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index ca795dacc90d..7a83e85cf096 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -210,6 +210,7 @@ struct rpal_service *rpal_register_service(void)
 	init_waitqueue_head(&rs->rpd.rpal_waitqueue);
 #ifdef CONFIG_RPAL_PKU
 	rs->pkey = -1;
+	rs->pku_on = false;
 	rpal_service_pku_init();
 #endif
 
diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
index 02c1a9c22dd7..fcc592baaac0 100644
--- a/arch/x86/rpal/thread.c
+++ b/arch/x86/rpal/thread.c
@@ -281,6 +281,11 @@ int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 			regs->sp = ersp;
 			/* avoid rebuild again */
 			scc->ec.magic = 0;
+#ifdef CONFIG_RPAL_PKU
+			rpal_set_current_pkru(
+				rpal_pkey_to_pkru(rpal_current_service()->pkey),
+				RPAL_PKRU_SET);
+#endif
 			return 0;
 		}
 	}
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 2f2982d281cc..f2474cb53abe 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -239,6 +239,7 @@ struct rpal_service {
 
 #ifdef CONFIG_RPAL_PKU
 	/* pkey */
+	bool pku_on;
 	int pkey;
 #endif
 
@@ -571,4 +572,6 @@ void rpal_schedule(struct task_struct *next);
 asmlinkage struct task_struct *
 __rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p);
 asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev);
+int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey);
+void rpal_set_pku_schedule_tail(struct task_struct *prev);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f9343698198..eb5d5bd51597 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11029,6 +11029,9 @@ asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev)
 
 	finish_task_switch(prev);
 	trace_sched_exit_tp(true, CALLER_ADDR0);
+#ifdef CONFIG_RPAL_PKU
+	rpal_set_pku_schedule_tail(prev);
+#endif
 	preempt_enable();
 
 	calculate_sigpending();
diff --git a/mm/mmap.c b/mm/mmap.c
index 98bb33d2091e..d36ea4ea2bd0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -396,6 +396,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (pkey < 0)
 			pkey = 0;
 	}
+#ifdef CONFIG_RPAL_PKU
+	/*
+	 * For RPAL process, if pku is enabled, we always use
+	 * its service pkey for new vma.
+	 */
+	do {
+		struct rpal_service *cur = rpal_current_service();
+
+		if (cur && cur->pku_on)
+			pkey = cur->pkey;
+	} while (0);
+#endif
 
 	/* Do simple checking here so the lower-level routines won't have
 	 * to. we assume access permissions have been handled by the open
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 982f911ffaba..e9ae828e377d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -713,6 +713,18 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	struct mmu_gather tlb;
 	struct vma_iterator vmi;
 
+#ifdef CONFIG_RPAL_PKU
+	if (pkey != -1) {
+		struct rpal_service *cur = rpal_current_service();
+
+		if (unlikely(cur) && cur->pku_on) {
+			rpal_err("%s, pid: %d, try to change pkey\n",
+				 current->comm, current->pid);
+			return -EINVAL;
+		}
+	}
+#endif
+
 	start = untagged_addr(start);
 
 	prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
@@ -848,6 +860,90 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	return error;
 }
 
+#ifdef CONFIG_RPAL_PKU
+int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey)
+{
+	unsigned long nstart, end, tmp;
+	struct vm_area_struct *vma, *prev;
+	struct rpal_service *cur = rpal_current_service();
+	int error = -EINVAL;
+	struct mmu_gather tlb;
+	struct vma_iterator vmi;
+
+	start = untagged_addr(start);
+
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	if (!len)
+		return 0;
+	len = PAGE_ALIGN(len);
+	end = start + len;
+	if (end <= start)
+		return -ENOMEM;
+
+	if (mmap_write_lock_killable(current->mm))
+		return -EINTR;
+
+	/*
+	 * If userspace did not allocate the pkey, do not let
+	 * them use it here.
+	 */
+	error = -EINVAL;
+	if ((pkey != -1) && !mm_pkey_is_allocated(current->mm, pkey))
+		goto out;
+
+	vma_iter_init(&vmi, current->mm, start);
+	vma = vma_find(&vmi, end);
+	error = -ENOMEM;
+	if (!vma)
+		goto out;
+
+	prev = vma_prev(&vmi);
+	if (vma->vm_start > start)
+		start = vma->vm_start;
+
+	if (start > vma->vm_start)
+		prev = vma;
+
+	tlb_gather_mmu(&tlb, current->mm);
+	nstart = start;
+	tmp = vma->vm_start;
+	for_each_vma_range(vmi, vma, end) {
+		unsigned long vma_pkey_mask;
+		unsigned long newflags;
+
+		tmp = vma->vm_start;
+		nstart = tmp;
+
+		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
+		vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 |
+				VM_PKEY_BIT3;
+		newflags = vma->vm_flags;
+		newflags &= ~vma_pkey_mask;
+		newflags |= ((unsigned long)cur->pkey) << VM_PKEY_SHIFT;
+
+		tmp = vma->vm_end;
+		if (tmp > end)
+			tmp = end;
+
+		if (vma->vm_ops && vma->vm_ops->mprotect) {
+			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
+			if (error)
+				break;
+		}
+
+		error = mprotect_fixup(&vmi, &tlb, vma, &prev, nstart, tmp, newflags);
+		if (error)
+			break;
+	}
+	tlb_finish_mmu(&tlb);
+
+out:
+	mmap_write_unlock(current->mm);
+	return error;
+}
+#endif
+
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
diff --git a/mm/vma.c b/mm/vma.c
index a468d4c29c0c..fa9d8f694e6e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -4,6 +4,8 @@
  * VMA-specific functions.
  */
 
+#include <linux/rpal.h>
+
 #include "vma_internal.h"
 #include "vma.h"
 
@@ -2622,6 +2624,22 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = current->mm;
 
+#ifdef CONFIG_RPAL_PKU
+	/*
+	 * Any memory need to use RPAL service pkey
+	 * once service is enabled.
+	 */
+	struct rpal_service *cur = rpal_current_service();
+	unsigned long vma_pkey_mask;
+
+	if (cur && cur->pku_on) {
+		vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 |
+				VM_PKEY_BIT3;
+		flags &= ~vma_pkey_mask;
+		flags |= ((unsigned long)cur->pkey) << VM_PKEY_SHIFT;
+	}
+#endif
+
 	/*
 	 * Check against address space limits by the changed size
 	 * Note: This happens *after* clearing old mappings in some code paths.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC v2 26/35] RPAL: enable MPK support
  2025-05-30  9:27 ` [RFC v2 26/35] RPAL: enable MPK support Bo Li
@ 2025-05-30 17:03   ` Dave Hansen
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Hansen @ 2025-05-30 17:03 UTC (permalink / raw)
  To: Bo Li, tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff

On 5/30/25 02:27, Bo Li wrote:
> RPAL leverages Memory Protection Keys (MPK) to safeguard shared memory
> from illegal access and corruption by other processes. 

... as long as nobody uses the completely unprivileged WRPKRU or XRSTOR
instructions. Right?

... the instructions that are plainly available in super obscure
libraries like glibc?

This seems like a rather major oversight.

Oh, speaking of major oversights, you stymied pkey_alloc() but forgot
pkey_free(). There's nothing to stop folks from pkey_free()'ing a pkey
that RPAL is using and then letting someone else allocate it.

Is this all for real and serious?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v2 27/35] RPAL: add epoll support
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (25 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 26/35] RPAL: enable MPK support Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 28/35] RPAL: add rpal_uds_fdmap() support Bo Li
                   ` (13 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

To support the epoll family, RPAL needs to add new logic for RPAL services
to the existing epoll logic, ensuring that user mode can execute RPAL
service-related logic through identical interfaces.

When the receiver thread calls epoll_wait(), it can set RPAL_EP_POLL_MAGIC
to notify the kernel to invoke RPAL-related logic. The kernel then sets the
receiver's state to RPAL_RECEIVER_STATE_READY and transitions it to
RPAL_RECEIVER_STATE_WAIT when the receiver is actually removed from the
runqueue, allowing the sender to perform RPAL calls on the receiver thread.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/core.c |   4 +
 fs/eventpoll.c       | 200 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/rpal.h |  21 +++++
 kernel/sched/core.c  |  17 ++++
 4 files changed, 242 insertions(+)

diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 47c9e551344e..6a22b9faa100 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -9,6 +9,7 @@
 #include <linux/rpal.h>
 #include <linux/sched/task_stack.h>
 #include <linux/pkeys.h>
+#include <linux/file.h>
 #include <asm/fsgsbase.h>
 
 #include "internal.h"
@@ -63,6 +64,7 @@ void rpal_kernel_ret(struct pt_regs *regs)
 
 	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) {
 		rcc = current->rpal_rd->rcc;
+		regs->ax = rpal_try_send_events(current->rpal_rd->ep, rcc);
 		atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET);
 	} else {
 		tsk = current->rpal_sd->receiver;
@@ -142,6 +144,7 @@ rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *regs)
 	struct task_struct *prev = current;
 
 	if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) {
+		rpal_resume_ep(next);
 		current->rpal_sd->receiver = next;
 		rpal_lock_cpu(current);
 		rpal_lock_cpu(next);
@@ -154,6 +157,7 @@ rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *regs)
 		 */
 		rebuild_sender_stack(current->rpal_sd, regs);
 		rpal_schedule(next);
+		fdput(next->rpal_rd->f);
 	} else {
 		update_dst_stack(next, regs);
 		/*
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d4dbffdedd08..437cd5764c03 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -38,6 +38,7 @@
 #include <linux/compat.h>
 #include <linux/rculist.h>
 #include <linux/capability.h>
+#include <linux/rpal.h>
 #include <net/busy_poll.h>
 
 /*
@@ -2141,6 +2142,187 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	}
 }
 
+#ifdef CONFIG_RPAL
+
+void rpal_resume_ep(struct task_struct *tsk)
+{
+	struct rpal_receiver_data *rrd = tsk->rpal_rd;
+	struct eventpoll *ep = (struct eventpoll *)rrd->ep;
+	struct rpal_receiver_call_context *rcc = rrd->rcc;
+
+	if (rcc->timeout > 0) {
+		hrtimer_cancel(&rrd->ep_sleeper.timer);
+		destroy_hrtimer_on_stack(&rrd->ep_sleeper.timer);
+	}
+	if (!list_empty_careful(&rrd->ep_wait.entry)) {
+		write_lock(&ep->lock);
+		__remove_wait_queue(&ep->wq, &rrd->ep_wait);
+		write_unlock(&ep->lock);
+	}
+}
+
+int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc)
+{
+	int eavail;
+	int res = 0;
+
+	res = ep_send_events(ep, rcc->events, rcc->maxevents);
+	if (res > 0)
+		ep_suspend_napi_irqs(ep);
+
+	eavail = ep_events_available(ep);
+	if (!eavail) {
+		atomic_and(~RPAL_KERNEL_PENDING, &rcc->ep_pending);
+		/* check again to avoid data race on RPAL_KERNEL_PENDING */
+		eavail = ep_events_available(ep);
+		if (eavail)
+			atomic_or(RPAL_KERNEL_PENDING, &rcc->ep_pending);
+	}
+	return res;
+}
+
+static int rpal_schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
+					       const enum hrtimer_mode mode,
+					       clockid_t clock_id)
+{
+	struct hrtimer_sleeper *t = &current->rpal_rd->ep_sleeper;
+
+	/*
+	 * Optimize when a zero timeout value is given. It does not
+	 * matter whether this is an absolute or a relative time.
+	 */
+	if (expires && *expires == 0) {
+		__set_current_state(TASK_RUNNING);
+		return 0;
+	}
+
+	/*
+	 * A NULL parameter means "infinite"
+	 */
+	if (!expires) {
+		schedule();
+		return -EINTR;
+	}
+
+	hrtimer_setup_sleeper_on_stack(t, clock_id, mode);
+	hrtimer_set_expires_range_ns(&t->timer, *expires, delta);
+	hrtimer_sleeper_start_expires(t, mode);
+
+	if (likely(t->task))
+		schedule();
+
+	hrtimer_cancel(&t->timer);
+	destroy_hrtimer_on_stack(&t->timer);
+
+	__set_current_state(TASK_RUNNING);
+
+	return !t->task ? 0 : -EINTR;
+}
+
+static int rpal_ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
+			int maxevents, struct timespec64 *timeout)
+{
+	int res = 0, eavail, timed_out = 0;
+	u64 slack = 0;
+	struct rpal_receiver_data *rrd = current->rpal_rd;
+	wait_queue_entry_t *wait = &rrd->ep_wait;
+	ktime_t expires, *to = NULL;
+
+	rrd->ep = ep;
+
+	lockdep_assert_irqs_enabled();
+
+	if (timeout && (timeout->tv_sec | timeout->tv_nsec)) {
+		slack = select_estimate_accuracy(timeout);
+		to = &expires;
+		*to = timespec64_to_ktime(*timeout);
+	} else if (timeout) {
+		timed_out = 1;
+	}
+
+	eavail = ep_events_available(ep);
+
+	while (1) {
+		if (eavail) {
+			res = rpal_try_send_events(ep, rrd->rcc);
+			if (res) {
+				atomic_xchg(&rrd->rcc->receiver_state,
+					    RPAL_RECEIVER_STATE_RUNNING);
+				return res;
+			}
+		}
+
+		if (timed_out) {
+			atomic_xchg(&rrd->rcc->receiver_state,
+				    RPAL_RECEIVER_STATE_RUNNING);
+			return 0;
+		}
+
+		eavail = ep_busy_loop(ep);
+		if (eavail)
+			continue;
+
+		if (signal_pending(current)) {
+			atomic_xchg(&rrd->rcc->receiver_state,
+				    RPAL_RECEIVER_STATE_RUNNING);
+			return -EINTR;
+		}
+
+		init_wait(wait);
+		wait->func = rpal_ep_autoremove_wake_function;
+		wait->private = rrd;
+		write_lock_irq(&ep->lock);
+
+		atomic_xchg(&rrd->rcc->receiver_state,
+			    RPAL_RECEIVER_STATE_READY);
+		__set_current_state(TASK_INTERRUPTIBLE);
+
+		eavail = ep_events_available(ep);
+		if (!eavail)
+			__add_wait_queue_exclusive(&ep->wq, wait);
+
+		write_unlock_irq(&ep->lock);
+
+		if (!eavail && ep_schedule_timeout(to)) {
+			if (RPAL_USER_PENDING & atomic_read(&rrd->rcc->ep_pending)) {
+				timed_out = 1;
+			} else {
+				timed_out =
+					!rpal_schedule_hrtimeout_range_clock(
+						to, slack, HRTIMER_MODE_ABS,
+						CLOCK_MONOTONIC);
+			}
+		}
+		atomic_cmpxchg(&rrd->rcc->receiver_state,
+			       RPAL_RECEIVER_STATE_READY,
+			       RPAL_RECEIVER_STATE_RUNNING);
+		__set_current_state(TASK_RUNNING);
+
+		/*
+		 * We were woken up, thus go and try to harvest some events.
+		 * If timed out and still on the wait queue, recheck eavail
+		 * carefully under lock, below.
+		 */
+		eavail = 1;
+
+		if (!list_empty_careful(&wait->entry)) {
+			write_lock_irq(&ep->lock);
+			/*
+			 * If the thread timed out and is not on the wait queue,
+			 * it means that the thread was woken up after its
+			 * timeout expired before it could reacquire the lock.
+			 * Thus, when wait.entry is empty, it needs to harvest
+			 * events.
+			 */
+			if (timed_out)
+				eavail = list_empty(&wait->entry);
+			__remove_wait_queue(&ep->wq, wait);
+			write_unlock_irq(&ep->lock);
+		}
+	}
+}
+#endif
+
 /**
  * ep_loop_check_proc - verify that adding an epoll file inside another
  *                      epoll structure does not violate the constraints, in
@@ -2529,7 +2711,25 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 	ep = fd_file(f)->private_data;
 
 	/* Time to fish for events ... */
+#ifdef CONFIG_RPAL
+	/*
+	 * For RPAL task, if it is a receiver and it set MAGIC in shared memory,
+	 * We think it is prepared for rpal calls. Therefore, we need to handle
+	 * it differently.
+	 *
+	 * In other cases, RPAL task always plays like a normal task.
+	 */
+	if (rpal_current_service() &&
+	    rpal_test_current_thread_flag(RPAL_RECEIVER_BIT) &&
+	    current->rpal_rd->rcc->rpal_ep_poll_magic == RPAL_EP_POLL_MAGIC) {
+		current->rpal_rd->f = f;
+		return rpal_ep_poll(ep, events, maxevents, to);
+	} else {
+		return ep_poll(ep, events, maxevents, to);
+	}
+#else
 	return ep_poll(ep, events, maxevents, to);
+#endif
 }
 
 SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index f2474cb53abe..5912ffec6e28 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -16,6 +16,8 @@
 #include <linux/hashtable.h>
 #include <linux/atomic.h>
 #include <linux/sizes.h>
+#include <linux/file.h>
+#include <linux/hrtimer.h>
 
 #define RPAL_ERROR_MSG "rpal error: "
 #define rpal_err(x...) pr_err(RPAL_ERROR_MSG x)
@@ -89,6 +91,7 @@ enum {
 };
 
 #define RPAL_ERROR_MAGIC 0x98CC98CC
+#define RPAL_EP_POLL_MAGIC 0xCC98CC98
 
 #define RPAL_SID_SHIFT 24
 #define RPAL_ID_SHIFT 8
@@ -103,6 +106,9 @@ enum {
 #define RPAL_PKRU_UNION 1
 #define RPAL_PKRU_INTERSECT 2
 
+#define RPAL_KERNEL_PENDING 0x1
+#define RPAL_USER_PENDING 0x2
+
 extern unsigned long rpal_cap;
 
 enum rpal_task_flag_bits {
@@ -282,6 +288,12 @@ struct rpal_receiver_call_context {
 	int receiver_id;
 	atomic_t receiver_state;
 	atomic_t sender_state;
+	atomic_t ep_pending;
+	int rpal_ep_poll_magic;
+	int epfd;
+	void __user *events;
+	int maxevents;
+	int timeout;
 };
 
 /* recovery point for sender */
@@ -325,6 +337,10 @@ struct rpal_receiver_data {
 	struct rpal_shared_page *rsp;
 	struct rpal_receiver_call_context *rcc;
 	struct task_struct *sender;
+	void *ep;
+	struct fd f;
+	struct hrtimer_sleeper ep_sleeper;
+	wait_queue_entry_t ep_wait;
 };
 
 struct rpal_sender_data {
@@ -574,4 +590,9 @@ __rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p);
 asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev);
 int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey);
 void rpal_set_pku_schedule_tail(struct task_struct *prev);
+int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr,
+	unsigned int mode, int wake_flags,
+	void *key);
+void rpal_resume_ep(struct task_struct *tsk);
+int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb5d5bd51597..486d59bdd3fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6794,6 +6794,23 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 #define SM_RTLOCK_WAIT		2
 
 #ifdef CONFIG_RPAL
+int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr,
+				     unsigned int mode, int wake_flags,
+				     void *key)
+{
+	struct rpal_receiver_data *rrd = curr->private;
+	struct task_struct *tsk = rrd->rcd.bp_task;
+	int ret;
+
+	ret = try_to_wake_up(tsk, mode, wake_flags);
+
+	list_del_init_careful(&curr->entry);
+	if (!ret)
+		atomic_or(RPAL_KERNEL_PENDING, &rrd->rcc->ep_pending);
+
+	return 1;
+}
+
 static inline void rpal_check_ready_state(struct task_struct *tsk, int state)
 {
 	if (rpal_test_task_thread_flag(tsk, RPAL_RECEIVER_BIT)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 28/35] RPAL: add rpal_uds_fdmap() support
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (26 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 27/35] RPAL: add epoll support Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 29/35] RPAL: fix race condition in pkru update Bo Li
                   ` (12 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

For a UDS connection between a sender and a receiver, neither side knows
which file descriptor (fd) the other uses to manage the connection. The
sender cannot determine which user space fd's buffer in the receiver to
write data to, necessitating a complex process for both sides to inform
each other of fd mappings. This process incurs significant overhead when
managing a large number of connections, which requires optimization.

This patch introduces the RPAL_IOCTL_UDS_FDMAP interface, which simplifies
the establishment of fd mappings between sender and receiver processes for
files monitored by epoll. This avoids the need for a complex setup process
each time a new connection is created.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/internal.h |   3 +
 arch/x86/rpal/proc.c     | 117 +++++++++++++++++++++++++++++++++++++++
 fs/eventpoll.c           |  19 +++++++
 include/linux/rpal.h     |  11 ++++
 4 files changed, 150 insertions(+)

diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index e49febce8645..e03f8a90619d 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -11,6 +11,7 @@
 
 #include <linux/mm.h>
 #include <linux/file.h>
+#include <net/af_unix.h>
 
 extern bool rpal_inited;
 
@@ -60,3 +61,5 @@ int rpal_alloc_pkey(struct rpal_service *rs, int pkey);
 int rpal_pkey_setup(struct rpal_service *rs, int pkey);
 void rpal_set_current_pkru(u32 val, int mode);
 void rpal_service_pku_init(void);
+
+extern struct sock *unix_peer_get(struct sock *sk);
diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c
index 2f9cceec4992..b60c099c4a92 100644
--- a/arch/x86/rpal/proc.c
+++ b/arch/x86/rpal/proc.c
@@ -9,6 +9,8 @@
 #include <linux/rpal.h>
 #include <linux/proc_fs.h>
 #include <linux/poll.h>
+#include <net/sock.h>
+#include <net/af_unix.h>
 
 #include "internal.h"
 
@@ -34,6 +36,118 @@ static int rpal_get_api_version_and_cap(void __user *p)
 	return 0;
 }
 
+static void *rpal_uds_peer_data(struct sock *psk, int *pfd)
+{
+	void *ep = NULL;
+	unsigned long flags;
+	struct socket_wq *wq;
+	wait_queue_entry_t *entry;
+	wait_queue_head_t *whead;
+
+	rcu_read_lock();
+	wq = rcu_dereference(psk->sk_wq);
+	if (!skwq_has_sleeper(wq))
+		goto unlock_rcu;
+
+	whead = &wq->wait;
+
+	spin_lock_irqsave(&whead->lock, flags);
+	if (list_empty(&whead->head)) {
+		pr_debug("rpal debug: [%d] cannot find epitem entry\n",
+			 current->pid);
+		goto unlock_spin;
+	}
+	entry = list_first_entry(&whead->head, wait_queue_entry_t, entry);
+	*pfd = rpal_get_epitemfd(entry);
+	if (*pfd < 0) {
+		pr_debug("rpal debug: [%d] cannot find epitem fd\n",
+			 current->pid);
+		goto unlock_spin;
+	}
+	ep = rpal_get_epitemep(entry);
+
+unlock_spin:
+	spin_unlock_irqrestore(&whead->lock, flags);
+unlock_rcu:
+	rcu_read_unlock();
+	return ep;
+}
+
+static int rpal_find_receiver_rid(int id, void *ep)
+{
+	struct task_struct *tsk;
+	struct rpal_service *cur, *tgt;
+	int rid = -1;
+
+	cur = rpal_current_service();
+
+	tgt = rpal_get_mapped_service_by_id(cur, id);
+	if (tgt == NULL)
+		goto out;
+
+	for_each_thread(tgt->group_leader, tsk) {
+		if (!rpal_test_task_thread_flag(tsk, RPAL_RECEIVER_BIT))
+			continue;
+		if (tsk->rpal_rd->ep == ep) {
+			rid = tsk->rpal_rd->rcc->receiver_id;
+			break;
+		}
+	}
+
+	rpal_put_service(tgt);
+out:
+	return rid;
+}
+
+static long rpal_uds_fdmap(unsigned long uarg)
+{
+	struct rpal_uds_fdmap_arg arg;
+	struct socket *sock;
+	struct sock *peer_sk;
+	void *ep;
+	int sfd, rid;
+	struct fd f;
+	long res;
+	int ret;
+
+	ret = copy_from_user(&arg, (void __user *)uarg, sizeof(arg));
+	if (ret)
+		return ret;
+
+	f = fdget(arg.cfd);
+	if (!fd_file(f))
+		goto fd_put;
+
+	sock = sock_from_file(fd_file(f));
+	if (!sock)
+		goto fd_put;
+
+	peer_sk = unix_peer_get(sock->sk);
+	if (peer_sk == NULL)
+		goto fd_put;
+	ep = rpal_uds_peer_data(peer_sk, &sfd);
+	if (ep == NULL) {
+		pr_debug("rpal debug: [%d] cannot find epitem ep\n",
+			 current->pid);
+		goto peer_sock_put;
+	}
+	rid = rpal_find_receiver_rid(arg.service_id, ep);
+	if (rid < 0) {
+		pr_debug("rpal debug: [%d] rpal: cannot find epitem rid\n",
+			 current->pid);
+		goto peer_sock_put;
+	}
+	res = (long)rid << 32 | (long)sfd;
+	ret = put_user(res, arg.res);
+
+peer_sock_put:
+	sock_put(peer_sk);
+fd_put:
+	if (fd_file(f))
+		fdput(f);
+	return ret;
+}
+
 static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct rpal_service *cur = rpal_current_service();
@@ -81,6 +195,9 @@ static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		ret = put_user(cur->pkey, (int __user *)arg);
 		break;
 #endif
+	case RPAL_IOCTL_UDS_FDMAP:
+		ret = rpal_uds_fdmap(arg);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 437cd5764c03..791321639561 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2143,6 +2143,25 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 }
 
 #ifdef CONFIG_RPAL
+void *rpal_get_epitemep(wait_queue_entry_t *wait)
+{
+	struct epitem *epi = ep_item_from_wait(wait);
+
+	if (!epi)
+		return NULL;
+
+	return epi->ep;
+}
+
+int rpal_get_epitemfd(wait_queue_entry_t *wait)
+{
+	struct epitem *epi = ep_item_from_wait(wait);
+
+	if (!epi)
+		return -1;
+
+	return epi->ffd.fd;
+}
 
 void rpal_resume_ep(struct task_struct *tsk)
 {
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 5912ffec6e28..7657e6c6393b 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -350,6 +350,12 @@ struct rpal_sender_data {
 	struct task_struct *receiver;
 };
 
+struct rpal_uds_fdmap_arg {
+	int service_id;
+	int cfd;
+	unsigned long *res;
+};
+
 enum rpal_command_type {
 	RPAL_CMD_GET_API_VERSION_AND_CAP,
 	RPAL_CMD_GET_SERVICE_KEY,
@@ -363,6 +369,7 @@ enum rpal_command_type {
 	RPAL_CMD_REQUEST_SERVICE,
 	RPAL_CMD_RELEASE_SERVICE,
 	RPAL_CMD_GET_SERVICE_PKEY,
+	RPAL_CMD_UDS_FDMAP,
 	RPAL_NR_CMD,
 };
 
@@ -393,6 +400,8 @@ enum rpal_command_type {
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long)
 #define RPAL_IOCTL_GET_SERVICE_PKEY \
 	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_PKEY, int *)
+#define RPAL_IOCTL_UDS_FDMAP \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_UDS_FDMAP, unsigned long)
 
 #define rpal_for_each_requested_service(rs, idx)                             \
 	for (idx = find_first_bit(rs->requested_service_bitmap, RPAL_NR_ID); \
@@ -594,5 +603,7 @@ int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr,
 	unsigned int mode, int wake_flags,
 	void *key);
 void rpal_resume_ep(struct task_struct *tsk);
+void *rpal_get_epitemep(wait_queue_entry_t *wait);
+int rpal_get_epitemfd(wait_queue_entry_t *wait);
 int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 29/35] RPAL: fix race condition in pkru update
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (27 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 28/35] RPAL: add rpal_uds_fdmap() support Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 30/35] RPAL: fix pkru setup when fork Bo Li
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

When setting up MPK, RPAL uses IPIs to notify tasks running on each core
in the thread group to modify their PKRU values and update the PKEY fields
in all VMA page tables. A race condition exists here: when updating PKRU,
the page table updates may not yet be complete. In such cases, writing
PKRU permissions at locations that require calling pkru_write_default()
(e.g., during signal handling) must not be restricted to a single PKEY,
as this would cause PKRU permissions to fail to accommodate both old and
new page table PKEY settings.

This patch introduces a pku_on state with values PKU_ON_FALSE, PKU_ON_INIT,
and PKU_ON_FINISH, representing the states before, during, and after page
table PKEY updates, respectively. For RPAL services, all calls to
pkru_write_default() are replaced with rpal_pkru_write_default().

- Before page table setup (PKU_ON_FALSE), rpal_pkru_write_default()
  directly calls pkru_write_default().
- During page table setup (PKU_ON_INIT), rpal_pkru_write_default() enables
  permissions for all PKEYs, ensuring the task can access both old and new
  page tables simultaneously.
- After page table setup completes (PKU_ON_FINISH),
  rpal_pkru_write_default() tightens permissions to match the updated page
  tables.

For newly allocated page tables, the new PKEY is only used when pku_on is
PKU_ON_FINISH. The mmap lock is used to ensure no race conditions occur
during this process.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/kernel/cpu/common.c |  4 ++--
 arch/x86/kernel/fpu/core.c   |  4 ++--
 arch/x86/kernel/process.c    |  4 ++--
 arch/x86/rpal/pku.c          | 14 +++++++++++++-
 arch/x86/rpal/service.c      |  2 +-
 include/linux/rpal.h         |  9 ++++++++-
 mm/mmap.c                    |  2 +-
 mm/mprotect.c                |  1 +
 mm/vma.c                     |  2 +-
 9 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 2678453cdf76..d21f44873b86 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -534,8 +534,8 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
 	cr4_set_bits(X86_CR4_PKE);
 	/* Load the default PKRU value */
 #ifdef CONFIG_RPAL_PKU
-	if (rpal_current_service() && rpal_current_service()->pku_on)
-		write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey));
+	if (rpal_current_service())
+		rpal_pkru_write_default();
 	else
 #endif
 		pkru_write_default();
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 251b1ddee726..4b413af0b179 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -748,8 +748,8 @@ static inline void restore_fpregs_from_init_fpstate(u64 features_mask)
 		frstor(&init_fpstate.regs.fsave);
 
 #ifdef CONFIG_RPAL_PKU
-	if (rpal_current_service() && rpal_current_service()->pku_on)
-		write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey));
+	if (rpal_current_service())
+		rpal_pkru_write_default();
 	else
 #endif
 		pkru_write_default();
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b74de35218f9..898a9e0b23e7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -286,8 +286,8 @@ static void pkru_flush_thread(void)
 	 * the hardware right here (similar to context switch).
 	 */
 #ifdef CONFIG_RPAL_PKU
-	if (rpal_current_service() && rpal_current_service()->pku_on)
-		write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey));
+	if (rpal_current_service())
+		rpal_pkru_write_default();
 	else
 #endif
 		pkru_write_default();
diff --git a/arch/x86/rpal/pku.c b/arch/x86/rpal/pku.c
index 26cef324f41f..8e530931fb23 100644
--- a/arch/x86/rpal/pku.c
+++ b/arch/x86/rpal/pku.c
@@ -161,7 +161,7 @@ int rpal_pkey_setup(struct rpal_service *rs, int pkey)
 	rs->pkey = pkey;
 	/* others must see rs->pkey before rs->pku_on */
 	barrier();
-	rs->pku_on = true;
+	rs->pku_on = PKU_ON_INIT;
 	mmap_write_unlock(current->mm);
 	rpal_set_group_pkru(val, RPAL_PKRU_UNION);
 	err = do_rpal_mprotect_pkey(rs->base, RPAL_ADDR_SPACE_SIZE, pkey);
@@ -182,3 +182,15 @@ int rpal_alloc_pkey(struct rpal_service *rs, int pkey)
 
 	return ret;
 }
+
+void rpal_pkru_write_default(void)
+{
+	struct rpal_service *cur = rpal_current_service();
+
+	if (cur->pku_on == PKU_ON_INIT)
+		write_pkru(0);
+	else if (cur->pku_on == PKU_ON_FINISH)
+		write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey));
+	else
+		pkru_write_default();
+}
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 7a83e85cf096..9fd568fa9a29 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -210,7 +210,7 @@ struct rpal_service *rpal_register_service(void)
 	init_waitqueue_head(&rs->rpd.rpal_waitqueue);
 #ifdef CONFIG_RPAL_PKU
 	rs->pkey = -1;
-	rs->pku_on = false;
+	rs->pku_on = PKU_ON_FALSE;
 	rpal_service_pku_init();
 #endif
 
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 7657e6c6393b..16a3c80383f7 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -138,6 +138,12 @@ enum rpal_capability {
 	RPAL_CAP_PKU
 };
 
+enum {
+	PKU_ON_FALSE,
+	PKU_ON_INIT,
+	PKU_ON_FINISH,
+};
+
 struct rpal_critical_section {
 	unsigned long ret_begin;
 	unsigned long ret_end;
@@ -245,7 +251,7 @@ struct rpal_service {
 
 #ifdef CONFIG_RPAL_PKU
 	/* pkey */
-	bool pku_on;
+	int pku_on;
 	int pkey;
 #endif
 
@@ -599,6 +605,7 @@ __rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p);
 asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev);
 int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey);
 void rpal_set_pku_schedule_tail(struct task_struct *prev);
+void rpal_pkru_write_default(void);
 int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr,
 	unsigned int mode, int wake_flags,
 	void *key);
diff --git a/mm/mmap.c b/mm/mmap.c
index d36ea4ea2bd0..85a4a33491ab 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -404,7 +404,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	do {
 		struct rpal_service *cur = rpal_current_service();
 
-		if (cur && cur->pku_on)
+		if (cur && cur->pku_on == PKU_ON_FINISH)
 			pkey = cur->pkey;
 	} while (0);
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e9ae828e377d..ac162180553e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -938,6 +938,7 @@ int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey)
 	}
 	tlb_finish_mmu(&tlb);
 
+	rpal_current_service()->pku_on = PKU_ON_FINISH;
 out:
 	mmap_write_unlock(current->mm);
 	return error;
diff --git a/mm/vma.c b/mm/vma.c
index fa9d8f694e6e..57ec99a5969d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2632,7 +2632,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	struct rpal_service *cur = rpal_current_service();
 	unsigned long vma_pkey_mask;
 
-	if (cur && cur->pku_on) {
+	if (cur && cur->pku_on == PKU_ON_FINISH) {
 		vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 |
 				VM_PKEY_BIT3;
 		flags &= ~vma_pkey_mask;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 30/35] RPAL: fix pkru setup when fork
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (28 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 29/35] RPAL: fix race condition in pkru update Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:27 ` [RFC v2 31/35] RPAL: add receiver waker Bo Li
                   ` (10 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

When a task performs a fork operation, the PKRU value of the newly forked
task is set to the value read from hardware. At this point, if the service
is executing rpal_pkey_setup(), the newly forked task has not yet been
added to the task list, so PKRU settings cannot be synchronized to the new
task. This results in the new task's PKRU not being set to the correct
value when it is woken up.

This patch addresses this issue by:

- After the newly forked task is added to the task list, further updating
  its PKRU value.
- Acquiring a mutex lock to ensure that the PKRU update occurs either
  before or after the invocation of rpal_pkey_setup(). This avoids race
  conditions with rpal_pkey_setup() and guarantees that the re-updated PKRU
  value is always correct.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 kernel/fork.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 01cd48eadf68..11cba74d07c8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2683,6 +2683,19 @@ __latent_entropy struct task_struct *copy_process(
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+#ifdef CONFIG_RPAL_PKU
+	do {
+		struct rpal_service *cur = rpal_current_service();
+
+		if (cur) {
+			/* ensure we are not in rpal_enable_service() */
+			mutex_lock(&cur->mutex);
+			p->thread.pkru = rdpkru();
+			mutex_unlock(&cur->mutex);
+		}
+	} while (0);
+#endif
+
 	if (pidfile)
 		fd_install(pidfd, pidfile);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 31/35] RPAL: add receiver waker
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (29 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 30/35] RPAL: fix pkru setup when fork Bo Li
@ 2025-05-30  9:27 ` Bo Li
  2025-05-30  9:28 ` [RFC v2 32/35] RPAL: fix unknown nmi on AMD CPU Bo Li
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

In an RPAL call, the receiver thread is in the TASK_INTERRUPTIBLE state
and cannot be awakened, which may lead to missed wakeups. For example, if
no kernel event occurs during the entire RPAL call, the receiver thread
will remain in the TASK_INTERRUPTIBLE state after the RPAL call completes.

To address this issue, RPAL adds a flag to the receiver whenever it
encounters an unawakened state and introduces a "waker" work. The waker
work runs automatically on every tick to check for receiver threads that
have missed wakeups. If any are found, it wakes them up. For epoll, the
waker also checks for pending user mode events and wakes the receiver
thread if such events exist.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/internal.h |  4 ++
 arch/x86/rpal/service.c  | 98 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/rpal/thread.c   |  3 ++
 include/linux/rpal.h     | 11 +++++
 kernel/sched/core.c      |  3 ++
 5 files changed, 119 insertions(+)

diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h
index e03f8a90619d..117357dabdec 100644
--- a/arch/x86/rpal/internal.h
+++ b/arch/x86/rpal/internal.h
@@ -22,6 +22,10 @@ int rpal_enable_service(unsigned long arg);
 int rpal_disable_service(void);
 int rpal_request_service(unsigned long arg);
 int rpal_release_service(u64 key);
+void rpal_insert_wake_list(struct rpal_service *rs,
+			   struct rpal_receiver_data *rrd);
+void rpal_remove_wake_list(struct rpal_service *rs,
+			   struct rpal_receiver_data *rrd);
 
 /* mm.c */
 static inline struct rpal_shared_page *
diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c
index 9fd568fa9a29..6fefb7a7729c 100644
--- a/arch/x86/rpal/service.c
+++ b/arch/x86/rpal/service.c
@@ -143,6 +143,99 @@ static void delete_service(struct rpal_service *rs)
 	spin_unlock_irqrestore(&hash_table_lock, flags);
 }
 
+void rpal_insert_wake_list(struct rpal_service *rs,
+			   struct rpal_receiver_data *rrd)
+{
+	unsigned long flags;
+	struct rpal_waker_struct *waker = &rs->waker;
+
+	spin_lock_irqsave(&waker->lock, flags);
+	list_add_tail(&rrd->wake_list, &waker->wake_head);
+	spin_unlock_irqrestore(&waker->lock, flags);
+	pr_debug("rpal debug: [%d] insert wake list\n", current->pid);
+}
+
+void rpal_remove_wake_list(struct rpal_service *rs,
+			   struct rpal_receiver_data *rrd)
+{
+	unsigned long flags;
+	struct rpal_waker_struct *waker = &rs->waker;
+
+	spin_lock_irqsave(&waker->lock, flags);
+	list_del(&rrd->wake_list);
+	spin_unlock_irqrestore(&waker->lock, flags);
+	pr_debug("rpal debug: [%d] remove wake list\n", current->pid);
+}
+
+/* waker->lock must be hold */
+static inline void rpal_wake_all(struct rpal_waker_struct *waker)
+{
+	struct task_struct *wake_list[RPAL_MAX_RECEIVER_NUM];
+	struct list_head *list;
+	unsigned long flags;
+	int i, cnt = 0;
+
+	spin_lock_irqsave(&waker->lock, flags);
+	list_for_each(list, &waker->wake_head) {
+		struct task_struct *task;
+		struct rpal_receiver_call_context *rcc;
+		struct rpal_receiver_data *rrd;
+		int pending;
+
+		rrd = list_entry(list, struct rpal_receiver_data, wake_list);
+		task = rrd->rcd.bp_task;
+		rcc = rrd->rcc;
+
+		pending = atomic_read(&rcc->ep_pending) & RPAL_USER_PENDING;
+
+		if (rpal_test_task_thread_flag(task, RPAL_WAKE_BIT) ||
+		    (pending && atomic_cmpxchg(&rcc->receiver_state,
+					       RPAL_RECEIVER_STATE_WAIT,
+					       RPAL_RECEIVER_STATE_RUNNING) ==
+					RPAL_RECEIVER_STATE_WAIT)) {
+			wake_list[cnt] = task;
+			cnt++;
+		}
+	}
+	spin_unlock_irqrestore(&waker->lock, flags);
+
+	for (i = 0; i < cnt; i++)
+		wake_up_process(wake_list[i]);
+}
+
+static void rpal_wake_callback(struct work_struct *work)
+{
+	struct rpal_waker_struct *waker =
+		container_of(work, struct rpal_waker_struct, waker_work.work);
+
+	rpal_wake_all(waker);
+	/* We check it every ticks */
+	schedule_delayed_work(&waker->waker_work, 1);
+}
+
+static void rpal_enable_waker(struct rpal_waker_struct *waker)
+{
+	INIT_DELAYED_WORK(&waker->waker_work, rpal_wake_callback);
+	schedule_delayed_work(&waker->waker_work, 1);
+	pr_debug("rpal debug: [%d] enable waker\n", current->pid);
+}
+
+static void rpal_disable_waker(struct rpal_waker_struct *waker)
+{
+	unsigned long flags;
+	struct list_head *p, *n;
+
+	cancel_delayed_work_sync(&waker->waker_work);
+	rpal_wake_all(waker);
+	spin_lock_irqsave(&waker->lock, flags);
+	list_for_each_safe(p, n, &waker->wake_head) {
+		list_del_init(p);
+	}
+	INIT_LIST_HEAD(&waker->wake_head);
+	spin_unlock_irqrestore(&waker->lock, flags);
+	pr_debug("rpal debug: [%d] disable waker\n", current->pid);
+}
+
 static inline unsigned long calculate_base_address(int id)
 {
 	return RPAL_ADDRESS_SPACE_LOW + RPAL_ADDR_SPACE_SIZE * id;
@@ -213,6 +306,10 @@ struct rpal_service *rpal_register_service(void)
 	rs->pku_on = PKU_ON_FALSE;
 	rpal_service_pku_init();
 #endif
+	spin_lock_init(&rs->waker.lock);
+	INIT_LIST_HEAD(&rs->waker.wake_head);
+	/* receiver may miss wake up if in lazy switch, try to wake it later */
+	rpal_enable_waker(&rs->waker);
 
 	rs->bad_service = false;
 	rs->base = calculate_base_address(rs->id);
@@ -257,6 +354,7 @@ void rpal_unregister_service(struct rpal_service *rs)
 		schedule();
 
 	delete_service(rs);
+	rpal_disable_waker(&rs->waker);
 
 	pr_debug("rpal: unregister service, id: %d, tgid: %d\n", rs->id,
 		 rs->group_leader->tgid);
diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
index fcc592baaac0..51c9eec639cb 100644
--- a/arch/x86/rpal/thread.c
+++ b/arch/x86/rpal/thread.c
@@ -186,6 +186,8 @@ int rpal_register_receiver(unsigned long addr)
 	current->rpal_rd = rrd;
 	rpal_set_current_thread_flag(RPAL_RECEIVER_BIT);
 
+	rpal_insert_wake_list(cur, rrd);
+
 	atomic_inc(&cur->thread_cnt);
 
 	return 0;
@@ -214,6 +216,7 @@ int rpal_unregister_receiver(void)
 	clear_fs_tsk_map();
 
 	rpal_put_shared_page(rrd->rsp);
+	rpal_remove_wake_list(cur, rrd);
 	rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT);
 	rpal_free_thread_pending(&rrd->rcd);
 	kfree(rrd);
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 16a3c80383f7..1d8c1bdc90f2 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -116,6 +116,7 @@ enum rpal_task_flag_bits {
 	RPAL_RECEIVER_BIT,
 	RPAL_CPU_LOCKED_BIT,
 	RPAL_LAZY_SWITCHED_BIT,
+	RPAL_WAKE_BIT,
 };
 
 enum rpal_receiver_state {
@@ -189,6 +190,12 @@ struct rpal_fsbase_tsk_map {
 	struct task_struct *tsk;
 };
 
+struct rpal_waker_struct {
+	spinlock_t lock;
+	struct list_head wake_head;
+	struct delayed_work waker_work;
+};
+
 /*
  * Each RPAL process (a.k.a RPAL service) should have a pointer to
  * struct rpal_service in all its tasks' task_struct.
@@ -255,6 +262,9 @@ struct rpal_service {
 	int pkey;
 #endif
 
+	/* receiver thread waker */
+	struct rpal_waker_struct waker;
+
 	/* delayed service put work */
 	struct delayed_work delayed_put_work;
 
@@ -347,6 +357,7 @@ struct rpal_receiver_data {
 	struct fd f;
 	struct hrtimer_sleeper ep_sleeper;
 	wait_queue_entry_t ep_wait;
+	struct list_head wake_list;
 };
 
 struct rpal_sender_data {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 486d59bdd3fc..c219ada29d34 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3943,6 +3943,7 @@ static bool rpal_check_state(struct task_struct *p)
 		struct rpal_receiver_call_context *rcc = p->rpal_rd->rcc;
 		int state;
 
+		rpal_clear_task_thread_flag(p, RPAL_WAKE_BIT);
 retry:
 		state = atomic_read(&rcc->receiver_state) & RPAL_RECEIVER_STATE_MASK;
 		switch (state) {
@@ -3957,6 +3958,7 @@ static bool rpal_check_state(struct task_struct *p)
 		case RPAL_RECEIVER_STATE_RUNNING:
 			break;
 		case RPAL_RECEIVER_STATE_CALL:
+			rpal_set_task_thread_flag(p, RPAL_WAKE_BIT);
 			ret = false;
 			break;
 		default:
@@ -4522,6 +4524,7 @@ int rpal_try_to_wake_up(struct task_struct *p)
 
 	BUG_ON(READ_ONCE(p->__state) == TASK_RUNNING);
 
+	rpal_clear_task_thread_flag(p, RPAL_WAKE_BIT);
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
 		if (!ttwu_state_match(p, TASK_NORMAL, &success))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 32/35] RPAL: fix unknown nmi on AMD CPU
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (30 preceding siblings ...)
  2025-05-30  9:27 ` [RFC v2 31/35] RPAL: add receiver waker Bo Li
@ 2025-05-30  9:28 ` Bo Li
  2025-05-30  9:28 ` [RFC v2 33/35] RPAL: enable time slice correction Bo Li
                   ` (8 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

In Lazy switch, the function event_sched_out() will be called. This
function deletes the perf event of the task being scheduled out, causing
the active_mask in cpu_hw_events to be cleared. In AMD's NMI handler, if
the bit corresponding to active_mask is not set, the CPU will not handle
the NMI event, ultimately triggering an unknown NMI error. Additionally,
event_sched_out() may call amd_pmu_wait_on_overflow(), leading to a busy
wait of up to 50us during lazy switch.

This patch adds two per_cpu variables. rpal_nmi_handle is set when an NMI
occurs. When encountering an unknown NMI, this NMI is skipped. rpal_nmi is
set before lazy switch and cleared after lazy switch, preventing the busy
wait caused by amd_pmu_wait_on_overflow().

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/events/amd/core.c | 14 ++++++++++++++
 arch/x86/kernel/nmi.c      | 20 ++++++++++++++++++++
 arch/x86/rpal/core.c       | 17 ++++++++++++++++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index b20661b8621d..633a9ac4e77c 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -719,6 +719,10 @@ static void amd_pmu_wait_on_overflow(int idx)
 	}
 }
 
+#ifdef CONFIG_RPAL
+DEFINE_PER_CPU(bool, rpal_nmi);
+#endif
+
 static void amd_pmu_check_overflow(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -732,6 +736,11 @@ static void amd_pmu_check_overflow(void)
 	if (in_nmi())
 		return;
 
+#ifdef CONFIG_RPAL
+	if (this_cpu_read(rpal_nmi))
+		return;
+#endif
+
 	/*
 	 * Check each counter for overflow and wait for it to be reset by the
 	 * NMI if it has overflowed. This relies on the fact that all active
@@ -807,6 +816,11 @@ static void amd_pmu_disable_event(struct perf_event *event)
 	if (in_nmi())
 		return;
 
+#ifdef CONFIG_RPAL
+	if (this_cpu_read(rpal_nmi))
+		return;
+#endif
+
 	amd_pmu_wait_on_overflow(event->hw.idx);
 }
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index be93ec7255bf..dd72b6d1c7f9 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -351,12 +351,23 @@ NOKPROBE_SYMBOL(unknown_nmi_error);
 
 static DEFINE_PER_CPU(bool, swallow_nmi);
 static DEFINE_PER_CPU(unsigned long, last_nmi_rip);
+#ifdef CONFIG_RPAL
+DEFINE_PER_CPU(bool, rpal_nmi_handle);
+#endif
 
 static noinstr void default_do_nmi(struct pt_regs *regs)
 {
 	unsigned char reason = 0;
 	int handled;
 	bool b2b = false;
+#ifdef CONFIG_RPAL
+	bool rpal_handle = false;
+
+	if (__this_cpu_read(rpal_nmi_handle)) {
+		__this_cpu_write(rpal_nmi_handle, false);
+		rpal_handle = true;
+	}
+#endif
 
 	/*
 	 * Back-to-back NMIs are detected by comparing the RIP of the
@@ -471,6 +482,15 @@ static noinstr void default_do_nmi(struct pt_regs *regs)
 	 */
 	if (b2b && __this_cpu_read(swallow_nmi))
 		__this_cpu_add(nmi_stats.swallow, 1);
+#ifdef CONFIG_RPAL
+	/*
+	 * Lazy switch may clear the bit in active_mask, causing
+	 * nmi event not handled. This will lead to unknown nmi,
+	 * try to avoid this.
+	 */
+	else if (rpal_handle)
+		goto out;
+#endif
 	else
 		unknown_nmi_error(reason, regs);
 
diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 6a22b9faa100..92281b557a6c 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -376,11 +376,26 @@ rpal_exception_context_switch(struct pt_regs *regs)
 	return next;
 }
 
+DECLARE_PER_CPU(bool, rpal_nmi_handle);
+DECLARE_PER_CPU(bool, rpal_nmi);
 __visible struct task_struct *rpal_nmi_context_switch(struct pt_regs *regs)
 {
 	struct task_struct *next;
 
-	next = rpal_kernel_context_switch(regs);
+	if (rpal_test_current_thread_flag(RPAL_LAZY_SWITCHED_BIT))
+		rpal_update_fsbase(regs);
+
+	next = rpal_misidentify();
+	if (unlikely(next != NULL)) {
+		next = rpal_fix_critical_section(next, regs);
+		if (next) {
+			__this_cpu_write(rpal_nmi_handle, true);
+			/* avoid wait in amd_pmu_check_overflow */
+			__this_cpu_write(rpal_nmi, true);
+			next = rpal_do_kernel_context_switch(next, regs);
+			__this_cpu_write(rpal_nmi, false);
+		}
+	}
 
 	return next;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 33/35] RPAL: enable time slice correction
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (31 preceding siblings ...)
  2025-05-30  9:28 ` [RFC v2 32/35] RPAL: fix unknown nmi on AMD CPU Bo Li
@ 2025-05-30  9:28 ` Bo Li
  2025-05-30  9:28 ` [RFC v2 34/35] RPAL: enable fast epoll wait Bo Li
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

After an RPAL call, the receiver's user mode code executes. However, the
kernel incorrectly attributes this CPU time to the sender due to the
unchanged kernel context. This results in incorrect runtime statistics.

This patch adds a new member total_time to both rpal_sender_call_context
and rpal_receiver_call_context. This member tracks how much runtime (
measured in CPU cycles via rdtsc()) has been incorrectly accounted for.
The kernel measures total_time at the entry of __schedule() and corrects
the delta in the update_rq_clock_task() function.

Additionally, since RPAL calls occur in user space, runtime statistics are
typically calculated by user space. However, when a lazy switch happens,
the kernel takes over. To address this, the patch introduces a start_time
member to record when an RPAL call is initiated, enabling the kernel to
accurately calculate the runtime that needs correction.

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/core.c   |  8 ++++++++
 arch/x86/rpal/thread.c |  6 ++++++
 include/linux/rpal.h   |  3 +++
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  1 +
 kernel/sched/core.c    | 42 ++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 62 insertions(+)

diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 92281b557a6c..2ac5d932f69c 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -144,6 +144,13 @@ rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *regs)
 	struct task_struct *prev = current;
 
 	if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) {
+		struct rpal_receiver_call_context *rcc = next->rpal_rd->rcc;
+		struct rpal_sender_call_context *scc = current->rpal_sd->scc;
+		u64 slice = rdtsc_ordered() - scc->start_time;
+
+		rcc->total_time += slice;
+		scc->total_time += slice;
+
 		rpal_resume_ep(next);
 		current->rpal_sd->receiver = next;
 		rpal_lock_cpu(current);
@@ -169,6 +176,7 @@ rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *regs)
 		rpal_schedule(next);
 		rpal_clear_task_thread_flag(prev, RPAL_LAZY_SWITCHED_BIT);
 		prev->rpal_rd->sender = NULL;
+		next->rpal_sd->scc->start_time = rdtsc_ordered();
 	}
 	if (unlikely(!irqs_disabled())) {
 		local_irq_disable();
diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c
index 51c9eec639cb..5cd0be631521 100644
--- a/arch/x86/rpal/thread.c
+++ b/arch/x86/rpal/thread.c
@@ -99,6 +99,8 @@ int rpal_register_sender(unsigned long addr)
 	rsd->scc = (struct rpal_sender_call_context *)(addr - rsp->user_start +
 						       rsp->kernel_start);
 	rsd->receiver = NULL;
+	rsd->scc->start_time = 0;
+	rsd->scc->total_time = 0;
 
 	current->rpal_sd = rsd;
 	rpal_set_current_thread_flag(RPAL_SENDER_BIT);
@@ -182,6 +184,7 @@ int rpal_register_receiver(unsigned long addr)
 		(struct rpal_receiver_call_context *)(addr - rsp->user_start +
 						      rsp->kernel_start);
 	rrd->sender = NULL;
+	rrd->rcc->total_time = 0;
 
 	current->rpal_rd = rrd;
 	rpal_set_current_thread_flag(RPAL_RECEIVER_BIT);
@@ -289,6 +292,9 @@ int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs,
 				rpal_pkey_to_pkru(rpal_current_service()->pkey),
 				RPAL_PKRU_SET);
 #endif
+			if (!rpal_is_correct_address(rpal_current_service(), regs->ip))
+				/* receiver has crashed */
+				scc->total_time += rdtsc_ordered() - scc->start_time;
 			return 0;
 		}
 	}
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index 1d8c1bdc90f2..f5f4da63f28c 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -310,6 +310,7 @@ struct rpal_receiver_call_context {
 	void __user *events;
 	int maxevents;
 	int timeout;
+	int64_t total_time;
 };
 
 /* recovery point for sender */
@@ -325,6 +326,8 @@ struct rpal_sender_call_context {
 	struct rpal_task_context rtc;
 	struct rpal_error_context ec;
 	int sender_id;
+	s64 start_time;
+	s64 total_time;
 };
 
 /* End */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5f25cc09fb71..a03113fecdc5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1663,6 +1663,7 @@ struct task_struct {
 		struct rpal_sender_data *rpal_sd;
 		struct rpal_receiver_data *rpal_rd;
 	};
+	s64 rpal_steal_time;
 #endif
 
 	/* CPU-specific state of this task: */
diff --git a/init/init_task.c b/init/init_task.c
index 2eb08b96e66b..3606cf701dfe 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -224,6 +224,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.rpal_rs = NULL,
 	.rpal_flag = 0,
 	.rpal_cd = NULL,
+	.rpal_steal_time = 0,
 #endif
 };
 EXPORT_SYMBOL(init_task);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11cba74d07c8..ff6331a28987 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1222,6 +1222,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->rpal_rs = NULL;
 	tsk->rpal_flag = 0;
 	tsk->rpal_cd = NULL;
+	tsk->rpal_steal_time = 0;
 #endif
 	return tsk;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c219ada29d34..d6f8e0d76fc0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -789,6 +789,14 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 		delta -= steal;
 	}
 #endif
+#ifdef CONFIG_RPAL
+	if (unlikely(current->rpal_steal_time != 0)) {
+		delta += current->rpal_steal_time;
+		if (unlikely(delta < 0))
+			delta = 0;
+		current->rpal_steal_time = 0;
+	}
+#endif
 
 	rq->clock_task += delta;
 
@@ -6872,6 +6880,36 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	return true;
 }
 
+#ifdef CONFIG_RPAL
+static void rpal_acct_runtime(void)
+{
+	if (rpal_current_service()) {
+		if (rpal_test_task_thread_flag(current, RPAL_SENDER_BIT) &&
+		    current->rpal_sd->scc->total_time != 0) {
+			struct rpal_sender_call_context *scc =
+				current->rpal_sd->scc;
+
+			u64 slice =
+				native_sched_clock_from_tsc(scc->total_time) -
+				native_sched_clock_from_tsc(0);
+			current->rpal_steal_time -= slice;
+			scc->total_time = 0;
+		} else if (rpal_test_task_thread_flag(current,
+						      RPAL_RECEIVER_BIT) &&
+			   current->rpal_rd->rcc->total_time != 0) {
+			struct rpal_receiver_call_context *rcc =
+				current->rpal_rd->rcc;
+
+			u64 slice =
+				native_sched_clock_from_tsc(rcc->total_time) -
+				native_sched_clock_from_tsc(0);
+			current->rpal_steal_time += slice;
+			rcc->total_time = 0;
+		}
+	}
+}
+#endif
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6926,6 +6964,10 @@ static void __sched notrace __schedule(int sched_mode)
 	struct rq *rq;
 	int cpu;
 
+#ifdef CONFIG_RPAL
+	rpal_acct_runtime();
+#endif
+
 	trace_sched_entry_tp(preempt, CALLER_ADDR0);
 
 	cpu = smp_processor_id();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 34/35] RPAL: enable fast epoll wait
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (32 preceding siblings ...)
  2025-05-30  9:28 ` [RFC v2 33/35] RPAL: enable time slice correction Bo Li
@ 2025-05-30  9:28 ` Bo Li
  2025-05-30  9:28 ` [RFC v2 35/35] samples/rpal: add RPAL samples Bo Li
                   ` (6 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

When a kernel event occurs during an RPAL call and triggers a lazy switch,
the kernel context switches from the sender to the receiver. When the
receiver later returns from user space to the sender, a second lazy switch
is required to switch the kernel context back to the sender. In the current
implementation, after the second lazy switch, the receiver returns to user
space via rpal_kernel_ret() and then calls epoll_wait() from user space to
re-enter the kernel. This causes the receiver to be unable to process epoll
events for a long period, degrading performance.

This patch introduces a fast epoll wait feature. During the second lazy
switch, the kernel configures epoll-related data structures so that the
receiver can directly enter the epoll wait state without first returning
to user space and then calling epoll_wait(). The patch adds a new state
RPAL_RECEIVER_STATE_READY_LS, which is used to mark that the receiver can
transition to RPAL_RECEIVER_STATE_WAIT during the second lazy switch. The
kernel then performs this state transition in rpal_lazy_switch_tail().

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 arch/x86/rpal/core.c |  29 ++++++++++++-
 fs/eventpoll.c       | 101 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/rpal.h |   3 ++
 kernel/sched/core.c  |  13 +++++-
 4 files changed, 143 insertions(+), 3 deletions(-)

diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c
index 2ac5d932f69c..7b6efde23e48 100644
--- a/arch/x86/rpal/core.c
+++ b/arch/x86/rpal/core.c
@@ -51,7 +51,25 @@ void rpal_lazy_switch_tail(struct task_struct *tsk)
 		atomic_cmpxchg(&rcc->receiver_state, rpal_build_call_state(tsk->rpal_sd),
 			       RPAL_RECEIVER_STATE_LAZY_SWITCH);
 	} else {
+		/* tsk is receiver */
+		int state;
+
+		rcc = tsk->rpal_rd->rcc;
+		state = atomic_read(&rcc->receiver_state);
+		/* receiver may be scheduled on another cpu after unlock. */
 		rpal_unlock_cpu(tsk);
+		/*
+		 * We must not use RPAL_RECEIVER_STATE_READY instead of
+		 * RPAL_RECEIVER_STATE_READY_LS. As receiver may at
+		 * TASK_RUNNING state and then call epoll_wait() again,
+		 * the state may become RPAL_RECEIVER_STATE_READY, we should
+		 * not changed its state to RPAL_RECEIVER_STATE_WAIT since
+		 * the state is set by another RPAL call.
+		 */
+		if (state == RPAL_RECEIVER_STATE_READY_LS)
+			atomic_cmpxchg(&rcc->receiver_state,
+				       RPAL_RECEIVER_STATE_READY_LS,
+				       RPAL_RECEIVER_STATE_WAIT);
 		rpal_unlock_cpu(current);
 	}
 }
@@ -63,8 +81,14 @@ void rpal_kernel_ret(struct pt_regs *regs)
 	int state;
 
 	if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) {
-		rcc = current->rpal_rd->rcc;
-		regs->ax = rpal_try_send_events(current->rpal_rd->ep, rcc);
+		struct rpal_receiver_data *rrd = current->rpal_rd;
+
+		rcc = rrd->rcc;
+		if (rcc->timeout > 0)
+			hrtimer_cancel(&rrd->ep_sleeper.timer);
+		rpal_remove_ep_wait_list(rrd);
+		regs->ax = rpal_try_send_events(rrd->ep, rcc);
+		fdput(rrd->f);
 		atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET);
 	} else {
 		tsk = current->rpal_sd->receiver;
@@ -173,6 +197,7 @@ rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *regs)
 		 * Otherwise, sender's user context will be corrupted.
 		 */
 		rebuild_receiver_stack(current->rpal_rd, regs);
+		rpal_fast_ep_poll(current->rpal_rd, regs);
 		rpal_schedule(next);
 		rpal_clear_task_thread_flag(prev, RPAL_LAZY_SWITCHED_BIT);
 		prev->rpal_rd->sender = NULL;
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 791321639561..b70c1cd82335 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2143,6 +2143,107 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 }
 
 #ifdef CONFIG_RPAL
+static void *rpal_get_eventpoll(struct rpal_receiver_data *rrd, struct pt_regs *regs)
+{
+	struct rpal_receiver_call_context *rcc = rrd->rcc;
+	int epfd = rcc->epfd;
+	struct epoll_event __user *events = rcc->events;
+	int maxevents = rcc->maxevents;
+	struct file *file;
+
+	if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) {
+		regs->ax = -EINVAL;
+		return NULL;
+	}
+
+	if (!access_ok(events, maxevents * sizeof(struct epoll_event))) {
+		regs->ax = -EFAULT;
+		return NULL;
+	}
+
+	rrd->f = fdget(epfd);
+	file = fd_file(rrd->f);
+	if (!file) {
+		regs->ax = -EBADF;
+		return NULL;
+	}
+
+	if (!is_file_epoll(file)) {
+		regs->ax = -EINVAL;
+		fdput(rrd->f);
+		return NULL;
+	}
+
+	rrd->ep = file->private_data;
+	return rrd->ep;
+}
+
+void rpal_fast_ep_poll(struct rpal_receiver_data *rrd, struct pt_regs *regs)
+{
+	struct eventpoll *ep;
+	struct rpal_receiver_call_context *rcc = rrd->rcc;
+	ktime_t ts = 0;
+	struct hrtimer *ht = &rrd->ep_sleeper.timer;
+	int state;
+	int avail;
+
+	regs->orig_ax = __NR_epoll_wait;
+	ep = rpal_get_eventpoll(rrd, regs);
+
+	if (!ep || signal_pending(current) ||
+	    unlikely(ep_events_available(ep)) ||
+	    atomic_read(&rcc->ep_pending) || unlikely(rcc->timeout == 0)) {
+		INIT_LIST_HEAD(&rrd->ep_wait.entry);
+	} else {
+		/*
+		 * Here we use RPAL_RECEIVER_STATE_READY_LS to avoid conflict with
+		 * RPAL_RECEIVER_STATE_READY. As the RPAL_RECEIVER_STATE_READY_LS
+		 * is convert to RPAL_RECEIVER_STATE_WAIT in rpal_lazy_switch_tail(),
+		 * it is possible the receiver is woken at that time. Thus,
+		 * rpal_lazy_switch_tail() should figure out whether the receiver
+		 * state is set by lazy switch or not. See rpal_lazy_switch_tail()
+		 * for details.
+		 */
+		state = atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_READY_LS);
+		if (unlikely(state != RPAL_RECEIVER_STATE_LAZY_SWITCH))
+			rpal_err("%s: unexpected state: %d\n", __func__, state);
+		init_waitqueue_func_entry(&rrd->ep_wait, rpal_ep_autoremove_wake_function);
+		rrd->ep_wait.private = rrd;
+		INIT_LIST_HEAD(&rrd->ep_wait.entry);
+		write_lock(&ep->lock);
+		set_current_state(TASK_INTERRUPTIBLE);
+		avail = ep_events_available(ep);
+		if (!avail)
+			__add_wait_queue_exclusive(&ep->wq, &rrd->ep_wait);
+		write_unlock(&ep->lock);
+		if (avail) {
+			/* keep state consistent when we enter rpal_kernel_ret() */
+			atomic_set(&rcc->receiver_state,
+				   RPAL_RECEIVER_STATE_LAZY_SWITCH);
+			set_current_state(TASK_RUNNING);
+			return;
+		}
+
+		if (rcc->timeout > 0) {
+			rrd->ep_sleeper.task = rrd->rcd.bp_task;
+			ts = ms_to_ktime(rcc->timeout);
+			hrtimer_start(ht, ts, HRTIMER_MODE_REL);
+		}
+	}
+}
+
+void rpal_remove_ep_wait_list(struct rpal_receiver_data *rrd)
+{
+	struct eventpoll *ep = (struct eventpoll *)rrd->ep;
+	wait_queue_entry_t *wait = &rrd->ep_wait;
+
+	if (!list_empty_careful(&wait->entry)) {
+		write_lock_irq(&ep->lock);
+		__remove_wait_queue(&ep->wq, wait);
+		write_unlock_irq(&ep->lock);
+	}
+}
+
 void *rpal_get_epitemep(wait_queue_entry_t *wait)
 {
 	struct epitem *epi = ep_item_from_wait(wait);
diff --git a/include/linux/rpal.h b/include/linux/rpal.h
index f5f4da63f28c..676113f0ba1f 100644
--- a/include/linux/rpal.h
+++ b/include/linux/rpal.h
@@ -126,6 +126,7 @@ enum rpal_receiver_state {
 	RPAL_RECEIVER_STATE_WAIT,
 	RPAL_RECEIVER_STATE_CALL,
 	RPAL_RECEIVER_STATE_LAZY_SWITCH,
+	RPAL_RECEIVER_STATE_READY_LS,
 	RPAL_RECEIVER_STATE_MAX,
 };
 
@@ -627,4 +628,6 @@ void rpal_resume_ep(struct task_struct *tsk);
 void *rpal_get_epitemep(wait_queue_entry_t *wait);
 int rpal_get_epitemfd(wait_queue_entry_t *wait);
 int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc);
+void rpal_remove_ep_wait_list(struct rpal_receiver_data *rrd);
+void rpal_fast_ep_poll(struct rpal_receiver_data *rrd, struct pt_regs *regs);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d6f8e0d76fc0..1728b04d1387 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3965,6 +3965,11 @@ static bool rpal_check_state(struct task_struct *p)
 		case RPAL_RECEIVER_STATE_LAZY_SWITCH:
 		case RPAL_RECEIVER_STATE_RUNNING:
 			break;
+		/*
+		 * Allow RPAL_RECEIVER_STATE_READY_LS to be woken will cause irq
+		 * being enabled in rpal_unlock_cpu.
+		 */
+		case RPAL_RECEIVER_STATE_READY_LS:
 		case RPAL_RECEIVER_STATE_CALL:
 			rpal_set_task_thread_flag(p, RPAL_WAKE_BIT);
 			ret = false;
@@ -11403,7 +11408,13 @@ void __sched notrace rpal_schedule(struct task_struct *next)
 
 	prev_state = READ_ONCE(prev->__state);
 	if (prev_state) {
-		try_to_block_task(rq, prev, &prev_state);
+		if (!try_to_block_task(rq, prev, &prev_state)) {
+			/*
+			 * As the task enter TASK_RUNNING state, we should clean up
+			 * RPAL_RECEIVER_STATE_READY_LS status.
+			 */
+			rpal_check_ready_state(prev, RPAL_RECEIVER_STATE_READY_LS);
+		}
 		switch_count = &prev->nvcsw;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 35/35] samples/rpal: add RPAL samples
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (33 preceding siblings ...)
  2025-05-30  9:28 ` [RFC v2 34/35] RPAL: enable fast epoll wait Bo Li
@ 2025-05-30  9:28 ` Bo Li
  2025-05-30  9:33 ` [RFC v2 00/35] optimize cost of inter-process communication Lorenzo Stoakes
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Bo Li @ 2025-05-30  9:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff, Bo Li

Added test samples for RPAL (with librpal included). Compile via:

    cd samples/rpal && make

And run it using the following command:

    ./server & ./client

Example output:

    EPOLL: Message length: 32 bytes, Total TSC cycles: 16439927066,
    Message count: 1000000, Average latency: 16439 cycles
    RPAL: Message length: 32 bytes, Total TSC cycles: 2197479484,
    Message count: 1000000, Average latency: 2197 cycles

Signed-off-by: Bo Li <libo.gcs85@bytedance.com>
---
 samples/rpal/Makefile                         |   17 +
 samples/rpal/asm_define.c                     |   14 +
 samples/rpal/client.c                         |  178 ++
 samples/rpal/librpal/asm_define.h             |    6 +
 samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
 samples/rpal/librpal/debug.h                  |   12 +
 samples/rpal/librpal/fiber.c                  |  119 +
 samples/rpal/librpal/fiber.h                  |   64 +
 .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
 .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
 .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
 samples/rpal/librpal/private.h                |  341 +++
 samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
 samples/rpal/librpal/rpal.h                   |  149 ++
 samples/rpal/librpal/rpal_pkru.h              |   78 +
 samples/rpal/librpal/rpal_queue.c             |  239 ++
 samples/rpal/librpal/rpal_queue.h             |   55 +
 samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
 samples/rpal/offset.sh                        |    5 +
 samples/rpal/server.c                         |  249 ++
 20 files changed, 4226 insertions(+)
 create mode 100644 samples/rpal/Makefile
 create mode 100644 samples/rpal/asm_define.c
 create mode 100644 samples/rpal/client.c
 create mode 100644 samples/rpal/librpal/asm_define.h
 create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
 create mode 100644 samples/rpal/librpal/debug.h
 create mode 100644 samples/rpal/librpal/fiber.c
 create mode 100644 samples/rpal/librpal/fiber.h
 create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/private.h
 create mode 100644 samples/rpal/librpal/rpal.c
 create mode 100644 samples/rpal/librpal/rpal.h
 create mode 100644 samples/rpal/librpal/rpal_pkru.h
 create mode 100644 samples/rpal/librpal/rpal_queue.c
 create mode 100644 samples/rpal/librpal/rpal_queue.h
 create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
 create mode 100755 samples/rpal/offset.sh
 create mode 100644 samples/rpal/server.c

diff --git a/samples/rpal/Makefile b/samples/rpal/Makefile
new file mode 100644
index 000000000000..25627a970028
--- /dev/null
+++ b/samples/rpal/Makefile
@@ -0,0 +1,17 @@
+.PHONY: rpal
+
+all: server client offset
+
+offset: asm_define.c
+	$(shell ./offset.sh)
+
+server: server.c librpal/*.c librpal/*.S
+	$(CC) $^ -lpthread -g -o $@
+	@printf "RPAL" | dd of=./server bs=1 count=4 conv=notrunc seek=12
+
+client: client.c librpal/*.c librpal/*.S
+	$(CC) $^ -lpthread -g -o $@
+	@printf "RPAL" | dd of=./client bs=1 count=4 conv=notrunc seek=12
+
+clean:
+	rm server client
diff --git a/samples/rpal/asm_define.c b/samples/rpal/asm_define.c
new file mode 100644
index 000000000000..6f7731ebc870
--- /dev/null
+++ b/samples/rpal/asm_define.c
@@ -0,0 +1,14 @@
+#include <stddef.h>
+#include "librpal/private.h"
+
+#define DEFINE(sym, val) asm volatile("\n-> " #sym " %0 " #val "\n" : : "i" (val))
+
+static void common(void)
+{
+    DEFINE(RCI_SENDER_TLS_BASE, offsetof(rpal_call_info_t, sender_tls_base));
+    DEFINE(RCI_SENDER_FCTX, offsetof(rpal_call_info_t, sender_fctx));
+    DEFINE(RCI_PKRU, offsetof(rpal_call_info_t, pkru));
+    DEFINE(RC_SENDER_STATE, offsetof(receiver_context_t, sender_state));
+    DEFINE(RET_BEGIN, offsetof(critical_section_t, ret_begin));
+    DEFINE(RET_END, offsetof(critical_section_t, ret_end));
+}
diff --git a/samples/rpal/client.c b/samples/rpal/client.c
new file mode 100644
index 000000000000..2c4a9eb6115e
--- /dev/null
+++ b/samples/rpal/client.c
@@ -0,0 +1,178 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <x86intrin.h>
+#include "librpal/rpal.h"
+
+#define SOCKET_PATH "/tmp/rpal_socket"
+#define BUFFER_SIZE 1025
+#define MSG_NUM 1000000
+#define MSG_LEN 32
+
+char hello[BUFFER_SIZE];
+char buffer[BUFFER_SIZE] = { 0 };
+
+int remote_id;
+uint64_t remote_sidfd;
+
+#define INIT_MSG "INIT"
+#define SUCC_MSG "SUCC"
+#define FAIL_MSG "FAIL"
+
+#define handle_error(s)                                                        \
+	do {                                                                   \
+		perror(s);                                                     \
+		exit(EXIT_FAILURE);                                            \
+	} while (0)
+
+int rpal_epoll_add(int epfd, int fd)
+{
+	struct epoll_event ev;
+
+	ev.events = EPOLLRPALIN | EPOLLIN | EPOLLRDHUP | EPOLLET;
+	ev.data.fd = fd;
+
+	return rpal_epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
+}
+
+void rpal_client_init(int fd)
+{
+	struct epoll_event ev;
+	char buffer[BUFFER_SIZE];
+	rpal_error_code_t err;
+	uint64_t remote_key, service_key;
+	int epoll_fd;
+	int proc_fd;
+	int ret;
+
+	proc_fd = rpal_init(1, 0, &err);
+	if (proc_fd < 0)
+		handle_error("rpal init fail");
+	rpal_get_service_key(&service_key);
+
+	strcpy(buffer, INIT_MSG);
+	*(uint64_t *)(buffer + strlen(INIT_MSG)) = service_key;
+	ret = write(fd, buffer, strlen(INIT_MSG) + sizeof(uint64_t));
+	if (ret < 0)
+		handle_error("write key");
+
+	ret = read(fd, buffer, BUFFER_SIZE);
+	if (ret < 0)
+		handle_error("read key");
+
+	memcpy(&remote_key, buffer, sizeof(remote_key));
+	if (remote_key == 0)
+		handle_error("remote down");
+
+	ret = rpal_request_service(remote_key);
+	if (ret) {
+		write(fd, FAIL_MSG, strlen(FAIL_MSG));
+		handle_error("request");
+	}
+
+	ret = write(fd, SUCC_MSG, strlen(SUCC_MSG));
+	if (ret < 0)
+		handle_error("handshake");
+
+	remote_id = rpal_get_request_service_id(remote_key);
+	rpal_sender_init(&err);
+
+	epoll_fd = epoll_create(1024);
+	if (epoll_fd == -1) {
+		perror("epoll_create");
+		exit(EXIT_FAILURE);
+	}
+	rpal_epoll_add(epoll_fd, fd);
+
+	sleep(3); //wait for epoll wait
+	ret = rpal_uds_fdmap(((unsigned long)remote_id << 32) | fd,
+			     &remote_sidfd);
+	if (ret < 0)
+		handle_error("uds fdmap fail");
+}
+
+int run_rpal_client(int msg_len)
+{
+	ssize_t valread;
+	int sock = 0;
+	struct sockaddr_un serv_addr;
+	int count = MSG_NUM;
+	int ret;
+
+	if ((sock = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
+		perror("socket creation error");
+		return -1;
+	}
+
+	memset(&serv_addr, 0, sizeof(serv_addr));
+	serv_addr.sun_family = AF_UNIX;
+	strncpy(serv_addr.sun_path, SOCKET_PATH, sizeof(SOCKET_PATH));
+
+	if (connect(sock, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) <
+	    0) {
+		perror("Connection Failed");
+		return -1;
+	}
+	rpal_client_init(sock);	
+
+	while (count) {
+		for (int i = 18; i < msg_len; i++)
+			hello[i] = 'a' + i % 26;
+		sprintf(hello, "0x%016lx", __rdtsc());
+		ret = rpal_write_ptrs(remote_id, remote_sidfd, (int64_t *)hello,
+				      msg_len / sizeof(int64_t *));
+		valread = read(sock, buffer, BUFFER_SIZE);
+		if (memcmp(hello, buffer, msg_len) != 0)
+			perror("data error");
+		count--;
+	}
+
+	close(sock);
+}
+
+int run_client(int msg_len)
+{
+	ssize_t valread;
+	int sock = 0;
+	struct sockaddr_un serv_addr;
+	int count = MSG_NUM;
+
+	if ((sock = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
+		perror("socket creation error");
+		return -1;
+	}
+
+	memset(&serv_addr, 0, sizeof(serv_addr));
+	serv_addr.sun_family = AF_UNIX;
+	strncpy(serv_addr.sun_path, SOCKET_PATH, sizeof(SOCKET_PATH));
+
+	if (connect(sock, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) <
+	    0) {
+		perror("Connection Failed");
+		return -1;
+	}
+
+	while (count) {
+		for (int i = 18; i < msg_len; i++)
+			hello[i] = 'a' + i % 26;
+		sprintf(hello, "0x%016lx", __rdtsc());
+		send(sock, hello, msg_len, 0);
+		valread = read(sock, buffer, BUFFER_SIZE);
+		if (memcmp(hello, buffer, msg_len) != 0)
+			perror("data error");
+		count--;
+	}
+
+	close(sock);
+}
+
+int main()
+{
+	run_client(MSG_LEN);
+	run_rpal_client(MSG_LEN);
+
+	return 0;
+}
diff --git a/samples/rpal/librpal/asm_define.h b/samples/rpal/librpal/asm_define.h
new file mode 100644
index 000000000000..bc57586cda58
--- /dev/null
+++ b/samples/rpal/librpal/asm_define.h
@@ -0,0 +1,6 @@
+#define RCI_SENDER_TLS_BASE 0
+#define RCI_SENDER_FCTX 16
+#define RCI_PKRU 8
+#define RC_SENDER_STATE 72
+#define RET_BEGIN 0
+#define RET_END 8
diff --git a/samples/rpal/librpal/asm_x86_64_rpal_call.S b/samples/rpal/librpal/asm_x86_64_rpal_call.S
new file mode 100644
index 000000000000..538e8ac5f09b
--- /dev/null
+++ b/samples/rpal/librpal/asm_x86_64_rpal_call.S
@@ -0,0 +1,57 @@
+#ifdef __x86_64__
+#define __ASSEMBLY__
+#include "asm_define.h"
+
+.text
+.globl rpal_access_warpper
+.type rpal_access_warpper,@function
+.align 16
+
+rpal_access_warpper:
+    pushq  %r12
+    pushq  %r13
+    pushq  %r14
+    pushq  %r15
+    pushq  %rbx
+    pushq  %rbp
+
+    leaq -0x8(%rsp), %rsp
+    stmxcsr  (%rsp)
+    fnstcw   0x4(%rsp)
+
+    pushq  %rsp         // Save rsp which may be unaligned.
+    pushq  (%rsp)       // Save the original value again
+    andq   $-16, %rsp   // Align stack to 16bytes - SysV AMD64 ABI.
+
+    movq   %rsp, (%rdi)
+    call   rpal_access@plt
+retip:
+    movq   8(%rsp), %rsp	// Restore the potentially unaligned stack
+    ldmxcsr  (%rsp)
+    fldcw    0x4(%rsp)
+    leaq 0x8(%rsp), %rsp
+
+    popq   %rbp
+    popq   %rbx
+    popq   %r15
+    popq   %r14
+    popq   %r13
+    popq   %r12
+    ret
+
+.size rpal_access_warpper,.-rpal_access_warpper
+
+
+
+.globl rpal_get_ret_rip
+.type rpal_get_ret_rip, @function
+.align 16
+rpal_get_ret_rip:
+    leaq retip(%rip), %rax
+    ret
+
+.size rpal_get_ret_rip,.-rpal_get_ret_rip
+
+/* Mark that we don't need executable stack. */
+.section .note.GNU-stack,"",%progbits
+#endif
diff --git a/samples/rpal/librpal/debug.h b/samples/rpal/librpal/debug.h
new file mode 100644
index 000000000000..10d2fef8d69a
--- /dev/null
+++ b/samples/rpal/librpal/debug.h
@@ -0,0 +1,12 @@
+#ifndef RPAL_DEBUG_H
+#define RPAL_DEBUG_H
+
+typedef enum {
+	RPAL_DEBUG_MANAGEMENT = (1 << 0),
+	RPAL_DEBUG_SENDER = (1 << 1),
+	RPAL_DEBUG_RECVER = (1 << 2),
+	RPAL_DEBUG_FIBER = (1 << 3),
+
+	__RPAL_DEBUG_ALL = ~(0ULL),
+} rpal_debug_flag_t;
+#endif
diff --git a/samples/rpal/librpal/fiber.c b/samples/rpal/librpal/fiber.c
new file mode 100644
index 000000000000..2141ad9ab770
--- /dev/null
+++ b/samples/rpal/librpal/fiber.c
@@ -0,0 +1,119 @@
+#ifdef __x86_64__
+#include "debug.h"
+#include "fiber.h"
+#include "private.h"
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/mman.h>
+
+#define RPAL_CHECK_FAIL -1
+#define STACK_DEBUG 1
+
+static task_t *make_fiber_ctx(task_t *fc)
+{
+	fc->fctx = make_fcontext(fc->sp, 0, NULL);
+	return fc;
+}
+
+static task_t *fiber_ctx_create(void (*fn)(void *ud), void *ud, void *stack,
+				size_t size)
+{
+	task_t *fc;
+	int i;
+
+	if (stack == NULL)
+		return NULL;
+
+	fc = (task_t *)stack;
+	fc->fn = fn;
+	fc->ud = ud;
+	fc->size = size;
+	fc->sp = stack + size;
+	for (i = 0; i < NR_PADDING; ++i) {
+		fc->padding[i] = 0xdeadbeef;
+	}
+
+	return make_fiber_ctx(fc);
+}
+
+task_t *fiber_ctx_alloc(void (*fn)(void *ud), void *ud, size_t size)
+{
+	void *stack;
+	size_t stack_size;
+	size_t total_size;
+	void *lower_guard;
+	void *upper_guard;
+
+	if (PAGE_SIZE == 4096 || STACK_DEBUG) {
+		stack_size = (size + PAGE_SIZE - 1) & ~(PAGE_SIZE - 1);
+
+		dbprint(RPAL_DEBUG_FIBER,
+			"fiber_ctx_alloc: stack size adjusted from %lu to %lu\n",
+			size, stack_size);
+
+		// Allocate a stack using mmap with 2 extra pages, 1 at each end
+		// which will be PROT_NONE to act as guard pages to catch overflow
+		// and underflow. This will result in a SIGSEGV but should make it
+		// easier to catch a stack that is too small (or underflows).
+		//
+		// Notes:
+		//
+		// 1. On ARM64 with 64K pages this would be quite wasteful of memory
+		//    so it is behind a DEBUG flag to enable/disable on that platform.
+		//
+		// 2. If the requested stack size is not a multiple of a page size
+		//    then stack underflow wont always be caught as there is some
+		//    extra space up until the next page boundary with the guard page.
+		//
+		// 3. The task_t is placed at the top of the stack so can be overwritten
+		//    just before the stack overflows and hits the guard page.
+		//
+
+		total_size = stack_size + (PAGE_SIZE * 2);
+		lower_guard = mmap(NULL, total_size, PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANON, -1, 0);
+		if (lower_guard == MAP_FAILED) {
+			errprint("mmap of %lu bytes failed: %s\n", total_size,
+				 strerror(errno));
+			return NULL;
+		}
+
+		stack = lower_guard + PAGE_SIZE;
+		upper_guard = stack + stack_size;
+		mprotect(lower_guard, PAGE_SIZE, PROT_NONE);
+		mprotect(upper_guard, PAGE_SIZE, PROT_NONE);
+
+		dbprint(RPAL_DEBUG_FIBER,
+			"Total stack of size %lu bytes allocated @ %p\n",
+			total_size, stack);
+		dbprint(RPAL_DEBUG_FIBER,
+			"Underflow guard page %p - %p overflow guard page %p - %p\n",
+			lower_guard, lower_guard + PAGE_SIZE - 1, upper_guard,
+			upper_guard + PAGE_SIZE - 1);
+	} else {
+		stack = malloc(size);
+	}
+	return fiber_ctx_create(fn, ud, stack, size);
+}
+
+void fiber_ctx_free(task_t *fc)
+{
+	size_t stack_size;
+	size_t total_size;
+	void *addr;
+
+	if (STACK_DEBUG) {
+		stack_size = (fc->size + PAGE_SIZE - 1) & ~(PAGE_SIZE - 1);
+		total_size = stack_size + (PAGE_SIZE * 2);
+		addr = fc;
+		addr -= PAGE_SIZE;
+		if (munmap(addr, total_size) != 0) {
+			errprint("munmap of %lu bytes @ %p failed: %s\n",
+				 total_size, addr, strerror(errno));
+		}
+	} else {
+		free(fc);
+	}
+}
+#endif
diff --git a/samples/rpal/librpal/fiber.h b/samples/rpal/librpal/fiber.h
new file mode 100644
index 000000000000..b46485ba740f
--- /dev/null
+++ b/samples/rpal/librpal/fiber.h
@@ -0,0 +1,64 @@
+#ifndef FIBER_H
+#define FIBER_H
+
+#include <stdlib.h>
+
+typedef void *fcontext_t;
+typedef struct {
+	fcontext_t fctx;
+	void *ud;
+} transfer_t;
+
+typedef struct fiber_stack {
+	unsigned long padding;
+	unsigned long r12;
+	unsigned long r13;
+	unsigned long r14;
+	unsigned long r15;
+	unsigned long rbx;
+	unsigned long rbp;
+	unsigned long rip;
+} fiber_stack_t;
+
+#define NR_PADDING 8
+typedef struct fiber_ctx {
+	void *sp;
+	size_t size;
+	void (*fn)(void *fc);
+	void *ud;
+	fcontext_t fctx;
+	int padding[NR_PADDING];
+} task_t;
+
+task_t *fiber_ctx_alloc(void (*fn)(void *ud), void *ud, size_t size);
+void fiber_ctx_free(task_t *fc);
+
+/**
+ * @brief Make a context for jump_fcontext.
+ *
+ * @param sp The stack top pointer of context.
+ * @param size The size of stack, this argument is useless. But a second argument is neccessary.
+ * @param fn The function pointer of the context function.
+ *
+ * @return The pointer of the newly made context.
+ */
+extern fcontext_t make_fcontext(void *sp, size_t size, void (*fn)(transfer_t));
+
+/**
+ * @brief jump to target context and execute fn with argument ud
+ *
+ * @param to The pointer of target context.
+ * @param ud The data part of the argument of fn.
+ *
+ * @return the pointer of the prev transfer_t struct, where RAX store
+ *  previous context, RDX store ud passed by previous caller.
+ */
+extern transfer_t jump_fcontext(fcontext_t const to, void *ud);
+
+/**
+ * @brief To be written.
+ */
+extern transfer_t ontop_fcontext(fcontext_t const to, void *ud,
+				 transfer_t (*fn)(transfer_t));
+
+#endif
diff --git a/samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S b/samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
new file mode 100644
index 000000000000..43d3a8149c58
--- /dev/null
+++ b/samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
@@ -0,0 +1,81 @@
+/*
+            Copyright Oliver Kowalke 2009.
+   Distributed under the Boost Software License, Version 1.0.
+      (See accompanying file LICENSE_1_0.txt or copy at
+            http://www.boost.org/LICENSE_1_0.txt)
+*/
+
+/****************************************************************************************
+ *                                                                                      *
+ *  ----------------------------------------------------------------------------------  *
+ *  |    0    |    1    |    2    |    3    |    4     |    5    |    6    |    7    |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |   0x0   |   0x4   |   0x8   |   0xc   |   0x10   |   0x14  |   0x18  |   0x1c  |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  | fc_mxcsr|fc_x87_cw|        R12        |         R13        |        R14        |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |    8    |    9    |   10    |   11    |    12    |    13   |    14   |    15   |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |   0x20  |   0x24  |   0x28  |  0x2c   |   0x30   |   0x34  |   0x38  |   0x3c  |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |        R15        |        RBX        |         RBP        |        RIP        |  *
+ *  ----------------------------------------------------------------------------------  *
+ *                                                                                      *
+ ****************************************************************************************/
+#ifdef __x86_64__
+.text
+.globl jump_fcontext
+.type jump_fcontext,@function
+.align 16
+jump_fcontext:
+	leaq  -0x38(%rsp), %rsp /* prepare stack */
+
+#if !defined(BOOST_USE_TSX)
+    stmxcsr  (%rsp)     /* save MMX control- and status-word */
+    fnstcw   0x4(%rsp)  /* save x87 control-word */
+#endif
+
+    movq  %r12, 0x8(%rsp)  /* save R12 */
+    movq  %r13, 0x10(%rsp)  /* save R13 */
+    movq  %r14, 0x18(%rsp)  /* save R14 */
+    movq  %r15, 0x20(%rsp)  /* save R15 */
+    movq  %rbx, 0x28(%rsp)  /* save RBX */
+    movq  %rbp, 0x30(%rsp)  /* save RBP */
+
+    /* store RSP (pointing to context-data) in RAX */
+    movq  %rsp, %rax
+
+    /* restore RSP (pointing to context-data) from RDI */
+    movq  %rdi, %rsp
+
+    movq  0x38(%rsp), %r8  /* restore return-address */
+
+#if !defined(BOOST_USE_TSX)
+    ldmxcsr  (%rsp)     /* restore MMX control- and status-word */
+    fldcw    0x4(%rsp)  /* restore x87 control-word */
+#endif
+
+    movq  0x8(%rsp), %r12  /* restore R12 */
+    movq  0x10(%rsp), %r13  /* restore R13 */
+    movq  0x18(%rsp), %r14  /* restore R14 */
+    movq  0x20(%rsp), %r15  /* restore R15 */
+    movq  0x28(%rsp), %rbx  /* restore RBX */
+    movq  0x30(%rsp), %rbp  /* restore RBP */
+
+    leaq  0x40(%rsp), %rsp /* prepare stack */
+
+    /* return transfer_t from jump */
+    /* RAX == fctx, RDX == data */
+    movq  %rsi, %rdx
+    /* pass transfer_t as first arg in context function */
+    /* RDI == fctx, RSI == data */
+    movq  %rax, %rdi
+
+    /* indirect jump to context */
+    jmp  *%r8
+.size jump_fcontext,.-jump_fcontext
+
+/* Mark that we don't need executable stack.  */
+.section .note.GNU-stack,"",%progbits
+#endif
diff --git a/samples/rpal/librpal/make_x86_64_sysv_elf_gas.S b/samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
new file mode 100644
index 000000000000..4f3af9247110
--- /dev/null
+++ b/samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
@@ -0,0 +1,82 @@
+/*
+            Copyright Oliver Kowalke 2009.
+   Distributed under the Boost Software License, Version 1.0.
+      (See accompanying file LICENSE_1_0.txt or copy at
+            http://www.boost.org/LICENSE_1_0.txt)
+*/
+
+/****************************************************************************************
+ *                                                                                      *
+ *  ----------------------------------------------------------------------------------  *
+ *  |    0    |    1    |    2    |    3    |    4     |    5    |    6    |    7    |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |   0x0   |   0x4   |   0x8   |   0xc   |   0x10   |   0x14  |   0x18  |   0x1c  |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  | fc_mxcsr|fc_x87_cw|        R12        |         R13        |        R14        |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |    8    |    9    |   10    |   11    |    12    |    13   |    14   |    15   |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |   0x20  |   0x24  |   0x28  |  0x2c   |   0x30   |   0x34  |   0x38  |   0x3c  |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |        R15        |        RBX        |         RBP        |        RIP        |  *
+ *  ----------------------------------------------------------------------------------  *
+ *                                                                                      *
+ ****************************************************************************************/
+#ifdef __x86_64__
+.text
+.globl make_fcontext
+.type make_fcontext,@function
+.align 16
+make_fcontext:
+    /* first arg of make_fcontext() == top of context-stack */
+    movq  %rdi, %rax
+
+    /* shift address in RAX to lower 16 byte boundary */
+    andq  $-16, %rax
+
+    /* reserve space for context-data on context-stack */
+    /* on context-function entry: (RSP -0x8) % 16 == 0 */
+    leaq  -0x40(%rax), %rax
+
+    /* third arg of make_fcontext() == address of context-function */
+    /* stored in RBX */
+    movq  %rdx, 0x28(%rax)
+
+    /* save MMX control- and status-word */
+    stmxcsr  (%rax)
+    /* save x87 control-word */
+    fnstcw   0x4(%rax)
+
+    /* compute abs address of label trampoline */
+    leaq  trampoline(%rip), %rcx
+    /* save address of trampoline as return-address for context-function */
+    /* will be entered after calling jump_fcontext() first time */
+    movq  %rcx, 0x38(%rax)
+
+    /* compute abs address of label finish */
+    leaq  finish(%rip), %rcx
+    /* save address of finish as return-address for context-function */
+    /* will be entered after context-function returns */
+    movq  %rcx, 0x30(%rax)
+
+    ret /* return pointer to context-data */
+
+trampoline:
+    /* store return address on stack */
+    /* fix stack alignment */
+    push %rbp
+    /* jump to context-function */
+    jmp *%rbx
+
+finish:
+    /* exit code is zero */
+    xorq  %rdi, %rdi
+    /* exit application */
+    call  _exit@PLT
+    hlt
+.size make_fcontext,.-make_fcontext
+
+/* Mark that we don't need executable stack. */
+.section .note.GNU-stack,"",%progbits
+#endif
\ No newline at end of file
diff --git a/samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S b/samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
new file mode 100644
index 000000000000..9dce797c2541
--- /dev/null
+++ b/samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
@@ -0,0 +1,84 @@
+/*
+            Copyright Oliver Kowalke 2009.
+   Distributed under the Boost Software License, Version 1.0.
+      (See accompanying file LICENSE_1_0.txt or copy at
+            http://www.boost.org/LICENSE_1_0.txt)
+*/
+
+/****************************************************************************************
+ *                                                                                      *
+ *  ----------------------------------------------------------------------------------  *
+ *  |    0    |    1    |    2    |    3    |    4     |    5    |    6    |    7    |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |   0x0   |   0x4   |   0x8   |   0xc   |   0x10   |   0x14  |   0x18  |   0x1c  |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  | fc_mxcsr|fc_x87_cw|        R12        |         R13        |        R14        |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |    8    |    9    |   10    |   11    |    12    |    13   |    14   |    15   |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |   0x20  |   0x24  |   0x28  |  0x2c   |   0x30   |   0x34  |   0x38  |   0x3c  |  *
+ *  ----------------------------------------------------------------------------------  *
+ *  |        R15        |        RBX        |         RBP        |        RIP        |  *
+ *  ----------------------------------------------------------------------------------  *
+ *                                                                                      *
+ ****************************************************************************************/
+#ifdef __x86_64__
+.text
+.globl ontop_fcontext
+.type ontop_fcontext,@function
+.align 16
+ontop_fcontext:
+    /* preserve ontop-function in R8 */
+    movq  %rdx, %r8
+
+    leaq  -0x38(%rsp), %rsp /* prepare stack */
+
+#if !defined(BOOST_USE_TSX)
+    stmxcsr  (%rsp)     /* save MMX control- and status-word */
+    fnstcw   0x4(%rsp)  /* save x87 control-word */
+#endif
+
+    movq  %r12, 0x8(%rsp)  /* save R12 */
+    movq  %r13, 0x10(%rsp)  /* save R13 */
+    movq  %r14, 0x18(%rsp)  /* save R14 */
+    movq  %r15, 0x20(%rsp)  /* save R15 */
+    movq  %rbx, 0x28(%rsp)  /* save RBX */
+    movq  %rbp, 0x30(%rsp)  /* save RBP */
+
+    /* store RSP (pointing to context-data) in RAX */
+    movq  %rsp, %rax
+
+    /* restore RSP (pointing to context-data) from RDI */
+    movq  %rdi, %rsp
+
+#if !defined(BOOST_USE_TSX)
+    ldmxcsr  (%rsp)     /* restore MMX control- and status-word */
+    fldcw    0x4(%rsp)  /* restore x87 control-word */
+#endif
+
+    movq  0x8(%rsp), %r12  /* restore R12 */
+    movq  0x10(%rsp), %r13  /* restore R13 */
+    movq  0x18(%rsp), %r14  /* restore R14 */
+    movq  0x20(%rsp), %r15  /* restore R15 */
+    movq  0x28(%rsp), %rbx  /* restore RBX */
+    movq  0x30(%rsp), %rbp  /* restore RBP */
+
+    leaq  0x38(%rsp), %rsp /* prepare stack */
+
+    /* return transfer_t from jump */
+    /* RAX == fctx, RDX == data */
+    movq  %rsi, %rdx
+    /* pass transfer_t as first arg in context function */
+    /* RDI == fctx, RSI == data */
+    movq  %rax, %rdi
+
+    /* keep return-address on stack */
+
+    /* indirect jump to context */
+    jmp  *%r8
+.size ontop_fcontext,.-ontop_fcontext
+
+/* Mark that we don't need executable stack.  */
+.section .note.GNU-stack,"",%progbits
+#endif
diff --git a/samples/rpal/librpal/private.h b/samples/rpal/librpal/private.h
new file mode 100644
index 000000000000..9dc78f449f0f
--- /dev/null
+++ b/samples/rpal/librpal/private.h
@@ -0,0 +1,341 @@
+#ifndef PRIVATE_H
+#define PRIVATE_H
+
+#include <unistd.h>
+#include <stdint.h>
+#include <sys/syscall.h>
+#include <sys/uio.h>
+#ifdef __x86_64__
+#include <immintrin.h>
+#endif
+#include <pthread.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <stddef.h>
+#include <sys/ioctl.h>
+
+#include "debug.h"
+#include "rpal_queue.h"
+#include "fiber.h"
+#include "rpal.h"
+
+#ifdef __x86_64__
+static inline void write_tls_base(unsigned long tls_base)
+{
+	asm volatile("wrfsbase %0" ::"r"(tls_base) : "memory");
+}
+
+static inline unsigned long read_tls_base(void)
+{
+	unsigned long fsbase;
+	asm volatile("rdfsbase %0" : "=r"(fsbase)::"memory");
+	return fsbase;
+}
+#endif
+
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+// | fd_timestamp |   pad   | rthread_id | server_fd |
+// |     16       |    8    |      8     |    32     |
+#define LOW32_MASK ((1UL << 32) - 1)
+#define MIDL8_MASK ((unsigned long)(((1UL << 8) - 1)) << 32)
+
+#define HIGH16_OFFSET 48
+#define HIGH32_OFFSET 32
+
+#define get_high16(val) ({ (val) >> HIGH16_OFFSET; })
+
+#define get_high32(val) ({ (val) >> HIGH32_OFFSET; })
+
+#define get_midl8(val) ({ ((val) & MIDL8_MASK) >> HIGH32_OFFSET; })
+#define get_low32(val) ({ (val) & LOW32_MASK; })
+
+#define get_fdtimestamp(rpalfd) get_high16(rpalfd)
+#define get_rid(rpalfd) get_midl8(rpalfd)
+#define get_sfd(rpalfd) get_low32(rpalfd)
+
+#define PAGE_SIZE 4096
+#define DEFUALT_STACK_SIZE (PAGE_SIZE * 4)
+#define TRAMPOLINE_SIZE (PAGE_SIZE * 1)
+
+#define BITS_PER_LONG 64
+#define BITS_TO_LONGS(x)                                                       \
+	(((x) + 8 * sizeof(unsigned long) - 1) / (8 * sizeof(unsigned long)))
+
+#define KEY_SIZE 16
+
+enum rpal_sender_state {
+	RPAL_SENDER_STATE_RUNNING,
+	RPAL_SENDER_STATE_CALL,
+	RPAL_SENDER_STATE_KERNEL_RET,
+};
+
+enum rpal_epoll_event {
+	RPAL_KERNEL_PENDING = 0x1,
+	RPAL_USER_PENDING = 0x2,
+};
+
+enum rpal_receiver_state {
+	RPAL_RECEIVER_STATE_RUNNING,
+	RPAL_RECEIVER_STATE_KERNEL_RET,
+	RPAL_RECEIVER_STATE_READY,
+	RPAL_RECEIVER_STATE_WAIT,
+	RPAL_RECEIVER_STATE_CALL,
+	RPAL_RECEIVER_STATE_LAZY_SWITCH,
+	RPAL_RECEIVER_STATE_MAX,
+};
+
+enum rpal_command_type {
+	RPAL_CMD_GET_API_VERSION_AND_CAP,
+	RPAL_CMD_GET_SERVICE_KEY,
+	RPAL_CMD_GET_SERVICE_ID,
+	RPAL_CMD_REGISTER_SENDER,
+	RPAL_CMD_UNREGISTER_SENDER,
+	RPAL_CMD_REGISTER_RECEIVER,
+	RPAL_CMD_UNREGISTER_RECEIVER,
+	RPAL_CMD_ENABLE_SERVICE,
+	RPAL_CMD_DISABLE_SERVICE,
+	RPAL_CMD_REQUEST_SERVICE,
+	RPAL_CMD_RELEASE_SERVICE,
+	RPAL_CMD_GET_SERVICE_PKEY,
+	RPAL_CMD_UDS_FDMAP,
+	RPAL_NR_CMD,
+};
+
+/* RPAL ioctl macro */
+#define RPAL_IOCTL_MAGIC 0x33
+#define RPAL_IOCTL_GET_API_VERSION_AND_CAP                        \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_API_VERSION_AND_CAP, \
+	      struct rpal_version_info *)
+#define RPAL_IOCTL_GET_SERVICE_KEY \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_KEY, unsigned long)
+#define RPAL_IOCTL_GET_SERVICE_ID \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_ID, int *)
+#define RPAL_IOCTL_REGISTER_SENDER \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_SENDER, unsigned long)
+#define RPAL_IOCTL_UNREGISTER_SENDER \
+	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_SENDER)
+#define RPAL_IOCTL_REGISTER_RECEIVER \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_RECEIVER, unsigned long)
+#define RPAL_IOCTL_UNREGISTER_RECEIVER \
+	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_RECEIVER)
+#define RPAL_IOCTL_ENABLE_SERVICE \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_ENABLE_SERVICE, unsigned long)
+#define RPAL_IOCTL_DISABLE_SERVICE \
+	_IO(RPAL_IOCTL_MAGIC, RPAL_CMD_DISABLE_SERVICE)
+#define RPAL_IOCTL_REQUEST_SERVICE \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REQUEST_SERVICE, unsigned long)
+#define RPAL_IOCTL_RELEASE_SERVICE \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long)
+#define RPAL_IOCTL_GET_SERVICE_PKEY \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_PKEY, int *)
+#define RPAL_IOCTL_UDS_FDMAP \
+	_IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_UDS_FDMAP, void *)
+
+typedef enum rpal_receiver_status {
+	RPAL_RECEIVER_UNINITIALIZED,
+	RPAL_RECEIVER_INITIALIZED,
+	RPAL_RECEIVER_AVAILABLE,
+} rpal_receiver_status_t;
+
+enum RPAL_CAPABILITIES {
+	RPAL_CAP_PKU,
+};
+
+#define RPAL_SID_SHIFT 24
+#define RPAL_ID_SHIFT 8
+#define RPAL_RECEIVER_STATE_MASK ((1 << RPAL_ID_SHIFT) - 1)
+#define RPAL_SID_MASK (~((1 << RPAL_SID_SHIFT) - 1))
+#define RPAL_ID_MASK (~(0 | RPAL_RECEIVER_STATE_MASK | RPAL_SID_MASK))
+#define RPAL_MAX_ID ((1 << (RPAL_SID_SHIFT - RPAL_ID_SHIFT)) - 1)
+#define RPAL_BUILD_CALL_STATE(id, sid)                                             \
+	((sid << RPAL_SID_SHIFT) | (id << RPAL_ID_SHIFT) | RPAL_RECEIVER_STATE_CALL)
+
+typedef struct rpal_capability {
+	int compat_version;
+	int api_version;
+	unsigned long cap;
+} rpal_capability_t;
+
+typedef struct task_context {
+	unsigned long r15;
+	unsigned long r14;
+	unsigned long r13;
+	unsigned long r12;
+	unsigned long rbx;
+	unsigned long rbp;
+	unsigned long rip;
+	unsigned long rsp;
+} task_context_t;
+
+typedef struct receiver_context {
+	task_context_t task_context;
+	int receiver_id;
+	int receiver_state;
+	int sender_state;
+	int ep_pending;
+	int rpal_ep_poll_magic;
+	int epfd;
+	void *ep_events;
+	int maxevents;
+	int timeout;
+	int64_t total_time;
+} receiver_context_t;
+
+typedef struct rpal_call_info {
+	unsigned long sender_tls_base;
+	uint32_t pkru;
+	fcontext_t sender_fctx;
+} rpal_call_info_t;
+
+enum thread_type {
+	RPAL_RECEIVER = 0x1,
+	RPAL_SENDER = 0x2,
+};
+typedef struct rpal_receiver_info {
+	long tid;
+	unsigned long tls_base;
+
+	int epfd;
+	rpal_receiver_status_t status;
+	epoll_uevent_queue_t ueventq;
+	volatile uint64_t uqlock;
+
+	fcontext_t main_ctx;
+	task_t *ep_stack;
+	task_t *trampoline;
+
+	rpal_call_info_t rci;
+
+	volatile receiver_context_t *rc;
+	struct rpal_thread_pool *rtp;
+} rpal_receiver_info_t;
+
+typedef struct fd_table fd_table_t;
+/* Keep it the same as kernel */
+struct rpal_thread_pool {
+	rpal_receiver_info_t *rris;
+	fd_table_t *fdt;
+	uint64_t service_key;
+	int nr_threads;
+	int service_id;
+	int pkey;
+};
+
+struct rpal_request_arg {
+	unsigned long version;
+	uint64_t key;
+	struct rpal_thread_pool **rtp;
+	int *id;
+	int *pkey;
+};
+
+struct rpal_uds_fdmap_arg {
+	int service_id;
+	int cfd;
+	unsigned long *res;
+};
+
+#define RPAL_ERROR_MAGIC 0x98CC98CC
+
+typedef struct rpal_error_context {
+	unsigned long tls_base;
+	uint64_t erip;
+	uint64_t ersp;
+	int state;
+	int magic;
+} rpal_error_context_t;
+
+typedef struct sender_context {
+	task_context_t task_context;
+	rpal_error_context_t ec;
+	int sender_id;
+	int64_t start_time;
+	int64_t total_time;
+} sender_context_t;
+
+#define RPAL_EP_POLL_MAGIC 0xCC98CC98
+
+typedef struct rpal_sender_info {
+	int idx;
+	int tid;
+	int pkey;
+	int inited;
+	sender_context_t sc;
+} rpal_sender_info_t;
+
+typedef struct fdt_node fdt_node_t;
+
+typedef struct fd_event {
+	int epfd;
+	int fd;
+	struct epoll_event epev;
+	uint32_t events;
+	int wait;
+
+	rpal_queue_t q;
+	int pkey; // unused
+	fdt_node_t *node;
+	struct fd_event *next;
+	uint16_t timestamp;
+	uint16_t outdated;
+	uint64_t service_key;
+} fd_event_t;
+
+struct fdt_node {
+	fd_event_t **events;
+	fdt_node_t *next;
+	int *ref_count;
+	uint16_t *timestamps;
+};
+
+// when sender calls fd_event_get, we must check this number to avoid
+// accessing outdated fdt_node definitions
+
+#define FDTAB_MAG1 0x4D414731UL // add fde lazyswitch
+#define FDTAB_MAG2 0x14D414731UL // add fde timestamp
+#define FDTAB_MAG3 0x34D414731UL // add fde outdated
+#define FDTAB_MAG4 0x74D414731UL // add automatic identification rpal mode
+
+enum fde_ref_status {
+	FDE_FREEING = -100,
+	FDE_FREED = -1,
+	FDE_AVAILABLE = 0,
+};
+
+#define DEFAULT_NODE_SHIFT 14 // 2^14 elements per node
+typedef struct fd_table {
+	fdt_node_t *head;
+	fdt_node_t *tail;
+	int max_fd;
+	unsigned int node_shift;
+	unsigned int node_mask;
+	pthread_mutex_t lock;
+	unsigned long magic;
+	fd_event_t *freelist;
+	pthread_mutex_t list_lock;
+} fd_table_t;
+
+typedef struct critical_section {
+	unsigned long ret_begin;
+	unsigned long ret_end;
+} critical_section_t;
+
+struct rpal_service_metadata {
+	unsigned long version;
+	struct rpal_thread_pool *rtp;
+	critical_section_t rcs;
+	int pkey;
+};
+
+#ifndef RPAL_DEBUG
+#define dbprint(category, format, args...) ((void)0)
+#else
+void dbprint(rpal_debug_flag_t category, char *format, ...)
+	__attribute__((format(printf, 2, 3)));
+#endif
+void errprint(const char *format, ...) __attribute__((format(printf, 1, 2)));
+void warnprint(const char *format, ...) __attribute__((format(printf, 1, 2)));
+
+#endif
diff --git a/samples/rpal/librpal/rpal.c b/samples/rpal/librpal/rpal.c
new file mode 100644
index 000000000000..64bd2b93bd67
--- /dev/null
+++ b/samples/rpal/librpal/rpal.c
@@ -0,0 +1,2351 @@
+#include "private.h"
+
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <sys/eventfd.h>
+#include <linux/futex.h>
+#include <signal.h>
+#include <stdarg.h>
+
+#include "rpal_pkru.h"
+
+/* prints an error message to stderr */
+void errprint(const char *format, ...)
+{
+	va_list args;
+
+	fprintf(stderr, "[RPAL_ERROR] ");
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+}
+
+/* prints a warning message to stderr */
+void warnprint(const char *format, ...)
+{
+	va_list args;
+
+	fprintf(stderr, "[RPAL_WARNING] ");
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+}
+
+#ifdef RPAL_DEBUG
+void dbprint(rpal_debug_flag_t category, char *format, ...)
+{
+	if (category & RPAL_DEBUG) {
+		va_list args;
+		fprintf(stderr, "[RPAL_DEBUG] ");
+		va_start(args, format);
+		vfprintf(stderr, format, args);
+		va_end(args);
+	}
+}
+#endif
+
+#define SAVE_FPU(mxcsr, fpucw)                                                 \
+	__asm__ __volatile__("stmxcsr %0;"                                     \
+			     "fnstcw %1;"                                      \
+			     : "=m"(mxcsr), "=m"(fpucw)                        \
+			     :)
+#define RESTORE_FPU(mxcsr, fpucw)                                              \
+	__asm__ __volatile__("ldmxcsr %0;"                                     \
+			     "fldcw %1;"                                       \
+			     :                                                 \
+			     : "m"(mxcsr), "m"(fpucw))
+
+#define ERRREPORT(EPTR, ECODE, ...)                                            \
+	if (EPTR) {                                                            \
+		*EPTR = ECODE;                                                 \
+	}                                                                      \
+	errprint(__VA_ARGS__);
+
+#define RPAL_MGT_FILE "/proc/rpal"
+#define MAX_SUPPROTED_CPUS 192
+
+static __always_inline unsigned long __ffs(unsigned long word)
+{
+	asm("rep; bsf %1,%0" : "=r"(word) : "rm"(word));
+
+	return word;
+}
+
+static void __set_bit(uint64_t *bitmap, int idx)
+{
+	int bit, i;
+	i = idx / 8;
+	bit = idx % 8;
+	bitmap[i] |= (1UL << bit);
+}
+
+static int clear_first_set_bit(uint64_t *bitmap, int size)
+{
+	int idx;
+	int bit, i;
+
+	for (i = 0; i * BITS_PER_LONG < size; i++) {
+		if (bitmap[i]) {
+			bit = __ffs(bitmap[i]);
+			idx = i * BITS_PER_LONG + bit;
+			if (idx >= size) {
+				return -1;
+			}
+			bitmap[i] &= ~(1UL << bit);
+			return idx;
+		}
+	}
+	return -1;
+}
+
+extern void rpal_get_critical_addr(critical_section_t *rcs);
+static critical_section_t rcs = { 0 };
+
+#define MAX_SERVICEID 254 // Intel MPK Limit
+#define MIN_RPAL_KERNEL_API_VERSION 1
+#define TARGET_RPAL_KERNEL_API_VERSION                                         \
+	1 // RPAL will disable when KERNEL_API < TARGET_RPAL_KERNEL_API_VERSION
+
+enum {
+	RCALL_IN = 0x1 << 0,
+	RCALL_OUT = 0x1 << 1,
+};
+
+enum {
+	FDE_NO_TRIGGER,
+	FDE_TRIGGER_OUT,
+};
+
+#define EPOLLRPALINOUT_BITS (EPOLLRPALIN | EPOLLRPALOUT)
+
+#define DEFAULT_QUEUE_SIZE 32U
+
+typedef struct rpal_requested_service {
+	struct rpal_thread_pool *service;
+	int pkey;
+	uint64_t key;
+} rpal_requeseted_service_t;
+
+static int rpal_mgtfd = -1;
+static int inited;
+int pkru_enabled = 0;
+
+static rpal_capability_t version;
+static pthread_key_t rpal_key;
+static rpal_requeseted_service_t requested_services[MAX_SERVICEID];
+static pthread_mutex_t release_lock;
+
+typedef struct rpal_local {
+	unsigned int tflag;
+	rpal_receiver_info_t *rri;
+	rpal_sender_info_t *rsi;
+} rpal_local_t;
+
+#define SENDERS_PAGE_ORDER 3
+#define RPALTHREAD_PAGE_ORDER 0
+
+typedef struct rpal_thread_metadata {
+	int rpal_receiver_idx;
+	int service_id;
+	const int epcpage_order;
+	uint64_t service_key;
+	struct rpal_thread_pool *rtp;
+	receiver_context_t *rc;
+	pid_t pid;
+	int *eventfds;
+} rpal_thread_metadata_t;
+
+static rpal_thread_metadata_t threads_md = {
+	.service_id = -1,
+	.epcpage_order = RPALTHREAD_PAGE_ORDER,
+};
+
+static inline rpal_sender_info_t *current_rpal_sender(void)
+{
+	rpal_local_t *local;
+
+	local = pthread_getspecific(rpal_key);
+	if (local && (local->tflag & RPAL_SENDER)) {
+		return local->rsi;
+	} else {
+		return NULL;
+	}
+}
+
+static inline rpal_receiver_info_t *current_rpal_thread(void)
+{
+	rpal_local_t *local;
+
+	local = pthread_getspecific(rpal_key);
+	if (local && (local->tflag & RPAL_RECEIVER)) {
+		return local->rri;
+	} else {
+		return NULL;
+	}
+}
+
+static status_t rpal_register_sender_local(rpal_sender_info_t *sender)
+{
+	rpal_local_t *local;
+	local = pthread_getspecific(rpal_key);
+	if (!local) {
+		local = malloc(sizeof(rpal_local_t));
+		if (!local)
+			return RPAL_FAILURE;
+		memset(local, 0, sizeof(rpal_local_t));
+		pthread_setspecific(rpal_key, local);
+	}
+	if (local->tflag & RPAL_SENDER) {
+		return RPAL_FAILURE;
+	}
+	local->rsi = sender;
+	local->tflag |= RPAL_SENDER;
+	return RPAL_SUCCESS;
+}
+
+static status_t rpal_unregister_sender_local(void)
+{
+	rpal_local_t *local;
+	local = pthread_getspecific(rpal_key);
+	if (!local || !(local->tflag & RPAL_SENDER))
+		return RPAL_FAILURE;
+
+	local->rsi = NULL;
+	local->tflag &= ~RPAL_SENDER;
+	if (!local->tflag) {
+		pthread_setspecific(rpal_key, NULL);
+		free(local);
+	}
+	return RPAL_SUCCESS;
+}
+
+static status_t rpal_register_receiver_local(rpal_receiver_info_t *thread)
+{
+	rpal_local_t *local;
+	local = pthread_getspecific(rpal_key);
+	if (!local) {
+		local = malloc(sizeof(rpal_local_t));
+		if (!local)
+			return RPAL_FAILURE;
+		memset(local, 0, sizeof(rpal_local_t));
+		pthread_setspecific(rpal_key, local);
+	}
+	if (local->tflag & RPAL_RECEIVER) {
+		return RPAL_FAILURE;
+	}
+	local->rri = thread;
+	local->tflag |= RPAL_RECEIVER;
+	return RPAL_SUCCESS;
+}
+
+static status_t rpal_unregister_receiver_local(void)
+{
+	rpal_local_t *local;
+	local = pthread_getspecific(rpal_key);
+	if (!local || !(local->tflag & RPAL_RECEIVER))
+		return RPAL_FAILURE;
+
+	local->rri = NULL;
+	local->tflag &= ~RPAL_RECEIVER;
+	if (!local->tflag) {
+		pthread_setspecific(rpal_key, NULL);
+		free(local);
+	}
+	return RPAL_SUCCESS;
+}
+
+#define MAX_SENDERS 256
+typedef struct rpal_senders_metadata {
+	uint64_t bitmap[BITS_TO_LONGS(MAX_SENDERS)];
+	pthread_mutex_t lock;
+	int sdpage_order;
+	rpal_sender_info_t *senders;
+} rpal_senders_metadata_t;
+
+static rpal_senders_metadata_t *senders_md;
+
+static long rpal_ioctl(unsigned long cmd, unsigned long arg)
+{
+	struct {
+		unsigned long *ret;
+		unsigned long cmd;
+		unsigned long arg0;
+		unsigned long arg1;
+	} args;
+	const int args_size = sizeof(args);
+	int ret;
+
+	if (rpal_mgtfd == -1) {
+		errprint("rpal_mgtfd is not opened\n");
+		return -1;
+	}
+
+	ret = ioctl(rpal_mgtfd, cmd, arg);
+
+	return ret;
+}
+
+static inline long rpal_register_sender(rpal_sender_info_t *sender)
+{
+	long ret;
+
+	if (rpal_register_sender_local(sender) == RPAL_FAILURE)
+		return RPAL_FAILURE;
+
+	ret = rpal_ioctl(RPAL_IOCTL_REGISTER_SENDER,
+			 (unsigned long)&sender->sc);
+	if (ret < 0) {
+		rpal_unregister_sender_local();
+	}
+	return ret;
+}
+
+static inline long rpal_register_receiver(rpal_receiver_info_t *rri)
+{
+	long ret;
+
+	if (rpal_register_receiver_local(rri) == RPAL_FAILURE)
+		return RPAL_FAILURE;
+	ret = rpal_ioctl(RPAL_IOCTL_REGISTER_RECEIVER,
+			 (unsigned long)rri->rc);
+	if (ret < 0) {
+		rpal_unregister_receiver_local();
+	}
+	return ret;
+}
+
+static inline long rpal_unregister_sender(void)
+{
+	if (rpal_unregister_sender_local() == RPAL_FAILURE)
+		return RPAL_FAILURE;
+	return rpal_ioctl(RPAL_IOCTL_UNREGISTER_SENDER, 0);
+}
+
+static inline long rpal_unregister_receiver(void)
+{
+	if (rpal_unregister_receiver_local() == RPAL_FAILURE)
+		return RPAL_FAILURE;
+	return rpal_ioctl(RPAL_IOCTL_UNREGISTER_RECEIVER, 0);
+}
+
+static int rpal_get_service_pkey(void)
+{
+	int pkey, ret;
+
+	ret = rpal_ioctl(RPAL_IOCTL_GET_SERVICE_PKEY, (unsigned long)&pkey);
+	if (ret < 0 || pkey == -1) {
+		warnprint("MPK not supported on this host, disabling PKRU\n");
+		return -1;
+	}
+	return pkey;
+}
+
+static int __rpal_get_service_id(void)
+{
+	int id, ret;
+
+	ret = rpal_ioctl(RPAL_IOCTL_GET_SERVICE_ID, (unsigned long)&id);
+
+	if (ret < 0)
+		return ret;
+	else
+		return id;
+}
+
+static uint64_t __rpal_get_service_key(void)
+{
+	int ret;
+	uint64_t key;
+
+	ret = rpal_ioctl(RPAL_IOCTL_GET_SERVICE_KEY, (unsigned long)&key);
+	if (ret < 0)
+		return 0;
+	else
+		return key;
+}
+
+static void *rpal_get_shared_page(int order)
+{
+	void *p;
+	int size;
+	int flags = MAP_SHARED;
+
+	if (rpal_mgtfd == -1) {
+		return NULL;
+	}
+	size = PAGE_SIZE * (1 << order);
+
+	p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, rpal_mgtfd, 0);
+
+	return p;
+}
+
+static int rpal_free_shared_page(void *page, int order)
+{
+	int ret = 0;
+	int size;
+
+	size = PAGE_SIZE * (1 << order);
+	ret = munmap(page, size);
+	if (ret) {
+		errprint("munmap fail: %d\n", ret);
+	}
+	return ret;
+}
+
+static inline int rpal_inited(void)
+{
+	return (inited == 1);
+}
+
+static inline int sender_idx_is_invalid(int idx)
+{
+	if (idx < 0 || idx >= MAX_SENDERS)
+		return 1;
+	return 0;
+}
+
+static int rpal_sender_info_alloc(rpal_sender_info_t **sender)
+{
+	int idx;
+
+	if (!senders_md)
+		return RPAL_FAILURE;
+	pthread_mutex_lock(&senders_md->lock);
+	idx = clear_first_set_bit(senders_md->bitmap, MAX_SENDERS);
+	if (idx < 0) {
+		errprint("sender data alloc failed: %d, bitmap: %lx\n", idx,
+			 senders_md->bitmap[0]);
+		goto unlock;
+	}
+	*sender = senders_md->senders + idx;
+
+unlock:
+	pthread_mutex_unlock(&senders_md->lock);
+	return idx;
+}
+
+static void rpal_sender_info_free(int idx)
+{
+	if (sender_idx_is_invalid(idx)) {
+		return;
+	}
+	pthread_mutex_lock(&senders_md->lock);
+	__set_bit(senders_md->bitmap, idx);
+	pthread_mutex_unlock(&senders_md->lock);
+}
+
+extern unsigned long rpal_get_ret_rip(void);
+
+static int rpal_sender_inited(rpal_sender_info_t *sender)
+{
+	return (sender->inited == 1);
+}
+
+status_t rpal_sender_init(rpal_error_code_t *error)
+{
+	int idx;
+	int ret = RPAL_FAILURE;
+	rpal_sender_info_t *sender;
+
+	if (!rpal_inited()) {
+		ERRREPORT(error, RPAL_DONT_INITED, "%s: rpal do not init\n",
+			  __FUNCTION__);
+		goto error_out;
+	}
+	sender = current_rpal_sender();
+	if (sender) {
+		goto error_out;
+	}
+	idx = rpal_sender_info_alloc(&sender);
+	if (idx < 0) {
+		if (error) {
+			*error = RPAL_ERR_SENDER_INIT;
+		}
+		goto error_out;
+	}
+	sender->idx = idx;
+	sender->sc.sender_id = idx;
+	sender->tid = syscall(SYS_gettid);
+	sender->pkey = rpal_get_service_pkey();
+	sender->sc.ec.erip = rpal_get_ret_rip();
+	ret = rpal_register_sender(sender);
+	if (ret) {
+		ERRREPORT(error, RPAL_ERR_SENDER_REG,
+			  "rpal_register_sender error: %d\n", ret);
+		goto sender_register_failed;
+	}
+	sender->inited = 1;
+	return RPAL_SUCCESS;
+
+sender_register_failed:
+	rpal_sender_info_free(idx);
+error_out:
+	return RPAL_FAILURE;
+}
+
+status_t rpal_sender_exit(void)
+{
+	int idx;
+	rpal_sender_info_t *sender;
+
+	sender = current_rpal_sender();
+
+	if (sender) {
+		idx = sender->idx;
+		sender->idx = 0;
+		sender->tid = 0;
+		rpal_unregister_sender();
+		rpal_sender_info_free(idx);
+		sender->pkey = 0;
+	}
+	return RPAL_SUCCESS;
+}
+
+static status_t rpal_enable_service(rpal_error_code_t *error)
+{
+	struct rpal_service_metadata rsm;
+	long ret = 0;
+
+	rsm.version = 0;
+	rsm.rtp = threads_md.rtp;
+	rsm.rcs = rcs;
+	rsm.pkey = -1;
+	ret = rpal_ioctl(RPAL_IOCTL_ENABLE_SERVICE, (unsigned long)&rsm);
+	if (ret) {
+		ERRREPORT(error, RPAL_ERR_ENABLE_SERVICE,
+			  "rpal enable service failed: %ld\n", ret)
+		return RPAL_FAILURE;
+	}
+	threads_md.rtp->pkey = rpal_get_service_pkey();
+	return RPAL_SUCCESS;
+}
+
+static status_t rpal_disable_service(void)
+{
+	long ret = 0;
+	ret = rpal_ioctl(RPAL_IOCTL_DISABLE_SERVICE, 0);
+	if (ret) {
+		errprint("rpal disable service failed: %ld\n", ret);
+		return RPAL_FAILURE;
+	}
+	return RPAL_SUCCESS;
+}
+
+static status_t add_requested_service(struct rpal_thread_pool *rtp, uint64_t key, int id, int pkey)
+{
+	struct rpal_thread_pool *expected = NULL;
+
+	if (!rtp) {
+		errprint("add requested service null\n");
+		return RPAL_FAILURE;
+	}
+
+	if (!__atomic_compare_exchange_n(&requested_services[id].service,
+					 &expected, rtp, 1, __ATOMIC_SEQ_CST,
+					 __ATOMIC_SEQ_CST)) {
+		errprint("rpal service %d already add, expected: %ld\n", id,
+			 expected->service_key);
+		return RPAL_FAILURE;
+	}
+	requested_services[id].key = key;
+	requested_services[id].pkey = pkey;
+	return RPAL_SUCCESS;
+}
+
+int rpal_get_request_service_id(uint64_t key)
+{
+	int i;
+
+	for (i = 0; i < MAX_SERVICEID; i++) {
+		if (requested_services[i].key == key)
+			return i;
+	}
+	return -1;
+}
+
+static struct rpal_thread_pool *get_service_from_key(uint64_t key)
+{
+	int i;
+	struct rpal_thread_pool *rtp;
+
+	for (i = 0; i < MAX_SERVICEID; i++) {
+		if (requested_services[i].key == key)
+			return requested_services[i].service;
+	}
+	return NULL;
+}
+
+static inline struct rpal_thread_pool *get_service_from_id(int id)
+{
+	return requested_services[id].service;
+}
+
+static inline int get_service_pkey_from_id(int id)
+{
+	return requested_services[id].pkey;
+}
+
+static struct rpal_thread_pool *del_requested_service(uint64_t key)
+{
+	int id;
+	struct rpal_thread_pool *rtp;
+
+	id = rpal_get_request_service_id(key);
+	if (id == -1)
+		return NULL;
+	rtp = __atomic_exchange_n(&requested_services[id].service, NULL,
+				  __ATOMIC_RELAXED);
+	return rtp;
+}
+
+int rpal_request_service(uint64_t key)
+{
+	struct rpal_request_arg rra;
+	long ret = RPAL_FAILURE;
+	struct rpal_thread_pool *rtp;
+	int id, pkey;
+
+	if (!rpal_inited()) {
+		errprint("%s: rpal do not init\n", __FUNCTION__);
+		goto error_out;
+	}
+
+	rra.version = 0;
+	rra.key = key;
+	rra.rtp = &rtp;
+	rra.id = &id;
+	rra.pkey = &pkey;
+	ret = rpal_ioctl(RPAL_IOCTL_REQUEST_SERVICE, (unsigned long)&rra);
+	if (ret) {
+		goto error_out;
+	}
+
+	ret = add_requested_service(rtp, key, id, pkey);
+	if (ret == RPAL_FAILURE) {
+		goto add_requested_failed;
+	}
+
+	return RPAL_SUCCESS;
+
+add_requested_failed:
+	rpal_ioctl(RPAL_IOCTL_RELEASE_SERVICE, key);
+error_out:
+	return (int)ret;
+}
+
+static void fdt_freelist_forcefree(fd_table_t *fdt, uint64_t service_key);
+
+status_t rpal_release_service(uint64_t key)
+{
+	long ret;
+	struct rpal_thread_pool *rtp;
+
+	if (!rpal_inited()) {
+		errprint("%s: rpal do not init\n", __FUNCTION__);
+		return RPAL_FAILURE;
+	}
+
+	rtp = del_requested_service(key);
+	ret = rpal_ioctl(RPAL_IOCTL_RELEASE_SERVICE, key);
+	if (ret) {
+		errprint("rpal release service failed: %ld\n", ret);
+		return RPAL_FAILURE;
+	}
+	fdt_freelist_forcefree(threads_md.rtp->fdt, key);
+	return RPAL_SUCCESS;
+}
+
+static void try_clean_lock(rpal_receiver_info_t *rri, uint64_t key)
+{
+	uint64_t lock_state = key | 1UL << 63;
+
+	if (__atomic_load_n(&rri->uqlock, __ATOMIC_RELAXED) == lock_state)
+		uevent_queue_fix(&rri->ueventq);
+
+	if (__atomic_compare_exchange_n(&rri->uqlock, &lock_state, (uint64_t)0,
+					1, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))
+		dbprint(RPAL_DEBUG_MANAGEMENT,
+			"Serivce (key: %lu) does exit with holding lock\n",
+			key);
+}
+
+struct release_info {
+	uint64_t keys[KEY_SIZE];
+	int size;
+};
+
+status_t rpal_clean_service_start(int64_t *ptr)
+{
+	rpal_receiver_info_t *rri;
+	struct release_info *info;
+	int i, j;
+	int size;
+
+	if (!ptr) {
+		goto error_out;
+	}
+
+	info = malloc(sizeof(struct release_info));
+	if (info == NULL) {
+		errprint("alloc release_info fail\n");
+		goto error_out;
+	}
+
+	pthread_mutex_lock(&release_lock);
+	size = read(rpal_mgtfd, info->keys, KEY_SIZE * sizeof(uint64_t));
+	if (size <= 0) {
+		errprint("Read keys on rpal_mgtfd failed\n");
+		goto error_unlock;
+	}
+
+	size /= sizeof(uint64_t);
+	info->size = size;
+
+	for (i = 0; i < size; i++) {
+		for (j = 0; j < threads_md.rtp->nr_threads; j++) {
+			rri = threads_md.rtp->rris + j;
+			try_clean_lock(rri, info->keys[i]);
+		}
+	}
+	pthread_mutex_unlock(&release_lock);
+	*ptr = (int64_t)info;
+	return RPAL_SUCCESS;
+
+error_unlock:
+	pthread_mutex_unlock(&release_lock);
+	free(info);
+error_out:
+	return RPAL_FAILURE;
+}
+
+void rpal_clean_service_end(int64_t *ptr)
+{
+	int i;
+	struct release_info *info;
+
+	if (ptr == NULL)
+		return;
+	info = (struct release_info *)(*ptr);
+	if (info == NULL)
+		return;
+	for (i = 0; i < info->size; i++) {
+		dbprint(RPAL_DEBUG_MANAGEMENT, "release service: 0x%lx\n",
+			info->keys[i]);
+		rpal_release_service(info->keys[i]);
+	}
+	free(info);
+}
+int rpal_get_service_id(void)
+{
+	if (!rpal_inited()) {
+		return RPAL_FAILURE;
+	}
+	return threads_md.service_id;
+}
+
+status_t rpal_get_service_key(uint64_t *service_key)
+{
+	if (!rpal_inited() || !service_key) {
+		return RPAL_FAILURE;
+	}
+	*service_key = threads_md.service_key;
+	return RPAL_SUCCESS;
+}
+
+static fdt_node_t *fdt_node_alloc(fd_table_t *fdt)
+{
+	fdt_node_t *node;
+	fd_event_t **ev;
+	int *ref_count;
+	uint16_t *timestamps;
+	int size = 0;
+
+	node = malloc(sizeof(fdt_node_t));
+	if (!node)
+		goto node_alloc_failed;
+
+	size = sizeof(fd_event_t **) * (1 << fdt->node_shift);
+	ev = malloc(size);
+	if (!ev)
+		goto events_alloc_failed;
+	memset(ev, 0, size);
+
+	size = sizeof(int) * (1 << fdt->node_shift);
+	ref_count = malloc(size);
+	if (!ref_count)
+		goto used_alloc_failed;
+	memset(ref_count, 0xff, size);
+
+	size = sizeof(uint16_t) * (1 << fdt->node_shift);
+	timestamps = malloc(size);
+	if (!timestamps)
+		goto ts_alloc_failed;
+	memset(timestamps, 0, size);
+
+	node->events = ev;
+	node->ref_count = ref_count;
+	node->next = NULL;
+	node->timestamps = timestamps;
+	if (!fdt->head) {
+		fdt->head = node;
+		fdt->tail = node;
+	} else {
+		fdt->tail->next = node;
+		fdt->tail = node;
+	}
+	fdt->max_fd += (1 << fdt->node_shift);
+	return node;
+
+ts_alloc_failed:
+	free(ref_count);
+used_alloc_failed:
+	free(ev);
+events_alloc_failed:
+	free(node);
+node_alloc_failed:
+	errprint("%s Error!!! max_fd: %d\n", __FUNCTION__, fdt->max_fd);
+	return NULL;
+}
+
+static void fdt_node_free_all(fd_table_t *fdt)
+{
+	fdt_node_t *node, *ptr;
+
+	node = fdt->head;
+	while (node) {
+		free(node->timestamps);
+		free(node->ref_count);
+		free(node->events);
+		ptr = node;
+		node = node->next;
+		free(ptr);
+	}
+}
+
+static fdt_node_t *fdt_node_expand(fd_table_t *fdt, int fd)
+{
+	fdt_node_t *node = NULL;
+	while (fd >= fdt->max_fd) {
+		node = fdt_node_alloc(fdt);
+		if (!node)
+			break;
+	}
+	return node;
+}
+
+static fdt_node_t *fdt_node_search(fd_table_t *fdt, int fd)
+{
+	fdt_node_t *node = NULL;
+	int pos = 0;
+	if (fd >= fdt->max_fd)
+		return NULL;
+	pos = fd >> fdt->node_shift;
+	node = fdt->head;
+	while (pos) {
+		if (!node) {
+			errprint(
+				"fdt node search ERROR! fd: %d, pos: %d, fdt->max_fd: %d\n",
+				fd, pos, fdt->max_fd);
+			return NULL;
+		}
+		node = node->next;
+		pos--;
+	}
+	return node;
+}
+
+static fd_table_t *fd_table_alloc(unsigned int node_shift)
+{
+	fd_table_t *fdt;
+	pthread_mutexattr_t mattr;
+
+	fdt = malloc(sizeof(fd_table_t));
+	if (!fdt)
+		return NULL;
+	fdt->head = NULL;
+	fdt->tail = NULL;
+	fdt->max_fd = 0;
+	fdt->node_shift = node_shift;
+	fdt->node_mask = (1 << node_shift) - 1;
+	fdt->freelist = NULL;
+	pthread_mutex_init(&fdt->list_lock, NULL);
+
+	pthread_mutexattr_init(&mattr);
+	pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED);
+	pthread_mutex_init(&fdt->lock, &mattr);
+	return fdt;
+}
+
+static void fd_table_free(fd_table_t *fdt)
+{
+	if (!fdt)
+		return;
+	fdt_node_free_all(fdt);
+	free(fdt);
+	return;
+}
+
+static inline fd_event_t *fd_event_alloc(int fd, int epfd,
+					 struct epoll_event *event)
+{
+	fd_event_t *fde;
+	uint64_t *qdata;
+
+	fde = (fd_event_t *)malloc(sizeof(fd_event_t));
+	if (!fde)
+		return NULL;
+
+	fde->fd = fd;
+	fde->epfd = epfd;
+	fde->epev = *event;
+	fde->events = 0;
+	fde->node = NULL;
+	fde->next = NULL;
+	fde->timestamp = 0;
+	fde->service_key = 0;
+	__atomic_store_n(&fde->outdated, (uint16_t)0, __ATOMIC_RELEASE);
+
+	qdata = malloc(DEFAULT_QUEUE_SIZE * sizeof(uint64_t));
+	if (!qdata) {
+		errprint("malloc queue data failed\n");
+		goto malloc_error;
+	}
+	if (rpal_queue_init(&fde->q, qdata, DEFAULT_QUEUE_SIZE)) {
+		errprint("fde queue alloc failed, fd: %d\n", fd);
+		goto init_error;
+	}
+	return fde;
+
+init_error:
+	free(qdata);
+malloc_error:
+	free(fde);
+	return NULL;
+}
+
+static inline void fd_event_free(fd_event_t *fde)
+{
+	uint64_t *qdata;
+
+	if (!fde)
+		return;
+	qdata = rpal_queue_destroy(&fde->q);
+	free(qdata);
+	free(fde);
+	return;
+}
+
+static void fdt_freelist_insert(fd_table_t *fdt, fd_event_t *fde)
+{
+	if (!fde)
+		return;
+
+	pthread_mutex_lock(&fdt->list_lock);
+	if (fdt->freelist == NULL) {
+		fdt->freelist = fde;
+	} else {
+		fde->next = fdt->freelist;
+		fdt->freelist = fde;
+	}
+	pthread_mutex_unlock(&fdt->list_lock);
+}
+
+static void fdt_freelist_forcefree(fd_table_t *fdt, uint64_t service_key)
+{
+	fd_event_t *prev, *pos, *f_fde;
+	fdt_node_t *node;
+	int idx;
+
+	pthread_mutex_lock(&fdt->list_lock);
+	prev = NULL;
+	pos = fdt->freelist;
+	while (pos) {
+		idx = pos->fd & fdt->node_mask;
+		node = pos->node;
+		if (pos->service_key == service_key) {
+			__atomic_exchange_n(&node->ref_count[idx], FDE_FREEING,
+					    __ATOMIC_RELAXED);
+			if (!prev) {
+				fdt->freelist = pos->next;
+			} else {
+				prev->next = pos->next;
+			}
+			f_fde = pos;
+			pos = pos->next;
+			node->events[idx] = NULL;
+			__atomic_store_n(&node->ref_count[idx], -1,
+					 __ATOMIC_RELEASE);
+			fd_event_free(f_fde);
+		} else {
+			prev = pos;
+			pos = pos->next;
+		}
+	}
+	pthread_mutex_unlock(&fdt->list_lock);
+	return;
+}
+
+static void fdt_freelist_lazyfree(fd_table_t *fdt)
+{
+	fd_event_t *prev, *pos, *f_fde;
+	fdt_node_t *node;
+	int idx;
+	int expected;
+
+	pthread_mutex_lock(&fdt->list_lock);
+	prev = NULL;
+	pos = fdt->freelist;
+
+	while (pos) {
+		idx = pos->fd & fdt->node_mask;
+		// do lazyfree when ref_count less than 0
+		expected = FDE_AVAILABLE;
+		node = pos->node;
+		if (__atomic_compare_exchange_n(
+			    &node->ref_count[idx], &expected, FDE_FREEING, 1,
+			    __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
+			if (!prev) {
+				fdt->freelist = pos->next;
+			} else {
+				prev->next = pos->next;
+			}
+			f_fde = pos;
+			pos = pos->next;
+			node->events[idx] = NULL;
+			__atomic_store_n(&node->ref_count[idx], -1,
+					 __ATOMIC_RELEASE);
+			fd_event_free(f_fde);
+		} else {
+			if (expected < 0) {
+				errprint("error ref: %d, fd: %d\n", expected,
+					 pos->fd);
+			}
+			prev = pos;
+			pos = pos->next;
+		}
+	}
+	pthread_mutex_unlock(&fdt->list_lock);
+	return;
+}
+
+static uint16_t fde_timestamp_get(fd_table_t *fdt, int fd)
+{
+	fdt_node_t *node;
+	int idx;
+
+	node = fdt_node_search(fdt, fd);
+	if (!node) {
+		return 0;
+	}
+	idx = fd & fdt->node_mask;
+	return node->timestamps[idx];
+}
+
+static void fd_event_put(fd_table_t *fdt, fd_event_t *fde);
+
+static fd_event_t *fd_event_get(fd_table_t *fdt, int fd)
+{
+	fd_event_t *fde = NULL;
+	fdt_node_t *node;
+	int idx;
+	int val = -1;
+	int expected;
+
+	node = fdt_node_search(fdt, fd);
+	if (!node) {
+		return NULL;
+	}
+	idx = fd & fdt->node_mask;
+
+retry:
+	val = __atomic_load_n(&node->ref_count[idx], __ATOMIC_ACQUIRE);
+	if (val < 0)
+		return NULL;
+	expected = val;
+	val++;
+	if (!__atomic_compare_exchange_n(&node->ref_count[idx], &expected, val,
+					 1, __ATOMIC_SEQ_CST,
+					 __ATOMIC_SEQ_CST)) {
+		if (expected >= 0) {
+			goto retry;
+		} else {
+			return NULL;
+		}
+	}
+	fde = node->events[idx];
+	if (!fde) {
+		errprint("error get: %d, fd: %d\n", val, fd);
+	} else {
+		if (__atomic_load_n(&fde->outdated, __ATOMIC_ACQUIRE)) {
+			fd_event_put(fdt, fde);
+			fde = NULL;
+		}
+	}
+	return fde;
+}
+
+static void fd_event_put(fd_table_t *fdt, fd_event_t *fde)
+{
+	int idx;
+	int val;
+
+	if (!fde)
+		return;
+
+	idx = fde->fd & fdt->node_mask;
+	val = __atomic_sub_fetch(&fde->node->ref_count[idx], 1,
+				 __ATOMIC_RELEASE);
+	if (val < 0) {
+		errprint("error put: %d, fd: %d\n", val, fde->fd);
+	}
+	return;
+}
+
+int rpal_access(void *addr, access_fn do_access, int *ret, va_list va);
+
+int rpal_access(void *addr, access_fn do_access, int *ret, va_list va)
+{
+	int func_ret;
+
+	func_ret = do_access(va);
+	if (ret) {
+		*ret = func_ret;
+	}
+	return RPAL_SUCCESS;
+}
+
+extern status_t rpal_access_warpper(void *addr, access_fn do_access, int *ret,
+				    va_list va);
+
+#define rpal_write_access_safety(ACCESS_FUNC, FUNC_RET, ...)                   \
+	({                                                                     \
+		status_t __access = RPAL_FAILURE;                              \
+		uint32_t old_pkru = 0;                                         \
+		old_pkru = rdpkru();                                           \
+		__access = rpal_read_access_safety(ACCESS_FUNC, FUNC_RET,      \
+						   ##__VA_ARGS__);             \
+		wrpkru(old_pkru);                                              \
+		__access;                                                      \
+	})
+
+status_t rpal_read_access_safety(access_fn do_access, int *ret, ...)
+{
+	rpal_sender_info_t *sender;
+	sender_context_t *sc;
+	rpal_error_code_t error;
+	status_t access = RPAL_FAILURE;
+	va_list args;
+
+	sender = current_rpal_sender();
+	if (!sender || !rpal_sender_inited(sender)) {
+		dbprint(RPAL_DEBUG_SENDER, "%s: sender(%d) do not init\n",
+			__FUNCTION__, getpid());
+		if (RPAL_FAILURE == rpal_sender_init(&error)) {
+			return RPAL_FAILURE;
+		}
+		sender = current_rpal_sender();
+	}
+	sc = &sender->sc;
+	sc->ec.magic = RPAL_ERROR_MAGIC;
+	va_start(args, ret);
+	access = rpal_access_warpper(&(sc->ec.ersp), do_access, ret, args);
+	va_end(args);
+	sc->ec.magic = 0;
+
+	return access;
+}
+
+static int64_t __do_rpal_uds_fdmap(int service_id, int connfd)
+{
+	struct rpal_uds_fdmap_arg arg;
+	int64_t res;
+	int ret;
+
+	arg.cfd = connfd;
+	arg.service_id = service_id;
+	arg.res = &res;
+	ret = rpal_ioctl(RPAL_IOCTL_UDS_FDMAP, (unsigned long)&arg);
+	if (ret < 0)
+		return RPAL_FAILURE;
+
+	return res;
+}
+
+static status_t do_rpal_uds_fdmap(va_list va)
+{
+	int64_t ret;
+	int sfd, cfd, sid;
+	struct rpal_thread_pool *srtp;
+	uint64_t stamp = 0;
+	uint64_t sid_fd;
+	uint64_t *rpalfd;
+	fd_event_t *fde;
+
+	sid_fd = va_arg(va, uint64_t);
+	rpalfd = va_arg(va, uint64_t *);
+
+	if (!rpalfd) {
+		return RPAL_FAILURE;
+	}
+	sid = get_high32(sid_fd);
+	cfd = get_low32(sid_fd);
+
+	ret = __do_rpal_uds_fdmap(sid, cfd);
+	if (ret < 0) {
+		errprint("%s failed %ld, cfd: %d\n", __FUNCTION__, ret, cfd);
+		return RPAL_FAILURE;
+	}
+
+	srtp = get_service_from_id(sid);
+	if (!srtp) {
+		errprint("%s INVALID service_id: %d\n", __FUNCTION__, sid);
+		return RPAL_FAILURE;
+	}
+	sfd = get_sfd(ret);
+	stamp = fde_timestamp_get(srtp->fdt, sfd);
+	ret |= (stamp << HIGH16_OFFSET);
+
+	fde = fd_event_get(threads_md.rtp->fdt, cfd);
+	if (!fde) {
+		errprint("%s get self fde error, fd: %d\n", __FUNCTION__, cfd);
+		goto out;
+	}
+	fde->service_key = srtp->service_key;
+	fd_event_put(threads_md.rtp->fdt, fde);
+out:
+	*rpalfd = ret;
+	return RPAL_SUCCESS;
+}
+
+int rpal_get_peer_rid(uint64_t sid_fd)
+{
+	int64_t ret;
+	int sid, cfd;
+	int rid;
+
+	sid = get_high32(sid_fd);
+	cfd = get_low32(sid_fd);
+
+	ret = __do_rpal_uds_fdmap(sid, cfd);
+	if (ret < 0) {
+		errprint("%s failed %ld, cfd: %d\n", __FUNCTION__, ret, cfd);
+		return RPAL_FAILURE;
+	}
+	rid = get_rid(ret);
+	return rid;
+}
+
+status_t rpal_uds_fdmap(uint64_t sid_fd, uint64_t *rpalfd)
+{
+	status_t ret = RPAL_FAILURE;
+	status_t access;
+	uint32_t old_pkru;
+
+	old_pkru = rdpkru();
+	wrpkru(old_pkru & RPAL_PKRU_BASE_CODE_READ);
+	access = rpal_read_access_safety(do_rpal_uds_fdmap, &ret, sid_fd,
+					 rpalfd);
+	wrpkru(old_pkru);
+	if (access == RPAL_FAILURE) {
+		return RPAL_FAILURE;
+	}
+	return ret;
+}
+
+static status_t fd_event_install(fd_table_t *fdt, int fd, int epfd,
+				 struct epoll_event *event)
+{
+	fdt_node_t *node;
+	fd_event_t *fde;
+	int idx;
+	int expected;
+
+	fde = fd_event_alloc(fd, epfd, event);
+	if (!fde) {
+		goto fde_error;
+	}
+	pthread_mutex_lock(&fdt->lock);
+	if (fd >= fdt->max_fd) {
+		node = fdt_node_expand(fdt, fd);
+	} else {
+		node = fdt_node_search(fdt, fd);
+	}
+	pthread_mutex_unlock(&fdt->lock);
+
+	if (!node) {
+		errprint("fd node search failed, fd: %d\n", fd);
+		goto node_error;
+	}
+	idx = fd & fdt->node_mask;
+	fdt_freelist_lazyfree(fdt);
+	expected = __atomic_load_n(&node->ref_count[idx], __ATOMIC_ACQUIRE);
+	if (expected != FDE_FREED) {
+		goto node_error;
+	}
+	fde->timestamp =
+		__atomic_add_fetch(&node->timestamps[idx], 1, __ATOMIC_RELEASE);
+	fde->node = node;
+	node->events[idx] = fde;
+	if (!__atomic_compare_exchange_n(&node->ref_count[idx], &expected,
+					 FDE_AVAILABLE, 1, __ATOMIC_SEQ_CST,
+					 __ATOMIC_SEQ_CST)) {
+		errprint("may override fd: %d, val: %d\n", fd, expected);
+		node->events[idx] = NULL;
+		goto node_error;
+	}
+	return RPAL_SUCCESS;
+
+node_error:
+	fd_event_free(fde);
+fde_error:
+	return RPAL_FAILURE;
+}
+
+static status_t fd_event_uninstall(fd_table_t *fdt, int fd)
+{
+	fd_event_t *fde;
+	fdt_node_t *node;
+	int idx;
+	int ret = RPAL_SUCCESS;
+	int expected;
+
+	node = fdt_node_search(fdt, fd);
+	if (!node) {
+		ret = RPAL_FAILURE;
+		goto out;
+	}
+	idx = fd & fdt->node_mask;
+	fde = node->events[idx];
+	if (!fde) {
+		ret = RPAL_FAILURE;
+		goto out;
+	}
+	expected = FDE_AVAILABLE;
+	__atomic_store_n(&fde->outdated, (uint16_t)1, __ATOMIC_RELEASE);
+	if (__atomic_compare_exchange_n(&node->ref_count[idx], &expected,
+					FDE_FREEING, 1, __ATOMIC_SEQ_CST,
+					__ATOMIC_SEQ_CST)) {
+		node->events[idx] = NULL;
+		__atomic_store_n(&node->ref_count[idx], -1, __ATOMIC_RELEASE);
+		fd_event_free(fde);
+	} else {
+		if (expected < FDE_AVAILABLE) {
+			errprint("error cnt: %d, fd: %d\n", expected, fde->fd);
+		}
+		// link this fde for free_head
+		fdt_freelist_insert(fdt, fde);
+	}
+
+out:
+	fdt_freelist_lazyfree(fdt);
+	return ret;
+}
+
+static status_t fd_event_modify(fd_table_t *fdt, int fd,
+				struct epoll_event *event)
+{
+	fd_event_t *fde;
+
+	fde = fd_event_get(fdt, fd);
+	if (!fde) {
+		errprint("fde MOD fd(%d) ERROR!\n", fd);
+		return RPAL_FAILURE;
+	}
+	fde->fd = fd;
+	fde->epev = *event;
+	fde->events = 0;
+	fd_event_put(fdt, fde);
+	return RPAL_SUCCESS;
+}
+
+static int rpal_receiver_info_create(struct rpal_thread_pool *rtp, int id)
+{
+	rpal_receiver_info_t *rri = &rtp->rris[id];
+
+	rri->ep_stack = fiber_ctx_alloc(NULL, NULL, DEFUALT_STACK_SIZE);
+	if (!rri->ep_stack)
+		return -1;
+
+	rri->trampoline = fiber_ctx_alloc(NULL, NULL, TRAMPOLINE_SIZE);
+	if (!rri->trampoline) {
+		fiber_ctx_free(rri->ep_stack);
+		return -1;
+	}
+
+	rri->rc = threads_md.rc + id;
+	rri->rc->receiver_id = id;
+	rri->rtp = rtp;
+
+	return 0;
+}
+
+static void rpal_receiver_info_destroy(rpal_receiver_info_t *rri)
+{
+	fiber_ctx_free(rri->ep_stack);
+	fiber_ctx_free(rri->trampoline);
+	return;
+}
+
+static struct rpal_thread_pool *rpal_thread_pool_create(int nr_threads,
+						   rpal_thread_metadata_t *rtm)
+{
+	void *p;
+	int i, j;
+	struct rpal_thread_pool *rtp;
+
+	if (rpal_inited())
+		goto out;
+	rtp = malloc(sizeof(struct rpal_thread_pool));
+	if (rtp == NULL) {
+		goto out;
+	}
+	threads_md.eventfds = malloc(nr_threads * sizeof(int));
+	if (threads_md.eventfds == NULL) {
+		goto eventfds_alloc_fail;
+	}
+	rtp->nr_threads = nr_threads;
+	rtp->pkey = -1;
+	p = malloc(nr_threads * sizeof(rpal_receiver_info_t));
+	if (p == NULL) {
+		goto rri_alloc_fail;
+	}
+	rtp->rris = p;
+	memset(p, 0, nr_threads * sizeof(rpal_receiver_info_t));
+
+	rtp->fdt = fd_table_alloc(DEFAULT_NODE_SHIFT);
+	if (!rtp->fdt) {
+		goto fdt_alloc_fail;
+	}
+
+	p = rpal_get_shared_page(rtm->epcpage_order);
+
+	if (!p)
+		goto page_alloc_fail;
+	rtm->rc = p;
+
+	for (i = 0; i < nr_threads; i++) {
+		if (rpal_receiver_info_create(rtp, i)) {
+			for (j = 0; j < i; j++) {
+				rpal_receiver_info_destroy(&rtp->rris[j]);
+			}
+			goto rri_create_fail;
+		}
+	}
+	return rtp;
+
+rri_create_fail:
+	rpal_free_shared_page(rtm->rc, rtm->epcpage_order);
+page_alloc_fail:
+	fd_table_free(rtp->fdt);
+fdt_alloc_fail:
+	free(rtp->rris);
+rri_alloc_fail:
+	free(threads_md.eventfds);
+eventfds_alloc_fail:
+	free(rtp);
+out:
+	return NULL;
+}
+
+static void rpal_thread_pool_destory(rpal_thread_metadata_t *rtm)
+{
+	int i;
+	struct rpal_thread_pool *rtp;
+
+	if (!rpal_inited()) {
+		errprint("thread pool is not created.\n");
+		return;
+	}
+	pthread_mutex_destroy(&release_lock);
+	rtp = threads_md.rtp;
+	fd_table_free(rtp->fdt);
+	for (i = 0; i < rtp->nr_threads; ++i) {
+		rpal_receiver_info_destroy(&rtp->rris[i]);
+	}
+	rpal_free_shared_page(threads_md.rc, threads_md.epcpage_order);
+	free(rtp->rris);
+	free(threads_md.eventfds);
+	free(rtp);
+}
+
+static inline int rpal_receiver_inited(rpal_receiver_info_t *rri)
+{
+	if (!rri)
+		return 0;
+	return (rri->status != RPAL_RECEIVER_UNINITIALIZED);
+}
+
+static inline int rpal_receiver_available(rpal_receiver_info_t *rri)
+{
+	return (rri->status == RPAL_RECEIVER_AVAILABLE);
+}
+
+static int rpal_receiver_idx_get(void)
+{
+	return __atomic_fetch_add(&threads_md.rpal_receiver_idx, 1,
+				  __ATOMIC_RELAXED);
+}
+
+int rpal_receiver_init(void)
+{
+	int ret = 0;
+	int receiver_idx;
+	rpal_receiver_info_t *rri;
+
+	if (!rpal_inited()) {
+		errprint("thread pool is not created.\n");
+		goto error_out;
+	}
+
+	receiver_idx = rpal_receiver_idx_get();
+	if (receiver_idx >= threads_md.rtp->nr_threads) {
+		errprint(
+			"rpal thread pool size exceeded. thread_idx: %d, thread pool capacity: %d\n",
+			receiver_idx, threads_md.rtp->nr_threads);
+		goto error_out;
+	}
+
+	rri = threads_md.rtp->rris + receiver_idx;
+	rri->status = RPAL_RECEIVER_UNINITIALIZED;
+	rri->tid = syscall(SYS_gettid);
+	rri->tls_base = read_tls_base();
+
+	rpal_uevent_queue_init(&rri->ueventq, &rri->uqlock);
+
+	rri->rc->rpal_ep_poll_magic = 0;
+	rri->rc->receiver_state = RPAL_RECEIVER_STATE_RUNNING;
+	rri->rc->ep_pending = 0;
+	__atomic_store_n(&rri->rc->sender_state, RPAL_SENDER_STATE_RUNNING,
+			 __ATOMIC_RELAXED);
+	ret = rpal_register_receiver(rri);
+	if (ret < 0) {
+		errprint("rpal thread %ld register failed %d\n", rri->tid, ret);
+		goto error_out;
+	}
+	ret = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
+	if (ret < 0) {
+		errprint("rpal thread %ld eventfd failed %d\n", rri->tid,
+			 errno);
+		goto eventfd_failed;
+	}
+	threads_md.eventfds[receiver_idx] = ret;
+	rri->status = RPAL_RECEIVER_INITIALIZED;
+	return ret;
+
+eventfd_failed:
+	rpal_unregister_receiver();
+error_out:
+	return RPAL_FAILURE;
+}
+
+void rpal_receiver_exit(void)
+{
+	rpal_receiver_info_t *rri = current_rpal_thread();
+	int id, fd;
+
+	if (!rpal_receiver_inited(rri))
+		return;
+	rri->status = RPAL_RECEIVER_UNINITIALIZED;
+	id = rri->rc->receiver_id;
+	fd = threads_md.eventfds[id];
+	close(fd);
+	threads_md.eventfds[id] = 0;
+	rpal_unregister_receiver();
+	return;
+}
+
+static inline void set_task_context(volatile task_context_t *tc, void *src)
+{
+	fiber_stack_t *fstack = src;
+	tc->r15 = fstack->r15;
+	tc->r14 = fstack->r14;
+	tc->r13 = fstack->r13;
+	tc->r12 = fstack->r12;
+	tc->rbx = fstack->rbx;
+	tc->rbp = fstack->rbp;
+	tc->rip = fstack->rip;
+	tc->rsp = (unsigned long)(src + 0x40);
+}
+
+static transfer_t _syscall_epoll_wait(transfer_t t)
+{
+	rpal_receiver_info_t *rri = t.ud;
+	volatile receiver_context_t *rc = rri->rc;
+	long ret;
+
+	rc->rpal_ep_poll_magic = RPAL_EP_POLL_MAGIC;
+	ret = epoll_wait(rc->epfd, rc->ep_events, rc->maxevents,
+			 rc->timeout);
+	t = jump_fcontext(rri->main_ctx, (void *)ret);
+	return t;
+}
+
+extern void rpal_ret_critical(volatile receiver_context_t *rc,
+			      rpal_call_info_t *rci);
+
+static transfer_t syscall_epoll_wait(transfer_t t)
+{
+	rpal_receiver_info_t *rri = t.ud;
+	volatile receiver_context_t *rc = rri->rc;
+	rpal_call_info_t *rci = &rri->rci;
+	task_t *estk = rri->ep_stack;
+
+	set_task_context(&rri->rc->task_context, t.fctx);
+	rri->main_ctx = t.fctx;
+
+	rpal_ret_critical(rc, rci);
+
+	estk->fctx = make_fcontext(estk->sp, 0, NULL);
+	t = ontop_fcontext(rri->ep_stack->fctx, rri, _syscall_epoll_wait);
+	return t;
+}
+
+static inline int ep_kernel_events_available(volatile int *ep_pending)
+{
+	return (RPAL_KERNEL_PENDING &
+		__atomic_load_n(ep_pending, __ATOMIC_ACQUIRE));
+}
+
+static inline int ep_user_events_available(volatile int *ep_pending)
+{
+	return (RPAL_USER_PENDING &
+		__atomic_load_n(ep_pending, __ATOMIC_ACQUIRE));
+}
+
+static inline int rpal_ep_send_events(epoll_uevent_queue_t *uq, fd_table_t *fdt,
+				      volatile receiver_context_t *rc,
+				      struct epoll_event *events, int maxevents)
+{
+	int fd = -1;
+	int ret = 0;
+	int res = 0;
+	fd_event_t *fde = NULL;
+
+	__atomic_and_fetch(&rc->ep_pending, ~RPAL_USER_PENDING,
+			   __ATOMIC_ACQUIRE);
+	while (uevent_queue_len(uq) && ret < maxevents) {
+		fd = uevent_queue_del(uq);
+		if (fd == -1) {
+			errprint("uevent get failed\n");
+			continue;
+		}
+		fde = fd_event_get(fdt, fd);
+		if (!fde)
+			continue;
+		res = __atomic_exchange_n(&fde->events, 0, __ATOMIC_RELAXED);
+		res &= fde->epev.events;
+		if (res) {
+			events[ret].data = fde->epev.data;
+			events[ret].events = res;
+			ret++;
+		}
+		fd_event_put(fdt, fde);
+	}
+	if (uevent_queue_len(uq) || ret == maxevents) {
+		dbprint(RPAL_DEBUG_RECVER,
+			"uevent queue still have events, len: %d, ret: %d, maxevents: %d\n",
+			uevent_queue_len(uq), ret, maxevents);
+		__atomic_fetch_or(&rc->ep_pending, RPAL_USER_PENDING,
+				  __ATOMIC_RELAXED);
+	}
+	return ret;
+}
+
+extern void rpal_call_critical(volatile receiver_context_t *rc,
+			       rpal_receiver_info_t *rri);
+
+int rpal_epoll_wait(int epfd, struct epoll_event *events, int maxevents,
+		    int timeout)
+{
+	transfer_t t;
+	rpal_call_info_t *rci;
+	task_t *estk, *trampoline;
+	volatile receiver_context_t *rc;
+	epoll_uevent_queue_t *ueventq;
+	rpal_receiver_info_t *rri = current_rpal_thread();
+	long ret = 0;
+	unsigned int mxcsr = 0, fpucw = 0;
+
+	if (!rpal_receiver_inited(rri))
+		return epoll_wait(epfd, events, maxevents, timeout);
+
+	rc = rri->rc;
+	estk = rri->ep_stack;
+	trampoline = rri->trampoline;
+	rci = &rri->rci;
+	ueventq = &rri->ueventq;
+
+	rc->epfd = epfd;
+	rc->ep_events = events;
+	rc->maxevents = maxevents;
+	rc->timeout = timeout;
+
+	if (!rpal_receiver_available(rri)) {
+		rri->status = RPAL_RECEIVER_AVAILABLE;
+		estk->fctx = make_fcontext(estk->sp, 0, NULL);
+		SAVE_FPU(mxcsr, fpucw);
+		trampoline->fctx = make_fcontext(trampoline->sp, 0, NULL);
+		t = ontop_fcontext(trampoline->fctx, rri, syscall_epoll_wait);
+	} else {
+		// kernel pending events
+		if (ep_kernel_events_available(&rc->ep_pending)) {
+			rc->rpal_ep_poll_magic =
+				RPAL_EP_POLL_MAGIC; // clear KERNEL_PENDING
+			ret = epoll_wait(epfd, events, maxevents, 0);
+			rc->rpal_ep_poll_magic = 0;
+			goto send_user_events;
+		}
+		// user pending events
+		if (ep_user_events_available(&rc->ep_pending)) {
+			goto send_user_events;
+		}
+		SAVE_FPU(mxcsr, fpucw);
+		trampoline->fctx = make_fcontext(trampoline->sp, 0, NULL);
+		t = ontop_fcontext(trampoline->fctx, rri, syscall_epoll_wait);
+	}
+	rc->rpal_ep_poll_magic = 0;
+
+	/*
+     * Here is where sender starts after user context switch.
+     * The TLS may still be sender's. We should not do anything
+     * that may use TLS, otherwise the result cannot be controlled.
+     */
+
+	switch (rc->receiver_state & RPAL_RECEIVER_STATE_MASK) {
+	case RPAL_RECEIVER_STATE_RUNNING: // syscall kernel ret
+		ret = (long)t.ud;
+		break;
+	case RPAL_RECEIVER_STATE_KERNEL_RET: // receiver kernel ret
+		RESTORE_FPU(mxcsr, fpucw);
+		ret = (long)t.fctx;
+		break;
+	case RPAL_RECEIVER_STATE_CALL: // rpalcall user jmp
+		rci->sender_tls_base = read_tls_base();
+		rci->pkru = rdpkru();
+		write_tls_base(rri->tls_base);
+		wrpkru(rpal_pkey_to_pkru(rri->rtp->pkey));
+		rci->sender_fctx = t.fctx;
+		break;
+	default:
+		errprint("Error ep_status: %ld\n",
+			 rc->receiver_state & RPAL_RECEIVER_STATE_MASK);
+		return -1;
+	}
+
+send_user_events:
+	if (ret < maxevents && ret >= 0)
+		ret += rpal_ep_send_events(ueventq, rri->rtp->fdt, rc,
+					   events + ret, maxevents - ret);
+	return ret;
+}
+
+int rpal_epoll_wait_user(int epfd, struct epoll_event *events, int maxevents,
+			 int timeout)
+{
+	volatile receiver_context_t *rc;
+	epoll_uevent_queue_t *ueventq;
+	rpal_receiver_info_t *rri = current_rpal_thread();
+
+	if (!rpal_receiver_inited(rri))
+		return 0;
+
+	if (!rpal_receiver_available(rri))
+		return 0;
+
+	rc = rri->rc;
+	ueventq = &rri->ueventq;
+	if (ep_user_events_available(&rc->ep_pending)) {
+		return rpal_ep_send_events(ueventq, rri->rtp->fdt, rc, events,
+					   maxevents);
+	}
+	return 0;
+}
+
+int rpal_epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
+{
+	fd_table_t *fdt;
+	int ret;
+
+	ret = epoll_ctl(epfd, op, fd, event);
+	if (ret || !rpal_inited()) {
+		return ret;
+	}
+	fdt = threads_md.rtp->fdt;
+	switch (op) {
+	case EPOLL_CTL_ADD:
+		if (event->events & EPOLLRPALINOUT_BITS) {
+			ret = fd_event_install(fdt, fd, epfd, event);
+			if (ret == RPAL_FAILURE)
+				goto install_error;
+		}
+		break;
+	case EPOLL_CTL_MOD:
+		fd_event_modify(fdt, fd, event);
+		break;
+	case EPOLL_CTL_DEL:
+		fd_event_uninstall(fdt, fd);
+		break;
+	}
+	return ret;
+install_error:
+	epoll_ctl(epfd, EPOLL_CTL_DEL, fd, event);
+	return RPAL_FAILURE;
+}
+
+static transfer_t set_fcontext(transfer_t t)
+{
+	sender_context_t *sc = t.ud;
+
+	set_task_context(&sc->task_context, t.fctx);
+	return t;
+}
+
+static void uq_lock(volatile uint64_t *uqlock, uint64_t key)
+{
+	uint64_t init = 0;
+
+	while (1) {
+		if (__atomic_compare_exchange_n(
+			    uqlock, &init, (1UL << 63 | key), 1,
+			    __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))
+			return;
+		asm volatile("rep; nop");
+		init = 0;
+	}
+}
+
+static void uq_unlock(volatile uint64_t *uqlock)
+{
+	__atomic_store_n(uqlock, (uint64_t)0, __ATOMIC_RELAXED);
+}
+
+static status_t do_rpal_call_jump(rpal_sender_info_t *rsi,
+				  rpal_receiver_info_t *rri,
+				  volatile receiver_context_t *rc)
+{
+	int desired, expected;
+	int64_t diff;
+
+WAKE_AGAIN:
+	desired = RPAL_BUILD_CALL_STATE(rsi->sc.sender_id,
+				    threads_md.service_id);
+	expected = RPAL_RECEIVER_STATE_WAIT;
+	if (__atomic_compare_exchange_n(&rc->receiver_state, &expected, desired, 1,
+					__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
+		__atomic_store_n(&rc->sender_state, RPAL_SENDER_STATE_CALL,
+				 __ATOMIC_RELAXED);
+		rsi->sc.start_time = _rdtsc();
+		ontop_fcontext(rri->main_ctx, &rsi->sc, set_fcontext);
+
+		if (__atomic_load_n(&rc->sender_state, __ATOMIC_RELAXED) ==
+			RPAL_SENDER_STATE_RUNNING) {
+			if (rc->receiver_state == RPAL_RECEIVER_STATE_LAZY_SWITCH)
+				read(-1, NULL, 0);
+			diff = _rdtsc() - rsi->sc.start_time;
+			rsi->sc.total_time += diff;
+			rri->rc->total_time += diff;
+			expected = desired;
+			desired = RPAL_RECEIVER_STATE_WAIT;
+			__atomic_compare_exchange_n(&rc->receiver_state, &expected,
+						    desired, 1,
+						    __ATOMIC_SEQ_CST,
+						    __ATOMIC_SEQ_CST);
+
+			if (ep_user_events_available(&rc->ep_pending)) {
+				goto WAKE_AGAIN;
+			}
+		}
+		dbprint(RPAL_DEBUG_SENDER, "app return: 0x%x, %d, %d\n",
+			rc->receiver_state, rc->sender_state, sfd);
+	}
+	return RPAL_SUCCESS;
+}
+
+static inline void set_fde_trigger(fd_event_t *fde)
+{
+	__atomic_store_n(&fde->wait, FDE_TRIGGER_OUT, __ATOMIC_RELEASE);
+	return;
+}
+
+static inline int clear_fde_trigger(fd_event_t *fde)
+{
+	int expected = FDE_TRIGGER_OUT;
+
+	return __atomic_compare_exchange_n(&fde->wait, &expected,
+					   FDE_NO_TRIGGER, 1, __ATOMIC_SEQ_CST,
+					   __ATOMIC_SEQ_CST);
+}
+
+static int do_rpal_call(va_list va)
+{
+	rpal_sender_info_t *rsi;
+	rpal_receiver_info_t *rri;
+	fd_event_t *fde;
+	volatile receiver_context_t *rc;
+	struct rpal_thread_pool *srtp;
+	uint16_t stamp;
+	uint8_t rid;
+	int sfd;
+	int ret = 0;
+	int fall = 0;
+	int pkey;
+
+	int service_id = va_arg(va, int);
+	uint64_t rpalfd = va_arg(va, uint64_t);
+	int64_t *ptrs = va_arg(va, int64_t *);
+	int len = va_arg(va, int);
+	int flags = va_arg(va, int);
+
+	rsi = current_rpal_sender();
+	if (!rsi) {
+		ret = RPAL_INVAL_THREAD;
+		goto ERROR;
+	}
+	srtp = get_service_from_id(service_id);
+	if (!srtp) {
+		ret = RPAL_INVAL_SERVICE;
+		goto ERROR;
+	}
+	pkey = get_service_pkey_from_id(service_id);
+
+	rid = get_rid(rpalfd);
+	sfd = get_sfd(rpalfd);
+	wrpkru(rpal_pkru_union(rdpkru(), rpal_pkey_to_pkru(pkey)));
+	rri = srtp->rris + rid;
+	if (!rri) {
+		errprint("INVALID rid: %u, rri is NULL\n", rid);
+		ret = RPAL_INVALID_ARG;
+		goto ERROR;
+	}
+	rc = rri->rc;
+	rsi->sc.ec.tls_base = rri->tls_base;
+
+	fde = fd_event_get(srtp->fdt, sfd);
+	if (!fde) {
+		ret = RPAL_INVALID_ARG;
+		goto ERROR;
+	}
+	stamp = get_fdtimestamp(rpalfd);
+	if (fde->timestamp != stamp) {
+		ret = RPAL_FDE_OUTDATED;
+		goto FDE_PUT;
+	}
+
+	uq_lock(&rri->uqlock, threads_md.service_key);
+	if (uevent_queue_len(&rri->ueventq) == MAX_RDY) {
+		errprint("rdylist is full: [%u, %u]\n", rri->ueventq.l_beg,
+			 rri->ueventq.l_end);
+		ret = RPAL_CACHE_FULL;
+		goto UNLOCK;
+	}
+	if (likely(flags & RCALL_IN)) {
+		if (unlikely(rpal_queue_unused(&fde->q) < (uint32_t)len)) {
+			set_fde_trigger(fde);
+			fall = 1;
+			/* fall through: try to put data to queue */
+		}
+		ret = rpal_queue_put(&fde->q, ptrs, len);
+		if (ret != len) {
+			errprint("fde queue put error: %d, data: %lx\n", ret,
+				 (unsigned long)fde->q.data);
+			ret = RPAL_QUEUE_PUT_FAILED;
+			goto UNLOCK;
+		}
+		if (unlikely(fall)) {
+			clear_fde_trigger(fde);
+		}
+		fde->events |= EPOLLRPALIN;
+	} else if (unlikely(flags & RCALL_OUT)) {
+		ret = 0;
+		fde->events |= EPOLLRPALOUT;
+	} else {
+		errprint("rpal call failed, ptrs: %lx, len: %d",
+			 (unsigned long)ptrs, len);
+		ret = RPAL_INVALID_ARG;
+		goto UNLOCK;
+	}
+
+	uevent_queue_add(&rri->ueventq, sfd);
+	uq_unlock(&rri->uqlock);
+	fd_event_put(srtp->fdt, fde);
+
+	__atomic_fetch_or(&rc->ep_pending, RPAL_USER_PENDING,
+			  __ATOMIC_RELEASE);
+	do_rpal_call_jump(rsi, rri, rc);
+	return ret;
+
+UNLOCK:
+	uq_unlock(&rri->uqlock);
+FDE_PUT:
+	fd_event_put(srtp->fdt, fde);
+ERROR:
+	return -ret;
+}
+
+static int __rpal_write_ptrs_common(int service_id, uint64_t rpalfd,
+				    int64_t *ptrs, int len, int flags)
+{
+	int ret = RPAL_FAILURE;
+	status_t access = RPAL_FAILURE;
+
+	if (unlikely(NULL == ptrs)) {
+		dbprint(RPAL_DEBUG_SENDER, "%s: ptrs is NULL\n", __FUNCTION__);
+		return -RPAL_INVALID_ARG;
+	}
+	if (unlikely(len <= 0 || ((uint32_t)len) > DEFAULT_QUEUE_SIZE)) {
+		dbprint(RPAL_DEBUG_SENDER,
+			"%s: data len less than or equal to zero\n",
+			__FUNCTION__);
+		return -RPAL_INVALID_ARG;
+	}
+
+	access = rpal_write_access_safety(do_rpal_call, &ret, service_id,
+					  rpalfd, ptrs, len, flags);
+	if (access == RPAL_FAILURE) {
+		return -RPAL_ERR_PEER_MEM;
+	}
+	return ret;
+}
+
+int rpal_write_ptrs(int service_id, uint64_t rpalfd, int64_t *ptrs, int len)
+{
+	return __rpal_write_ptrs_common(service_id, rpalfd, ptrs, len,
+					RCALL_IN);
+}
+
+int rpal_read_ptrs(int fd, int64_t *dptrs, int len)
+{
+	fd_event_t *fde;
+	fd_table_t *fdt = threads_md.rtp->fdt;
+	int ret;
+
+	if (!rpal_inited())
+		return -1;
+
+	fde = fd_event_get(fdt, fd);
+	if (!fde)
+		return -1;
+
+	ret = rpal_queue_get(&fde->q, dptrs, len);
+	fd_event_put(fdt, fde);
+	return ret;
+}
+
+int rpal_read_ptrs_trigger_out(int fd, int64_t *dptrs, int len, int service_id,
+			       uint64_t rpalfd)
+{
+	fd_event_t *fde;
+	fd_table_t *fdt = threads_md.rtp->fdt;
+	int access, ret = -1;
+	int nread;
+
+	if (!rpal_inited())
+		return -1;
+
+	fde = fd_event_get(fdt, fd);
+	if (!fde)
+		return -1;
+
+	nread = rpal_queue_get(&fde->q, dptrs, len);
+	if (nread > 0 && clear_fde_trigger(fde)) {
+		access =
+			rpal_write_access_safety(do_rpal_call, &ret, service_id,
+						 rpalfd, NULL, 0, RCALL_OUT);
+		if (access == RPAL_FAILURE || ret < 0) {
+			set_fde_trigger(fde);
+			errprint(
+				"trigger out failed! access: %d, ret: %d, id: %d, rpalfd: %lx\n",
+				access, ret, service_id, rpalfd);
+		}
+	}
+	fd_event_put(fdt, fde);
+
+	return nread;
+}
+
+static inline int pkey_is_invalid(const int pkey)
+{
+	return (pkey < 0 || pkey > 15);
+}
+
+static status_t rpal_thread_metadata_init(int nr_rpalthread,
+					  rpal_error_code_t *error)
+{
+	uint64_t key;
+	struct rpal_thread_pool *rtp;
+	key = __rpal_get_service_key();
+	if (key >= 1UL << 63) {
+		ERRREPORT(
+			error, RPAL_ERR_SERVICE_KEY,
+			"rpal service key error. Service key: 0x%lx, oeverflow, should less than 2^63\n",
+			key);
+		goto error_out;
+	}
+	threads_md.service_key = key;
+	threads_md.service_id = __rpal_get_service_id();
+	pthread_mutex_init(&release_lock, NULL);
+	rpal_get_critical_addr(&rcs);
+	rtp = rpal_thread_pool_create(nr_rpalthread, &threads_md);
+	if (rtp == NULL) {
+		goto error_out;
+	}
+	rtp->service_key = threads_md.service_key;
+	rtp->service_id = threads_md.service_id;
+	threads_md.rtp = rtp;
+	if (rpal_enable_service(error) == RPAL_FAILURE)
+		goto destroy_thread_pool;
+	threads_md.pid = getpid();
+	return RPAL_SUCCESS;
+
+destroy_thread_pool:
+	rpal_thread_pool_destory(&threads_md);
+error_out:
+	return RPAL_FAILURE;
+}
+
+static void rpal_thread_metadata_exit(void)
+{
+	rpal_disable_service();
+	rpal_thread_pool_destory(&threads_md);
+}
+
+static status_t rpal_senders_metadata_init(rpal_error_code_t *error)
+{
+	if (senders_md) {
+		ERRREPORT(error, RPAL_ERR_SENDERS_METADATA,
+			  "senders metadata is already initialized.\n");
+		return RPAL_FAILURE;
+	}
+
+	senders_md = malloc(sizeof(struct rpal_senders_metadata));
+	if (!senders_md) {
+		ERRREPORT(error, RPAL_ERR_NOMEM,
+			  "senders metadata alloc failed.\n");
+		goto sendes_alloc_failed;
+	}
+	senders_md->sdpage_order = SENDERS_PAGE_ORDER;
+	memset(senders_md->bitmap, 0xFF,
+	       sizeof(unsigned long) * BITS_TO_LONGS(MAX_SENDERS));
+	pthread_mutex_init(&senders_md->lock, NULL);
+	senders_md->senders = rpal_get_shared_page(senders_md->sdpage_order);
+	if (!senders_md->senders) {
+		ERRREPORT(error, RPAL_ERR_SENDER_PAGES,
+			  "get senders share page error.\n");
+		goto pages_alloc_failed;
+	}
+	dbprint(RPAL_DEBUG_MANAGEMENT, "senders pages addr: 0x%016lx\n",
+		(unsigned long)senders_md->senders);
+	return RPAL_SUCCESS;
+
+pages_alloc_failed:
+	free(senders_md);
+sendes_alloc_failed:
+	return RPAL_FAILURE;
+}
+
+static void rpal_senders_metadata_exit(void)
+{
+	if (!senders_md)
+		return;
+
+	rpal_free_shared_page((void *)senders_md->senders,
+			      senders_md->sdpage_order);
+	pthread_mutex_destroy(&senders_md->lock);
+	free(senders_md);
+}
+
+static int rpal_get_version_cap(rpal_capability_t *version)
+{
+	return rpal_ioctl(RPAL_IOCTL_GET_API_VERSION_AND_CAP,
+			     (unsigned long)version);
+}
+
+static status_t rpal_version_check(rpal_capability_t *ver)
+{
+	if (ver->compat_version != MIN_RPAL_KERNEL_API_VERSION)
+		return RPAL_FAILURE;
+	if (ver->api_version < TARGET_RPAL_KERNEL_API_VERSION)
+		return RPAL_FAILURE;
+	return RPAL_SUCCESS;
+}
+
+static status_t rpal_capability_check(rpal_capability_t *ver)
+{
+	unsigned long cap = ver->cap;
+
+	if (!(cap & (1 << RPAL_CAP_PKU))) {
+		return RPAL_FAILURE;
+	}
+	return RPAL_SUCCESS;
+}
+
+static status_t rpal_check_version_cap(rpal_error_code_t *error)
+{
+	int ret;
+
+	ret = rpal_get_version_cap(&version);
+	if (ret < 0) {
+		ERRREPORT(error, RPAL_ERR_GET_CAP_VERSION,
+			  "rpal get version failed: %d\n", ret);
+		ret = RPAL_FAILURE;
+		goto out;
+	}
+	ret = rpal_version_check(&version);
+	if (ret == RPAL_FAILURE) {
+		ERRREPORT(
+			error, RPAL_KERNEL_API_NOTSUPPORT,
+			"kernel rpal(version: %d-%d) API is not compatible with librpal(version: %d-%d)\n",
+			version.compat_version, version.api_version,
+			MIN_RPAL_KERNEL_API_VERSION,
+			TARGET_RPAL_KERNEL_API_VERSION);
+		goto out;
+	}
+	ret = rpal_capability_check(&version);
+	if (ret == RPAL_FAILURE) {
+		ERRREPORT(error, RPAL_HARDWARE_NOTSUPPORT,
+			  "hardware do not support RPAL\n");
+		goto out;
+	}
+out:
+	return ret;
+}
+
+static status_t rpal_mgtfd_init(rpal_error_code_t *error)
+{
+	int err, n;
+	int mgtfd;
+	char name[1024];
+
+	mgtfd = open(RPAL_MGT_FILE, O_RDWR);
+	if (mgtfd == -1) {
+		err = errno;
+		switch (err) {
+		case EPERM:
+			n = readlink("/proc/self/exe", name, sizeof(name) - 1);
+			if (n < 0) {
+				n = 0;
+			}
+			name[n] = 0;
+			errprint("%s is not a RPAL binary\n", name);
+			break;
+		case ENOENT:
+			errprint("Not in RPAL Environment\n");
+			break;
+		default:
+			errprint("open %s fail, %d, %s\n", RPAL_MGT_FILE, err,
+				 strerror(err));
+		}
+		if (error) {
+			*error = RPAL_ERR_RPALFILE_OPS;
+		}
+		return RPAL_FAILURE;
+	}
+	rpal_mgtfd = mgtfd;
+	return RPAL_SUCCESS;
+}
+
+static void rpal_mgtfd_destroy(void)
+{
+	if (rpal_mgtfd != -1) {
+		close(rpal_mgtfd);
+	}
+	return;
+}
+
+#define RPAL_SECTION_SIZE (512 * 1024 * 1024 * 1024UL)
+
+static inline status_t rpal_check_address(uint64_t start, uint64_t end,
+					  uint64_t check)
+{
+	if (check >= start && check < end) {
+		return RPAL_SUCCESS;
+	}
+	return RPAL_FAILURE;
+}
+
+static status_t rpal_managment_init(rpal_error_code_t *error)
+{
+	int i = 0;
+
+	if (rpal_mgtfd_init(error) == RPAL_FAILURE) {
+		goto mgtfd_init_failed;
+	}
+	if (pthread_key_create(&rpal_key, NULL))
+		goto rpal_key_failed;
+
+	for (i = 0; i < MAX_SERVICEID; i++) {
+		requested_services[i].key = 0;
+		requested_services[i].service = NULL;
+		requested_services[i].pkey = -1;
+	}
+	if (rpal_check_version_cap(error) == RPAL_FAILURE) {
+		goto rpal_check_failed;
+	}
+	return RPAL_SUCCESS;
+
+rpal_check_failed:
+	pthread_key_delete(rpal_key);
+rpal_key_failed:
+	rpal_mgtfd_destroy();
+mgtfd_init_failed:
+	return RPAL_FAILURE;
+}
+
+static void rpal_managment_exit(void)
+{
+	pthread_key_delete(rpal_key);
+	rpal_mgtfd_destroy();
+	return;
+}
+
+int rpal_init(int nr_rpalthread, int flags, rpal_error_code_t *error)
+{
+	if (nr_rpalthread <= 0) {
+		dbprint(RPAL_DEBUG_MANAGEMENT,
+			"%s: nr_rpalthread(%d) less than or equal to 0\n",
+			__FUNCTION__, nr_rpalthread);
+		return RPAL_FAILURE;
+	}
+	if (rpal_managment_init(error) == RPAL_FAILURE) {
+		goto error_out;
+	}
+	if (rpal_thread_metadata_init(nr_rpalthread, error) == RPAL_FAILURE)
+		goto managment_exit;
+
+	if (rpal_senders_metadata_init(error) == RPAL_FAILURE)
+		goto thread_md_exit;
+
+	inited = 1;
+	dbprint(RPAL_DEBUG_MANAGEMENT,
+		"rpal init success, service key: 0x%lx, service id: %d, "
+		"critical_start: 0x%016lx, critical_end: 0x%016lx\n",
+		threads_md.service_key, threads_md.service_id, rcs.ret_begin,
+		rcs.ret_end);
+	return rpal_mgtfd;
+
+thread_md_exit:
+	rpal_thread_metadata_exit();
+managment_exit:
+	rpal_managment_exit();
+error_out:
+	return RPAL_FAILURE;
+}
+
+void rpal_exit(void)
+{
+	if (rpal_inited()) {
+		dbprint(RPAL_DEBUG_MANAGEMENT,
+			"rpal exit, service key: 0x%lx, service id: %d\n",
+			threads_md.service_key, threads_md.service_id);
+		rpal_senders_metadata_exit();
+		rpal_thread_metadata_exit();
+		rpal_managment_exit();
+	}
+}
diff --git a/samples/rpal/librpal/rpal.h b/samples/rpal/librpal/rpal.h
new file mode 100644
index 000000000000..e91a206b8370
--- /dev/null
+++ b/samples/rpal/librpal/rpal.h
@@ -0,0 +1,149 @@
+#ifndef RPAL_H_INCLUDED
+#define RPAL_H_INCLUDED
+
+#ifdef __cplusplus
+#if __cplusplus
+extern "C" {
+#endif
+#endif /* __cplusplus */
+
+#include <stdint.h>
+#include <stdarg.h>
+#include <sys/epoll.h>
+
+typedef enum rpal_error_code {
+	RPAL_ERR_NONE = 0,
+	RPAL_ERR_BAD_ARG = 1,
+	RPAL_ERR_NO_SERVICE = 2,
+	RPAL_ERR_MAPPED = 3,
+	RPAL_ERR_RETRY = 4,
+	RPAL_ERR_BAD_SERVICE_STATUS = 5,
+	RPAL_ERR_BAD_THREAD_STATUS = 6,
+	RPAL_ERR_REACH_LIMIT = 7,
+	RPAL_ERR_NOMEM = 8,
+	RPAL_ERR_NOMAPPING = 9,
+	RPAL_ERR_INVAL = 10,
+
+	RPAL_ERR_KERNEL_MAX_CODE = 100,
+
+	RPAL_ERR_RPALFILE_OPS, /**< Failed to open /proc/self/rpal */
+	RPAL_ERR_RPAL_DISABLED,
+	RPAL_ERR_GET_CAP_VERSION,
+	RPAL_KERNEL_API_NOTSUPPORT,
+	RPAL_HARDWARE_NOTSUPPORT,
+	RPAL_ERR_SERVICE_KEY, /**< Failed to get service key */
+	RPAL_ERR_SENDERS_METADATA,
+	RPAL_ERR_ENABLE_SERVICE,
+	RPAL_ERR_SENDER_PAGES,
+	RPAL_DONT_INITED,
+	RPAL_ERR_SENDER_INIT,
+	RPAL_ERR_SENDER_REG,
+	RPAL_INVALID_ARG,
+	RPAL_CACHE_FULL,
+	RPAL_FDE_OUTDATED,
+	RPAL_QUEUE_PUT_FAILED,
+	RPAL_ERR_PEER_MEM,
+	RPAL_ERR_NOTIFY_RECVER,
+	RPAL_INVAL_THREAD,
+	RPAL_INVAL_SERVICE,
+} rpal_error_code_t;
+
+#define EPOLLRPALIN 0x00020000
+#define EPOLLRPALOUT 0x00040000
+
+typedef enum rpal_features {
+	RPAL_SENDER_RECEIVER = 0x1 << 0,
+} rpal_features_t;
+
+typedef enum status {
+	RPAL_FAILURE = -1, /**< return value indicating failure */
+	RPAL_SUCCESS /**< return value indicating success */
+} status_t;
+
+#define RPAL_PUBLIC __attribute__((visibility("default")))
+
+RPAL_PUBLIC
+int rpal_init(int nr_rpalthread, int flags, rpal_error_code_t *error);
+
+RPAL_PUBLIC
+void rpal_exit(void);
+
+RPAL_PUBLIC
+int rpal_receiver_init(void);
+
+RPAL_PUBLIC
+void rpal_receiver_exit(void);
+
+RPAL_PUBLIC
+int rpal_request_service(uint64_t key);
+
+RPAL_PUBLIC
+status_t rpal_release_service(uint64_t key);
+
+RPAL_PUBLIC
+status_t rpal_clean_service_start(int64_t *ptr);
+
+RPAL_PUBLIC
+void rpal_clean_service_end(int64_t *ptr);
+
+RPAL_PUBLIC
+int rpal_get_service_id(void);
+
+RPAL_PUBLIC
+status_t rpal_get_service_key(uint64_t *service_key);
+
+RPAL_PUBLIC
+int rpal_get_request_service_id(uint64_t key);
+
+RPAL_PUBLIC
+status_t rpal_uds_fdmap(uint64_t sid_fd, uint64_t *rpalfd);
+
+RPAL_PUBLIC
+int rpal_get_peer_rid(uint64_t sid_fd);
+
+RPAL_PUBLIC
+status_t rpal_sender_init(rpal_error_code_t *error);
+
+RPAL_PUBLIC
+status_t rpal_sender_exit(void);
+
+/* Hook epoll syscall */
+RPAL_PUBLIC
+int rpal_epoll_wait(int epfd, struct epoll_event *events, int maxevents,
+		    int timeout);
+
+RPAL_PUBLIC
+int rpal_epoll_wait_user(int epfd, struct epoll_event *events, int maxevents,
+			 int timeout);
+
+RPAL_PUBLIC
+int rpal_epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
+
+RPAL_PUBLIC
+status_t rpal_copy_prepare(int service_id);
+
+RPAL_PUBLIC
+status_t rpal_copy_finish(void);
+
+RPAL_PUBLIC
+int rpal_write_ptrs(int service_id, uint64_t rpalfd, int64_t *ptrs, int len);
+
+RPAL_PUBLIC
+int rpal_read_ptrs(int fd, int64_t *ptrs, int len);
+
+typedef int (*access_fn)(va_list args);
+RPAL_PUBLIC
+status_t rpal_read_access_safety(access_fn do_access, int *do_access_ret, ...);
+
+RPAL_PUBLIC
+void rpal_recver_count_print(void);
+
+RPAL_PUBLIC
+void rpal_sender_count_print(void);
+
+#ifdef __cplusplus
+#if __cplusplus
+}
+#endif
+#endif
+#endif //!_RPAL_H_INCLUDED
diff --git a/samples/rpal/librpal/rpal_pkru.h b/samples/rpal/librpal/rpal_pkru.h
new file mode 100644
index 000000000000..9590aa7203bb
--- /dev/null
+++ b/samples/rpal/librpal/rpal_pkru.h
@@ -0,0 +1,78 @@
+#include <x86intrin.h>
+#include "private.h"
+
+#define RPAL_PKRU_BASE_CODE_READ 0xAAAAAAAA
+#define RPAL_PKRU_BASE_CODE 0xFFFFFFFF
+#define RPAL_NO_PKEY -1
+
+typedef uint32_t u32;
+/*
+ * extern __inline unsigned int
+ * __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+ * _rdpkru_u32 (void)
+ * {
+ *   return __builtin_ia32_rdpkru ();
+ * }
+ *
+ * extern __inline void
+ * __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+ * _wrpkru (unsigned int __key)
+ * {
+ *   __builtin_ia32_wrpkru (__key);
+ * }
+ */
+// #define rdpkru _rdpkru_u32
+// #define wrpkru _wrpkru
+static inline uint32_t rdpkru(void)
+{
+	uint32_t ecx = 0;
+	uint32_t edx, pkru;
+
+	/*
+	 * "rdpkru" instruction.  Places PKRU contents in to EAX,
+	 * clears EDX and requires that ecx=0.
+	 */
+	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+		     : "=a"(pkru), "=d"(edx)
+		     : "c"(ecx));
+	return pkru;
+}
+
+static inline void wrpkru(uint32_t pkru)
+{
+	uint32_t ecx = 0, edx = 0;
+
+	/*
+	 * "wrpkru" instruction.  Loads contents in EAX to PKRU,
+	 * requires that ecx = edx = 0.
+	 */
+	asm volatile(".byte 0x0f,0x01,0xef\n\t"
+		     :
+		     : "a"(pkru), "c"(ecx), "d"(edx));
+}
+
+static inline u32 rpal_pkey_to_pkru(int pkey)
+{
+	int offset = pkey * 2;
+	u32 mask = 0x3 << offset;
+
+	return RPAL_PKRU_BASE_CODE & ~mask;
+}
+
+static inline u32 rpal_pkey_to_pkru_read(int pkey)
+{
+	int offset = pkey * 2;
+	u32 mask = 0x3 << offset;
+
+	return RPAL_PKRU_BASE_CODE_READ & ~mask;
+}
+
+static inline u32 rpal_pkru_union(u32 pkru0, u32 pkru1)
+{
+	return pkru0 & pkru1;
+}
+
+static inline u32 rpal_pkru_intersect(u32 pkru0, u32 pkru1)
+{
+	return pkru0 | pkru1;
+}
diff --git a/samples/rpal/librpal/rpal_queue.c b/samples/rpal/librpal/rpal_queue.c
new file mode 100644
index 000000000000..07a90122aa16
--- /dev/null
+++ b/samples/rpal/librpal/rpal_queue.c
@@ -0,0 +1,239 @@
+#include "rpal_queue.h"
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <assert.h>
+
+#define min(X, Y) ({ ((X) > (Y)) ? (Y) : (X); })
+
+static unsigned int roundup_pow_of_two(unsigned int data)
+{
+	unsigned int msb_position;
+
+	if (data <= 1)
+		return 1;
+	if (!(data & (data - 1)))
+		return data;
+
+	msb_position = 31 - __builtin_clz(data);
+	assert(msb_position < 31);
+	return 1 << (msb_position + 1);
+}
+
+QUEUE_UINT rpal_queue_unused(rpal_queue_t *q)
+{
+	return (q->mask + 1) - (q->tail - q->head);
+}
+
+QUEUE_UINT rpal_queue_len(rpal_queue_t *q)
+{
+	return (q->tail - q->head);
+}
+
+int rpal_queue_init(rpal_queue_t *q, void *data, QUEUE_UINT_INC usize)
+{
+	QUEUE_UINT_INC size;
+	if (usize > QUEUE_UINT_MAX || !data) {
+		return -1;
+	}
+	size = roundup_pow_of_two(usize);
+	if (usize != size) {
+		return -1;
+	}
+	q->data = data;
+	memset(q->data, 0, size * sizeof(int64_t));
+	q->head = 0;
+	q->tail = 0;
+	q->mask = size - 1;
+	return 0;
+}
+
+void *rpal_queue_destroy(rpal_queue_t *q)
+{
+	void *data = q->data;
+	if (q->data) {
+		q->data = NULL;
+	}
+	q->mask = 0;
+	q->head = 0;
+	q->tail = 0;
+	return data;
+}
+
+int rpal_queue_alloc(rpal_queue_t *q, QUEUE_UINT_INC size)
+{
+	assert(q && size);
+	if (size > QUEUE_UINT_MAX) {
+		return -1;
+	}
+	size = roundup_pow_of_two(size);
+	q->data = malloc(size * sizeof(int64_t));
+	if (!q->data)
+		return -1;
+	memset(q->data, 0, size * sizeof(int64_t));
+	q->head = 0;
+	q->tail = 0;
+	q->mask = size - 1;
+	return 0;
+}
+
+void rpal_queue_free(rpal_queue_t *q)
+{
+	if (q->data) {
+		free(q->data);
+		q->data = NULL;
+	}
+	q->mask = 0;
+	q->head = 0;
+	q->tail = 0;
+}
+
+static void rpal_queue_copy_in(rpal_queue_t *q, const int64_t *buf,
+			       QUEUE_UINT_INC len, QUEUE_UINT off)
+{
+	QUEUE_UINT_INC l;
+	QUEUE_UINT_INC size = q->mask + 1;
+
+	off &= q->mask;
+	l = min(len, size - off);
+
+	memcpy(q->data + off, buf, l << 3);
+	memcpy(q->data, buf + l, (len - l) << 3);
+	asm volatile("" : : : "memory");
+}
+
+QUEUE_UINT_INC rpal_queue_put(rpal_queue_t *q, const int64_t *buf,
+			      QUEUE_UINT_INC len)
+{
+	QUEUE_UINT_INC l;
+
+	if (!q->data) {
+		return 0;
+	}
+	l = rpal_queue_unused(q);
+	if (len > l) {
+		return 0;
+	}
+	l = len;
+	rpal_queue_copy_in(q, buf, l, q->tail);
+	q->tail += l;
+	return l;
+}
+
+static QUEUE_UINT_INC rpal_queue_copy_out(rpal_queue_t *q, int64_t *buf,
+					  QUEUE_UINT_INC len, QUEUE_UINT head)
+{
+	unsigned int l;
+	QUEUE_UINT tail;
+	QUEUE_UINT off;
+	QUEUE_UINT_INC size = q->mask + 1;
+
+	tail = __atomic_load_n(&q->tail, __ATOMIC_RELAXED);
+	len = min((QUEUE_UINT)(tail - head), len);
+	if (head == tail)
+		return 0;
+	off = head & q->mask;
+	l = min(len, size - off);
+
+	memcpy(buf, q->data + off, l << 3);
+	memcpy(buf + l, q->data, (len - l) << 3);
+
+	return len;
+}
+
+QUEUE_UINT_INC rpal_queue_peek(rpal_queue_t *q, int64_t *buf,
+			       QUEUE_UINT_INC len, QUEUE_UINT *phead)
+{
+	QUEUE_UINT_INC copied;
+	QUEUE_UINT head;
+
+	head = __atomic_load_n(&q->head, __ATOMIC_RELAXED);
+	copied = rpal_queue_copy_out(q, buf, len, head);
+	if (phead) {
+		*phead = head;
+	}
+	return copied;
+}
+
+QUEUE_UINT_INC rpal_queue_skip(rpal_queue_t *q, QUEUE_UINT head,
+			       QUEUE_UINT_INC skip)
+{
+	if (skip > rpal_queue_len(q)) {
+		return 0;
+	}
+	if (__atomic_compare_exchange_n(&q->head, &head, head + skip, 1,
+					__ATOMIC_RELAXED, __ATOMIC_RELAXED)) {
+		return skip;
+	}
+	return 0;
+}
+
+QUEUE_UINT_INC rpal_queue_get(rpal_queue_t *q, int64_t *buf, QUEUE_UINT_INC len)
+{
+	QUEUE_UINT_INC copied;
+	QUEUE_UINT head;
+
+	while (1) {
+		head = __atomic_load_n(&q->head, __ATOMIC_RELAXED);
+		copied = rpal_queue_copy_out(q, buf, len, head);
+		if (__atomic_compare_exchange_n(&q->head, &head, head + copied,
+						1, __ATOMIC_RELAXED,
+						__ATOMIC_RELAXED)) {
+			return copied;
+		}
+	}
+}
+
+void rpal_uevent_queue_init(epoll_uevent_queue_t *ueventq,
+			    volatile uint64_t *uqlock)
+{
+	int i;
+	__atomic_store_n(uqlock, (uint64_t)0, __ATOMIC_RELAXED);
+	ueventq->l_beg = 0;
+	ueventq->l_end = 0;
+	ueventq->l_end_cache = 0;
+	for (i = 0; i < MAX_RDY; ++i) {
+		ueventq->fds[i] = -1;
+	}
+	return;
+}
+
+QUEUE_UINT uevent_queue_len(epoll_uevent_queue_t *ueventq)
+{
+	return (ueventq->l_end - ueventq->l_beg);
+}
+
+QUEUE_UINT uevent_queue_add(epoll_uevent_queue_t *ueventq, int fd)
+{
+	unsigned int pos;
+	if (uevent_queue_len(ueventq) == MAX_RDY)
+		return MAX_RDY;
+	pos = __sync_fetch_and_add(&ueventq->l_end_cache, 1);
+	pos %= MAX_RDY;
+	ueventq->fds[pos] = fd;
+	asm volatile("" : : : "memory");
+	__sync_fetch_and_add(&ueventq->l_end, 1);
+	return (pos);
+}
+
+int uevent_queue_del(epoll_uevent_queue_t *ueventq)
+{
+	int fd = -1;
+	int pos;
+	if (uevent_queue_len(ueventq) == 0) {
+		return -1;
+	}
+	pos = ueventq->l_beg % MAX_RDY;
+	fd = ueventq->fds[pos];
+	asm volatile("" : : : "memory");
+	__sync_fetch_and_add(&ueventq->l_beg, 1);
+	return fd;
+}
+
+int uevent_queue_fix(epoll_uevent_queue_t *ueventq)
+{
+	__atomic_store_n(&ueventq->l_end_cache, ueventq->l_end,
+			 __ATOMIC_SEQ_CST);
+	return 0;
+}
diff --git a/samples/rpal/librpal/rpal_queue.h b/samples/rpal/librpal/rpal_queue.h
new file mode 100644
index 000000000000..224e7b449d50
--- /dev/null
+++ b/samples/rpal/librpal/rpal_queue.h
@@ -0,0 +1,55 @@
+#ifndef RPAL_QUEUE_H
+#define RPAL_QUEUE_H
+
+#include <stdint.h>
+
+// typedef uint8_t QUEUE_UINT;
+// typedef uint16_t QUEUE_UINT_INC;
+// #define QUEUE_UINT_MAX UINT8_MAX
+
+// typedef uint16_t QUEUE_UINT;
+// typedef uint32_t QUEUE_UINT_INC;
+// #define QUEUE_UINT_MAX UINT16_MAX
+
+typedef uint32_t QUEUE_UINT;
+typedef uint64_t QUEUE_UINT_INC;
+#define QUEUE_UINT_MAX UINT32_MAX
+
+typedef struct rpal_queue {
+	QUEUE_UINT head;
+	QUEUE_UINT tail;
+	QUEUE_UINT mask;
+	uint64_t *data;
+} rpal_queue_t;
+
+QUEUE_UINT rpal_queue_len(rpal_queue_t *q);
+QUEUE_UINT rpal_queue_unused(rpal_queue_t *q);
+int rpal_queue_init(rpal_queue_t *q, void *data, QUEUE_UINT_INC usize);
+void *rpal_queue_destroy(rpal_queue_t *q);
+int rpal_queue_alloc(rpal_queue_t *q, QUEUE_UINT_INC size);
+void rpal_queue_free(rpal_queue_t *q);
+QUEUE_UINT_INC rpal_queue_put(rpal_queue_t *q, const int64_t *buf,
+			      QUEUE_UINT_INC len);
+QUEUE_UINT_INC rpal_queue_get(rpal_queue_t *q, int64_t *buf,
+			      QUEUE_UINT_INC len);
+QUEUE_UINT_INC rpal_queue_peek(rpal_queue_t *q, int64_t *buf,
+			       QUEUE_UINT_INC len, QUEUE_UINT *phead);
+QUEUE_UINT_INC rpal_queue_skip(rpal_queue_t *q, QUEUE_UINT head,
+			       QUEUE_UINT_INC skip);
+
+#define MAX_RDY 4096
+typedef struct epoll_uevent_queue {
+	int fds[MAX_RDY];
+	volatile QUEUE_UINT l_beg;
+	volatile QUEUE_UINT l_end;
+	volatile QUEUE_UINT l_end_cache;
+} epoll_uevent_queue_t;
+
+void rpal_uevent_queue_init(epoll_uevent_queue_t *ueventq,
+			    volatile uint64_t *uqlock);
+QUEUE_UINT uevent_queue_len(epoll_uevent_queue_t *ueventq);
+QUEUE_UINT uevent_queue_add(epoll_uevent_queue_t *ueventq, int fd);
+int uevent_queue_del(epoll_uevent_queue_t *ueventq);
+int uevent_queue_fix(epoll_uevent_queue_t *ueventq);
+
+#endif
diff --git a/samples/rpal/librpal/rpal_x86_64_call_ret.S b/samples/rpal/librpal/rpal_x86_64_call_ret.S
new file mode 100644
index 000000000000..a7c09a1b033d
--- /dev/null
+++ b/samples/rpal/librpal/rpal_x86_64_call_ret.S
@@ -0,0 +1,45 @@
+#ifdef __x86_64__
+#define __ASSEMBLY__
+#include "asm_define.h"
+#define RPAL_SENDER_STATE_RUNNING $0x0
+#define RPAL_SENDER_STATE_CALL $0x1
+
+.text
+.globl rpal_ret_critical
+.type rpal_ret_critical,@function
+.align 16
+
+//void rpal_ret_criticalreceiver_context_t *rc, rpal_call_info_t *rci
+
+rpal_ret_critical:
+    mov     RPAL_SENDER_STATE_CALL, %eax
+    mov     RPAL_SENDER_STATE_RUNNING, %ecx
+    lock cmpxchg   %ecx, RC_SENDER_STATE(%rdi)
+ret_begin:
+    jne 2f
+    movq    RCI_PKRU(%rsi), %rax
+    xor     %edx, %edx
+    .byte 0x0f,0x01,0xef
+    movq    RCI_SENDER_TLS_BASE(%rsi), %rax
+    wrfsbase %rax
+ret_end:
+    movq    RCI_SENDER_FCTX(%rsi), %rdi
+    call    jump_fcontext@plt
+2:
+    ret
+
+.globl rpal_get_critical_addr
+.type rpal_get_critical_addr,@function
+.align 16
+rpal_get_critical_addr:
+    leaq    ret_begin(%rip), %rax
+    movq    %rax, RET_BEGIN(%rdi)
+    leaq    ret_end(%rip), %rax
+    movq    %rax, RET_END(%rdi)
+    ret
+
+.size rpal_ret_critical,.-rpal_ret_critical
+
+/* Mark that we don't need executable stack. */
+.section .note.GNU-stack,"",%progbits
+#endif
diff --git a/samples/rpal/offset.sh b/samples/rpal/offset.sh
new file mode 100755
index 000000000000..f5ae77b893e8
--- /dev/null
+++ b/samples/rpal/offset.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+
+set -e
+CUR_DIR=$(dirname $(realpath -s "$0"))
+gcc -masm=intel -S $CUR_DIR/asm_define.c -o - | awk '($1 == "->") { print "#define " $2 " " $3 }' > $CUR_DIR/librpal/asm_define.h
\ No newline at end of file
diff --git a/samples/rpal/server.c b/samples/rpal/server.c
new file mode 100644
index 000000000000..82c5c9dec922
--- /dev/null
+++ b/samples/rpal/server.c
@@ -0,0 +1,249 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <sys/epoll.h>
+#include <x86intrin.h>
+#include "librpal/rpal.h"
+
+#define SOCKET_PATH "/tmp/rpal_socket"
+#define MAX_EVENTS 10
+#define BUFFER_SIZE 1025
+#define MSG_LEN 32
+
+#define INIT_MSG "INIT"
+#define SUCC_MSG "SUCC"
+#define FAIL_MSG "FAIL"
+
+#define handle_error(s)                                                        \
+	do {                                                                   \
+		perror(s);                                                     \
+		exit(EXIT_FAILURE);                                            \
+	} while (0)
+
+uint64_t service_key;
+int server_fd;
+int epoll_fd;
+
+int rpal_epoll_add(int epfd, int fd)
+{
+	struct epoll_event ev;
+
+	ev.events = EPOLLRPALIN | EPOLLIN | EPOLLRDHUP | EPOLLET;
+	ev.data.fd = fd;
+
+	return rpal_epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
+}
+
+void rpal_server_init(int fd, int epoll_fd)
+{
+	char buffer[BUFFER_SIZE];
+	rpal_error_code_t err;
+	uint64_t remote_key, service_key;
+	int remote_id;
+	int proc_fd;
+	int ret;
+
+	proc_fd = rpal_init(1, 0, &err);
+	if (proc_fd < 0)
+		handle_error("rpal init fail");
+	rpal_get_service_key(&service_key);
+
+	rpal_epoll_add(epoll_fd, fd);
+
+	ret = read(fd, buffer, BUFFER_SIZE);
+	if (ret < 0)
+		handle_error("rpal init: read");
+
+	if (strncmp(buffer, INIT_MSG, strlen(INIT_MSG)) != 0) {
+		buffer[BUFFER_SIZE - 1] = 0;
+		handle_error("Invalid msg\n");
+		return;
+	}
+
+	remote_key = *(uint64_t *)(buffer + strlen(INIT_MSG));
+	ret = rpal_request_service(remote_key);
+	if (ret) {
+		uint64_t service_key = 0;
+		ret = write(fd, (char *)&service_key, sizeof(uint64_t));
+		handle_error("request service fail");
+		return;
+	}
+	ret = write(fd, (char *)&service_key, sizeof(uint64_t));
+	if (ret < 0)
+		handle_error("write error");
+
+	ret = read(fd, buffer, BUFFER_SIZE);
+	if (ret < 0)
+		handle_error("handshake read");
+
+	if (strncmp(SUCC_MSG, buffer, strlen(SUCC_MSG)) != 0)
+		handle_error("handshake");
+
+	remote_id = rpal_get_request_service_id(remote_key);
+	if (remote_id < 0)
+		handle_error("remote id get fail");
+	rpal_receiver_init();
+}
+
+void run_rpal_server(int msg_len)
+{
+	struct epoll_event ev, events[MAX_EVENTS];
+	int new_socket;
+	int nfds;
+	uint64_t tsc, total_tsc = 0;
+	int count = 0;
+
+	while (1) {
+		nfds = rpal_epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
+		if (nfds == -1) {
+			perror("epoll_wait");
+			exit(EXIT_FAILURE);
+		}
+
+		for (int n = 0; n < nfds; ++n) {
+			if (events[n].data.fd == server_fd) {
+				new_socket = accept(server_fd, NULL, NULL);
+				if (new_socket == -1) {
+					perror("accept");
+					continue;
+				}
+
+				rpal_server_init(new_socket, epoll_fd);
+			} else if (events[n].events & EPOLLRDHUP) {
+				close(events[n].data.fd);
+				goto finish;
+			} else if (events[n].events & EPOLLRPALIN) {
+				char buffer[BUFFER_SIZE] = { 0 };
+
+				ssize_t valread = rpal_read_ptrs(
+					events[n].data.fd, (int64_t *)buffer,
+					MSG_LEN / sizeof(int64_t *));
+				if (valread <= 0) {
+					close(events[n].data.fd);
+					epoll_ctl(epoll_fd, EPOLL_CTL_DEL,
+						  events[n].data.fd, NULL);
+					goto finish;
+				} else {
+					count++;
+					sscanf(buffer, "0x%016lx", &tsc);
+					total_tsc += __rdtsc() - tsc;
+					send(events[n].data.fd, buffer, msg_len,
+					     0);
+				}
+			} else {
+				perror("bad request\n");
+			}
+		}
+	}
+finish:
+	printf("RPAL: Message length: %d bytes, Total TSC cycles: %lu, "
+	       "Message count: %d, Average latency: %lu cycles\n",
+	       MSG_LEN, total_tsc, count, total_tsc / count);
+}
+
+void run_server(int msg_len)
+{
+	struct epoll_event ev, events[MAX_EVENTS];
+	int new_socket;
+	int nfds;
+	uint64_t tsc, total_tsc = 0;
+	int count = 0;
+
+	while (1) {
+		nfds = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
+		if (nfds == -1) {
+			perror("epoll_wait");
+			exit(EXIT_FAILURE);
+		}
+
+		for (int n = 0; n < nfds; ++n) {
+			if (events[n].data.fd == server_fd) {
+				new_socket = accept(server_fd, NULL, NULL);
+				if (new_socket == -1) {
+					perror("accept");
+					continue;
+				}
+
+				ev.events = EPOLLIN;
+				ev.data.fd = new_socket;
+				if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD,
+					      new_socket, &ev) == -1) {
+					close(new_socket);
+					perror("epoll_ctl: add new socket");
+				}
+			} else if (events[n].events & EPOLLRDHUP) {
+				close(events[n].data.fd);
+				goto finish;
+			} else {
+				char buffer[BUFFER_SIZE] = { 0 };
+
+				ssize_t valread = read(events[n].data.fd,
+						       buffer, BUFFER_SIZE);
+				if (valread <= 0) {
+					close(events[n].data.fd);
+					epoll_ctl(epoll_fd, EPOLL_CTL_DEL,
+						  events[n].data.fd, NULL);
+					goto finish;
+				} else {
+					count++;
+					sscanf(buffer, "0x%016lx", &tsc);
+					total_tsc += __rdtsc() - tsc;
+					send(events[n].data.fd, buffer, msg_len,
+					     0);
+				}
+			}
+		}
+	}
+finish:
+	printf("EPOLL: Message length: %d bytes, Total TSC cycles: %lu, "
+	       "Message count: %d, Average latency: %lu cycles\n",
+	       MSG_LEN, total_tsc, count, total_tsc / count);
+}
+
+int main()
+{
+	struct sockaddr_un address;
+	struct epoll_event ev;
+
+	if ((server_fd = socket(AF_UNIX, SOCK_STREAM, 0)) == 0) {
+		perror("socket failed");
+		exit(EXIT_FAILURE);
+	}
+
+	memset(&address, 0, sizeof(address));
+	address.sun_family = AF_UNIX;
+	strncpy(address.sun_path, SOCKET_PATH, sizeof(SOCKET_PATH));
+
+	if (bind(server_fd, (struct sockaddr *)&address, sizeof(address)) < 0) {
+		perror("bind failed");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(server_fd, 3) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	epoll_fd = epoll_create(1024);
+	if (epoll_fd == -1) {
+		perror("epoll_create");
+		exit(EXIT_FAILURE);
+	}
+
+	ev.events = EPOLLIN;
+	ev.data.fd = server_fd;
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, server_fd, &ev) == -1) {
+		perror("epoll_ctl: listen_sock");
+		exit(EXIT_FAILURE);
+	}
+
+	run_server(MSG_LEN);
+	run_rpal_server(MSG_LEN);
+
+	close(server_fd);
+	unlink(SOCKET_PATH);
+	return 0;
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (34 preceding siblings ...)
  2025-05-30  9:28 ` [RFC v2 35/35] samples/rpal: add RPAL samples Bo Li
@ 2025-05-30  9:33 ` Lorenzo Stoakes
  2025-06-03  8:22   ` Bo Li
  2025-05-30  9:41 ` Pedro Falcato
                   ` (4 subsequent siblings)
  40 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30  9:33 UTC (permalink / raw)
  To: Bo Li
  Cc: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz, dietmar.eggemann, hpa, acme,
	namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, viro, brauner, jack, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, rostedt, bsegall, mgorman, vschneid,
	jannh, pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff

Bo,

You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
quite sure why you're sending a v2 without responding to that.

This isn't how the upstream kernel works...

Thanks, Lorenzo

On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
> Changelog:
>
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
>   security considerations of RPAL.
>
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
>
> --------------------------------------------------------------------------
>
> # Introduction
>
> We mainly apply RPAL to the service mesh architecture widely adopted in
> modern cloud-native data centers. Before the rise of the service mesh
> architecture, network functions were usually integrated into monolithic
> applications as libraries, and the main business programs invoked them
> through function calls. However, to facilitate the independent development
> and operation and maintenance of the main business programs and network
> functions, the service mesh removed the network functions from the main
> business programs and made them independent processes (called sidecars).
> Inter-process communication (IPC) is used for interaction between the main
> business program and the sidecar, and the introduced inter-process
> communication has led to a sharp increase in resource consumption in
> cloud-native data centers, and may even occupy more than 10% of the CPU of
> the entire microservice cluster.
>
> To achieve the efficient function call mechanism of the monolithic
> architecture under the service mesh architecture, we introduced the RPAL
> (Running Process As Library) architecture, which implements the sharing of
> the virtual address space of processes and the switching threads in user
> mode. Through the analysis of the service mesh architecture, we found that
> the process memory isolation between the main business program and the
> sidecar is not particularly important because they are split from one
> application and were an integral part of the original monolithic
> application. It is more important for the two processes to be independent
> of each other because they need to be independently developed and
> maintained to ensure the architectural advantages of the service mesh.
> Therefore, RPAL breaks the isolation between processes while preserving the
> independence between them.  We think that RPAL can also be applied to other
> scenarios featuring sidecar-like architectures, such as distributed file
> storage systems in LLM infra.
>
> In RPAL architecture, multiple processes share a virtual address space, so
> this architecture can be regarded as an advanced version of the Linux
> shared memory mechanism:
>
> 1. Traditional shared memory requires two processes to negotiate to ensure
> the mapping of the same piece of memory. In RPAL architecture, two RPAL
> processes still need to reach a consensus before they can successfully
> invoke the relevant system calls of RPAL to share the virtual address
> space.
> 2. Traditional shared memory only shares part of the data. However, in RPAL
> architecture, processes that have established an RPAL communication
> relationship share a virtual address space, and all user memory (such as
> data segments and code segments) of each RPAL process is shared among these
> processes. However, a process cannot access the memory of other processes
> at any time. We use the MPK mechanism to ensure that the memory of other
> processes can only be accessed when special RPAL functions are called.
> Otherwise, a page fault will be triggered.
> 3. In RPAL architecture, to ensure the consistency of the execution context
> of the shared code (such as the stack and thread local storage), we further
> implement the thread context switching in user mode based on the ability to
> share the virtual address space of different processes, enabling the
> threads of different processes to directly perform fast switching in user
> mode without falling into kernel mode for slow switching.
>
> # Background
>
> In traditional inter-process communication (IPC) scenarios, Unix domain
> sockets are commonly used in conjunction with the epoll() family for event
> multiplexing. IPC operations involve system calls on both the data and
> control planes, thereby imposing a non-trivial overhead on the interacting
> processes. Even when shared memory is employed to optimize the data plane,
> two data copies still remain. Specifically, data is initially copied from
> a process's private memory space into the shared memory area, and then it
> is copied from the shared memory into the private memory of another
> process.
>
> This poses a question: Is it possible to reduce the overhead of IPC with
> only minimal modifications at the application level? To address this, we
> observed that the functionality of IPC, which encompasses data transfer
> and invocation of the target thread, is similar to a function call, where
> arguments are passed and the callee function is invoked to process them.
> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
> framework designed to enable one process to invoke another as if making
> a local function call, all without going through the kernel.
>
> # Design
>
> First, let’s formalize RPAL’s core objectives:
>
> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
>    shared memory solution) to one.
> 2. Control-plane optimization: Eliminate the overhead of system calls and
>    kernel's thread switches.
> 3. Application compatibility: Minimize the modifications to existing
>    applications that utilize Unix domain sockets and the epoll() family.
>
> To attain the first objective, processes that use RPAL share the same
> virtual address space. So one process can access another's data directly
> via a data pointer. This means data can be transferred from one process to
> another with just one copy operation.
>
> To meet the second goal, RPAL relies on the shared address space to do
> lightweight context switching in user space, which we call an "RPAL call".
> This allows one process to execute another process's code just like a
> local function call.
>
> To achieve the third target, RPAL stays compatible with the epoll family
> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
> application uses epoll for IPC, developers can switch to RPAL with just a
> few small changes. For instance, you can just replace epoll_wait() with
> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
> another to write to a monitored descriptor using an epoll file descriptor,
> still works fine with RPAL.
>
> ## Address space sharing
>
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward:
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
>
>
>  |------------| <- 0
>  |------------| <- 512 GB
>  |  Process A |
>  |------------| <- 2*512 GB
>  |------------| <- n*512 GB
>  |  Process B |
>  |------------| <- (n+1)*512 GB
>  |------------| <- STACK_TOP
>  |  Kernel    |
>  |------------|
>
> ## RPAL call
>
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process.
>
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
>
>
>  |------------|                |---------------------|
>  |  Process A |                |  Process B          |
>  | |-------|  |                | |-------|           |
>  | | caller| --- RPAL call --> | | callee|    handle |
>  | | thread| <------------------ | thread| -> event  |
>  | |-------|  |                | |-------|           |
>  |------------|                |---------------------|
>
> # Security and compatibility with kernel subsystems
>
> ## Memory protection between processes
>
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
>
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
>
> ## Page fault handling and TLB flushing
>
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.
>
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.
>
> ## Lazy switch of kernel context
>
> In RPAL, a mismatch may arise between the user context and the kernel
> context. The RPAL call is designed solely to switch the user context,
> leaving the kernel context unchanged. For instance, when a RPAL call takes
> place, transitioning from caller thread to callee thread, and subsequently
> a system call is initiated within callee thread, the kernel will
> incorrectly utilize the caller's kernel context (such as the kernel stack)
> to process the system call.
>
> To resolve context mismatch issues, a kernel context switch is triggered at
> the kernel entry point when the callee initiates a syscall or an
> exception/interrupt occurs. This mechanism ensures context consistency
> before processing system calls, interrupts, or exceptions. We refer to this
> kernel context switch as a "lazy switch" because it defers the switching
> operation from the traditional thread switch point to the next kernel entry
> point.
>
> Lazy switch should be minimized as much as possible, as it significantly
> degrades performance. We currently utilize RPAL in an RPC framework, in
> which the RPC sender thread relies on the RPAL call to invoke the RPC
> receiver thread entirely in user space. In most cases, the receiver
> thread is free of system calls and the code execution time is relatively
> short. This characteristic effectively reduces the probability of a lazy
> switch occurring.
>
> ## Time slice correction
>
> After an RPAL call, the callee's user mode code executes. However, the
> kernel incorrectly attributes this CPU time to the caller due to the
> unchanged kernel context.
>
> To resolve this, we use the Time Stamp Counter (TSC) register to measure
> CPU time consumed by the callee thread in user space. The kernel then uses
> this user-reported timing data to adjust the CPU accounting for both the
> caller and callee thread, similar to how CPU steal time is implemented.
>
> ## Process recovery
>
> Since processes can access each other’s memory, there is a risk that the
> target process’s memory may become invalid at the access time (e.g., if
> the target process has exited unexpectedly). The kernel must handle such
> cases; otherwise, the accessing process could be terminated due to
> failures originating from another process.
>
> To address this issue, each thread of the process should pre-establish a
> recovery point when accessing the memory of other processes. When such an
> invalid access occurs, the thread traps into the kernel. Inside the page
> fault handler, the kernel restores the user context of the thread to the
> recovery point. This mechanism ensures that processes maintain mutual
> independence, preventing cascading failures caused by cross-process memory
> issues.
>
> # Performance
>
> To quantify the performance improvements driven by RPAL, we measured
> latency both before and after its deployment. Experiments were conducted on
> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
> and 1 TB of memory. Latency was defined as the duration from when the
> client thread initiates a message to when the server thread is invoked and
> receives it.
>
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
>
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
>
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.
>
> We have applied RPAL to an RPC framework that is widely used in our data
> center. With RPAL, we have successfully achieved up to 15.5% reduction in
> the CPU utilization of processes in real-world microservice scenario. The
> gains primarily stem from minimizing control plane overhead through the
> utilization of userspace context switches. Additionally, by leveraging
> address space sharing, the number of memory copies is significantly
> reduced.
>
> # Future Work
>
> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
> supported only on the latest processor, specifically, 3th Generation AMD
> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
> support to systems lacking MPK hardware will be provided later.
>
> Accompanying test programs are also provided in the samples/rpal/
> directory. And the user-mode RPAL library, which realizes user-space RPAL
> call, is in the samples/rpal/librpal directory.
>
> We hope to get some community discussions and feedback on RPAL's
> optimization approaches and architecture.
>
> Look forward to your comments.
>
> Bo Li (35):
>   Kbuild: rpal support
>   RPAL: add struct rpal_service
>   RPAL: add service registration interface
>   RPAL: add member to task_struct and mm_struct
>   RPAL: enable virtual address space partitions
>   RPAL: add user interface
>   RPAL: enable shared page mmap
>   RPAL: enable sender/receiver registration
>   RPAL: enable address space sharing
>   RPAL: allow service enable/disable
>   RPAL: add service request/release
>   RPAL: enable service disable notification
>   RPAL: add tlb flushing support
>   RPAL: enable page fault handling
>   RPAL: add sender/receiver state
>   RPAL: add cpu lock interface
>   RPAL: add a mapping between fsbase and tasks
>   sched: pick a specified task
>   RPAL: add lazy switch main logic
>   RPAL: add rpal_ret_from_lazy_switch
>   RPAL: add kernel entry handling for lazy switch
>   RPAL: rebuild receiver state
>   RPAL: resume cpumask when fork
>   RPAL: critical section optimization
>   RPAL: add MPK initialization and interface
>   RPAL: enable MPK support
>   RPAL: add epoll support
>   RPAL: add rpal_uds_fdmap() support
>   RPAL: fix race condition in pkru update
>   RPAL: fix pkru setup when fork
>   RPAL: add receiver waker
>   RPAL: fix unknown nmi on AMD CPU
>   RPAL: enable time slice correction
>   RPAL: enable fast epoll wait
>   samples/rpal: add RPAL samples
>
>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++
>  fs/binfmt_elf.c                               |   98 +-
>  fs/eventpoll.c                                |  320 +++
>  fs/exec.c                                     |   11 +
>  include/linux/mm_types.h                      |    3 +
>  include/linux/rpal.h                          |  633 +++++
>  include/linux/sched.h                         |   21 +
>  init/init_task.c                              |    6 +
>  kernel/exit.c                                 |    5 +
>  kernel/fork.c                                 |   32 +
>  kernel/sched/core.c                           |  676 +++++
>  kernel/sched/fair.c                           |  109 +
>  kernel/sched/sched.h                          |    8 +
>  mm/mmap.c                                     |   16 +
>  mm/mprotect.c                                 |  106 +
>  mm/rmap.c                                     |    4 +
>  mm/vma.c                                      |   18 +
>  samples/rpal/Makefile                         |   17 +
>  samples/rpal/asm_define.c                     |   14 +
>  samples/rpal/client.c                         |  178 ++
>  samples/rpal/librpal/asm_define.h             |    6 +
>  samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
>  samples/rpal/librpal/debug.h                  |   12 +
>  samples/rpal/librpal/fiber.c                  |  119 +
>  samples/rpal/librpal/fiber.h                  |   64 +
>  .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
>  .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
>  .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
>  samples/rpal/librpal/private.h                |  341 +++
>  samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
>  samples/rpal/librpal/rpal.h                   |  149 ++
>  samples/rpal/librpal/rpal_pkru.h              |   78 +
>  samples/rpal/librpal/rpal_queue.c             |  239 ++
>  samples/rpal/librpal/rpal_queue.h             |   55 +
>  samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
>  samples/rpal/offset.sh                        |    5 +
>  samples/rpal/server.c                         |  249 ++
>  61 files changed, 9710 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/rpal/Kconfig
>  create mode 100644 arch/x86/rpal/Makefile
>  create mode 100644 arch/x86/rpal/core.c
>  create mode 100644 arch/x86/rpal/internal.h
>  create mode 100644 arch/x86/rpal/mm.c
>  create mode 100644 arch/x86/rpal/pku.c
>  create mode 100644 arch/x86/rpal/proc.c
>  create mode 100644 arch/x86/rpal/service.c
>  create mode 100644 arch/x86/rpal/thread.c
>  create mode 100644 include/linux/rpal.h
>  create mode 100644 samples/rpal/Makefile
>  create mode 100644 samples/rpal/asm_define.c
>  create mode 100644 samples/rpal/client.c
>  create mode 100644 samples/rpal/librpal/asm_define.h
>  create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
>  create mode 100644 samples/rpal/librpal/debug.h
>  create mode 100644 samples/rpal/librpal/fiber.c
>  create mode 100644 samples/rpal/librpal/fiber.h
>  create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/private.h
>  create mode 100644 samples/rpal/librpal/rpal.c
>  create mode 100644 samples/rpal/librpal/rpal.h
>  create mode 100644 samples/rpal/librpal/rpal_pkru.h
>  create mode 100644 samples/rpal/librpal/rpal_queue.c
>  create mode 100644 samples/rpal/librpal/rpal_queue.h
>  create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
>  create mode 100755 samples/rpal/offset.sh
>  create mode 100644 samples/rpal/server.c
>
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-05-30  9:33 ` [RFC v2 00/35] optimize cost of inter-process communication Lorenzo Stoakes
@ 2025-06-03  8:22   ` Bo Li
  2025-06-03  9:22     ` Lorenzo Stoakes
  0 siblings, 1 reply; 46+ messages in thread
From: Bo Li @ 2025-06-03  8:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz, dietmar.eggemann, hpa, acme,
	namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, viro, brauner, jack, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, rostedt, bsegall, mgorman, vschneid,
	jannh, pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff

Hi Lorenzo,

On 5/30/25 5:33 PM, Lorenzo Stoakes wrote:
> Bo,
>
> You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
> quite sure why you're sending a v2 without responding to that.
>
> This isn't how the upstream kernel works...
>
> Thanks, Lorenzo
>
> On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
>> Changelog:
>>
>> v2:
>> - Port the RPAL functions to the latest v6.15 kernel.
>> - Add a supplementary introduction to the application scenarios and
>>    security considerations of RPAL.
>>
>> link to v1:
>> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
>>
>> --------------------------------------------------------------------------
>>
>> # Introduction
>>
>> We mainly apply RPAL to the service mesh architecture widely adopted in
>> modern cloud-native data centers. Before the rise of the service mesh
>> architecture, network functions were usually integrated into monolithic
>> applications as libraries, and the main business programs invoked them
>> through function calls. However, to facilitate the independent development
>> and operation and maintenance of the main business programs and network
>> functions, the service mesh removed the network functions from the main
>> business programs and made them independent processes (called sidecars).
>> Inter-process communication (IPC) is used for interaction between the main
>> business program and the sidecar, and the introduced inter-process
>> communication has led to a sharp increase in resource consumption in
>> cloud-native data centers, and may even occupy more than 10% of the CPU of
>> the entire microservice cluster.
>>
>> To achieve the efficient function call mechanism of the monolithic
>> architecture under the service mesh architecture, we introduced the RPAL
>> (Running Process As Library) architecture, which implements the sharing of
>> the virtual address space of processes and the switching threads in user
>> mode. Through the analysis of the service mesh architecture, we found that
>> the process memory isolation between the main business program and the
>> sidecar is not particularly important because they are split from one
>> application and were an integral part of the original monolithic
>> application. It is more important for the two processes to be independent
>> of each other because they need to be independently developed and
>> maintained to ensure the architectural advantages of the service mesh.
>> Therefore, RPAL breaks the isolation between processes while preserving the
>> independence between them.  We think that RPAL can also be applied to other
>> scenarios featuring sidecar-like architectures, such as distributed file
>> storage systems in LLM infra.
>>
>> In RPAL architecture, multiple processes share a virtual address space, so
>> this architecture can be regarded as an advanced version of the Linux
>> shared memory mechanism:
>>
>> 1. Traditional shared memory requires two processes to negotiate to ensure
>> the mapping of the same piece of memory. In RPAL architecture, two RPAL
>> processes still need to reach a consensus before they can successfully
>> invoke the relevant system calls of RPAL to share the virtual address
>> space.
>> 2. Traditional shared memory only shares part of the data. However, in RPAL
>> architecture, processes that have established an RPAL communication
>> relationship share a virtual address space, and all user memory (such as
>> data segments and code segments) of each RPAL process is shared among these
>> processes. However, a process cannot access the memory of other processes
>> at any time. We use the MPK mechanism to ensure that the memory of other
>> processes can only be accessed when special RPAL functions are called.
>> Otherwise, a page fault will be triggered.
>> 3. In RPAL architecture, to ensure the consistency of the execution context
>> of the shared code (such as the stack and thread local storage), we further
>> implement the thread context switching in user mode based on the ability to
>> share the virtual address space of different processes, enabling the
>> threads of different processes to directly perform fast switching in user
>> mode without falling into kernel mode for slow switching.
>>
>> # Background
>>
>> In traditional inter-process communication (IPC) scenarios, Unix domain
>> sockets are commonly used in conjunction with the epoll() family for event
>> multiplexing. IPC operations involve system calls on both the data and
>> control planes, thereby imposing a non-trivial overhead on the interacting
>> processes. Even when shared memory is employed to optimize the data plane,
>> two data copies still remain. Specifically, data is initially copied from
>> a process's private memory space into the shared memory area, and then it
>> is copied from the shared memory into the private memory of another
>> process.
>>
>> This poses a question: Is it possible to reduce the overhead of IPC with
>> only minimal modifications at the application level? To address this, we
>> observed that the functionality of IPC, which encompasses data transfer
>> and invocation of the target thread, is similar to a function call, where
>> arguments are passed and the callee function is invoked to process them.
>> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
>> framework designed to enable one process to invoke another as if making
>> a local function call, all without going through the kernel.
>>
>> # Design
>>
>> First, let’s formalize RPAL’s core objectives:
>>
>> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
>>     shared memory solution) to one.
>> 2. Control-plane optimization: Eliminate the overhead of system calls and
>>     kernel's thread switches.
>> 3. Application compatibility: Minimize the modifications to existing
>>     applications that utilize Unix domain sockets and the epoll() family.
>>
>> To attain the first objective, processes that use RPAL share the same
>> virtual address space. So one process can access another's data directly
>> via a data pointer. This means data can be transferred from one process to
>> another with just one copy operation.
>>
>> To meet the second goal, RPAL relies on the shared address space to do
>> lightweight context switching in user space, which we call an "RPAL call".
>> This allows one process to execute another process's code just like a
>> local function call.
>>
>> To achieve the third target, RPAL stays compatible with the epoll family
>> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
>> application uses epoll for IPC, developers can switch to RPAL with just a
>> few small changes. For instance, you can just replace epoll_wait() with
>> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
>> another to write to a monitored descriptor using an epoll file descriptor,
>> still works fine with RPAL.
>>
>> ## Address space sharing
>>
>> For address space sharing, RPAL partitions the entire userspace virtual
>> address space and allocates non-overlapping memory ranges to each process.
>> On x86_64 architectures, RPAL uses a memory range size covered by a
>> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
>> each process’s virtual address space to 512GB on x86_64, sufficient for
>> most applications in our scenario. The rationale is straightforward:
>> address space sharing can be simply achieved by copying the PUD from one
>> process’s page table to another’s. So one process can directly use the
>> data pointer to access another's memory.
>>
>>
>>   |------------| <- 0
>>   |------------| <- 512 GB
>>   |  Process A |
>>   |------------| <- 2*512 GB
>>   |------------| <- n*512 GB
>>   |  Process B |
>>   |------------| <- (n+1)*512 GB
>>   |------------| <- STACK_TOP
>>   |  Kernel    |
>>   |------------|
>>
>> ## RPAL call
>>
>> We refer to the lightweight userspace context switching mechanism as RPAL
>> call. It enables the caller (or sender) thread of one process to directly
>> switch to the callee (or receiver) thread of another process.
>>
>> When Process A’s caller thread initiates an RPAL call to Process B’s
>> callee thread, the CPU saves the caller’s context and loads the callee’s
>> context. This enables direct userspace control flow transfer from the
>> caller to the callee. After the callee finishes data processing, the CPU
>> saves Process B’s callee context and switches back to Process A’s caller
>> context, completing a full IPC cycle.
>>
>>
>>   |------------|                |---------------------|
>>   |  Process A |                |  Process B          |
>>   | |-------|  |                | |-------|           |
>>   | | caller| --- RPAL call --> | | callee|    handle |
>>   | | thread| <------------------ | thread| -> event  |
>>   | |-------|  |                | |-------|           |
>>   |------------|                |---------------------|
>>
>> # Security and compatibility with kernel subsystems
>>
>> ## Memory protection between processes
>>
>> Since processes using RPAL share the address space, unintended
>> cross-process memory access may occur and corrupt the data of another
>> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
>> architectures.
>>
>> MPK assigns 4 bits in each page table entry to a "protection key", which
>> is paired with a userspace register (PKRU). The PKRU register defines
>> access permissions for memory regions protected by specific keys (for
>> detailed implementation, refer to the kernel documentation "Memory
>> Protection Keys"). With MPK, even though the address space is shared
>> among processes, cross-process access is restricted: a process can only
>> access the memory protected by a key if its PKRU register is configured
>> with the corresponding permission. This ensures that processes cannot
>> access each other’s memory unless an explicit PKRU configuration is set.
>>
>> ## Page fault handling and TLB flushing
>>
>> Due to the shared address space architecture, both page fault handling and
>> TLB flushing require careful consideration. For instance, when Process A
>> accesses Process B’s memory, a page fault may occur in Process A's
>> context, but the faulting address belongs to Process B. In this case, we
>> must pass Process B's mm_struct to the page fault handler.
>>
>> TLB flushing is more complex. When a thread flushes the TLB, since the
>> address space is shared, not only other threads in the current process but
>> also other processes that share the address space may access the
>> corresponding memory (related to the TLB flush). Therefore, the cpuset used
>> for TLB flushing should be the union of the mm_cpumasks of all processes
>> that share the address space.
>>
>> ## Lazy switch of kernel context
>>
>> In RPAL, a mismatch may arise between the user context and the kernel
>> context. The RPAL call is designed solely to switch the user context,
>> leaving the kernel context unchanged. For instance, when a RPAL call takes
>> place, transitioning from caller thread to callee thread, and subsequently
>> a system call is initiated within callee thread, the kernel will
>> incorrectly utilize the caller's kernel context (such as the kernel stack)
>> to process the system call.
>>
>> To resolve context mismatch issues, a kernel context switch is triggered at
>> the kernel entry point when the callee initiates a syscall or an
>> exception/interrupt occurs. This mechanism ensures context consistency
>> before processing system calls, interrupts, or exceptions. We refer to this
>> kernel context switch as a "lazy switch" because it defers the switching
>> operation from the traditional thread switch point to the next kernel entry
>> point.
>>
>> Lazy switch should be minimized as much as possible, as it significantly
>> degrades performance. We currently utilize RPAL in an RPC framework, in
>> which the RPC sender thread relies on the RPAL call to invoke the RPC
>> receiver thread entirely in user space. In most cases, the receiver
>> thread is free of system calls and the code execution time is relatively
>> short. This characteristic effectively reduces the probability of a lazy
>> switch occurring.
>>
>> ## Time slice correction
>>
>> After an RPAL call, the callee's user mode code executes. However, the
>> kernel incorrectly attributes this CPU time to the caller due to the
>> unchanged kernel context.
>>
>> To resolve this, we use the Time Stamp Counter (TSC) register to measure
>> CPU time consumed by the callee thread in user space. The kernel then uses
>> this user-reported timing data to adjust the CPU accounting for both the
>> caller and callee thread, similar to how CPU steal time is implemented.
>>
>> ## Process recovery
>>
>> Since processes can access each other’s memory, there is a risk that the
>> target process’s memory may become invalid at the access time (e.g., if
>> the target process has exited unexpectedly). The kernel must handle such
>> cases; otherwise, the accessing process could be terminated due to
>> failures originating from another process.
>>
>> To address this issue, each thread of the process should pre-establish a
>> recovery point when accessing the memory of other processes. When such an
>> invalid access occurs, the thread traps into the kernel. Inside the page
>> fault handler, the kernel restores the user context of the thread to the
>> recovery point. This mechanism ensures that processes maintain mutual
>> independence, preventing cascading failures caused by cross-process memory
>> issues.
>>
>> # Performance
>>
>> To quantify the performance improvements driven by RPAL, we measured
>> latency both before and after its deployment. Experiments were conducted on
>> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
>> and 1 TB of memory. Latency was defined as the duration from when the
>> client thread initiates a message to when the server thread is invoked and
>> receives it.
>>
>> During testing, the client transmitted 1 million 32-byte messages, and we
>> computed the per-message average latency. The results are as follows:
>>
>> *****************
>> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>>   Message count: 1000000, Average latency: 19616 cycles
>> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>>   Message count: 1000000, Average latency: 1703 cycles
>> *****************
>>
>> These results confirm that RPAL delivers substantial latency improvements
>> over the current epoll implementation—achieving a 17,913-cycle reduction
>> (an ~91.3% improvement) for 32-byte messages.
>>
>> We have applied RPAL to an RPC framework that is widely used in our data
>> center. With RPAL, we have successfully achieved up to 15.5% reduction in
>> the CPU utilization of processes in real-world microservice scenario. The
>> gains primarily stem from minimizing control plane overhead through the
>> utilization of userspace context switches. Additionally, by leveraging
>> address space sharing, the number of memory copies is significantly
>> reduced.
>>
>> # Future Work
>>
>> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
>> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
>> supported only on the latest processor, specifically, 3th Generation AMD
>> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
>> support to systems lacking MPK hardware will be provided later.
>>
>> Accompanying test programs are also provided in the samples/rpal/
>> directory. And the user-mode RPAL library, which realizes user-space RPAL
>> call, is in the samples/rpal/librpal directory.
>>
>> We hope to get some community discussions and feedback on RPAL's
>> optimization approaches and architecture.
>>
>> Look forward to your comments.
>>
>> Bo Li (35):
>>    Kbuild: rpal support
>>    RPAL: add struct rpal_service
>>    RPAL: add service registration interface
>>    RPAL: add member to task_struct and mm_struct
>>    RPAL: enable virtual address space partitions
>>    RPAL: add user interface
>>    RPAL: enable shared page mmap
>>    RPAL: enable sender/receiver registration
>>    RPAL: enable address space sharing
>>    RPAL: allow service enable/disable
>>    RPAL: add service request/release
>>    RPAL: enable service disable notification
>>    RPAL: add tlb flushing support
>>    RPAL: enable page fault handling
>>    RPAL: add sender/receiver state
>>    RPAL: add cpu lock interface
>>    RPAL: add a mapping between fsbase and tasks
>>    sched: pick a specified task
>>    RPAL: add lazy switch main logic
>>    RPAL: add rpal_ret_from_lazy_switch
>>    RPAL: add kernel entry handling for lazy switch
>>    RPAL: rebuild receiver state
>>    RPAL: resume cpumask when fork
>>    RPAL: critical section optimization
>>    RPAL: add MPK initialization and interface
>>    RPAL: enable MPK support
>>    RPAL: add epoll support
>>    RPAL: add rpal_uds_fdmap() support
>>    RPAL: fix race condition in pkru update
>>    RPAL: fix pkru setup when fork
>>    RPAL: add receiver waker
>>    RPAL: fix unknown nmi on AMD CPU
>>    RPAL: enable time slice correction
>>    RPAL: enable fast epoll wait
>>    samples/rpal: add RPAL samples
>>
>>   arch/x86/Kbuild                               |    2 +
>>   arch/x86/Kconfig                              |    2 +
>>   arch/x86/entry/entry_64.S                     |  160 ++
>>   arch/x86/events/amd/core.c                    |   14 +
>>   arch/x86/include/asm/pgtable.h                |   25 +
>>   arch/x86/include/asm/pgtable_types.h          |   11 +
>>   arch/x86/include/asm/tlbflush.h               |   10 +
>>   arch/x86/kernel/asm-offsets.c                 |    3 +
>>   arch/x86/kernel/cpu/common.c                  |    8 +-
>>   arch/x86/kernel/fpu/core.c                    |    8 +-
>>   arch/x86/kernel/nmi.c                         |   20 +
>>   arch/x86/kernel/process.c                     |   25 +-
>>   arch/x86/kernel/process_64.c                  |  118 +
>>   arch/x86/mm/fault.c                           |  271 ++
>>   arch/x86/mm/mmap.c                            |   10 +
>>   arch/x86/mm/tlb.c                             |  172 ++
>>   arch/x86/rpal/Kconfig                         |   21 +
>>   arch/x86/rpal/Makefile                        |    6 +
>>   arch/x86/rpal/core.c                          |  477 ++++
>>   arch/x86/rpal/internal.h                      |   69 +
>>   arch/x86/rpal/mm.c                            |  426 +++
>>   arch/x86/rpal/pku.c                           |  196 ++
>>   arch/x86/rpal/proc.c                          |  279 ++
>>   arch/x86/rpal/service.c                       |  776 ++++++
>>   arch/x86/rpal/thread.c                        |  313 +++
>>   fs/binfmt_elf.c                               |   98 +-
>>   fs/eventpoll.c                                |  320 +++
>>   fs/exec.c                                     |   11 +
>>   include/linux/mm_types.h                      |    3 +
>>   include/linux/rpal.h                          |  633 +++++
>>   include/linux/sched.h                         |   21 +
>>   init/init_task.c                              |    6 +
>>   kernel/exit.c                                 |    5 +
>>   kernel/fork.c                                 |   32 +
>>   kernel/sched/core.c                           |  676 +++++
>>   kernel/sched/fair.c                           |  109 +
>>   kernel/sched/sched.h                          |    8 +
>>   mm/mmap.c                                     |   16 +
>>   mm/mprotect.c                                 |  106 +
>>   mm/rmap.c                                     |    4 +
>>   mm/vma.c                                      |   18 +
>>   samples/rpal/Makefile                         |   17 +
>>   samples/rpal/asm_define.c                     |   14 +
>>   samples/rpal/client.c                         |  178 ++
>>   samples/rpal/librpal/asm_define.h             |    6 +
>>   samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
>>   samples/rpal/librpal/debug.h                  |   12 +
>>   samples/rpal/librpal/fiber.c                  |  119 +
>>   samples/rpal/librpal/fiber.h                  |   64 +
>>   .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
>>   .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
>>   .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
>>   samples/rpal/librpal/private.h                |  341 +++
>>   samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
>>   samples/rpal/librpal/rpal.h                   |  149 ++
>>   samples/rpal/librpal/rpal_pkru.h              |   78 +
>>   samples/rpal/librpal/rpal_queue.c             |  239 ++
>>   samples/rpal/librpal/rpal_queue.h             |   55 +
>>   samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
>>   samples/rpal/offset.sh                        |    5 +
>>   samples/rpal/server.c                         |  249 ++
>>   61 files changed, 9710 insertions(+), 4 deletions(-)
>>   create mode 100644 arch/x86/rpal/Kconfig
>>   create mode 100644 arch/x86/rpal/Makefile
>>   create mode 100644 arch/x86/rpal/core.c
>>   create mode 100644 arch/x86/rpal/internal.h
>>   create mode 100644 arch/x86/rpal/mm.c
>>   create mode 100644 arch/x86/rpal/pku.c
>>   create mode 100644 arch/x86/rpal/proc.c
>>   create mode 100644 arch/x86/rpal/service.c
>>   create mode 100644 arch/x86/rpal/thread.c
>>   create mode 100644 include/linux/rpal.h
>>   create mode 100644 samples/rpal/Makefile
>>   create mode 100644 samples/rpal/asm_define.c
>>   create mode 100644 samples/rpal/client.c
>>   create mode 100644 samples/rpal/librpal/asm_define.h
>>   create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
>>   create mode 100644 samples/rpal/librpal/debug.h
>>   create mode 100644 samples/rpal/librpal/fiber.c
>>   create mode 100644 samples/rpal/librpal/fiber.h
>>   create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
>>   create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
>>   create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
>>   create mode 100644 samples/rpal/librpal/private.h
>>   create mode 100644 samples/rpal/librpal/rpal.c
>>   create mode 100644 samples/rpal/librpal/rpal.h
>>   create mode 100644 samples/rpal/librpal/rpal_pkru.h
>>   create mode 100644 samples/rpal/librpal/rpal_queue.c
>>   create mode 100644 samples/rpal/librpal/rpal_queue.h
>>   create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
>>   create mode 100755 samples/rpal/offset.sh
>>   create mode 100644 samples/rpal/server.c
>>
>> --
>> 2.20.1
>>

Thank you for your feedback! There might be some misunderstanding.
According to the feedback in RPAL V1, we rebased the RPAL to the latest
stable kernel and added an introduction section to explain our
considerations regarding the process isolation of the RPAL architecture.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-06-03  8:22   ` Bo Li
@ 2025-06-03  9:22     ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2025-06-03  9:22 UTC (permalink / raw)
  To: Bo Li
  Cc: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz, dietmar.eggemann, hpa, acme,
	namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, viro, brauner, jack, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, rostedt, bsegall, mgorman, vschneid,
	jannh, pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff

On Tue, Jun 03, 2025 at 03:22:39AM -0500, Bo Li wrote:
> Hi Lorenzo,
>
> On 5/30/25 5:33 PM, Lorenzo Stoakes wrote:
> > Bo,
> >
> > You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
> > quite sure why you're sending a v2 without responding to that.
> >
> > This isn't how the upstream kernel works...
> >
> > Thanks, Lorenzo
> >
> > On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:

[snip]

> Thank you for your feedback! There might be some misunderstanding.
> According to the feedback in RPAL V1, we rebased the RPAL to the latest
> stable kernel and added an introduction section to explain our
> considerations regarding the process isolation of the RPAL architecture.
>
> Thanks!

Hi Bo,

You need to engage in _conversation_ with maintainers, not simply resend
giant RFC's with changes made based on your interpretation of the feedback.

You've not addressed my comments, you've interpreted them to be 'ok do X,
Y, Z', then done them without a word. This is, again, not how upstream
works. You've seemingly ignored Dave altogether.

Others have highlighted it, but let me repeat what they have (in effect)
said - this is just not mergeable upstream in any way shape or form,
sorry.

It's a NAK and there's just no way it's not a NAK, you're doing too many
crazy things here that are just not acceptable, not to mention the issues
people have raised.

You should have engaged with upstream WAY earlier.

It's a pity you've put so much work into this without having done so, but
I'm afraid you're going to have to maintain this out-of-tree indefinitely.

I hope you can at least can take some lessons from this on how best in
future to engage with upstream (early and often! :)

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (35 preceding siblings ...)
  2025-05-30  9:33 ` [RFC v2 00/35] optimize cost of inter-process communication Lorenzo Stoakes
@ 2025-05-30  9:41 ` Pedro Falcato
  2025-05-30  9:56 ` David Hildenbrand
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: Pedro Falcato @ 2025-05-30  9:41 UTC (permalink / raw)
  To: Bo Li
  Cc: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz, dietmar.eggemann, hpa, acme,
	namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, viro, brauner, jack, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, rostedt, bsegall,
	mgorman, vschneid, jannh, riel, harry.yoo, linux-kernel,
	linux-perf-users, linux-fsdevel, linux-mm, duanxiongchun,
	yinhongbo, dengliang.1214, xieyongji, chaiwen.cc, songmuchun,
	yuanzhu, chengguozhu, sunjiadong.lff

On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
> Changelog:
> 
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
>   security considerations of RPAL.
> 
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
> 
> --------------------------------------------------------------------------
> 
> # Introduction
> 
> We mainly apply RPAL to the service mesh architecture widely adopted in
> modern cloud-native data centers. Before the rise of the service mesh
> architecture, network functions were usually integrated into monolithic
> applications as libraries, and the main business programs invoked them
> through function calls. However, to facilitate the independent development
> and operation and maintenance of the main business programs and network
> functions, the service mesh removed the network functions from the main
> business programs and made them independent processes (called sidecars).
> Inter-process communication (IPC) is used for interaction between the main
> business program and the sidecar, and the introduced inter-process
> communication has led to a sharp increase in resource consumption in
> cloud-native data centers, and may even occupy more than 10% of the CPU of
> the entire microservice cluster.
> 
> To achieve the efficient function call mechanism of the monolithic
> architecture under the service mesh architecture, we introduced the RPAL
> (Running Process As Library) architecture, which implements the sharing of
> the virtual address space of processes and the switching threads in user
> mode. Through the analysis of the service mesh architecture, we found that
> the process memory isolation between the main business program and the
> sidecar is not particularly important because they are split from one
> application and were an integral part of the original monolithic
> application. It is more important for the two processes to be independent
> of each other because they need to be independently developed and
> maintained to ensure the architectural advantages of the service mesh.
> Therefore, RPAL breaks the isolation between processes while preserving the
> independence between them.  We think that RPAL can also be applied to other
> scenarios featuring sidecar-like architectures, such as distributed file
> storage systems in LLM infra.
> 
> In RPAL architecture, multiple processes share a virtual address space, so
> this architecture can be regarded as an advanced version of the Linux
> shared memory mechanism:
> 
> 1. Traditional shared memory requires two processes to negotiate to ensure
> the mapping of the same piece of memory. In RPAL architecture, two RPAL
> processes still need to reach a consensus before they can successfully
> invoke the relevant system calls of RPAL to share the virtual address
> space.
> 2. Traditional shared memory only shares part of the data. However, in RPAL
> architecture, processes that have established an RPAL communication
> relationship share a virtual address space, and all user memory (such as
> data segments and code segments) of each RPAL process is shared among these
> processes. However, a process cannot access the memory of other processes
> at any time. We use the MPK mechanism to ensure that the memory of other
> processes can only be accessed when special RPAL functions are called.
> Otherwise, a page fault will be triggered.
> 3. In RPAL architecture, to ensure the consistency of the execution context
> of the shared code (such as the stack and thread local storage), we further
> implement the thread context switching in user mode based on the ability to
> share the virtual address space of different processes, enabling the
> threads of different processes to directly perform fast switching in user
> mode without falling into kernel mode for slow switching.
> 
> # Background
> 
> In traditional inter-process communication (IPC) scenarios, Unix domain
> sockets are commonly used in conjunction with the epoll() family for event
> multiplexing. IPC operations involve system calls on both the data and
> control planes, thereby imposing a non-trivial overhead on the interacting
> processes. Even when shared memory is employed to optimize the data plane,
> two data copies still remain. Specifically, data is initially copied from
> a process's private memory space into the shared memory area, and then it
> is copied from the shared memory into the private memory of another
> process.
> 
> This poses a question: Is it possible to reduce the overhead of IPC with
> only minimal modifications at the application level? To address this, we
> observed that the functionality of IPC, which encompasses data transfer
> and invocation of the target thread, is similar to a function call, where
> arguments are passed and the callee function is invoked to process them.
> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
> framework designed to enable one process to invoke another as if making
> a local function call, all without going through the kernel.
> 
> # Design
> 
> First, let’s formalize RPAL’s core objectives:
> 
> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
>    shared memory solution) to one.
> 2. Control-plane optimization: Eliminate the overhead of system calls and
>    kernel's thread switches.
> 3. Application compatibility: Minimize the modifications to existing
>    applications that utilize Unix domain sockets and the epoll() family.
> 
> To attain the first objective, processes that use RPAL share the same
> virtual address space. So one process can access another's data directly
> via a data pointer. This means data can be transferred from one process to
> another with just one copy operation. 
> 
> To meet the second goal, RPAL relies on the shared address space to do
> lightweight context switching in user space, which we call an "RPAL call".
> This allows one process to execute another process's code just like a
> local function call.
> 
> To achieve the third target, RPAL stays compatible with the epoll family
> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
> application uses epoll for IPC, developers can switch to RPAL with just a
> few small changes. For instance, you can just replace epoll_wait() with
> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
> another to write to a monitored descriptor using an epoll file descriptor,
> still works fine with RPAL.
> 
> ## Address space sharing
> 
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward: 
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
> 
> 
>  |------------| <- 0
>  |------------| <- 512 GB
>  |  Process A |
>  |------------| <- 2*512 GB
>  |------------| <- n*512 GB
>  |  Process B |
>  |------------| <- (n+1)*512 GB
>  |------------| <- STACK_TOP
>  |  Kernel    |
>  |------------|
> 
> ## RPAL call
> 
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process. 
> 
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
> 
> 
>  |------------|                |---------------------|  
>  |  Process A |                |  Process B          |
>  | |-------|  |                | |-------|           |     
>  | | caller| --- RPAL call --> | | callee|    handle |
>  | | thread| <------------------ | thread| -> event  |
>  | |-------|  |                | |-------|           |
>  |------------|                |---------------------|
> 
> # Security and compatibility with kernel subsystems
> 
> ## Memory protection between processes
> 
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
> 
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
> 
> ## Page fault handling and TLB flushing
> 
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.
> 
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.
> 
> ## Lazy switch of kernel context
> 
> In RPAL, a mismatch may arise between the user context and the kernel
> context. The RPAL call is designed solely to switch the user context,
> leaving the kernel context unchanged. For instance, when a RPAL call takes
> place, transitioning from caller thread to callee thread, and subsequently
> a system call is initiated within callee thread, the kernel will
> incorrectly utilize the caller's kernel context (such as the kernel stack)
> to process the system call.
> 
> To resolve context mismatch issues, a kernel context switch is triggered at
> the kernel entry point when the callee initiates a syscall or an
> exception/interrupt occurs. This mechanism ensures context consistency
> before processing system calls, interrupts, or exceptions. We refer to this
> kernel context switch as a "lazy switch" because it defers the switching
> operation from the traditional thread switch point to the next kernel entry
> point.
> 
> Lazy switch should be minimized as much as possible, as it significantly
> degrades performance. We currently utilize RPAL in an RPC framework, in
> which the RPC sender thread relies on the RPAL call to invoke the RPC
> receiver thread entirely in user space. In most cases, the receiver
> thread is free of system calls and the code execution time is relatively
> short. This characteristic effectively reduces the probability of a lazy
> switch occurring.
> 
> ## Time slice correction
> 
> After an RPAL call, the callee's user mode code executes. However, the
> kernel incorrectly attributes this CPU time to the caller due to the
> unchanged kernel context.
> 
> To resolve this, we use the Time Stamp Counter (TSC) register to measure
> CPU time consumed by the callee thread in user space. The kernel then uses
> this user-reported timing data to adjust the CPU accounting for both the
> caller and callee thread, similar to how CPU steal time is implemented.
> 
> ## Process recovery
> 
> Since processes can access each other’s memory, there is a risk that the
> target process’s memory may become invalid at the access time (e.g., if
> the target process has exited unexpectedly). The kernel must handle such
> cases; otherwise, the accessing process could be terminated due to
> failures originating from another process.
> 
> To address this issue, each thread of the process should pre-establish a
> recovery point when accessing the memory of other processes. When such an
> invalid access occurs, the thread traps into the kernel. Inside the page
> fault handler, the kernel restores the user context of the thread to the
> recovery point. This mechanism ensures that processes maintain mutual
> independence, preventing cascading failures caused by cross-process memory
> issues.
> 
> # Performance
> 
> To quantify the performance improvements driven by RPAL, we measured
> latency both before and after its deployment. Experiments were conducted on
> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
> and 1 TB of memory. Latency was defined as the duration from when the
> client thread initiates a message to when the server thread is invoked and
> receives it.
> 
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
> 
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
> 
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.
> 
> We have applied RPAL to an RPC framework that is widely used in our data
> center. With RPAL, we have successfully achieved up to 15.5% reduction in
> the CPU utilization of processes in real-world microservice scenario. The
> gains primarily stem from minimizing control plane overhead through the
> utilization of userspace context switches. Additionally, by leveraging
> address space sharing, the number of memory copies is significantly
> reduced.
> 
> # Future Work
> 
> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
> supported only on the latest processor, specifically, 3th Generation AMD
> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
> support to systems lacking MPK hardware will be provided later.
> 
> Accompanying test programs are also provided in the samples/rpal/
> directory. And the user-mode RPAL library, which realizes user-space RPAL
> call, is in the samples/rpal/librpal directory.
>             
> We hope to get some community discussions and feedback on RPAL's
> optimization approaches and architecture.
> 
> Look forward to your comments.

The first time you posted, you got two NACKs (from Dave Hansen and Lorenzo).
You didn't reply and now you post this flood of patches? Please don't?

From my end it's also a Big Ol' NACK.

> 
> Bo Li (35):
>   Kbuild: rpal support
>   RPAL: add struct rpal_service
>   RPAL: add service registration interface
>   RPAL: add member to task_struct and mm_struct
>   RPAL: enable virtual address space partitions
>   RPAL: add user interface
>   RPAL: enable shared page mmap
>   RPAL: enable sender/receiver registration
>   RPAL: enable address space sharing
>   RPAL: allow service enable/disable
>   RPAL: add service request/release
>   RPAL: enable service disable notification
>   RPAL: add tlb flushing support
>   RPAL: enable page fault handling
>   RPAL: add sender/receiver state
>   RPAL: add cpu lock interface
>   RPAL: add a mapping between fsbase and tasks
>   sched: pick a specified task
>   RPAL: add lazy switch main logic
>   RPAL: add rpal_ret_from_lazy_switch
>   RPAL: add kernel entry handling for lazy switch
>   RPAL: rebuild receiver state
>   RPAL: resume cpumask when fork
>   RPAL: critical section optimization
>   RPAL: add MPK initialization and interface
>   RPAL: enable MPK support
>   RPAL: add epoll support
>   RPAL: add rpal_uds_fdmap() support
>   RPAL: fix race condition in pkru update
>   RPAL: fix pkru setup when fork
>   RPAL: add receiver waker
>   RPAL: fix unknown nmi on AMD CPU
>   RPAL: enable time slice correction
>   RPAL: enable fast epoll wait
>   samples/rpal: add RPAL samples
> 
>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++
>  fs/binfmt_elf.c                               |   98 +-
>  fs/eventpoll.c                                |  320 +++
>  fs/exec.c                                     |   11 +
>  include/linux/mm_types.h                      |    3 +
>  include/linux/rpal.h                          |  633 +++++
>  include/linux/sched.h                         |   21 +
>  init/init_task.c                              |    6 +
>  kernel/exit.c                                 |    5 +
>  kernel/fork.c                                 |   32 +
>  kernel/sched/core.c                           |  676 +++++
>  kernel/sched/fair.c                           |  109 +
>  kernel/sched/sched.h                          |    8 +
>  mm/mmap.c                                     |   16 +
>  mm/mprotect.c                                 |  106 +
>  mm/rmap.c                                     |    4 +
>  mm/vma.c                                      |   18 +
>  samples/rpal/Makefile                         |   17 +
>  samples/rpal/asm_define.c                     |   14 +
>  samples/rpal/client.c                         |  178 ++
>  samples/rpal/librpal/asm_define.h             |    6 +
>  samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
>  samples/rpal/librpal/debug.h                  |   12 +
>  samples/rpal/librpal/fiber.c                  |  119 +
>  samples/rpal/librpal/fiber.h                  |   64 +
>  .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
>  .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
>  .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
>  samples/rpal/librpal/private.h                |  341 +++
>  samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
>  samples/rpal/librpal/rpal.h                   |  149 ++
>  samples/rpal/librpal/rpal_pkru.h              |   78 +
>  samples/rpal/librpal/rpal_queue.c             |  239 ++
>  samples/rpal/librpal/rpal_queue.h             |   55 +
>  samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
>  samples/rpal/offset.sh                        |    5 +
>  samples/rpal/server.c                         |  249 ++
>  61 files changed, 9710 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/rpal/Kconfig
>  create mode 100644 arch/x86/rpal/Makefile
>  create mode 100644 arch/x86/rpal/core.c
>  create mode 100644 arch/x86/rpal/internal.h
>  create mode 100644 arch/x86/rpal/mm.c
>  create mode 100644 arch/x86/rpal/pku.c
>  create mode 100644 arch/x86/rpal/proc.c
>  create mode 100644 arch/x86/rpal/service.c
>  create mode 100644 arch/x86/rpal/thread.c
>  create mode 100644 include/linux/rpal.h
>  create mode 100644 samples/rpal/Makefile
>  create mode 100644 samples/rpal/asm_define.c
>  create mode 100644 samples/rpal/client.c
>  create mode 100644 samples/rpal/librpal/asm_define.h
>  create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
>  create mode 100644 samples/rpal/librpal/debug.h
>  create mode 100644 samples/rpal/librpal/fiber.c
>  create mode 100644 samples/rpal/librpal/fiber.h
>  create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/private.h
>  create mode 100644 samples/rpal/librpal/rpal.c
>  create mode 100644 samples/rpal/librpal/rpal.h
>  create mode 100644 samples/rpal/librpal/rpal_pkru.h
>  create mode 100644 samples/rpal/librpal/rpal_queue.c
>  create mode 100644 samples/rpal/librpal/rpal_queue.h
>  create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
>  create mode 100755 samples/rpal/offset.sh
>  create mode 100644 samples/rpal/server.c

Seriously, look at all the files you're touching. All the lines you're changing.
All the maintainers you had to CC. All for a random new RPC method you developed.
This is _not_ mergeable.

-- 
Pedro

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (36 preceding siblings ...)
  2025-05-30  9:41 ` Pedro Falcato
@ 2025-05-30  9:56 ` David Hildenbrand
  2025-05-30 22:42 ` Andrew Morton
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 46+ messages in thread
From: David Hildenbrand @ 2025-05-30  9:56 UTC (permalink / raw)
  To: Bo Li, tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, hpa, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff


> 
> ## Address space sharing
> 
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward:
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
> 
> 
>   |------------| <- 0
>   |------------| <- 512 GB
>   |  Process A |
>   |------------| <- 2*512 GB
>   |------------| <- n*512 GB
>   |  Process B |
>   |------------| <- (n+1)*512 GB
>   |------------| <- STACK_TOP
>   |  Kernel    |
>   |------------|

Oh my.

It reminds me a bit about mshare -- just that mshare tries to do it in a 
less hacky way..

> 
> ## RPAL call
> 
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process.
> 
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
> 
> 
>   |------------|                |---------------------|
>   |  Process A |                |  Process B          |
>   | |-------|  |                | |-------|           |
>   | | caller| --- RPAL call --> | | callee|    handle |
>   | | thread| <------------------ | thread| -> event  |
>   | |-------|  |                | |-------|           |
>   |------------|                |---------------------|
> 
> # Security and compatibility with kernel subsystems
> 
> ## Memory protection between processes
> 
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
> 
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
> 
> ## Page fault handling and TLB flushing
> 
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.

In an mshare region, all faults would be rerouted to the mshare MM 
either way.

> 
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.

Oh my.

It all reminds me of mshare, just the context switch handling is 
different (and significantly ... more problematic).

Maybe something could be built on top of mshare, but I'm afraid the real 
magic is the address space sharing combined with the context switching 
... which sounds like a big can of worms.

So in the current form, I understand all the NACKs.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (37 preceding siblings ...)
  2025-05-30  9:56 ` David Hildenbrand
@ 2025-05-30 22:42 ` Andrew Morton
  2025-05-31  7:16 ` Ingo Molnar
  2025-06-03 17:49 ` H. Peter Anvin
  40 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2025-05-30 22:42 UTC (permalink / raw)
  To: Bo Li
  Cc: tglx, mingo, bp, dave.hansen, x86, luto, kees, david, juri.lelli,
	vincent.guittot, peterz, dietmar.eggemann, hpa, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang, viro, brauner, jack, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, rostedt, bsegall, mgorman, vschneid,
	jannh, pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff

On Fri, 30 May 2025 17:27:28 +0800 Bo Li <libo.gcs85@bytedance.com> wrote:

> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
> 
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
> 
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.

Noted ;)

Quick question:

>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++

The changes are very x86-heavy.  Is that a necessary thing?  Would
another architecture need to implement a similar amount to enable RPAL?
IOW, how much of the above could be made arch-neutral?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (38 preceding siblings ...)
  2025-05-30 22:42 ` Andrew Morton
@ 2025-05-31  7:16 ` Ingo Molnar
  2025-06-03 17:49 ` H. Peter Anvin
  40 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2025-05-31  7:16 UTC (permalink / raw)
  To: Bo Li
  Cc: tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz, dietmar.eggemann, hpa, acme,
	namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, viro, brauner, jack, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, rostedt, bsegall,
	mgorman, vschneid, jannh, pfalcato, riel, harry.yoo, linux-kernel,
	linux-perf-users, linux-fsdevel, linux-mm, duanxiongchun,
	yinhongbo, dengliang.1214, xieyongji, chaiwen.cc, songmuchun,
	yuanzhu, chengguozhu, sunjiadong.lff

* Bo Li <libo.gcs85@bytedance.com> wrote:

> # Performance
> 
> To quantify the performance improvements driven by RPAL, we measured 
> latency both before and after its deployment. Experiments were 
> conducted on a server equipped with two Intel(R) Xeon(R) Platinum 
> 8336C CPUs (2.30 GHz) and 1 TB of memory. Latency was defined as the 
> duration from when the client thread initiates a message to when the 
> server thread is invoked and receives it.
> 
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
> 
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
> 
> These results confirm that RPAL delivers substantial latency 
> improvements over the current epoll implementation—achieving a 
> 17,913-cycle reduction (an ~91.3% improvement) for 32-byte messages.

No, these results do not necessarily confirm that.

19,616 cycles per message on a vanilla kernel on a 2.3 GHz CPU suggests 
a messaging performance of 117k messages/second or 8.5 usecs/message, 
which is *way* beyond typical kernel interprocess communication 
latencies on comparable CPUs:

  root@localhost:~# taskset 1 perf bench sched pipe
  # Running 'sched/pipe' benchmark:
  # Executed 1000000 pipe operations between two processes

       Total time: 2.790 [sec]

       2.790614 usecs/op
         358344 ops/sec

And my 2.8 usecs result was from a kernel running inside a KVM sandbox 
...

( I used 'taskset' to bind the benchmark to a single CPU, to remove any 
  inter-CPU migration noise from the measurement. )

The scheduler parts of your series simply try to remove much of 
scheduler and context switching functionality to create a special 
fast-path with no FPU context switching and TLB flushing AFAICS, for 
the purposes of message latency benchmarking in essence, and you then 
compare it against the full scheduling and MM context switching costs 
of full-blown Linux processes.

I'm not convinced, at all, that this many changes are required to speed 
up the usecase you are trying to optimize:

  >  61 files changed, 9710 insertions(+), 4 deletions(-)

Nor am I convinced that 9,700 lines of *new* code of a parallel 
facility are needed, crudely wrapped in 1970s technology (#ifdefs), 
instead of optimizing/improving facilities we already have...

So NAK for the scheduler bits, until proven otherwise (and presented in 
a clean fashion, which the current series is very far from).

I'll be the first one to acknowledge that our process and MM context 
switching overhead is too high and could be improved, and I have no 
objections against the general goal of improving Linux inter-process 
messaging performance either, I only NAK this particular 
implementation/approach.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/35] optimize cost of inter-process communication
  2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
                   ` (39 preceding siblings ...)
  2025-05-31  7:16 ` Ingo Molnar
@ 2025-06-03 17:49 ` H. Peter Anvin
  40 siblings, 0 replies; 46+ messages in thread
From: H. Peter Anvin @ 2025-06-03 17:49 UTC (permalink / raw)
  To: Bo Li, tglx, mingo, bp, dave.hansen, x86, luto, kees, akpm, david,
	juri.lelli, vincent.guittot, peterz
  Cc: dietmar.eggemann, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	viro, brauner, jack, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, rostedt, bsegall, mgorman, vschneid, jannh,
	pfalcato, riel, harry.yoo, linux-kernel, linux-perf-users,
	linux-fsdevel, linux-mm, duanxiongchun, yinhongbo, dengliang.1214,
	xieyongji, chaiwen.cc, songmuchun, yuanzhu, chengguozhu,
	sunjiadong.lff

On 5/30/25 02:27, Bo Li wrote:
> Changelog:
> 
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
>   security considerations of RPAL.
> 
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
> 

Okay,

First of all, I agree with most of the other reviewers that this is
insane. Second of all, calling this "optimize cost of inter-process
communication" is *extremely* misleading, to the point that one could
worry about it being malicious.

What you are doing is attempting to provide isolation between threads
running in the same memory space. *By definition* those are not processes.

Secondly, doing function calls from one thread to another in the same
memory space isn't really IPC at all, as the scheduler is not involved.

Third, this is something that should be possible to do entirely in user
space (mostly in a modified libc). Most of the facilities that you seem
to implement already have equivalents (/dev/shm, ET_DYN, ld.so, ...)

This isn't a new idea; this is where the microkernel people eventually
ended up when they tried to get performant. It didn't work well for the
same reason -- without involving the kernel (or dedicated hardware
facilities; x86 segments and MPK are *not* designed for this), the
isolation *can't* be enforced. You can, of course, have a kernel
interface to switch the address space around -- and you have just
(re)invented processes.

From what I can see, a saner version of this would probably be something
like a sched_yield_to(X) system call, basically a request to the
scheduler "if possible, give the rest of my time slice to process/thread
<X>, as if I had been continuing to run." The rest of the communication
can be done with shared memory.

The other option is that if you actually are OK with your workloads
living in the same privilege domain to simply use threads.

If this somehow isn't what you're doing, and I (and others) have somehow
misread the intentions entirely, we will need a whole lot of additional
explanations.

	-hpa

	-hpa

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2025-06-03 17:51 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30  9:27 [RFC v2 00/35] optimize cost of inter-process communication Bo Li
2025-05-30  9:27 ` [RFC v2 01/35] Kbuild: rpal support Bo Li
2025-05-30  9:27 ` [RFC v2 02/35] RPAL: add struct rpal_service Bo Li
2025-05-30  9:27 ` [RFC v2 03/35] RPAL: add service registration interface Bo Li
2025-05-30  9:27 ` [RFC v2 04/35] RPAL: add member to task_struct and mm_struct Bo Li
2025-05-30  9:27 ` [RFC v2 05/35] RPAL: enable virtual address space partitions Bo Li
2025-05-30  9:27 ` [RFC v2 06/35] RPAL: add user interface Bo Li
2025-05-30  9:27 ` [RFC v2 07/35] RPAL: enable shared page mmap Bo Li
2025-05-30  9:27 ` [RFC v2 08/35] RPAL: enable sender/receiver registration Bo Li
2025-05-30  9:27 ` [RFC v2 09/35] RPAL: enable address space sharing Bo Li
2025-05-30  9:27 ` [RFC v2 10/35] RPAL: allow service enable/disable Bo Li
2025-05-30  9:27 ` [RFC v2 11/35] RPAL: add service request/release Bo Li
2025-05-30  9:27 ` [RFC v2 12/35] RPAL: enable service disable notification Bo Li
2025-05-30  9:27 ` [RFC v2 13/35] RPAL: add tlb flushing support Bo Li
2025-05-30  9:27 ` [RFC v2 14/35] RPAL: enable page fault handling Bo Li
2025-05-30 13:59   ` Dave Hansen
2025-05-30  9:27 ` [RFC v2 15/35] RPAL: add sender/receiver state Bo Li
2025-05-30  9:27 ` [RFC v2 16/35] RPAL: add cpu lock interface Bo Li
2025-05-30  9:27 ` [RFC v2 17/35] RPAL: add a mapping between fsbase and tasks Bo Li
2025-05-30  9:27 ` [RFC v2 18/35] sched: pick a specified task Bo Li
2025-05-30  9:27 ` [RFC v2 19/35] RPAL: add lazy switch main logic Bo Li
2025-05-30  9:27 ` [RFC v2 20/35] RPAL: add rpal_ret_from_lazy_switch Bo Li
2025-05-30  9:27 ` [RFC v2 21/35] RPAL: add kernel entry handling for lazy switch Bo Li
2025-05-30  9:27 ` [RFC v2 22/35] RPAL: rebuild receiver state Bo Li
2025-05-30  9:27 ` [RFC v2 23/35] RPAL: resume cpumask when fork Bo Li
2025-05-30  9:27 ` [RFC v2 24/35] RPAL: critical section optimization Bo Li
2025-05-30  9:27 ` [RFC v2 25/35] RPAL: add MPK initialization and interface Bo Li
2025-05-30  9:27 ` [RFC v2 26/35] RPAL: enable MPK support Bo Li
2025-05-30 17:03   ` Dave Hansen
2025-05-30  9:27 ` [RFC v2 27/35] RPAL: add epoll support Bo Li
2025-05-30  9:27 ` [RFC v2 28/35] RPAL: add rpal_uds_fdmap() support Bo Li
2025-05-30  9:27 ` [RFC v2 29/35] RPAL: fix race condition in pkru update Bo Li
2025-05-30  9:27 ` [RFC v2 30/35] RPAL: fix pkru setup when fork Bo Li
2025-05-30  9:27 ` [RFC v2 31/35] RPAL: add receiver waker Bo Li
2025-05-30  9:28 ` [RFC v2 32/35] RPAL: fix unknown nmi on AMD CPU Bo Li
2025-05-30  9:28 ` [RFC v2 33/35] RPAL: enable time slice correction Bo Li
2025-05-30  9:28 ` [RFC v2 34/35] RPAL: enable fast epoll wait Bo Li
2025-05-30  9:28 ` [RFC v2 35/35] samples/rpal: add RPAL samples Bo Li
2025-05-30  9:33 ` [RFC v2 00/35] optimize cost of inter-process communication Lorenzo Stoakes
2025-06-03  8:22   ` Bo Li
2025-06-03  9:22     ` Lorenzo Stoakes
2025-05-30  9:41 ` Pedro Falcato
2025-05-30  9:56 ` David Hildenbrand
2025-05-30 22:42 ` Andrew Morton
2025-05-31  7:16 ` Ingo Molnar
2025-06-03 17:49 ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).