[RFC v2 00/35] optimize cost of inter-process communication

Bo Li posted 35 patches 8 months, 2 weeks ago
arch/x86/Kbuild                               |    2 +
arch/x86/Kconfig                              |    2 +
arch/x86/entry/entry_64.S                     |  160 ++
arch/x86/events/amd/core.c                    |   14 +
arch/x86/include/asm/pgtable.h                |   25 +
arch/x86/include/asm/pgtable_types.h          |   11 +
arch/x86/include/asm/tlbflush.h               |   10 +
arch/x86/kernel/asm-offsets.c                 |    3 +
arch/x86/kernel/cpu/common.c                  |    8 +-
arch/x86/kernel/fpu/core.c                    |    8 +-
arch/x86/kernel/nmi.c                         |   20 +
arch/x86/kernel/process.c                     |   25 +-
arch/x86/kernel/process_64.c                  |  118 +
arch/x86/mm/fault.c                           |  271 ++
arch/x86/mm/mmap.c                            |   10 +
arch/x86/mm/tlb.c                             |  172 ++
arch/x86/rpal/Kconfig                         |   21 +
arch/x86/rpal/Makefile                        |    6 +
arch/x86/rpal/core.c                          |  477 ++++
arch/x86/rpal/internal.h                      |   69 +
arch/x86/rpal/mm.c                            |  426 +++
arch/x86/rpal/pku.c                           |  196 ++
arch/x86/rpal/proc.c                          |  279 ++
arch/x86/rpal/service.c                       |  776 ++++++
arch/x86/rpal/thread.c                        |  313 +++
fs/binfmt_elf.c                               |   98 +-
fs/eventpoll.c                                |  320 +++
fs/exec.c                                     |   11 +
include/linux/mm_types.h                      |    3 +
include/linux/rpal.h                          |  633 +++++
include/linux/sched.h                         |   21 +
init/init_task.c                              |    6 +
kernel/exit.c                                 |    5 +
kernel/fork.c                                 |   32 +
kernel/sched/core.c                           |  676 +++++
kernel/sched/fair.c                           |  109 +
kernel/sched/sched.h                          |    8 +
mm/mmap.c                                     |   16 +
mm/mprotect.c                                 |  106 +
mm/rmap.c                                     |    4 +
mm/vma.c                                      |   18 +
samples/rpal/Makefile                         |   17 +
samples/rpal/asm_define.c                     |   14 +
samples/rpal/client.c                         |  178 ++
samples/rpal/librpal/asm_define.h             |    6 +
samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
samples/rpal/librpal/debug.h                  |   12 +
samples/rpal/librpal/fiber.c                  |  119 +
samples/rpal/librpal/fiber.h                  |   64 +
.../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
.../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
.../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
samples/rpal/librpal/private.h                |  341 +++
samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
samples/rpal/librpal/rpal.h                   |  149 ++
samples/rpal/librpal/rpal_pkru.h              |   78 +
samples/rpal/librpal/rpal_queue.c             |  239 ++
samples/rpal/librpal/rpal_queue.h             |   55 +
samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
samples/rpal/offset.sh                        |    5 +
samples/rpal/server.c                         |  249 ++
61 files changed, 9710 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/rpal/Kconfig
create mode 100644 arch/x86/rpal/Makefile
create mode 100644 arch/x86/rpal/core.c
create mode 100644 arch/x86/rpal/internal.h
create mode 100644 arch/x86/rpal/mm.c
create mode 100644 arch/x86/rpal/pku.c
create mode 100644 arch/x86/rpal/proc.c
create mode 100644 arch/x86/rpal/service.c
create mode 100644 arch/x86/rpal/thread.c
create mode 100644 include/linux/rpal.h
create mode 100644 samples/rpal/Makefile
create mode 100644 samples/rpal/asm_define.c
create mode 100644 samples/rpal/client.c
create mode 100644 samples/rpal/librpal/asm_define.h
create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
create mode 100644 samples/rpal/librpal/debug.h
create mode 100644 samples/rpal/librpal/fiber.c
create mode 100644 samples/rpal/librpal/fiber.h
create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
create mode 100644 samples/rpal/librpal/private.h
create mode 100644 samples/rpal/librpal/rpal.c
create mode 100644 samples/rpal/librpal/rpal.h
create mode 100644 samples/rpal/librpal/rpal_pkru.h
create mode 100644 samples/rpal/librpal/rpal_queue.c
create mode 100644 samples/rpal/librpal/rpal_queue.h
create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
create mode 100755 samples/rpal/offset.sh
create mode 100644 samples/rpal/server.c
[RFC v2 00/35] optimize cost of inter-process communication
Posted by Bo Li 8 months, 2 weeks ago
Changelog:

v2:
- Port the RPAL functions to the latest v6.15 kernel.
- Add a supplementary introduction to the application scenarios and
  security considerations of RPAL.

link to v1:
https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/

--------------------------------------------------------------------------

# Introduction

We mainly apply RPAL to the service mesh architecture widely adopted in
modern cloud-native data centers. Before the rise of the service mesh
architecture, network functions were usually integrated into monolithic
applications as libraries, and the main business programs invoked them
through function calls. However, to facilitate the independent development
and operation and maintenance of the main business programs and network
functions, the service mesh removed the network functions from the main
business programs and made them independent processes (called sidecars).
Inter-process communication (IPC) is used for interaction between the main
business program and the sidecar, and the introduced inter-process
communication has led to a sharp increase in resource consumption in
cloud-native data centers, and may even occupy more than 10% of the CPU of
the entire microservice cluster.

To achieve the efficient function call mechanism of the monolithic
architecture under the service mesh architecture, we introduced the RPAL
(Running Process As Library) architecture, which implements the sharing of
the virtual address space of processes and the switching threads in user
mode. Through the analysis of the service mesh architecture, we found that
the process memory isolation between the main business program and the
sidecar is not particularly important because they are split from one
application and were an integral part of the original monolithic
application. It is more important for the two processes to be independent
of each other because they need to be independently developed and
maintained to ensure the architectural advantages of the service mesh.
Therefore, RPAL breaks the isolation between processes while preserving the
independence between them.  We think that RPAL can also be applied to other
scenarios featuring sidecar-like architectures, such as distributed file
storage systems in LLM infra.

In RPAL architecture, multiple processes share a virtual address space, so
this architecture can be regarded as an advanced version of the Linux
shared memory mechanism:

1. Traditional shared memory requires two processes to negotiate to ensure
the mapping of the same piece of memory. In RPAL architecture, two RPAL
processes still need to reach a consensus before they can successfully
invoke the relevant system calls of RPAL to share the virtual address
space.
2. Traditional shared memory only shares part of the data. However, in RPAL
architecture, processes that have established an RPAL communication
relationship share a virtual address space, and all user memory (such as
data segments and code segments) of each RPAL process is shared among these
processes. However, a process cannot access the memory of other processes
at any time. We use the MPK mechanism to ensure that the memory of other
processes can only be accessed when special RPAL functions are called.
Otherwise, a page fault will be triggered.
3. In RPAL architecture, to ensure the consistency of the execution context
of the shared code (such as the stack and thread local storage), we further
implement the thread context switching in user mode based on the ability to
share the virtual address space of different processes, enabling the
threads of different processes to directly perform fast switching in user
mode without falling into kernel mode for slow switching.

# Background

In traditional inter-process communication (IPC) scenarios, Unix domain
sockets are commonly used in conjunction with the epoll() family for event
multiplexing. IPC operations involve system calls on both the data and
control planes, thereby imposing a non-trivial overhead on the interacting
processes. Even when shared memory is employed to optimize the data plane,
two data copies still remain. Specifically, data is initially copied from
a process's private memory space into the shared memory area, and then it
is copied from the shared memory into the private memory of another
process.

This poses a question: Is it possible to reduce the overhead of IPC with
only minimal modifications at the application level? To address this, we
observed that the functionality of IPC, which encompasses data transfer
and invocation of the target thread, is similar to a function call, where
arguments are passed and the callee function is invoked to process them.
Inspired by this analogy, we introduce RPAL (Run Process As Library), a
framework designed to enable one process to invoke another as if making
a local function call, all without going through the kernel.

# Design

First, let’s formalize RPAL’s core objectives:

1. Data-plane efficiency: Reduce the number of data copies from two (in the
   shared memory solution) to one.
2. Control-plane optimization: Eliminate the overhead of system calls and
   kernel's thread switches.
3. Application compatibility: Minimize the modifications to existing
   applications that utilize Unix domain sockets and the epoll() family.

To attain the first objective, processes that use RPAL share the same
virtual address space. So one process can access another's data directly
via a data pointer. This means data can be transferred from one process to
another with just one copy operation. 

To meet the second goal, RPAL relies on the shared address space to do
lightweight context switching in user space, which we call an "RPAL call".
This allows one process to execute another process's code just like a
local function call.

To achieve the third target, RPAL stays compatible with the epoll family
of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
application uses epoll for IPC, developers can switch to RPAL with just a
few small changes. For instance, you can just replace epoll_wait() with
rpal_epoll_wait(). The basic epoll procedure, where a process waits for
another to write to a monitored descriptor using an epoll file descriptor,
still works fine with RPAL.

## Address space sharing

For address space sharing, RPAL partitions the entire userspace virtual
address space and allocates non-overlapping memory ranges to each process.
On x86_64 architectures, RPAL uses a memory range size covered by a
single PUD (Page Upper Directory) entry, which is 512GB. This restricts
each process’s virtual address space to 512GB on x86_64, sufficient for
most applications in our scenario. The rationale is straightforward: 
address space sharing can be simply achieved by copying the PUD from one
process’s page table to another’s. So one process can directly use the
data pointer to access another's memory.


 |------------| <- 0
 |------------| <- 512 GB
 |  Process A |
 |------------| <- 2*512 GB
 |------------| <- n*512 GB
 |  Process B |
 |------------| <- (n+1)*512 GB
 |------------| <- STACK_TOP
 |  Kernel    |
 |------------|

## RPAL call

We refer to the lightweight userspace context switching mechanism as RPAL
call. It enables the caller (or sender) thread of one process to directly
switch to the callee (or receiver) thread of another process. 

When Process A’s caller thread initiates an RPAL call to Process B’s
callee thread, the CPU saves the caller’s context and loads the callee’s
context. This enables direct userspace control flow transfer from the
caller to the callee. After the callee finishes data processing, the CPU
saves Process B’s callee context and switches back to Process A’s caller
context, completing a full IPC cycle.


 |------------|                |---------------------|  
 |  Process A |                |  Process B          |
 | |-------|  |                | |-------|           |     
 | | caller| --- RPAL call --> | | callee|    handle |
 | | thread| <------------------ | thread| -> event  |
 | |-------|  |                | |-------|           |
 |------------|                |---------------------|

# Security and compatibility with kernel subsystems

## Memory protection between processes

Since processes using RPAL share the address space, unintended
cross-process memory access may occur and corrupt the data of another
process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
architectures.

MPK assigns 4 bits in each page table entry to a "protection key", which
is paired with a userspace register (PKRU). The PKRU register defines
access permissions for memory regions protected by specific keys (for
detailed implementation, refer to the kernel documentation "Memory
Protection Keys"). With MPK, even though the address space is shared
among processes, cross-process access is restricted: a process can only
access the memory protected by a key if its PKRU register is configured
with the corresponding permission. This ensures that processes cannot
access each other’s memory unless an explicit PKRU configuration is set.

## Page fault handling and TLB flushing

Due to the shared address space architecture, both page fault handling and
TLB flushing require careful consideration. For instance, when Process A
accesses Process B’s memory, a page fault may occur in Process A's
context, but the faulting address belongs to Process B. In this case, we
must pass Process B's mm_struct to the page fault handler.

TLB flushing is more complex. When a thread flushes the TLB, since the
address space is shared, not only other threads in the current process but
also other processes that share the address space may access the
corresponding memory (related to the TLB flush). Therefore, the cpuset used
for TLB flushing should be the union of the mm_cpumasks of all processes
that share the address space.

## Lazy switch of kernel context

In RPAL, a mismatch may arise between the user context and the kernel
context. The RPAL call is designed solely to switch the user context,
leaving the kernel context unchanged. For instance, when a RPAL call takes
place, transitioning from caller thread to callee thread, and subsequently
a system call is initiated within callee thread, the kernel will
incorrectly utilize the caller's kernel context (such as the kernel stack)
to process the system call.

To resolve context mismatch issues, a kernel context switch is triggered at
the kernel entry point when the callee initiates a syscall or an
exception/interrupt occurs. This mechanism ensures context consistency
before processing system calls, interrupts, or exceptions. We refer to this
kernel context switch as a "lazy switch" because it defers the switching
operation from the traditional thread switch point to the next kernel entry
point.

Lazy switch should be minimized as much as possible, as it significantly
degrades performance. We currently utilize RPAL in an RPC framework, in
which the RPC sender thread relies on the RPAL call to invoke the RPC
receiver thread entirely in user space. In most cases, the receiver
thread is free of system calls and the code execution time is relatively
short. This characteristic effectively reduces the probability of a lazy
switch occurring.

## Time slice correction

After an RPAL call, the callee's user mode code executes. However, the
kernel incorrectly attributes this CPU time to the caller due to the
unchanged kernel context.

To resolve this, we use the Time Stamp Counter (TSC) register to measure
CPU time consumed by the callee thread in user space. The kernel then uses
this user-reported timing data to adjust the CPU accounting for both the
caller and callee thread, similar to how CPU steal time is implemented.

## Process recovery

Since processes can access each other’s memory, there is a risk that the
target process’s memory may become invalid at the access time (e.g., if
the target process has exited unexpectedly). The kernel must handle such
cases; otherwise, the accessing process could be terminated due to
failures originating from another process.

To address this issue, each thread of the process should pre-establish a
recovery point when accessing the memory of other processes. When such an
invalid access occurs, the thread traps into the kernel. Inside the page
fault handler, the kernel restores the user context of the thread to the
recovery point. This mechanism ensures that processes maintain mutual
independence, preventing cascading failures caused by cross-process memory
issues.

# Performance

To quantify the performance improvements driven by RPAL, we measured
latency both before and after its deployment. Experiments were conducted on
a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
and 1 TB of memory. Latency was defined as the duration from when the
client thread initiates a message to when the server thread is invoked and
receives it.

During testing, the client transmitted 1 million 32-byte messages, and we
computed the per-message average latency. The results are as follows:

*****************
Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
 Message count: 1000000, Average latency: 19616 cycles
With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
 Message count: 1000000, Average latency: 1703 cycles
*****************

These results confirm that RPAL delivers substantial latency improvements
over the current epoll implementation—achieving a 17,913-cycle reduction
(an ~91.3% improvement) for 32-byte messages.

We have applied RPAL to an RPC framework that is widely used in our data
center. With RPAL, we have successfully achieved up to 15.5% reduction in
the CPU utilization of processes in real-world microservice scenario. The
gains primarily stem from minimizing control plane overhead through the
utilization of userspace context switches. Additionally, by leveraging
address space sharing, the number of memory copies is significantly
reduced.

# Future Work

Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
which is supported by a range of Intel CPUs. For AMD architectures, MPK is
supported only on the latest processor, specifically, 3th Generation AMD
EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
support to systems lacking MPK hardware will be provided later.

Accompanying test programs are also provided in the samples/rpal/
directory. And the user-mode RPAL library, which realizes user-space RPAL
call, is in the samples/rpal/librpal directory.
            
We hope to get some community discussions and feedback on RPAL's
optimization approaches and architecture.

Look forward to your comments.

Bo Li (35):
  Kbuild: rpal support
  RPAL: add struct rpal_service
  RPAL: add service registration interface
  RPAL: add member to task_struct and mm_struct
  RPAL: enable virtual address space partitions
  RPAL: add user interface
  RPAL: enable shared page mmap
  RPAL: enable sender/receiver registration
  RPAL: enable address space sharing
  RPAL: allow service enable/disable
  RPAL: add service request/release
  RPAL: enable service disable notification
  RPAL: add tlb flushing support
  RPAL: enable page fault handling
  RPAL: add sender/receiver state
  RPAL: add cpu lock interface
  RPAL: add a mapping between fsbase and tasks
  sched: pick a specified task
  RPAL: add lazy switch main logic
  RPAL: add rpal_ret_from_lazy_switch
  RPAL: add kernel entry handling for lazy switch
  RPAL: rebuild receiver state
  RPAL: resume cpumask when fork
  RPAL: critical section optimization
  RPAL: add MPK initialization and interface
  RPAL: enable MPK support
  RPAL: add epoll support
  RPAL: add rpal_uds_fdmap() support
  RPAL: fix race condition in pkru update
  RPAL: fix pkru setup when fork
  RPAL: add receiver waker
  RPAL: fix unknown nmi on AMD CPU
  RPAL: enable time slice correction
  RPAL: enable fast epoll wait
  samples/rpal: add RPAL samples

 arch/x86/Kbuild                               |    2 +
 arch/x86/Kconfig                              |    2 +
 arch/x86/entry/entry_64.S                     |  160 ++
 arch/x86/events/amd/core.c                    |   14 +
 arch/x86/include/asm/pgtable.h                |   25 +
 arch/x86/include/asm/pgtable_types.h          |   11 +
 arch/x86/include/asm/tlbflush.h               |   10 +
 arch/x86/kernel/asm-offsets.c                 |    3 +
 arch/x86/kernel/cpu/common.c                  |    8 +-
 arch/x86/kernel/fpu/core.c                    |    8 +-
 arch/x86/kernel/nmi.c                         |   20 +
 arch/x86/kernel/process.c                     |   25 +-
 arch/x86/kernel/process_64.c                  |  118 +
 arch/x86/mm/fault.c                           |  271 ++
 arch/x86/mm/mmap.c                            |   10 +
 arch/x86/mm/tlb.c                             |  172 ++
 arch/x86/rpal/Kconfig                         |   21 +
 arch/x86/rpal/Makefile                        |    6 +
 arch/x86/rpal/core.c                          |  477 ++++
 arch/x86/rpal/internal.h                      |   69 +
 arch/x86/rpal/mm.c                            |  426 +++
 arch/x86/rpal/pku.c                           |  196 ++
 arch/x86/rpal/proc.c                          |  279 ++
 arch/x86/rpal/service.c                       |  776 ++++++
 arch/x86/rpal/thread.c                        |  313 +++
 fs/binfmt_elf.c                               |   98 +-
 fs/eventpoll.c                                |  320 +++
 fs/exec.c                                     |   11 +
 include/linux/mm_types.h                      |    3 +
 include/linux/rpal.h                          |  633 +++++
 include/linux/sched.h                         |   21 +
 init/init_task.c                              |    6 +
 kernel/exit.c                                 |    5 +
 kernel/fork.c                                 |   32 +
 kernel/sched/core.c                           |  676 +++++
 kernel/sched/fair.c                           |  109 +
 kernel/sched/sched.h                          |    8 +
 mm/mmap.c                                     |   16 +
 mm/mprotect.c                                 |  106 +
 mm/rmap.c                                     |    4 +
 mm/vma.c                                      |   18 +
 samples/rpal/Makefile                         |   17 +
 samples/rpal/asm_define.c                     |   14 +
 samples/rpal/client.c                         |  178 ++
 samples/rpal/librpal/asm_define.h             |    6 +
 samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
 samples/rpal/librpal/debug.h                  |   12 +
 samples/rpal/librpal/fiber.c                  |  119 +
 samples/rpal/librpal/fiber.h                  |   64 +
 .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
 .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
 .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
 samples/rpal/librpal/private.h                |  341 +++
 samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
 samples/rpal/librpal/rpal.h                   |  149 ++
 samples/rpal/librpal/rpal_pkru.h              |   78 +
 samples/rpal/librpal/rpal_queue.c             |  239 ++
 samples/rpal/librpal/rpal_queue.h             |   55 +
 samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
 samples/rpal/offset.sh                        |    5 +
 samples/rpal/server.c                         |  249 ++
 61 files changed, 9710 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/rpal/Kconfig
 create mode 100644 arch/x86/rpal/Makefile
 create mode 100644 arch/x86/rpal/core.c
 create mode 100644 arch/x86/rpal/internal.h
 create mode 100644 arch/x86/rpal/mm.c
 create mode 100644 arch/x86/rpal/pku.c
 create mode 100644 arch/x86/rpal/proc.c
 create mode 100644 arch/x86/rpal/service.c
 create mode 100644 arch/x86/rpal/thread.c
 create mode 100644 include/linux/rpal.h
 create mode 100644 samples/rpal/Makefile
 create mode 100644 samples/rpal/asm_define.c
 create mode 100644 samples/rpal/client.c
 create mode 100644 samples/rpal/librpal/asm_define.h
 create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
 create mode 100644 samples/rpal/librpal/debug.h
 create mode 100644 samples/rpal/librpal/fiber.c
 create mode 100644 samples/rpal/librpal/fiber.h
 create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
 create mode 100644 samples/rpal/librpal/private.h
 create mode 100644 samples/rpal/librpal/rpal.c
 create mode 100644 samples/rpal/librpal/rpal.h
 create mode 100644 samples/rpal/librpal/rpal_pkru.h
 create mode 100644 samples/rpal/librpal/rpal_queue.c
 create mode 100644 samples/rpal/librpal/rpal_queue.h
 create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
 create mode 100755 samples/rpal/offset.sh
 create mode 100644 samples/rpal/server.c

-- 
2.20.1

Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by H. Peter Anvin 8 months, 1 week ago
On 5/30/25 02:27, Bo Li wrote:
> Changelog:
> 
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
>   security considerations of RPAL.
> 
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
> 

Okay,

First of all, I agree with most of the other reviewers that this is
insane. Second of all, calling this "optimize cost of inter-process
communication" is *extremely* misleading, to the point that one could
worry about it being malicious.

What you are doing is attempting to provide isolation between threads
running in the same memory space. *By definition* those are not processes.

Secondly, doing function calls from one thread to another in the same
memory space isn't really IPC at all, as the scheduler is not involved.

Third, this is something that should be possible to do entirely in user
space (mostly in a modified libc). Most of the facilities that you seem
to implement already have equivalents (/dev/shm, ET_DYN, ld.so, ...)

This isn't a new idea; this is where the microkernel people eventually
ended up when they tried to get performant. It didn't work well for the
same reason -- without involving the kernel (or dedicated hardware
facilities; x86 segments and MPK are *not* designed for this), the
isolation *can't* be enforced. You can, of course, have a kernel
interface to switch the address space around -- and you have just
(re)invented processes.

From what I can see, a saner version of this would probably be something
like a sched_yield_to(X) system call, basically a request to the
scheduler "if possible, give the rest of my time slice to process/thread
<X>, as if I had been continuing to run." The rest of the communication
can be done with shared memory.

The other option is that if you actually are OK with your workloads
living in the same privilege domain to simply use threads.

If this somehow isn't what you're doing, and I (and others) have somehow
misread the intentions entirely, we will need a whole lot of additional
explanations.

	-hpa


	-hpa
Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by Ingo Molnar 8 months, 2 weeks ago
* Bo Li <libo.gcs85@bytedance.com> wrote:

> # Performance
> 
> To quantify the performance improvements driven by RPAL, we measured 
> latency both before and after its deployment. Experiments were 
> conducted on a server equipped with two Intel(R) Xeon(R) Platinum 
> 8336C CPUs (2.30 GHz) and 1 TB of memory. Latency was defined as the 
> duration from when the client thread initiates a message to when the 
> server thread is invoked and receives it.
> 
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
> 
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
> 
> These results confirm that RPAL delivers substantial latency 
> improvements over the current epoll implementation—achieving a 
> 17,913-cycle reduction (an ~91.3% improvement) for 32-byte messages.

No, these results do not necessarily confirm that.

19,616 cycles per message on a vanilla kernel on a 2.3 GHz CPU suggests 
a messaging performance of 117k messages/second or 8.5 usecs/message, 
which is *way* beyond typical kernel interprocess communication 
latencies on comparable CPUs:

  root@localhost:~# taskset 1 perf bench sched pipe
  # Running 'sched/pipe' benchmark:
  # Executed 1000000 pipe operations between two processes

       Total time: 2.790 [sec]

       2.790614 usecs/op
         358344 ops/sec

And my 2.8 usecs result was from a kernel running inside a KVM sandbox 
...

( I used 'taskset' to bind the benchmark to a single CPU, to remove any 
  inter-CPU migration noise from the measurement. )

The scheduler parts of your series simply try to remove much of 
scheduler and context switching functionality to create a special 
fast-path with no FPU context switching and TLB flushing AFAICS, for 
the purposes of message latency benchmarking in essence, and you then 
compare it against the full scheduling and MM context switching costs 
of full-blown Linux processes.

I'm not convinced, at all, that this many changes are required to speed 
up the usecase you are trying to optimize:

  >  61 files changed, 9710 insertions(+), 4 deletions(-)

Nor am I convinced that 9,700 lines of *new* code of a parallel 
facility are needed, crudely wrapped in 1970s technology (#ifdefs), 
instead of optimizing/improving facilities we already have...

So NAK for the scheduler bits, until proven otherwise (and presented in 
a clean fashion, which the current series is very far from).

I'll be the first one to acknowledge that our process and MM context 
switching overhead is too high and could be improved, and I have no 
objections against the general goal of improving Linux inter-process 
messaging performance either, I only NAK this particular 
implementation/approach.

Thanks,

	Ingo
Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by Andrew Morton 8 months, 2 weeks ago
On Fri, 30 May 2025 17:27:28 +0800 Bo Li <libo.gcs85@bytedance.com> wrote:

> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
> 
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
> 
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.

Noted ;)

Quick question:

>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++

The changes are very x86-heavy.  Is that a necessary thing?  Would
another architecture need to implement a similar amount to enable RPAL?
IOW, how much of the above could be made arch-neutral?
Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by David Hildenbrand 8 months, 2 weeks ago
> 
> ## Address space sharing
> 
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward:
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
> 
> 
>   |------------| <- 0
>   |------------| <- 512 GB
>   |  Process A |
>   |------------| <- 2*512 GB
>   |------------| <- n*512 GB
>   |  Process B |
>   |------------| <- (n+1)*512 GB
>   |------------| <- STACK_TOP
>   |  Kernel    |
>   |------------|

Oh my.

It reminds me a bit about mshare -- just that mshare tries to do it in a 
less hacky way..

> 
> ## RPAL call
> 
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process.
> 
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
> 
> 
>   |------------|                |---------------------|
>   |  Process A |                |  Process B          |
>   | |-------|  |                | |-------|           |
>   | | caller| --- RPAL call --> | | callee|    handle |
>   | | thread| <------------------ | thread| -> event  |
>   | |-------|  |                | |-------|           |
>   |------------|                |---------------------|
> 
> # Security and compatibility with kernel subsystems
> 
> ## Memory protection between processes
> 
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
> 
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
> 
> ## Page fault handling and TLB flushing
> 
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.

In an mshare region, all faults would be rerouted to the mshare MM 
either way.

> 
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.

Oh my.

It all reminds me of mshare, just the context switch handling is 
different (and significantly ... more problematic).

Maybe something could be built on top of mshare, but I'm afraid the real 
magic is the address space sharing combined with the context switching 
... which sounds like a big can of worms.

So in the current form, I understand all the NACKs.

-- 
Cheers,

David / dhildenb

Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by Pedro Falcato 8 months, 2 weeks ago
On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
> Changelog:
> 
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
>   security considerations of RPAL.
> 
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
> 
> --------------------------------------------------------------------------
> 
> # Introduction
> 
> We mainly apply RPAL to the service mesh architecture widely adopted in
> modern cloud-native data centers. Before the rise of the service mesh
> architecture, network functions were usually integrated into monolithic
> applications as libraries, and the main business programs invoked them
> through function calls. However, to facilitate the independent development
> and operation and maintenance of the main business programs and network
> functions, the service mesh removed the network functions from the main
> business programs and made them independent processes (called sidecars).
> Inter-process communication (IPC) is used for interaction between the main
> business program and the sidecar, and the introduced inter-process
> communication has led to a sharp increase in resource consumption in
> cloud-native data centers, and may even occupy more than 10% of the CPU of
> the entire microservice cluster.
> 
> To achieve the efficient function call mechanism of the monolithic
> architecture under the service mesh architecture, we introduced the RPAL
> (Running Process As Library) architecture, which implements the sharing of
> the virtual address space of processes and the switching threads in user
> mode. Through the analysis of the service mesh architecture, we found that
> the process memory isolation between the main business program and the
> sidecar is not particularly important because they are split from one
> application and were an integral part of the original monolithic
> application. It is more important for the two processes to be independent
> of each other because they need to be independently developed and
> maintained to ensure the architectural advantages of the service mesh.
> Therefore, RPAL breaks the isolation between processes while preserving the
> independence between them.  We think that RPAL can also be applied to other
> scenarios featuring sidecar-like architectures, such as distributed file
> storage systems in LLM infra.
> 
> In RPAL architecture, multiple processes share a virtual address space, so
> this architecture can be regarded as an advanced version of the Linux
> shared memory mechanism:
> 
> 1. Traditional shared memory requires two processes to negotiate to ensure
> the mapping of the same piece of memory. In RPAL architecture, two RPAL
> processes still need to reach a consensus before they can successfully
> invoke the relevant system calls of RPAL to share the virtual address
> space.
> 2. Traditional shared memory only shares part of the data. However, in RPAL
> architecture, processes that have established an RPAL communication
> relationship share a virtual address space, and all user memory (such as
> data segments and code segments) of each RPAL process is shared among these
> processes. However, a process cannot access the memory of other processes
> at any time. We use the MPK mechanism to ensure that the memory of other
> processes can only be accessed when special RPAL functions are called.
> Otherwise, a page fault will be triggered.
> 3. In RPAL architecture, to ensure the consistency of the execution context
> of the shared code (such as the stack and thread local storage), we further
> implement the thread context switching in user mode based on the ability to
> share the virtual address space of different processes, enabling the
> threads of different processes to directly perform fast switching in user
> mode without falling into kernel mode for slow switching.
> 
> # Background
> 
> In traditional inter-process communication (IPC) scenarios, Unix domain
> sockets are commonly used in conjunction with the epoll() family for event
> multiplexing. IPC operations involve system calls on both the data and
> control planes, thereby imposing a non-trivial overhead on the interacting
> processes. Even when shared memory is employed to optimize the data plane,
> two data copies still remain. Specifically, data is initially copied from
> a process's private memory space into the shared memory area, and then it
> is copied from the shared memory into the private memory of another
> process.
> 
> This poses a question: Is it possible to reduce the overhead of IPC with
> only minimal modifications at the application level? To address this, we
> observed that the functionality of IPC, which encompasses data transfer
> and invocation of the target thread, is similar to a function call, where
> arguments are passed and the callee function is invoked to process them.
> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
> framework designed to enable one process to invoke another as if making
> a local function call, all without going through the kernel.
> 
> # Design
> 
> First, let’s formalize RPAL’s core objectives:
> 
> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
>    shared memory solution) to one.
> 2. Control-plane optimization: Eliminate the overhead of system calls and
>    kernel's thread switches.
> 3. Application compatibility: Minimize the modifications to existing
>    applications that utilize Unix domain sockets and the epoll() family.
> 
> To attain the first objective, processes that use RPAL share the same
> virtual address space. So one process can access another's data directly
> via a data pointer. This means data can be transferred from one process to
> another with just one copy operation. 
> 
> To meet the second goal, RPAL relies on the shared address space to do
> lightweight context switching in user space, which we call an "RPAL call".
> This allows one process to execute another process's code just like a
> local function call.
> 
> To achieve the third target, RPAL stays compatible with the epoll family
> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
> application uses epoll for IPC, developers can switch to RPAL with just a
> few small changes. For instance, you can just replace epoll_wait() with
> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
> another to write to a monitored descriptor using an epoll file descriptor,
> still works fine with RPAL.
> 
> ## Address space sharing
> 
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward: 
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
> 
> 
>  |------------| <- 0
>  |------------| <- 512 GB
>  |  Process A |
>  |------------| <- 2*512 GB
>  |------------| <- n*512 GB
>  |  Process B |
>  |------------| <- (n+1)*512 GB
>  |------------| <- STACK_TOP
>  |  Kernel    |
>  |------------|
> 
> ## RPAL call
> 
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process. 
> 
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
> 
> 
>  |------------|                |---------------------|  
>  |  Process A |                |  Process B          |
>  | |-------|  |                | |-------|           |     
>  | | caller| --- RPAL call --> | | callee|    handle |
>  | | thread| <------------------ | thread| -> event  |
>  | |-------|  |                | |-------|           |
>  |------------|                |---------------------|
> 
> # Security and compatibility with kernel subsystems
> 
> ## Memory protection between processes
> 
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
> 
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
> 
> ## Page fault handling and TLB flushing
> 
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.
> 
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.
> 
> ## Lazy switch of kernel context
> 
> In RPAL, a mismatch may arise between the user context and the kernel
> context. The RPAL call is designed solely to switch the user context,
> leaving the kernel context unchanged. For instance, when a RPAL call takes
> place, transitioning from caller thread to callee thread, and subsequently
> a system call is initiated within callee thread, the kernel will
> incorrectly utilize the caller's kernel context (such as the kernel stack)
> to process the system call.
> 
> To resolve context mismatch issues, a kernel context switch is triggered at
> the kernel entry point when the callee initiates a syscall or an
> exception/interrupt occurs. This mechanism ensures context consistency
> before processing system calls, interrupts, or exceptions. We refer to this
> kernel context switch as a "lazy switch" because it defers the switching
> operation from the traditional thread switch point to the next kernel entry
> point.
> 
> Lazy switch should be minimized as much as possible, as it significantly
> degrades performance. We currently utilize RPAL in an RPC framework, in
> which the RPC sender thread relies on the RPAL call to invoke the RPC
> receiver thread entirely in user space. In most cases, the receiver
> thread is free of system calls and the code execution time is relatively
> short. This characteristic effectively reduces the probability of a lazy
> switch occurring.
> 
> ## Time slice correction
> 
> After an RPAL call, the callee's user mode code executes. However, the
> kernel incorrectly attributes this CPU time to the caller due to the
> unchanged kernel context.
> 
> To resolve this, we use the Time Stamp Counter (TSC) register to measure
> CPU time consumed by the callee thread in user space. The kernel then uses
> this user-reported timing data to adjust the CPU accounting for both the
> caller and callee thread, similar to how CPU steal time is implemented.
> 
> ## Process recovery
> 
> Since processes can access each other’s memory, there is a risk that the
> target process’s memory may become invalid at the access time (e.g., if
> the target process has exited unexpectedly). The kernel must handle such
> cases; otherwise, the accessing process could be terminated due to
> failures originating from another process.
> 
> To address this issue, each thread of the process should pre-establish a
> recovery point when accessing the memory of other processes. When such an
> invalid access occurs, the thread traps into the kernel. Inside the page
> fault handler, the kernel restores the user context of the thread to the
> recovery point. This mechanism ensures that processes maintain mutual
> independence, preventing cascading failures caused by cross-process memory
> issues.
> 
> # Performance
> 
> To quantify the performance improvements driven by RPAL, we measured
> latency both before and after its deployment. Experiments were conducted on
> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
> and 1 TB of memory. Latency was defined as the duration from when the
> client thread initiates a message to when the server thread is invoked and
> receives it.
> 
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
> 
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
> 
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.
> 
> We have applied RPAL to an RPC framework that is widely used in our data
> center. With RPAL, we have successfully achieved up to 15.5% reduction in
> the CPU utilization of processes in real-world microservice scenario. The
> gains primarily stem from minimizing control plane overhead through the
> utilization of userspace context switches. Additionally, by leveraging
> address space sharing, the number of memory copies is significantly
> reduced.
> 
> # Future Work
> 
> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
> supported only on the latest processor, specifically, 3th Generation AMD
> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
> support to systems lacking MPK hardware will be provided later.
> 
> Accompanying test programs are also provided in the samples/rpal/
> directory. And the user-mode RPAL library, which realizes user-space RPAL
> call, is in the samples/rpal/librpal directory.
>             
> We hope to get some community discussions and feedback on RPAL's
> optimization approaches and architecture.
> 
> Look forward to your comments.

The first time you posted, you got two NACKs (from Dave Hansen and Lorenzo).
You didn't reply and now you post this flood of patches? Please don't?

From my end it's also a Big Ol' NACK.

> 
> Bo Li (35):
>   Kbuild: rpal support
>   RPAL: add struct rpal_service
>   RPAL: add service registration interface
>   RPAL: add member to task_struct and mm_struct
>   RPAL: enable virtual address space partitions
>   RPAL: add user interface
>   RPAL: enable shared page mmap
>   RPAL: enable sender/receiver registration
>   RPAL: enable address space sharing
>   RPAL: allow service enable/disable
>   RPAL: add service request/release
>   RPAL: enable service disable notification
>   RPAL: add tlb flushing support
>   RPAL: enable page fault handling
>   RPAL: add sender/receiver state
>   RPAL: add cpu lock interface
>   RPAL: add a mapping between fsbase and tasks
>   sched: pick a specified task
>   RPAL: add lazy switch main logic
>   RPAL: add rpal_ret_from_lazy_switch
>   RPAL: add kernel entry handling for lazy switch
>   RPAL: rebuild receiver state
>   RPAL: resume cpumask when fork
>   RPAL: critical section optimization
>   RPAL: add MPK initialization and interface
>   RPAL: enable MPK support
>   RPAL: add epoll support
>   RPAL: add rpal_uds_fdmap() support
>   RPAL: fix race condition in pkru update
>   RPAL: fix pkru setup when fork
>   RPAL: add receiver waker
>   RPAL: fix unknown nmi on AMD CPU
>   RPAL: enable time slice correction
>   RPAL: enable fast epoll wait
>   samples/rpal: add RPAL samples
> 
>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++
>  fs/binfmt_elf.c                               |   98 +-
>  fs/eventpoll.c                                |  320 +++
>  fs/exec.c                                     |   11 +
>  include/linux/mm_types.h                      |    3 +
>  include/linux/rpal.h                          |  633 +++++
>  include/linux/sched.h                         |   21 +
>  init/init_task.c                              |    6 +
>  kernel/exit.c                                 |    5 +
>  kernel/fork.c                                 |   32 +
>  kernel/sched/core.c                           |  676 +++++
>  kernel/sched/fair.c                           |  109 +
>  kernel/sched/sched.h                          |    8 +
>  mm/mmap.c                                     |   16 +
>  mm/mprotect.c                                 |  106 +
>  mm/rmap.c                                     |    4 +
>  mm/vma.c                                      |   18 +
>  samples/rpal/Makefile                         |   17 +
>  samples/rpal/asm_define.c                     |   14 +
>  samples/rpal/client.c                         |  178 ++
>  samples/rpal/librpal/asm_define.h             |    6 +
>  samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
>  samples/rpal/librpal/debug.h                  |   12 +
>  samples/rpal/librpal/fiber.c                  |  119 +
>  samples/rpal/librpal/fiber.h                  |   64 +
>  .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
>  .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
>  .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
>  samples/rpal/librpal/private.h                |  341 +++
>  samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
>  samples/rpal/librpal/rpal.h                   |  149 ++
>  samples/rpal/librpal/rpal_pkru.h              |   78 +
>  samples/rpal/librpal/rpal_queue.c             |  239 ++
>  samples/rpal/librpal/rpal_queue.h             |   55 +
>  samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
>  samples/rpal/offset.sh                        |    5 +
>  samples/rpal/server.c                         |  249 ++
>  61 files changed, 9710 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/rpal/Kconfig
>  create mode 100644 arch/x86/rpal/Makefile
>  create mode 100644 arch/x86/rpal/core.c
>  create mode 100644 arch/x86/rpal/internal.h
>  create mode 100644 arch/x86/rpal/mm.c
>  create mode 100644 arch/x86/rpal/pku.c
>  create mode 100644 arch/x86/rpal/proc.c
>  create mode 100644 arch/x86/rpal/service.c
>  create mode 100644 arch/x86/rpal/thread.c
>  create mode 100644 include/linux/rpal.h
>  create mode 100644 samples/rpal/Makefile
>  create mode 100644 samples/rpal/asm_define.c
>  create mode 100644 samples/rpal/client.c
>  create mode 100644 samples/rpal/librpal/asm_define.h
>  create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
>  create mode 100644 samples/rpal/librpal/debug.h
>  create mode 100644 samples/rpal/librpal/fiber.c
>  create mode 100644 samples/rpal/librpal/fiber.h
>  create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/private.h
>  create mode 100644 samples/rpal/librpal/rpal.c
>  create mode 100644 samples/rpal/librpal/rpal.h
>  create mode 100644 samples/rpal/librpal/rpal_pkru.h
>  create mode 100644 samples/rpal/librpal/rpal_queue.c
>  create mode 100644 samples/rpal/librpal/rpal_queue.h
>  create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
>  create mode 100755 samples/rpal/offset.sh
>  create mode 100644 samples/rpal/server.c

Seriously, look at all the files you're touching. All the lines you're changing.
All the maintainers you had to CC. All for a random new RPC method you developed.
This is _not_ mergeable.

-- 
Pedro
Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by Lorenzo Stoakes 8 months, 2 weeks ago
Bo,

You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
quite sure why you're sending a v2 without responding to that.

This isn't how the upstream kernel works...

Thanks, Lorenzo

On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
> Changelog:
>
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
>   security considerations of RPAL.
>
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
>
> --------------------------------------------------------------------------
>
> # Introduction
>
> We mainly apply RPAL to the service mesh architecture widely adopted in
> modern cloud-native data centers. Before the rise of the service mesh
> architecture, network functions were usually integrated into monolithic
> applications as libraries, and the main business programs invoked them
> through function calls. However, to facilitate the independent development
> and operation and maintenance of the main business programs and network
> functions, the service mesh removed the network functions from the main
> business programs and made them independent processes (called sidecars).
> Inter-process communication (IPC) is used for interaction between the main
> business program and the sidecar, and the introduced inter-process
> communication has led to a sharp increase in resource consumption in
> cloud-native data centers, and may even occupy more than 10% of the CPU of
> the entire microservice cluster.
>
> To achieve the efficient function call mechanism of the monolithic
> architecture under the service mesh architecture, we introduced the RPAL
> (Running Process As Library) architecture, which implements the sharing of
> the virtual address space of processes and the switching threads in user
> mode. Through the analysis of the service mesh architecture, we found that
> the process memory isolation between the main business program and the
> sidecar is not particularly important because they are split from one
> application and were an integral part of the original monolithic
> application. It is more important for the two processes to be independent
> of each other because they need to be independently developed and
> maintained to ensure the architectural advantages of the service mesh.
> Therefore, RPAL breaks the isolation between processes while preserving the
> independence between them.  We think that RPAL can also be applied to other
> scenarios featuring sidecar-like architectures, such as distributed file
> storage systems in LLM infra.
>
> In RPAL architecture, multiple processes share a virtual address space, so
> this architecture can be regarded as an advanced version of the Linux
> shared memory mechanism:
>
> 1. Traditional shared memory requires two processes to negotiate to ensure
> the mapping of the same piece of memory. In RPAL architecture, two RPAL
> processes still need to reach a consensus before they can successfully
> invoke the relevant system calls of RPAL to share the virtual address
> space.
> 2. Traditional shared memory only shares part of the data. However, in RPAL
> architecture, processes that have established an RPAL communication
> relationship share a virtual address space, and all user memory (such as
> data segments and code segments) of each RPAL process is shared among these
> processes. However, a process cannot access the memory of other processes
> at any time. We use the MPK mechanism to ensure that the memory of other
> processes can only be accessed when special RPAL functions are called.
> Otherwise, a page fault will be triggered.
> 3. In RPAL architecture, to ensure the consistency of the execution context
> of the shared code (such as the stack and thread local storage), we further
> implement the thread context switching in user mode based on the ability to
> share the virtual address space of different processes, enabling the
> threads of different processes to directly perform fast switching in user
> mode without falling into kernel mode for slow switching.
>
> # Background
>
> In traditional inter-process communication (IPC) scenarios, Unix domain
> sockets are commonly used in conjunction with the epoll() family for event
> multiplexing. IPC operations involve system calls on both the data and
> control planes, thereby imposing a non-trivial overhead on the interacting
> processes. Even when shared memory is employed to optimize the data plane,
> two data copies still remain. Specifically, data is initially copied from
> a process's private memory space into the shared memory area, and then it
> is copied from the shared memory into the private memory of another
> process.
>
> This poses a question: Is it possible to reduce the overhead of IPC with
> only minimal modifications at the application level? To address this, we
> observed that the functionality of IPC, which encompasses data transfer
> and invocation of the target thread, is similar to a function call, where
> arguments are passed and the callee function is invoked to process them.
> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
> framework designed to enable one process to invoke another as if making
> a local function call, all without going through the kernel.
>
> # Design
>
> First, let’s formalize RPAL’s core objectives:
>
> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
>    shared memory solution) to one.
> 2. Control-plane optimization: Eliminate the overhead of system calls and
>    kernel's thread switches.
> 3. Application compatibility: Minimize the modifications to existing
>    applications that utilize Unix domain sockets and the epoll() family.
>
> To attain the first objective, processes that use RPAL share the same
> virtual address space. So one process can access another's data directly
> via a data pointer. This means data can be transferred from one process to
> another with just one copy operation.
>
> To meet the second goal, RPAL relies on the shared address space to do
> lightweight context switching in user space, which we call an "RPAL call".
> This allows one process to execute another process's code just like a
> local function call.
>
> To achieve the third target, RPAL stays compatible with the epoll family
> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
> application uses epoll for IPC, developers can switch to RPAL with just a
> few small changes. For instance, you can just replace epoll_wait() with
> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
> another to write to a monitored descriptor using an epoll file descriptor,
> still works fine with RPAL.
>
> ## Address space sharing
>
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward:
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
>
>
>  |------------| <- 0
>  |------------| <- 512 GB
>  |  Process A |
>  |------------| <- 2*512 GB
>  |------------| <- n*512 GB
>  |  Process B |
>  |------------| <- (n+1)*512 GB
>  |------------| <- STACK_TOP
>  |  Kernel    |
>  |------------|
>
> ## RPAL call
>
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process.
>
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
>
>
>  |------------|                |---------------------|
>  |  Process A |                |  Process B          |
>  | |-------|  |                | |-------|           |
>  | | caller| --- RPAL call --> | | callee|    handle |
>  | | thread| <------------------ | thread| -> event  |
>  | |-------|  |                | |-------|           |
>  |------------|                |---------------------|
>
> # Security and compatibility with kernel subsystems
>
> ## Memory protection between processes
>
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
>
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
>
> ## Page fault handling and TLB flushing
>
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.
>
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.
>
> ## Lazy switch of kernel context
>
> In RPAL, a mismatch may arise between the user context and the kernel
> context. The RPAL call is designed solely to switch the user context,
> leaving the kernel context unchanged. For instance, when a RPAL call takes
> place, transitioning from caller thread to callee thread, and subsequently
> a system call is initiated within callee thread, the kernel will
> incorrectly utilize the caller's kernel context (such as the kernel stack)
> to process the system call.
>
> To resolve context mismatch issues, a kernel context switch is triggered at
> the kernel entry point when the callee initiates a syscall or an
> exception/interrupt occurs. This mechanism ensures context consistency
> before processing system calls, interrupts, or exceptions. We refer to this
> kernel context switch as a "lazy switch" because it defers the switching
> operation from the traditional thread switch point to the next kernel entry
> point.
>
> Lazy switch should be minimized as much as possible, as it significantly
> degrades performance. We currently utilize RPAL in an RPC framework, in
> which the RPC sender thread relies on the RPAL call to invoke the RPC
> receiver thread entirely in user space. In most cases, the receiver
> thread is free of system calls and the code execution time is relatively
> short. This characteristic effectively reduces the probability of a lazy
> switch occurring.
>
> ## Time slice correction
>
> After an RPAL call, the callee's user mode code executes. However, the
> kernel incorrectly attributes this CPU time to the caller due to the
> unchanged kernel context.
>
> To resolve this, we use the Time Stamp Counter (TSC) register to measure
> CPU time consumed by the callee thread in user space. The kernel then uses
> this user-reported timing data to adjust the CPU accounting for both the
> caller and callee thread, similar to how CPU steal time is implemented.
>
> ## Process recovery
>
> Since processes can access each other’s memory, there is a risk that the
> target process’s memory may become invalid at the access time (e.g., if
> the target process has exited unexpectedly). The kernel must handle such
> cases; otherwise, the accessing process could be terminated due to
> failures originating from another process.
>
> To address this issue, each thread of the process should pre-establish a
> recovery point when accessing the memory of other processes. When such an
> invalid access occurs, the thread traps into the kernel. Inside the page
> fault handler, the kernel restores the user context of the thread to the
> recovery point. This mechanism ensures that processes maintain mutual
> independence, preventing cascading failures caused by cross-process memory
> issues.
>
> # Performance
>
> To quantify the performance improvements driven by RPAL, we measured
> latency both before and after its deployment. Experiments were conducted on
> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
> and 1 TB of memory. Latency was defined as the duration from when the
> client thread initiates a message to when the server thread is invoked and
> receives it.
>
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
>
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
>
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.
>
> We have applied RPAL to an RPC framework that is widely used in our data
> center. With RPAL, we have successfully achieved up to 15.5% reduction in
> the CPU utilization of processes in real-world microservice scenario. The
> gains primarily stem from minimizing control plane overhead through the
> utilization of userspace context switches. Additionally, by leveraging
> address space sharing, the number of memory copies is significantly
> reduced.
>
> # Future Work
>
> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
> supported only on the latest processor, specifically, 3th Generation AMD
> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
> support to systems lacking MPK hardware will be provided later.
>
> Accompanying test programs are also provided in the samples/rpal/
> directory. And the user-mode RPAL library, which realizes user-space RPAL
> call, is in the samples/rpal/librpal directory.
>
> We hope to get some community discussions and feedback on RPAL's
> optimization approaches and architecture.
>
> Look forward to your comments.
>
> Bo Li (35):
>   Kbuild: rpal support
>   RPAL: add struct rpal_service
>   RPAL: add service registration interface
>   RPAL: add member to task_struct and mm_struct
>   RPAL: enable virtual address space partitions
>   RPAL: add user interface
>   RPAL: enable shared page mmap
>   RPAL: enable sender/receiver registration
>   RPAL: enable address space sharing
>   RPAL: allow service enable/disable
>   RPAL: add service request/release
>   RPAL: enable service disable notification
>   RPAL: add tlb flushing support
>   RPAL: enable page fault handling
>   RPAL: add sender/receiver state
>   RPAL: add cpu lock interface
>   RPAL: add a mapping between fsbase and tasks
>   sched: pick a specified task
>   RPAL: add lazy switch main logic
>   RPAL: add rpal_ret_from_lazy_switch
>   RPAL: add kernel entry handling for lazy switch
>   RPAL: rebuild receiver state
>   RPAL: resume cpumask when fork
>   RPAL: critical section optimization
>   RPAL: add MPK initialization and interface
>   RPAL: enable MPK support
>   RPAL: add epoll support
>   RPAL: add rpal_uds_fdmap() support
>   RPAL: fix race condition in pkru update
>   RPAL: fix pkru setup when fork
>   RPAL: add receiver waker
>   RPAL: fix unknown nmi on AMD CPU
>   RPAL: enable time slice correction
>   RPAL: enable fast epoll wait
>   samples/rpal: add RPAL samples
>
>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++
>  fs/binfmt_elf.c                               |   98 +-
>  fs/eventpoll.c                                |  320 +++
>  fs/exec.c                                     |   11 +
>  include/linux/mm_types.h                      |    3 +
>  include/linux/rpal.h                          |  633 +++++
>  include/linux/sched.h                         |   21 +
>  init/init_task.c                              |    6 +
>  kernel/exit.c                                 |    5 +
>  kernel/fork.c                                 |   32 +
>  kernel/sched/core.c                           |  676 +++++
>  kernel/sched/fair.c                           |  109 +
>  kernel/sched/sched.h                          |    8 +
>  mm/mmap.c                                     |   16 +
>  mm/mprotect.c                                 |  106 +
>  mm/rmap.c                                     |    4 +
>  mm/vma.c                                      |   18 +
>  samples/rpal/Makefile                         |   17 +
>  samples/rpal/asm_define.c                     |   14 +
>  samples/rpal/client.c                         |  178 ++
>  samples/rpal/librpal/asm_define.h             |    6 +
>  samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
>  samples/rpal/librpal/debug.h                  |   12 +
>  samples/rpal/librpal/fiber.c                  |  119 +
>  samples/rpal/librpal/fiber.h                  |   64 +
>  .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
>  .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
>  .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
>  samples/rpal/librpal/private.h                |  341 +++
>  samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
>  samples/rpal/librpal/rpal.h                   |  149 ++
>  samples/rpal/librpal/rpal_pkru.h              |   78 +
>  samples/rpal/librpal/rpal_queue.c             |  239 ++
>  samples/rpal/librpal/rpal_queue.h             |   55 +
>  samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
>  samples/rpal/offset.sh                        |    5 +
>  samples/rpal/server.c                         |  249 ++
>  61 files changed, 9710 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/rpal/Kconfig
>  create mode 100644 arch/x86/rpal/Makefile
>  create mode 100644 arch/x86/rpal/core.c
>  create mode 100644 arch/x86/rpal/internal.h
>  create mode 100644 arch/x86/rpal/mm.c
>  create mode 100644 arch/x86/rpal/pku.c
>  create mode 100644 arch/x86/rpal/proc.c
>  create mode 100644 arch/x86/rpal/service.c
>  create mode 100644 arch/x86/rpal/thread.c
>  create mode 100644 include/linux/rpal.h
>  create mode 100644 samples/rpal/Makefile
>  create mode 100644 samples/rpal/asm_define.c
>  create mode 100644 samples/rpal/client.c
>  create mode 100644 samples/rpal/librpal/asm_define.h
>  create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
>  create mode 100644 samples/rpal/librpal/debug.h
>  create mode 100644 samples/rpal/librpal/fiber.c
>  create mode 100644 samples/rpal/librpal/fiber.h
>  create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/private.h
>  create mode 100644 samples/rpal/librpal/rpal.c
>  create mode 100644 samples/rpal/librpal/rpal.h
>  create mode 100644 samples/rpal/librpal/rpal_pkru.h
>  create mode 100644 samples/rpal/librpal/rpal_queue.c
>  create mode 100644 samples/rpal/librpal/rpal_queue.h
>  create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
>  create mode 100755 samples/rpal/offset.sh
>  create mode 100644 samples/rpal/server.c
>
> --
> 2.20.1
>
Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by Bo Li 8 months, 1 week ago
Hi Lorenzo,

On 5/30/25 5:33 PM, Lorenzo Stoakes wrote:
> Bo,
>
> You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
> quite sure why you're sending a v2 without responding to that.
>
> This isn't how the upstream kernel works...
>
> Thanks, Lorenzo
>
> On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
>> Changelog:
>>
>> v2:
>> - Port the RPAL functions to the latest v6.15 kernel.
>> - Add a supplementary introduction to the application scenarios and
>>    security considerations of RPAL.
>>
>> link to v1:
>> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
>>
>> --------------------------------------------------------------------------
>>
>> # Introduction
>>
>> We mainly apply RPAL to the service mesh architecture widely adopted in
>> modern cloud-native data centers. Before the rise of the service mesh
>> architecture, network functions were usually integrated into monolithic
>> applications as libraries, and the main business programs invoked them
>> through function calls. However, to facilitate the independent development
>> and operation and maintenance of the main business programs and network
>> functions, the service mesh removed the network functions from the main
>> business programs and made them independent processes (called sidecars).
>> Inter-process communication (IPC) is used for interaction between the main
>> business program and the sidecar, and the introduced inter-process
>> communication has led to a sharp increase in resource consumption in
>> cloud-native data centers, and may even occupy more than 10% of the CPU of
>> the entire microservice cluster.
>>
>> To achieve the efficient function call mechanism of the monolithic
>> architecture under the service mesh architecture, we introduced the RPAL
>> (Running Process As Library) architecture, which implements the sharing of
>> the virtual address space of processes and the switching threads in user
>> mode. Through the analysis of the service mesh architecture, we found that
>> the process memory isolation between the main business program and the
>> sidecar is not particularly important because they are split from one
>> application and were an integral part of the original monolithic
>> application. It is more important for the two processes to be independent
>> of each other because they need to be independently developed and
>> maintained to ensure the architectural advantages of the service mesh.
>> Therefore, RPAL breaks the isolation between processes while preserving the
>> independence between them.  We think that RPAL can also be applied to other
>> scenarios featuring sidecar-like architectures, such as distributed file
>> storage systems in LLM infra.
>>
>> In RPAL architecture, multiple processes share a virtual address space, so
>> this architecture can be regarded as an advanced version of the Linux
>> shared memory mechanism:
>>
>> 1. Traditional shared memory requires two processes to negotiate to ensure
>> the mapping of the same piece of memory. In RPAL architecture, two RPAL
>> processes still need to reach a consensus before they can successfully
>> invoke the relevant system calls of RPAL to share the virtual address
>> space.
>> 2. Traditional shared memory only shares part of the data. However, in RPAL
>> architecture, processes that have established an RPAL communication
>> relationship share a virtual address space, and all user memory (such as
>> data segments and code segments) of each RPAL process is shared among these
>> processes. However, a process cannot access the memory of other processes
>> at any time. We use the MPK mechanism to ensure that the memory of other
>> processes can only be accessed when special RPAL functions are called.
>> Otherwise, a page fault will be triggered.
>> 3. In RPAL architecture, to ensure the consistency of the execution context
>> of the shared code (such as the stack and thread local storage), we further
>> implement the thread context switching in user mode based on the ability to
>> share the virtual address space of different processes, enabling the
>> threads of different processes to directly perform fast switching in user
>> mode without falling into kernel mode for slow switching.
>>
>> # Background
>>
>> In traditional inter-process communication (IPC) scenarios, Unix domain
>> sockets are commonly used in conjunction with the epoll() family for event
>> multiplexing. IPC operations involve system calls on both the data and
>> control planes, thereby imposing a non-trivial overhead on the interacting
>> processes. Even when shared memory is employed to optimize the data plane,
>> two data copies still remain. Specifically, data is initially copied from
>> a process's private memory space into the shared memory area, and then it
>> is copied from the shared memory into the private memory of another
>> process.
>>
>> This poses a question: Is it possible to reduce the overhead of IPC with
>> only minimal modifications at the application level? To address this, we
>> observed that the functionality of IPC, which encompasses data transfer
>> and invocation of the target thread, is similar to a function call, where
>> arguments are passed and the callee function is invoked to process them.
>> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
>> framework designed to enable one process to invoke another as if making
>> a local function call, all without going through the kernel.
>>
>> # Design
>>
>> First, let’s formalize RPAL’s core objectives:
>>
>> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
>>     shared memory solution) to one.
>> 2. Control-plane optimization: Eliminate the overhead of system calls and
>>     kernel's thread switches.
>> 3. Application compatibility: Minimize the modifications to existing
>>     applications that utilize Unix domain sockets and the epoll() family.
>>
>> To attain the first objective, processes that use RPAL share the same
>> virtual address space. So one process can access another's data directly
>> via a data pointer. This means data can be transferred from one process to
>> another with just one copy operation.
>>
>> To meet the second goal, RPAL relies on the shared address space to do
>> lightweight context switching in user space, which we call an "RPAL call".
>> This allows one process to execute another process's code just like a
>> local function call.
>>
>> To achieve the third target, RPAL stays compatible with the epoll family
>> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
>> application uses epoll for IPC, developers can switch to RPAL with just a
>> few small changes. For instance, you can just replace epoll_wait() with
>> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
>> another to write to a monitored descriptor using an epoll file descriptor,
>> still works fine with RPAL.
>>
>> ## Address space sharing
>>
>> For address space sharing, RPAL partitions the entire userspace virtual
>> address space and allocates non-overlapping memory ranges to each process.
>> On x86_64 architectures, RPAL uses a memory range size covered by a
>> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
>> each process’s virtual address space to 512GB on x86_64, sufficient for
>> most applications in our scenario. The rationale is straightforward:
>> address space sharing can be simply achieved by copying the PUD from one
>> process’s page table to another’s. So one process can directly use the
>> data pointer to access another's memory.
>>
>>
>>   |------------| <- 0
>>   |------------| <- 512 GB
>>   |  Process A |
>>   |------------| <- 2*512 GB
>>   |------------| <- n*512 GB
>>   |  Process B |
>>   |------------| <- (n+1)*512 GB
>>   |------------| <- STACK_TOP
>>   |  Kernel    |
>>   |------------|
>>
>> ## RPAL call
>>
>> We refer to the lightweight userspace context switching mechanism as RPAL
>> call. It enables the caller (or sender) thread of one process to directly
>> switch to the callee (or receiver) thread of another process.
>>
>> When Process A’s caller thread initiates an RPAL call to Process B’s
>> callee thread, the CPU saves the caller’s context and loads the callee’s
>> context. This enables direct userspace control flow transfer from the
>> caller to the callee. After the callee finishes data processing, the CPU
>> saves Process B’s callee context and switches back to Process A’s caller
>> context, completing a full IPC cycle.
>>
>>
>>   |------------|                |---------------------|
>>   |  Process A |                |  Process B          |
>>   | |-------|  |                | |-------|           |
>>   | | caller| --- RPAL call --> | | callee|    handle |
>>   | | thread| <------------------ | thread| -> event  |
>>   | |-------|  |                | |-------|           |
>>   |------------|                |---------------------|
>>
>> # Security and compatibility with kernel subsystems
>>
>> ## Memory protection between processes
>>
>> Since processes using RPAL share the address space, unintended
>> cross-process memory access may occur and corrupt the data of another
>> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
>> architectures.
>>
>> MPK assigns 4 bits in each page table entry to a "protection key", which
>> is paired with a userspace register (PKRU). The PKRU register defines
>> access permissions for memory regions protected by specific keys (for
>> detailed implementation, refer to the kernel documentation "Memory
>> Protection Keys"). With MPK, even though the address space is shared
>> among processes, cross-process access is restricted: a process can only
>> access the memory protected by a key if its PKRU register is configured
>> with the corresponding permission. This ensures that processes cannot
>> access each other’s memory unless an explicit PKRU configuration is set.
>>
>> ## Page fault handling and TLB flushing
>>
>> Due to the shared address space architecture, both page fault handling and
>> TLB flushing require careful consideration. For instance, when Process A
>> accesses Process B’s memory, a page fault may occur in Process A's
>> context, but the faulting address belongs to Process B. In this case, we
>> must pass Process B's mm_struct to the page fault handler.
>>
>> TLB flushing is more complex. When a thread flushes the TLB, since the
>> address space is shared, not only other threads in the current process but
>> also other processes that share the address space may access the
>> corresponding memory (related to the TLB flush). Therefore, the cpuset used
>> for TLB flushing should be the union of the mm_cpumasks of all processes
>> that share the address space.
>>
>> ## Lazy switch of kernel context
>>
>> In RPAL, a mismatch may arise between the user context and the kernel
>> context. The RPAL call is designed solely to switch the user context,
>> leaving the kernel context unchanged. For instance, when a RPAL call takes
>> place, transitioning from caller thread to callee thread, and subsequently
>> a system call is initiated within callee thread, the kernel will
>> incorrectly utilize the caller's kernel context (such as the kernel stack)
>> to process the system call.
>>
>> To resolve context mismatch issues, a kernel context switch is triggered at
>> the kernel entry point when the callee initiates a syscall or an
>> exception/interrupt occurs. This mechanism ensures context consistency
>> before processing system calls, interrupts, or exceptions. We refer to this
>> kernel context switch as a "lazy switch" because it defers the switching
>> operation from the traditional thread switch point to the next kernel entry
>> point.
>>
>> Lazy switch should be minimized as much as possible, as it significantly
>> degrades performance. We currently utilize RPAL in an RPC framework, in
>> which the RPC sender thread relies on the RPAL call to invoke the RPC
>> receiver thread entirely in user space. In most cases, the receiver
>> thread is free of system calls and the code execution time is relatively
>> short. This characteristic effectively reduces the probability of a lazy
>> switch occurring.
>>
>> ## Time slice correction
>>
>> After an RPAL call, the callee's user mode code executes. However, the
>> kernel incorrectly attributes this CPU time to the caller due to the
>> unchanged kernel context.
>>
>> To resolve this, we use the Time Stamp Counter (TSC) register to measure
>> CPU time consumed by the callee thread in user space. The kernel then uses
>> this user-reported timing data to adjust the CPU accounting for both the
>> caller and callee thread, similar to how CPU steal time is implemented.
>>
>> ## Process recovery
>>
>> Since processes can access each other’s memory, there is a risk that the
>> target process’s memory may become invalid at the access time (e.g., if
>> the target process has exited unexpectedly). The kernel must handle such
>> cases; otherwise, the accessing process could be terminated due to
>> failures originating from another process.
>>
>> To address this issue, each thread of the process should pre-establish a
>> recovery point when accessing the memory of other processes. When such an
>> invalid access occurs, the thread traps into the kernel. Inside the page
>> fault handler, the kernel restores the user context of the thread to the
>> recovery point. This mechanism ensures that processes maintain mutual
>> independence, preventing cascading failures caused by cross-process memory
>> issues.
>>
>> # Performance
>>
>> To quantify the performance improvements driven by RPAL, we measured
>> latency both before and after its deployment. Experiments were conducted on
>> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
>> and 1 TB of memory. Latency was defined as the duration from when the
>> client thread initiates a message to when the server thread is invoked and
>> receives it.
>>
>> During testing, the client transmitted 1 million 32-byte messages, and we
>> computed the per-message average latency. The results are as follows:
>>
>> *****************
>> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>>   Message count: 1000000, Average latency: 19616 cycles
>> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>>   Message count: 1000000, Average latency: 1703 cycles
>> *****************
>>
>> These results confirm that RPAL delivers substantial latency improvements
>> over the current epoll implementation—achieving a 17,913-cycle reduction
>> (an ~91.3% improvement) for 32-byte messages.
>>
>> We have applied RPAL to an RPC framework that is widely used in our data
>> center. With RPAL, we have successfully achieved up to 15.5% reduction in
>> the CPU utilization of processes in real-world microservice scenario. The
>> gains primarily stem from minimizing control plane overhead through the
>> utilization of userspace context switches. Additionally, by leveraging
>> address space sharing, the number of memory copies is significantly
>> reduced.
>>
>> # Future Work
>>
>> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
>> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
>> supported only on the latest processor, specifically, 3th Generation AMD
>> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
>> support to systems lacking MPK hardware will be provided later.
>>
>> Accompanying test programs are also provided in the samples/rpal/
>> directory. And the user-mode RPAL library, which realizes user-space RPAL
>> call, is in the samples/rpal/librpal directory.
>>
>> We hope to get some community discussions and feedback on RPAL's
>> optimization approaches and architecture.
>>
>> Look forward to your comments.
>>
>> Bo Li (35):
>>    Kbuild: rpal support
>>    RPAL: add struct rpal_service
>>    RPAL: add service registration interface
>>    RPAL: add member to task_struct and mm_struct
>>    RPAL: enable virtual address space partitions
>>    RPAL: add user interface
>>    RPAL: enable shared page mmap
>>    RPAL: enable sender/receiver registration
>>    RPAL: enable address space sharing
>>    RPAL: allow service enable/disable
>>    RPAL: add service request/release
>>    RPAL: enable service disable notification
>>    RPAL: add tlb flushing support
>>    RPAL: enable page fault handling
>>    RPAL: add sender/receiver state
>>    RPAL: add cpu lock interface
>>    RPAL: add a mapping between fsbase and tasks
>>    sched: pick a specified task
>>    RPAL: add lazy switch main logic
>>    RPAL: add rpal_ret_from_lazy_switch
>>    RPAL: add kernel entry handling for lazy switch
>>    RPAL: rebuild receiver state
>>    RPAL: resume cpumask when fork
>>    RPAL: critical section optimization
>>    RPAL: add MPK initialization and interface
>>    RPAL: enable MPK support
>>    RPAL: add epoll support
>>    RPAL: add rpal_uds_fdmap() support
>>    RPAL: fix race condition in pkru update
>>    RPAL: fix pkru setup when fork
>>    RPAL: add receiver waker
>>    RPAL: fix unknown nmi on AMD CPU
>>    RPAL: enable time slice correction
>>    RPAL: enable fast epoll wait
>>    samples/rpal: add RPAL samples
>>
>>   arch/x86/Kbuild                               |    2 +
>>   arch/x86/Kconfig                              |    2 +
>>   arch/x86/entry/entry_64.S                     |  160 ++
>>   arch/x86/events/amd/core.c                    |   14 +
>>   arch/x86/include/asm/pgtable.h                |   25 +
>>   arch/x86/include/asm/pgtable_types.h          |   11 +
>>   arch/x86/include/asm/tlbflush.h               |   10 +
>>   arch/x86/kernel/asm-offsets.c                 |    3 +
>>   arch/x86/kernel/cpu/common.c                  |    8 +-
>>   arch/x86/kernel/fpu/core.c                    |    8 +-
>>   arch/x86/kernel/nmi.c                         |   20 +
>>   arch/x86/kernel/process.c                     |   25 +-
>>   arch/x86/kernel/process_64.c                  |  118 +
>>   arch/x86/mm/fault.c                           |  271 ++
>>   arch/x86/mm/mmap.c                            |   10 +
>>   arch/x86/mm/tlb.c                             |  172 ++
>>   arch/x86/rpal/Kconfig                         |   21 +
>>   arch/x86/rpal/Makefile                        |    6 +
>>   arch/x86/rpal/core.c                          |  477 ++++
>>   arch/x86/rpal/internal.h                      |   69 +
>>   arch/x86/rpal/mm.c                            |  426 +++
>>   arch/x86/rpal/pku.c                           |  196 ++
>>   arch/x86/rpal/proc.c                          |  279 ++
>>   arch/x86/rpal/service.c                       |  776 ++++++
>>   arch/x86/rpal/thread.c                        |  313 +++
>>   fs/binfmt_elf.c                               |   98 +-
>>   fs/eventpoll.c                                |  320 +++
>>   fs/exec.c                                     |   11 +
>>   include/linux/mm_types.h                      |    3 +
>>   include/linux/rpal.h                          |  633 +++++
>>   include/linux/sched.h                         |   21 +
>>   init/init_task.c                              |    6 +
>>   kernel/exit.c                                 |    5 +
>>   kernel/fork.c                                 |   32 +
>>   kernel/sched/core.c                           |  676 +++++
>>   kernel/sched/fair.c                           |  109 +
>>   kernel/sched/sched.h                          |    8 +
>>   mm/mmap.c                                     |   16 +
>>   mm/mprotect.c                                 |  106 +
>>   mm/rmap.c                                     |    4 +
>>   mm/vma.c                                      |   18 +
>>   samples/rpal/Makefile                         |   17 +
>>   samples/rpal/asm_define.c                     |   14 +
>>   samples/rpal/client.c                         |  178 ++
>>   samples/rpal/librpal/asm_define.h             |    6 +
>>   samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
>>   samples/rpal/librpal/debug.h                  |   12 +
>>   samples/rpal/librpal/fiber.c                  |  119 +
>>   samples/rpal/librpal/fiber.h                  |   64 +
>>   .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
>>   .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
>>   .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
>>   samples/rpal/librpal/private.h                |  341 +++
>>   samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
>>   samples/rpal/librpal/rpal.h                   |  149 ++
>>   samples/rpal/librpal/rpal_pkru.h              |   78 +
>>   samples/rpal/librpal/rpal_queue.c             |  239 ++
>>   samples/rpal/librpal/rpal_queue.h             |   55 +
>>   samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
>>   samples/rpal/offset.sh                        |    5 +
>>   samples/rpal/server.c                         |  249 ++
>>   61 files changed, 9710 insertions(+), 4 deletions(-)
>>   create mode 100644 arch/x86/rpal/Kconfig
>>   create mode 100644 arch/x86/rpal/Makefile
>>   create mode 100644 arch/x86/rpal/core.c
>>   create mode 100644 arch/x86/rpal/internal.h
>>   create mode 100644 arch/x86/rpal/mm.c
>>   create mode 100644 arch/x86/rpal/pku.c
>>   create mode 100644 arch/x86/rpal/proc.c
>>   create mode 100644 arch/x86/rpal/service.c
>>   create mode 100644 arch/x86/rpal/thread.c
>>   create mode 100644 include/linux/rpal.h
>>   create mode 100644 samples/rpal/Makefile
>>   create mode 100644 samples/rpal/asm_define.c
>>   create mode 100644 samples/rpal/client.c
>>   create mode 100644 samples/rpal/librpal/asm_define.h
>>   create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
>>   create mode 100644 samples/rpal/librpal/debug.h
>>   create mode 100644 samples/rpal/librpal/fiber.c
>>   create mode 100644 samples/rpal/librpal/fiber.h
>>   create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
>>   create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
>>   create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
>>   create mode 100644 samples/rpal/librpal/private.h
>>   create mode 100644 samples/rpal/librpal/rpal.c
>>   create mode 100644 samples/rpal/librpal/rpal.h
>>   create mode 100644 samples/rpal/librpal/rpal_pkru.h
>>   create mode 100644 samples/rpal/librpal/rpal_queue.c
>>   create mode 100644 samples/rpal/librpal/rpal_queue.h
>>   create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
>>   create mode 100755 samples/rpal/offset.sh
>>   create mode 100644 samples/rpal/server.c
>>
>> --
>> 2.20.1
>>

Thank you for your feedback! There might be some misunderstanding.
According to the feedback in RPAL V1, we rebased the RPAL to the latest
stable kernel and added an introduction section to explain our
considerations regarding the process isolation of the RPAL architecture.

Thanks!
Re: [RFC v2 00/35] optimize cost of inter-process communication
Posted by Lorenzo Stoakes 8 months, 1 week ago
On Tue, Jun 03, 2025 at 03:22:39AM -0500, Bo Li wrote:
> Hi Lorenzo,
>
> On 5/30/25 5:33 PM, Lorenzo Stoakes wrote:
> > Bo,
> >
> > You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
> > quite sure why you're sending a v2 without responding to that.
> >
> > This isn't how the upstream kernel works...
> >
> > Thanks, Lorenzo
> >
> > On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:

[snip]

> Thank you for your feedback! There might be some misunderstanding.
> According to the feedback in RPAL V1, we rebased the RPAL to the latest
> stable kernel and added an introduction section to explain our
> considerations regarding the process isolation of the RPAL architecture.
>
> Thanks!

Hi Bo,

You need to engage in _conversation_ with maintainers, not simply resend
giant RFC's with changes made based on your interpretation of the feedback.

You've not addressed my comments, you've interpreted them to be 'ok do X,
Y, Z', then done them without a word. This is, again, not how upstream
works. You've seemingly ignored Dave altogether.

Others have highlighted it, but let me repeat what they have (in effect)
said - this is just not mergeable upstream in any way shape or form,
sorry.

It's a NAK and there's just no way it's not a NAK, you're doing too many
crazy things here that are just not acceptable, not to mention the issues
people have raised.

You should have engaged with upstream WAY earlier.

It's a pity you've put so much work into this without having done so, but
I'm afraid you're going to have to maintain this out-of-tree indefinitely.

I hope you can at least can take some lessons from this on how best in
future to engage with upstream (early and often! :)

Thanks, Lorenzo