From nobody Wed Feb 11 06:32:26 2026 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 505D022B8B3 for ; Fri, 30 May 2025 09:35:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597724; cv=none; b=g2wx+7NJWmXj8c+hum1IfHHekJsPDjG6uYzbg1ygp/cqEzw2PWNqjSqF39/3lUi+LIlq2ey326z1Y/KOlB3aVqBdHdjeC5XADo47VvxUTmhgCLj84p8MlaFnx7RLWimwWHngkecC3bLUJvbFre6PQVlINOKZnG+qI6+WfF5Eb+g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597724; c=relaxed/simple; bh=9CHW/F/J9wY0zWhmMIE46/gcJf++4JvdpcdoqukKOXQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=qfsCpBvWhh6/DfbeAGCvu29U4M2+DU13I4dLl2+zWEYvFGFUbs2Cy41tWhJiqp8OVS9vYqDTvcobr6ko4faaoXJl3SQXU8alqmxRHBsVTqjp+S9JB8sOUYIRpubFzemq28cL5Bq9oSuHoGXmAfLZFmbETQ0k0uMIyLns7QwXIsg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=f0RARZhZ; arc=none smtp.client-ip=209.85.215.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="f0RARZhZ" Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-b200047a6a5so2491168a12.0 for ; Fri, 30 May 2025 02:35:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597722; x=1749202522; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VFn4X8pNflxKPFbQMCWiED+xW2QzJgp0gVRveKZQMVQ=; b=f0RARZhZLjfW/aWmcoFtrJUt9YE2tMtU6hW+dQbnlucaiQQxvRKFiRnO3h/iMy/s2v omUDtrkqOEt/ktuvHS7n8o/6kqJxThZBTjaanLdHcfE1r68tQrAeOHxsfRfLW3NJ1Hyd nd/Aive6GBoywupUoDPQdMrRdo+l8n/7/P60qKSKROcXtd+U80e0W3nNU68j1qoyhmnr mn1MiVkLYfoxSFIz6NDcTU0XKh6O44BoWmqcKTXh53e35VdFfXUyYkby9sBRdZExDvD8 3ou16CG9Vt4ycR+xYcIbyYYpe5bEbsuJZozsHxU4FYXZ5gsNlVcyG2FwKzZAZMjxVrhE iPLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597722; x=1749202522; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VFn4X8pNflxKPFbQMCWiED+xW2QzJgp0gVRveKZQMVQ=; b=cO7B5OrVRnkB+KFchpungkFJEglXAMpGrp7TP0FeoyU2vJmRzsoY8btgZ2cF/Skw4K gepzHQ75c0HresKK/CbQDTpZKOUuM8x8c8L/bT6TLcTK1BP4Pm0e21lg6a/xWUTt8bEy r6b55Rx3G0pBZqB9lP8cwT4odUCzbloslt5uuSwrImMS4lk8tl9FQv5rctvjIa6/nAEK dB1RFbUsMcuvai3fBZPlTnihNXQF0Eea4gCtfRCeUrcrnNzOAbERSgwmOMXPMG2MK6O9 Cw6TqoW83DPCpjbIj8JcJ+7yr3xtdYDVSTFJr0iZaNBm7/aRmoBqL85TSjCeml1Qokk/ MgQA== X-Forwarded-Encrypted: i=1; AJvYcCX4Y0vjP5TvUhZrCpgCA3TpvsuVUivLw7180O2nesaZjhbhiJjkSGp3do46DrxNH3KSLowvPA8sYm5iHQc=@vger.kernel.org X-Gm-Message-State: AOJu0YwB2TpaTfjPweQXQxiPny7J5WIpAbRu5Ndq06J4peolK3pcZk7u fjdf2hfRqu5kXD5uEqYVZ9xoNLchdny1oCHfC2vVIsd1MN7zvkRmIN1dG5AjvUnKlfg= X-Gm-Gg: ASbGncuZStyn61YIL6ziFiimPGAr4dQp4FlbSmtx/yrP8NZQ4TntrCsAWgAspMRzQLm cTIDACCjcE5FgsfyGJSRUIjTHU8LVvrWy1J6rpuXlE0QRtlHjPRJluYEfJ+jc3AT9oAmDwjycgV TkDyhfnwxtjuhg3+EC8p+k04/vzkLram927tQyj8MdkcR31NdHWsk+/vLQGjjmVL/gqLhFGWdL5 ZH+hvqJ53tNjK5+rRB27n+NBAwFIdhufg+VQFPnk0AmHXuAfu/JdK43NMbvnExYXHWNaVpUoRQQ q+oNckFyY9hQo/oKJp2U/PK9SY4cTO5WTYNOEQdMGLw6AKDBymdCR9gMJoyqLHkdQx0PO4+rNDk Fs7WC4BrKdy8IjnXPSUoc X-Google-Smtp-Source: AGHT+IH46+bZCrX7K5xEEg6cSKu1TR1I7SkSliOWsHgcBqoA6CG8wyIvf12Zvpihk7VZmfu1IuwYXQ== X-Received: by 2002:a17:90b:164b:b0:310:8d4a:4a97 with SMTP id 98e67ed59e1d1-31214ee68f2mr10223135a91.15.1748597721380; Fri, 30 May 2025 02:35:21 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.35.06 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:35:21 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 27/35] RPAL: add epoll support Date: Fri, 30 May 2025 17:27:55 +0800 Message-Id: <7eb30a577e2c6a4f582515357aea25260105eb18.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" To support the epoll family, RPAL needs to add new logic for RPAL services to the existing epoll logic, ensuring that user mode can execute RPAL service-related logic through identical interfaces. When the receiver thread calls epoll_wait(), it can set RPAL_EP_POLL_MAGIC to notify the kernel to invoke RPAL-related logic. The kernel then sets the receiver's state to RPAL_RECEIVER_STATE_READY and transitions it to RPAL_RECEIVER_STATE_WAIT when the receiver is actually removed from the runqueue, allowing the sender to perform RPAL calls on the receiver thread. Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 4 + fs/eventpoll.c | 200 +++++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 21 +++++ kernel/sched/core.c | 17 ++++ 4 files changed, 242 insertions(+) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 47c9e551344e..6a22b9faa100 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -9,6 +9,7 @@ #include #include #include +#include #include =20 #include "internal.h" @@ -63,6 +64,7 @@ void rpal_kernel_ret(struct pt_regs *regs) =20 if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { rcc =3D current->rpal_rd->rcc; + regs->ax =3D rpal_try_send_events(current->rpal_rd->ep, rcc); atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET); } else { tsk =3D current->rpal_sd->receiver; @@ -142,6 +144,7 @@ rpal_do_kernel_context_switch(struct task_struct *next,= struct pt_regs *regs) struct task_struct *prev =3D current; =20 if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) { + rpal_resume_ep(next); current->rpal_sd->receiver =3D next; rpal_lock_cpu(current); rpal_lock_cpu(next); @@ -154,6 +157,7 @@ rpal_do_kernel_context_switch(struct task_struct *next,= struct pt_regs *regs) */ rebuild_sender_stack(current->rpal_sd, regs); rpal_schedule(next); + fdput(next->rpal_rd->f); } else { update_dst_stack(next, regs); /* diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d4dbffdedd08..437cd5764c03 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -38,6 +38,7 @@ #include #include #include +#include #include =20 /* @@ -2141,6 +2142,187 @@ static int ep_poll(struct eventpoll *ep, struct epo= ll_event __user *events, } } =20 +#ifdef CONFIG_RPAL + +void rpal_resume_ep(struct task_struct *tsk) +{ + struct rpal_receiver_data *rrd =3D tsk->rpal_rd; + struct eventpoll *ep =3D (struct eventpoll *)rrd->ep; + struct rpal_receiver_call_context *rcc =3D rrd->rcc; + + if (rcc->timeout > 0) { + hrtimer_cancel(&rrd->ep_sleeper.timer); + destroy_hrtimer_on_stack(&rrd->ep_sleeper.timer); + } + if (!list_empty_careful(&rrd->ep_wait.entry)) { + write_lock(&ep->lock); + __remove_wait_queue(&ep->wq, &rrd->ep_wait); + write_unlock(&ep->lock); + } +} + +int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc) +{ + int eavail; + int res =3D 0; + + res =3D ep_send_events(ep, rcc->events, rcc->maxevents); + if (res > 0) + ep_suspend_napi_irqs(ep); + + eavail =3D ep_events_available(ep); + if (!eavail) { + atomic_and(~RPAL_KERNEL_PENDING, &rcc->ep_pending); + /* check again to avoid data race on RPAL_KERNEL_PENDING */ + eavail =3D ep_events_available(ep); + if (eavail) + atomic_or(RPAL_KERNEL_PENDING, &rcc->ep_pending); + } + return res; +} + +static int rpal_schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta, + const enum hrtimer_mode mode, + clockid_t clock_id) +{ + struct hrtimer_sleeper *t =3D ¤t->rpal_rd->ep_sleeper; + + /* + * Optimize when a zero timeout value is given. It does not + * matter whether this is an absolute or a relative time. + */ + if (expires && *expires =3D=3D 0) { + __set_current_state(TASK_RUNNING); + return 0; + } + + /* + * A NULL parameter means "infinite" + */ + if (!expires) { + schedule(); + return -EINTR; + } + + hrtimer_setup_sleeper_on_stack(t, clock_id, mode); + hrtimer_set_expires_range_ns(&t->timer, *expires, delta); + hrtimer_sleeper_start_expires(t, mode); + + if (likely(t->task)) + schedule(); + + hrtimer_cancel(&t->timer); + destroy_hrtimer_on_stack(&t->timer); + + __set_current_state(TASK_RUNNING); + + return !t->task ? 0 : -EINTR; +} + +static int rpal_ep_poll(struct eventpoll *ep, struct epoll_event __user *e= vents, + int maxevents, struct timespec64 *timeout) +{ + int res =3D 0, eavail, timed_out =3D 0; + u64 slack =3D 0; + struct rpal_receiver_data *rrd =3D current->rpal_rd; + wait_queue_entry_t *wait =3D &rrd->ep_wait; + ktime_t expires, *to =3D NULL; + + rrd->ep =3D ep; + + lockdep_assert_irqs_enabled(); + + if (timeout && (timeout->tv_sec | timeout->tv_nsec)) { + slack =3D select_estimate_accuracy(timeout); + to =3D &expires; + *to =3D timespec64_to_ktime(*timeout); + } else if (timeout) { + timed_out =3D 1; + } + + eavail =3D ep_events_available(ep); + + while (1) { + if (eavail) { + res =3D rpal_try_send_events(ep, rrd->rcc); + if (res) { + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_RUNNING); + return res; + } + } + + if (timed_out) { + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_RUNNING); + return 0; + } + + eavail =3D ep_busy_loop(ep); + if (eavail) + continue; + + if (signal_pending(current)) { + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_RUNNING); + return -EINTR; + } + + init_wait(wait); + wait->func =3D rpal_ep_autoremove_wake_function; + wait->private =3D rrd; + write_lock_irq(&ep->lock); + + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_READY); + __set_current_state(TASK_INTERRUPTIBLE); + + eavail =3D ep_events_available(ep); + if (!eavail) + __add_wait_queue_exclusive(&ep->wq, wait); + + write_unlock_irq(&ep->lock); + + if (!eavail && ep_schedule_timeout(to)) { + if (RPAL_USER_PENDING & atomic_read(&rrd->rcc->ep_pending)) { + timed_out =3D 1; + } else { + timed_out =3D + !rpal_schedule_hrtimeout_range_clock( + to, slack, HRTIMER_MODE_ABS, + CLOCK_MONOTONIC); + } + } + atomic_cmpxchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_READY, + RPAL_RECEIVER_STATE_RUNNING); + __set_current_state(TASK_RUNNING); + + /* + * We were woken up, thus go and try to harvest some events. + * If timed out and still on the wait queue, recheck eavail + * carefully under lock, below. + */ + eavail =3D 1; + + if (!list_empty_careful(&wait->entry)) { + write_lock_irq(&ep->lock); + /* + * If the thread timed out and is not on the wait queue, + * it means that the thread was woken up after its + * timeout expired before it could reacquire the lock. + * Thus, when wait.entry is empty, it needs to harvest + * events. + */ + if (timed_out) + eavail =3D list_empty(&wait->entry); + __remove_wait_queue(&ep->wq, wait); + write_unlock_irq(&ep->lock); + } + } +} +#endif + /** * ep_loop_check_proc - verify that adding an epoll file inside another * epoll structure does not violate the constraints, = in @@ -2529,7 +2711,25 @@ static int do_epoll_wait(int epfd, struct epoll_even= t __user *events, ep =3D fd_file(f)->private_data; =20 /* Time to fish for events ... */ +#ifdef CONFIG_RPAL + /* + * For RPAL task, if it is a receiver and it set MAGIC in shared memory, + * We think it is prepared for rpal calls. Therefore, we need to handle + * it differently. + * + * In other cases, RPAL task always plays like a normal task. + */ + if (rpal_current_service() && + rpal_test_current_thread_flag(RPAL_RECEIVER_BIT) && + current->rpal_rd->rcc->rpal_ep_poll_magic =3D=3D RPAL_EP_POLL_MAGIC) { + current->rpal_rd->f =3D f; + return rpal_ep_poll(ep, events, maxevents, to); + } else { + return ep_poll(ep, events, maxevents, to); + } +#else return ep_poll(ep, events, maxevents, to); +#endif } =20 SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, diff --git a/include/linux/rpal.h b/include/linux/rpal.h index f2474cb53abe..5912ffec6e28 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -16,6 +16,8 @@ #include #include #include +#include +#include =20 #define RPAL_ERROR_MSG "rpal error: " #define rpal_err(x...) pr_err(RPAL_ERROR_MSG x) @@ -89,6 +91,7 @@ enum { }; =20 #define RPAL_ERROR_MAGIC 0x98CC98CC +#define RPAL_EP_POLL_MAGIC 0xCC98CC98 =20 #define RPAL_SID_SHIFT 24 #define RPAL_ID_SHIFT 8 @@ -103,6 +106,9 @@ enum { #define RPAL_PKRU_UNION 1 #define RPAL_PKRU_INTERSECT 2 =20 +#define RPAL_KERNEL_PENDING 0x1 +#define RPAL_USER_PENDING 0x2 + extern unsigned long rpal_cap; =20 enum rpal_task_flag_bits { @@ -282,6 +288,12 @@ struct rpal_receiver_call_context { int receiver_id; atomic_t receiver_state; atomic_t sender_state; + atomic_t ep_pending; + int rpal_ep_poll_magic; + int epfd; + void __user *events; + int maxevents; + int timeout; }; =20 /* recovery point for sender */ @@ -325,6 +337,10 @@ struct rpal_receiver_data { struct rpal_shared_page *rsp; struct rpal_receiver_call_context *rcc; struct task_struct *sender; + void *ep; + struct fd f; + struct hrtimer_sleeper ep_sleeper; + wait_queue_entry_t ep_wait; }; =20 struct rpal_sender_data { @@ -574,4 +590,9 @@ __rpal_switch_to(struct task_struct *prev_p, struct tas= k_struct *next_p); asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev); int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey); void rpal_set_pku_schedule_tail(struct task_struct *prev); +int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr, + unsigned int mode, int wake_flags, + void *key); +void rpal_resume_ep(struct task_struct *tsk); +int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index eb5d5bd51597..486d59bdd3fc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6794,6 +6794,23 @@ pick_next_task(struct rq *rq, struct task_struct *pr= ev, struct rq_flags *rf) #define SM_RTLOCK_WAIT 2 =20 #ifdef CONFIG_RPAL +int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr, + unsigned int mode, int wake_flags, + void *key) +{ + struct rpal_receiver_data *rrd =3D curr->private; + struct task_struct *tsk =3D rrd->rcd.bp_task; + int ret; + + ret =3D try_to_wake_up(tsk, mode, wake_flags); + + list_del_init_careful(&curr->entry); + if (!ret) + atomic_or(RPAL_KERNEL_PENDING, &rrd->rcc->ep_pending); + + return 1; +} + static inline void rpal_check_ready_state(struct task_struct *tsk, int sta= te) { if (rpal_test_task_thread_flag(tsk, RPAL_RECEIVER_BIT)) { --=20 2.20.1