From nobody Sun Jun 14 04:09:40 2026 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 725DA1A9F85 for ; Mon, 4 May 2026 01:12:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857135; cv=none; b=jxAUXgT5YgaTXaccWVSaTzUCS2XROlLBgM+ZlsZE0zrVtHjmeoXN1wAgD9MpPPlLblYajjlPIzd+arqC+wYRv3WyH+gR1hScgep3hCq1LR9rEGWQ26wVL9vleT2ghz7cOdEEJx1X9k1etyojpFLyaG1bwvs4incn5aQjBJM2qGw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857135; c=relaxed/simple; bh=rZ062P3mxHo0LKwstd5yptc+7ICpMgxDy2Yyv9OIrw4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=OnJDE0iEiCICXBUWgPMC13dOz8MSfFYztY6J6jg0c45ImtsH++AOBxmFAJaHi4cmJ+qux4hnttQsC3IR0Lu6PX2VmXiNXzviK/BjrLK9R0P4/8Sh1TwAVp4qMZfCMdxFOwJZoobPbzs/XjHbChX8cmC6JyPmpCRzk8pP1fyhWJM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=iBOcw3lY; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="iBOcw3lY" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-2ad617d5b80so20689195ad.1 for ; Sun, 03 May 2026 18:12:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777857132; x=1778461932; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=zQVwSX5u+OC2y1nKqiKK5h7/DPfLERDYHmiNYcjtEfw=; b=iBOcw3lYrFaCIWZKD6QpplOqYB65CwWRv1Vv37/uVzSyhQS+OKYOKzkqwCDSiJso84 EDz6YqQu6llQQNRAuZrP4jLb2BEs0RyBqHjFi2JAPOMSpLASnwypaRPpbzrp5kVSvzN2 4u3qIG28CfhbQChsprpbJRAyZGBReUDF6SJpbvxKOtCxIaxjFJ9SevROAttxmsrb7UL/ mjd1Z7iLJJlandWeU+Ck0/5MLTV6p/aIFQI7VhJPbUnXs5NFsUX++WgINkTw4+lSMOQq VDeudXl2mpqXMsdfUW5bxVgn8796eTiDs9gEnDdbudC+4J8vwd2791X4BjbS4SnDo/5i hrzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777857132; x=1778461932; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=zQVwSX5u+OC2y1nKqiKK5h7/DPfLERDYHmiNYcjtEfw=; b=nmbx2B0lELK3QvkF1TgKPAdJo2isOnnUDScConsbtA5rzQ200kMO9OFBPJrv/knzeq j/mYs7R9xQdbbe7OBawYHCpe19QNt8CPhAqYiUgiOce5491d5Bvg0T5UOLFqKU/guGgM wAIZfjrDp3WqzBj95mEK2vkhnGWTqqmSlc5IzxBviK7Vqs4BgTkN5x3Ugjd47FHl0tfI IpIKBKB7EKhMQDop4MH4T+9uaRkmSKp/uVUDJoxrK5bAv9eql+HD2HKfqC42sGmRvsM6 kyq1mDPycSGRXTuH3ZvtIKj2LKXN5fylQnZTPu8Xymrwca4pru1neu6/bqvXTYpv3VPX a1dA== X-Forwarded-Encrypted: i=1; AFNElJ+s0OZxkfa1VCULyxnj7epX0IWDlyIkh3QFA6WF0rt3YE7g/aA9AIDxvSH1LYJ54xBAGM4DGIYe2eR1gq4=@vger.kernel.org X-Gm-Message-State: AOJu0Yw/jx3zAQKCbIhvNo2m3dQ/FjPtMXo0bPrFcYEQ2ky19lPFv8Tc BJ9zZLD4Zc7vNHmK05EfagI7PhO4pz3Qjr1Sfd7AM7A0m9DeQ211i05u X-Gm-Gg: AeBDiev2dnw46jYIywJQzJfvgvQBD7FiLseadaR7TCB/AnH8FLKv7usUXinbQYSYpSh jcxA7kqnGmHlUZzNwIDsPqWn87Ne5ZB3dmJ0UHcLuD+rQn0juqzpURkAjiiYvMec9UiFTRQv4wY Yj3D05eK+pu3tyHjP2i7IFIMLJSBDmo0cvITPcCTduv07By3DWJt7Sf4JbaWi5eyXhGKi70WsQk 7/Ai/WVGxfbI+nrE0miNR4L2ZjfduPwApLMcbaeSnMdYyFIJfSipTPslqP+N/XN5fbnpgpIn3mc 5RLkSWI0kLysBgvHeoXgsi0gMF/e1u3ex6cASh7Q+XZieixELr5m4b/NhN3/BALkHAVcaxuVsod agbDXlSrY0XOtKbeiaDSk27P+95+SDE5hMH2TFLlmmeQzbFOs4Q1NFB+WHhNVVx2dxrVZd0SFo4 /4cwQkBgnceRyXa5lSLS4QK5fzvSAttzPFgQFYe4R78iwsSpYUOkBF3oH8zGxtz9CKoyr599jFt o2vF1mMBA== X-Received: by 2002:a17:902:ea09:b0:2b9:4941:7f6e with SMTP id d9443c01a7336-2b9f256745fmr79551335ad.2.1777857131348; Sun, 03 May 2026 18:12:11 -0700 (PDT) Received: from pop-os.. ([2601:647:6802:dbc0:8bb8:1710:d99a:3c81]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b9cae3697bsm83810065ad.58.2026.05.03.18.12.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 18:12:10 -0700 (PDT) From: Cong Wang To: Kees Cook , linux-kernel@vger.kernel.org Cc: Andy Lutomirski , Will Drewry , Christian Brauner , Cong Wang Subject: [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race Date: Sun, 3 May 2026 18:12:05 -0700 Message-ID: <20260504011207.539408-2-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260504011207.539408-1-xiyou.wangcong@gmail.com> References: <20260504011207.539408-1-xiyou.wangcong@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Cong Wang seccomp_unotify(2) leaves a documented TOCTOU window for unprivileged supervisors: a sibling thread or CLONE_VM peer can mutate pointer-arg buffers between the supervisor's process_vm_readv() and the kernel's re-read on SECCOMP_USER_NOTIF_FLAG_CONTINUE. ptrace()/proc/pid/mem are not available to unprivileged supervisors, so today there is no race-free path for argument-content policy on CONTINUE. This patch adds SECCOMP_IOCTL_NOTIF_PIN_ARGS, which atomically copies designated pointer-arg payloads from the trapped task's address space into kernel-owned buffers and binds those buffers to the task's next syscall execution. On SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the syscall-body fetch points consume from the kernel buffer instead of re-reading user memory; mutations after PIN_ARGS returns have no effect. Three v1 shapes are supported: a fixed-size copy (sockaddr, single- buffer write/read content), a NUL-bounded C string (paths), and a NULL-terminated array of C strings (argv/envp). Each per-arg descriptor caps copy size; total cumulative bytes per request are bounded at a hardcoded 1 MiB. Pinned-buffer allocations are tagged GFP_KERNEL_ACCOUNT so the trapped task's memcg pays the cost. Pin orchestration uses a three-phase lock dance: validate the notif and snapshot register args under the filter notify lock, walk the trapped task's mm without locks, then re-validate and attach the snapshot. The pin is one-shot: a task_work clears it on the next return-to-userspace after the resumed syscall body completes, with fallback paths for task exit, listener release, and explicit discard (CONTINUE without CONTINUE_PINNED). The syscall number is captured at pin time and verified at consumption so a signal-handler-issued syscall during -ERESTART* resolution will not consume the pin. Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Cong Wang --- MAINTAINERS | 2 + fs/exec.c | 63 +++++ fs/namei.c | 19 ++ fs/read_write.c | 8 +- include/linux/mm.h | 2 +- include/linux/seccomp.h | 35 +++ include/linux/seccomp_types.h | 33 +++ include/uapi/linux/seccomp.h | 73 ++++++ kernel/Makefile | 1 + kernel/exit.c | 1 + kernel/fork.c | 5 + kernel/seccomp.c | 189 +++++++++++++- kernel/seccomp_pin.c | 453 ++++++++++++++++++++++++++++++++++ kernel/seccomp_pin.h | 109 ++++++++ lib/iov_iter.c | 22 ++ mm/memory.c | 4 +- mm/nommu.c | 4 +- net/socket.c | 16 ++ 18 files changed, 1026 insertions(+), 13 deletions(-) create mode 100644 kernel/seccomp_pin.c create mode 100644 kernel/seccomp_pin.h diff --git a/MAINTAINERS b/MAINTAINERS index 882214b0e7db..d7904e8989ca 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -24086,6 +24086,8 @@ F: Documentation/userspace-api/seccomp_filter.rst F: include/linux/seccomp.h F: include/uapi/linux/seccomp.h F: kernel/seccomp.c +F: kernel/seccomp_pin.c +F: kernel/seccomp_pin.h F: tools/testing/selftests/kselftest_harness.h F: tools/testing/selftests/kselftest_harness/ F: tools/testing/selftests/seccomp/* diff --git a/fs/exec.c b/fs/exec.c index ba12b4c466f6..99d4a3daaeeb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -445,6 +446,63 @@ static int bprm_stack_limits(struct linux_binprm *bprm) * processes's memory to the new process's stack. The call to get_user_pa= ges() * ensures the destination page is created and not swapped out. */ +/* + * If a seccomp PIN_ARGS snapshot covers this argv/envp pointer table, + * push each pinned string onto the bprm stack directly via + * copy_string_kernel(), bypassing the per-string strnlen_user() and + * copy_from_user() that would otherwise re-read mutated user memory. + * + * Returns 0 on success, a negative errno on failure, or +1 if no pin + * applied and the caller should run the normal user-memory walk. + */ +static int copy_strings_from_pin(struct user_arg_ptr argv, + struct linux_binprm *bprm) +{ + const struct seccomp_pinned_arg *pin; + const u32 *header; + const char *strings; + u32 count, i; + u64 user_argv; + +#ifdef CONFIG_COMPAT + user_argv =3D (u64)(uintptr_t)(argv.is_compat ? + (const void __user *)argv.ptr.compat : + (const void __user *)argv.ptr.native); +#else + user_argv =3D (u64)(uintptr_t)argv.ptr.native; +#endif + if (!user_argv) + return 1; + + pin =3D seccomp_pin_lookup_current(user_argv); + if (!pin || pin->kind !=3D SECCOMP_PIN_CSTRING_ARRAY) + return 1; + + header =3D pin->data; + count =3D header[0]; + strings =3D (const char *)pin->data; + + /* + * copy_strings() processes argv backwards (highest index first) + * because it grows the bprm stack downward. Match that ordering + * so the resulting stack layout is identical. + */ + for (i =3D count; i-- > 0; ) { + u32 off =3D header[1 + i]; + int ret; + + if (off >=3D pin->size) + return -EINVAL; + ret =3D copy_string_kernel(strings + off, bprm); + if (ret < 0) + return ret; + if (fatal_signal_pending(current)) + return -ERESTARTNOHAND; + cond_resched(); + } + return 0; +} + static int copy_strings(int argc, struct user_arg_ptr argv, struct linux_binprm *bprm) { @@ -453,6 +511,11 @@ static int copy_strings(int argc, struct user_arg_ptr = argv, unsigned long kpos =3D 0; int ret; =20 + ret =3D copy_strings_from_pin(argv, bprm); + if (ret <=3D 0) + return ret; + /* No pin matched; continue with the normal user-memory walk. */ + while (argc-- > 0) { const char __user *str; int len; diff --git a/fs/namei.c b/fs/namei.c index c7fac83c9a85..ee86f4c91cae 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -222,6 +223,24 @@ do_getname(const char __user *filename, int flags, boo= l incomplete) struct filename * getname_flags(const char __user *filename, int flags) { + const struct seccomp_pinned_arg *pin; + + /* + * If a seccomp supervisor pinned this path via PIN_ARGS and sent + * CONTINUE_PINNED, build the struct filename from the kernel-side + * snapshot instead of re-reading user memory. The pinned buffer + * is NUL-terminated by copy_remote_vm_str() in the walker, so + * getname_kernel() can consume it directly. + * + * The empty-path-with-LOOKUP_EMPTY policy is handled here because + * getname_kernel() does not reject empty strings. + */ + pin =3D seccomp_pin_lookup_current((u64)(uintptr_t)filename); + if (pin && pin->kind =3D=3D SECCOMP_PIN_CSTRING) { + if (pin->size <=3D 1 && !(flags & LOOKUP_EMPTY)) + return ERR_PTR(-ENOENT); + return getname_kernel(pin->data); + } return do_getname(filename, flags, false); } =20 diff --git a/fs/read_write.c b/fs/read_write.c index 50bff7edc91f..59877e8422a8 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -488,7 +488,9 @@ static ssize_t new_sync_read(struct file *filp, char __= user *buf, size_t len, lo =20 init_sync_kiocb(&kiocb, filp); kiocb.ki_pos =3D (ppos ? *ppos : 0); - iov_iter_ubuf(&iter, ITER_DEST, buf, len); + ret =3D import_ubuf(ITER_DEST, buf, len, &iter); + if (unlikely(ret)) + return ret; =20 ret =3D filp->f_op->read_iter(&kiocb, &iter); BUG_ON(ret =3D=3D -EIOCBQUEUED); @@ -590,7 +592,9 @@ static ssize_t new_sync_write(struct file *filp, const = char __user *buf, size_t =20 init_sync_kiocb(&kiocb, filp); kiocb.ki_pos =3D (ppos ? *ppos : 0); - iov_iter_ubuf(&iter, ITER_SOURCE, (void __user *)buf, len); + ret =3D import_ubuf(ITER_SOURCE, (void __user *)buf, len, &iter); + if (unlikely(ret)) + return ret; =20 ret =3D filp->f_op->write_iter(&kiocb, &iter); BUG_ON(ret =3D=3D -EIOCBQUEUED); diff --git a/include/linux/mm.h b/include/linux/mm.h index af23453e9dbd..b0116e8ed407 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3187,7 +3187,7 @@ extern int access_process_vm(struct task_struct *tsk,= unsigned long addr, extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags); =20 -#ifdef CONFIG_BPF_SYSCALL +#if defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_SECCOMP_FILTER) extern int copy_remote_vm_str(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags); #endif diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 9b959972bf4a..fcc369d3dfca 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -75,6 +75,35 @@ static inline int seccomp_mode(struct seccomp *s) #ifdef CONFIG_SECCOMP_FILTER extern void seccomp_filter_release(struct task_struct *tsk); extern void get_seccomp_filter(struct task_struct *tsk); +extern void seccomp_clear_pinned_args(struct task_struct *tsk); + +/** + * seccomp_pin_lookup_current - find a live PIN_ARGS snapshot for current(= ). + * @user_addr: the userspace address the syscall body is about to read. + * + * Called from syscall fetch points (getname_flags, copy_strings, + * move_addr_to_kernel, import_ubuf). Returns a pinned-arg entry whose + * @data / @size the caller may consume in place of re-reading user + * memory, or NULL if there is no live snapshot, the current syscall + * does not match the one captured at pin time, or no entry matches + * @user_addr. + * + * Safe to call lockless: current owns its seccomp.pinned_args field + * once the PIN_ARGS orchestrator has installed it via WRITE_ONCE. + */ +const struct seccomp_pinned_arg *seccomp_pin_lookup_current(u64 user_addr); + +/** + * seccomp_pin_kvec_for - return a stable kvec for the given pin entry. + * @pin: a pin returned by seccomp_pin_lookup_current(); must belong + * to the current task. + * + * The returned pointer references kvec storage that outlives the pin + * (freed at syscall exit), suitable for iov_iter_kvec() callers whose + * iov_iter consumes after the wrapping function returns. + */ +struct kvec; +const struct kvec *seccomp_pin_kvec_for(const struct seccomp_pinned_arg *p= in); #else /* CONFIG_SECCOMP_FILTER */ static inline void seccomp_filter_release(struct task_struct *tsk) { @@ -84,6 +113,12 @@ static inline void get_seccomp_filter(struct task_struc= t *tsk) { return; } +static inline void seccomp_clear_pinned_args(struct task_struct *tsk) { } +static inline const struct seccomp_pinned_arg * +seccomp_pin_lookup_current(u64 user_addr) { return NULL; } +struct kvec; +static inline const struct kvec * +seccomp_pin_kvec_for(const struct seccomp_pinned_arg *pin) { return NULL; } #endif /* CONFIG_SECCOMP_FILTER */ =20 #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE) diff --git a/include/linux/seccomp_types.h b/include/linux/seccomp_types.h index cf0a0355024f..bd3fe17e659a 100644 --- a/include/linux/seccomp_types.h +++ b/include/linux/seccomp_types.h @@ -7,6 +7,34 @@ #ifdef CONFIG_SECCOMP =20 struct seccomp_filter; +struct seccomp_pinned_args; + +#define SECCOMP_PIN_MAX_ARGS 6 + +/** + * struct seccomp_pinned_arg - one kernel-owned snapshot of a user-pointer= arg. + * @user_addr: the original userspace address (key for lookup at consumpti= on). + * @size: bytes actually populated in @data. + * @arg_idx: syscall register slot 0..5. + * @kind: one of SECCOMP_PIN_*. + * @data: kvmalloc'd buffer holding the snapshotted bytes. + * + * Consumption sites (getname_flags, copy_strings, move_addr_to_kernel, + * import_ubuf) inspect @data and @size after a successful + * seccomp_pin_lookup_current(). For sites that need a stable kvec + * pointer outliving the call (import_ubuf -> vfs_write iter), + * seccomp_pin_kvec_for() returns a kvec stored alongside the pin + * with matching lifetime. + */ +struct seccomp_pinned_arg { + u64 user_addr; + u32 size; + u8 arg_idx; + u8 kind; + u16 _pad; + void *data; +}; + /** * struct seccomp - the state of a seccomp'ed process * @@ -18,11 +46,16 @@ struct seccomp_filter; * * @filter must only be accessed from the context of current as t= here * is no read locking. + * @pinned_args: NULL except during a PIN_ARGS window. Owned by the trapped + * task itself; populated by SECCOMP_IOCTL_NOTIF_PIN_ARGS, consum= ed + * on CONTINUE_PINNED, freed at syscall exit, listener release, or + * task exit. See kernel/seccomp_pin.c. */ struct seccomp { int mode; atomic_t filter_count; struct seccomp_filter *filter; + struct seccomp_pinned_args *pinned_args; }; =20 #else diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index dbfc9b37fcae..51cf081cbc5a 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -154,4 +154,77 @@ struct seccomp_notif_addfd { =20 #define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64) =20 +/* + * SECCOMP_IOCTL_NOTIF_PIN_ARGS =E2=80=94 atomically snapshot the trapped = child's + * pointer-arg payloads into kernel buffers, populate the supervisor's + * byte buffer, and bind the snapshot to the child for re-execution. + * + * On NOTIF_SEND with SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the kernel + * consumes from the pinned buffers instead of re-reading user memory, + * closing the documented TOCTOU race in seccomp_unotify(2). + */ + +/* Shape of a pointer-arg to be pinned. */ +#define SECCOMP_PIN_FIXED 0 /* exactly max_bytes from user_addr */ +#define SECCOMP_PIN_CSTRING 1 /* walk to NUL, capped at max_bytes */ +#define SECCOMP_PIN_CSTRING_ARRAY 2 /* NULL-term array of CSTRINGs */ +#define SECCOMP_PIN_KIND_MAX 2 + +/* New NOTIF_SEND response flag (paired with CONTINUE). */ +#define SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED (1UL << 1) + +/* Bits for seccomp_pin_arg.truncated. */ +#define SECCOMP_PIN_TRUNCATED_BYTES (1U << 0) +#define SECCOMP_PIN_TRUNCATED_ENTRIES (1U << 1) + +/** + * struct seccomp_pin_arg - per-arg pin descriptor (in/out). + * @arg_idx: syscall register slot (0..5). + * @kind: one of SECCOMP_PIN_*. + * @max_bytes: hard cap on bytes copied for this arg; kernel may copy less. + * @max_entries: hard cap on pointer-table entries (CSTRING_ARRAY only). + * @actual_size: bytes the kernel actually populated for this arg (out). + * @actual_entries: entries actually walked (CSTRING_ARRAY only, out). + * @truncated: bitmask of SECCOMP_PIN_TRUNCATED_* (out). + * @user_addr: the userspace address the kernel snapshotted (out, echoed). + * @buf_offset: offset into the supervisor's buf where this arg's bytes + * begin (out). + */ +struct seccomp_pin_arg { + /* in */ + __u8 arg_idx; + __u8 kind; + __u16 _reserved; + __u32 max_bytes; + __u32 max_entries; + __u32 _reserved2; + /* out */ + __u32 actual_size; + __u32 actual_entries; + __u32 truncated; + __u32 _reserved3; + __u64 user_addr; + __u64 buf_offset; +}; + +/** + * struct seccomp_notif_pin_args - PIN_ARGS ioctl payload (in/out). + * @id: notification id from NOTIF_RECV. + * @nr_args: count of valid entries in @args (1..6). + * @buf_size: size in bytes of @buf. + * @buf: user pointer to the bulk byte buffer; the kernel writes + * copied bytes here, indexed by args[i].buf_offset. + * @args: per-arg descriptors; only args[0..nr_args-1] are read/written. + */ +struct seccomp_notif_pin_args { + __u64 id; + __u32 nr_args; + __u32 buf_size; + __u64 buf; + struct seccomp_pin_arg args[6]; +}; + +#define SECCOMP_IOCTL_NOTIF_PIN_ARGS SECCOMP_IOWR(5, \ + struct seccomp_notif_pin_args) + #endif /* _UAPI_LINUX_SECCOMP_H */ diff --git a/kernel/Makefile b/kernel/Makefile index 6785982013dc..7fb35fa1b43a 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -106,6 +106,7 @@ obj-$(CONFIG_LOCKUP_DETECTOR) +=3D watchdog.o obj-$(CONFIG_HARDLOCKUP_DETECTOR_BUDDY) +=3D watchdog_buddy.o obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) +=3D watchdog_perf.o obj-$(CONFIG_SECCOMP) +=3D seccomp.o +obj-$(CONFIG_SECCOMP_FILTER) +=3D seccomp_pin.o obj-$(CONFIG_RELAY) +=3D relay.o obj-$(CONFIG_SYSCTL) +=3D utsname_sysctl.o obj-$(CONFIG_TASK_DELAY_ACCT) +=3D delayacct.o diff --git a/kernel/exit.c b/kernel/exit.c index 25e9cb6de7e7..5d1c54000405 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -917,6 +917,7 @@ void __noreturn do_exit(long code) exit_signals(tsk); /* sets PF_EXITING */ =20 seccomp_filter_release(tsk); + seccomp_clear_pinned_args(tsk); =20 acct_update_integrals(tsk); group_dead =3D atomic_dec_and_test(&tsk->signal->live); diff --git a/kernel/fork.c b/kernel/fork.c index 5f3fdfdb14c7..a5b7dbf21932 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1763,6 +1763,11 @@ static void copy_seccomp(struct task_struct *p) /* Ref-count the new filter user, and assign it. */ get_seccomp_filter(current); p->seccomp =3D current->seccomp; + /* + * pinned_args is a per-trapped-task transient that belongs to the + * outstanding notification on the parent (if any). Don't inherit it. + */ + p->seccomp.pinned_args =3D NULL; =20 /* * Explicitly enable no_new_privs here in case it got set diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 066909393c38..66b7a8e4fcab 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -44,6 +44,8 @@ #include #include =20 +#include "seccomp_pin.h" + /* * When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the * wrong direction flag in the ioctl number. This is the broken one, @@ -97,6 +99,13 @@ struct seccomp_knotif { =20 /* outstanding addfd requests */ struct list_head addfd; + + /* + * A SECCOMP_IOCTL_NOTIF_PIN_ARGS for this notification is mid-walk + * (i.e. inside Phase B's lockless mm scan). Concurrent PIN_ARGS + * ioctls for the same id bail with -EBUSY rather than racing. + */ + bool pin_in_progress; }; =20 /** @@ -1475,6 +1484,13 @@ static void seccomp_notify_detach(struct seccomp_fil= ter *filter) knotif->error =3D -ENOSYS; knotif->val =3D 0; =20 + /* + * Drop any PIN_ARGS snapshot held on the trapped task; the + * supervisor that owned this notif fd is gone, so the pin + * can never be consumed via CONTINUE_PINNED. + */ + seccomp_clear_pinned_args(knotif->task); + /* * We do not need to wake up any pending addfd messages, as * the notifier will do that for us, as this just looks @@ -1498,7 +1514,7 @@ static int seccomp_notify_release(struct inode *inode= , struct file *file) =20 /* must be called with notif_lock held */ static inline struct seccomp_knotif * -find_notification(struct seccomp_filter *filter, u64 id) +seccomp_find_notification(struct seccomp_filter *filter, u64 id) { struct seccomp_knotif *cur; =20 @@ -1607,7 +1623,7 @@ static long seccomp_notify_recv(struct seccomp_filter= *filter, * sure it's still around. */ mutex_lock(&filter->notify_lock); - knotif =3D find_notification(filter, unotif.id); + knotif =3D seccomp_find_notification(filter, unotif.id); if (knotif) { /* Reset the process to make sure it's not stuck */ if (should_sleep_killable(filter, knotif)) @@ -1632,18 +1648,27 @@ static long seccomp_notify_send(struct seccomp_filt= er *filter, if (copy_from_user(&resp, buf, sizeof(resp))) return -EFAULT; =20 - if (resp.flags & ~SECCOMP_USER_NOTIF_FLAG_CONTINUE) + if (resp.flags & ~(SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED)) return -EINVAL; =20 if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) && (resp.error || resp.val)) return -EINVAL; =20 + /* + * CONTINUE_PINNED is only valid alongside CONTINUE, and is a no-op + * until the consumption-side hooks land in subsequent patches. + */ + if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED) && + !(resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE)) + return -EINVAL; + ret =3D mutex_lock_interruptible(&filter->notify_lock); if (ret < 0) return ret; =20 - knotif =3D find_notification(filter, resp.id); + knotif =3D seccomp_find_notification(filter, resp.id); if (!knotif) { ret =3D -ENOENT; goto out; @@ -1660,6 +1685,37 @@ static long seccomp_notify_send(struct seccomp_filte= r *filter, knotif->error =3D resp.error; knotif->val =3D resp.val; knotif->flags =3D resp.flags; + + /* + * If CONTINUE_PINNED was set, arm the snapshot so that the + * syscall-body fetch points consume from kernel buffers instead of + * re-reading user memory. If CONTINUE was set without PINNED, the + * supervisor explicitly opted out of the snapshot and we discard + * it (re-read from user memory as today). + */ + if (resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) { + struct seccomp_pinned_args *kpa =3D + READ_ONCE(knotif->task->seccomp.pinned_args); + + if (kpa && (resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED)) { + WRITE_ONCE(kpa->live, true); + /* + * Schedule a one-shot clear that fires when the + * trapped task next returns to user mode (after the + * resumed syscall body completes). Failure here + * means the task is exiting; cleanup happens via + * seccomp_filter_release / do_exit instead. + */ + seccomp_pin_queue_clear(knotif->task); + } else if (kpa) { + seccomp_clear_pinned_args(knotif->task); + } + } else if (resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED) { + /* Already rejected at the top of this function, but be defensive. */ + ret =3D -EINVAL; + goto out; + } + if (filter->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP) complete_on_current_cpu(&knotif->ready); else @@ -1683,7 +1739,7 @@ static long seccomp_notify_id_valid(struct seccomp_fi= lter *filter, if (ret < 0) return ret; =20 - knotif =3D find_notification(filter, id); + knotif =3D seccomp_find_notification(filter, id); if (knotif && knotif->state =3D=3D SECCOMP_NOTIFY_SENT) ret =3D 0; else @@ -1751,7 +1807,7 @@ static long seccomp_notify_addfd(struct seccomp_filte= r *filter, if (ret < 0) goto out; =20 - knotif =3D find_notification(filter, addfd.id); + knotif =3D seccomp_find_notification(filter, addfd.id); if (!knotif) { ret =3D -ENOENT; goto out_unlock; @@ -1823,6 +1879,125 @@ static long seccomp_notify_addfd(struct seccomp_fil= ter *filter, return ret; } =20 +static long seccomp_notif_pin_args(struct seccomp_filter *filter, + struct seccomp_notif_pin_args __user *uargs) +{ + struct seccomp_notif_pin_args kargs; + struct seccomp_pinned_args *kpa =3D NULL; + struct seccomp_knotif *knotif; + struct task_struct *task =3D NULL; + void __user *user_buf; + u64 args[6]; + int syscall_nr =3D 0; + int i; + long ret; + + if (copy_from_user(&kargs, uargs, sizeof(kargs))) + return -EFAULT; + if (kargs.nr_args =3D=3D 0 || kargs.nr_args > SECCOMP_PIN_MAX_ARGS) + return -EINVAL; + if (kargs.buf_size > SECCOMP_PIN_MAX_TOTAL_BYTES) + return -E2BIG; + + /* Validate descriptor inputs before any allocation. */ + for (i =3D 0; i < kargs.nr_args; i++) { + struct seccomp_pin_arg *d =3D &kargs.args[i]; + + if (d->arg_idx >=3D 6) + return -EINVAL; + if (d->kind > SECCOMP_PIN_KIND_MAX) + return -EINVAL; + if (d->max_bytes =3D=3D 0) + return -EINVAL; + if (d->max_bytes > SECCOMP_PIN_MAX_TOTAL_BYTES) + return -E2BIG; + } + + user_buf =3D (void __user *)(uintptr_t)kargs.buf; + if (kargs.buf_size && !user_buf) + return -EINVAL; + + /* + * Phase A: validate notif state, snapshot the args we need under + * the lock, take task ref, mark pin_in_progress so a concurrent + * PIN_ARGS for the same id bails with -EBUSY. + */ + mutex_lock(&filter->notify_lock); + knotif =3D seccomp_find_notification(filter, kargs.id); + if (!knotif) { + ret =3D -ENOENT; + goto unlock_a; + } + if (knotif->state !=3D SECCOMP_NOTIFY_SENT) { + ret =3D -EINPROGRESS; + goto unlock_a; + } + if (knotif->task->seccomp.pinned_args) { + ret =3D -EEXIST; + goto unlock_a; + } + if (knotif->pin_in_progress) { + ret =3D -EBUSY; + goto unlock_a; + } + knotif->pin_in_progress =3D true; + memcpy(args, knotif->data->args, sizeof(args)); + syscall_nr =3D knotif->data->nr; + task =3D get_task_struct(knotif->task); + mutex_unlock(&filter->notify_lock); + + /* Phase B: lockless mm walk + supervisor copy. */ + ret =3D seccomp_pin_args_walk(task, &kargs, args, syscall_nr, + user_buf, kargs.buf_size, &kpa); + if (ret) + goto cleanup; + + if (copy_to_user(uargs, &kargs, sizeof(kargs))) { + ret =3D -EFAULT; + goto cleanup; + } + + /* + * Phase C: re-validate (the notif may have been replied to or the + * supervisor may have released the listener) and attach the + * snapshot. + */ + mutex_lock(&filter->notify_lock); + knotif =3D seccomp_find_notification(filter, kargs.id); + if (!knotif || knotif->state !=3D SECCOMP_NOTIFY_SENT) { + mutex_unlock(&filter->notify_lock); + ret =3D -ENOENT; + goto cleanup; + } + WRITE_ONCE(task->seccomp.pinned_args, kpa); + knotif->pin_in_progress =3D false; + kpa =3D NULL; /* ownership transferred to task */ + mutex_unlock(&filter->notify_lock); + put_task_struct(task); + return 0; + +cleanup: + /* + * Best-effort: clear pin_in_progress so a subsequent PIN_ARGS can + * proceed. The notif may already be gone, in which case there is + * nothing to clear. + */ + mutex_lock(&filter->notify_lock); + knotif =3D seccomp_find_notification(filter, kargs.id); + if (knotif) + knotif->pin_in_progress =3D false; + mutex_unlock(&filter->notify_lock); + + seccomp_free_pinned_args(kpa); + if (task) + put_task_struct(task); + return ret; + +unlock_a: + mutex_unlock(&filter->notify_lock); + return ret; +} + static long seccomp_notify_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -1840,6 +2015,8 @@ static long seccomp_notify_ioctl(struct file *file, u= nsigned int cmd, return seccomp_notify_id_valid(filter, buf); case SECCOMP_IOCTL_NOTIF_SET_FLAGS: return seccomp_notify_set_flags(filter, arg); + case SECCOMP_IOCTL_NOTIF_PIN_ARGS: + return seccomp_notif_pin_args(filter, buf); } =20 /* Extensible Argument ioctls */ diff --git a/kernel/seccomp_pin.c b/kernel/seccomp_pin.c new file mode 100644 index 000000000000..a206fde3d806 --- /dev/null +++ b/kernel/seccomp_pin.c @@ -0,0 +1,453 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Pin-args lifecycle and walker for SECCOMP_IOCTL_NOTIF_PIN_ARGS. + * + * The supervisor calls PIN_ARGS to atomically copy designated pointer-arg + * payloads of a trapped child into kernel-owned buffers, then sends + * NOTIF_SEND with CONTINUE | CONTINUE_PINNED. The kernel re-executes the + * syscall using the pinned bytes instead of re-reading user memory, + * closing the documented seccomp_unotify(2) TOCTOU race. + * + * The lock-and-validate dance lives in kernel/seccomp.c (where + * struct seccomp_knotif and filter->notify_lock are defined). This file + * owns the per-arg walker (Phase B) and the lifecycle primitives. + * + * Only SECCOMP_PIN_FIXED is implemented in v1's first cut; CSTRING and + * CSTRING_ARRAY arrive in subsequent patches. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "seccomp_pin.h" + +struct seccomp_pinned_args *seccomp_alloc_pinned_args(u8 nr_args) +{ + struct seccomp_pinned_args *kpa; + + if (nr_args =3D=3D 0 || nr_args > SECCOMP_PIN_MAX_ARGS) + return ERR_PTR(-EINVAL); + + kpa =3D kzalloc_obj(*kpa, GFP_KERNEL_ACCOUNT); + if (!kpa) + return ERR_PTR(-ENOMEM); + kpa->nr_args =3D nr_args; + return kpa; +} + +void seccomp_free_pinned_args(struct seccomp_pinned_args *kpa) +{ + int i; + + if (!kpa) + return; + for (i =3D 0; i < kpa->nr_args; i++) + kvfree(kpa->args[i].data); + kfree(kpa); +} + +void seccomp_clear_pinned_args(struct task_struct *task) +{ + struct seccomp_pinned_args *kpa; + + /* + * Atomically claim ownership of the kpa: this can be called + * concurrently from the task's own task_work callback (returning + * to userspace after a CONTINUE_PINNED'd syscall), from a + * listener-release path on the supervisor side, and from task + * exit. Only the xchg winner frees. + */ + kpa =3D xchg(&task->seccomp.pinned_args, NULL); + if (!kpa) + return; + /* + * Cancel any queued post-syscall clear; its callback_head lives + * inside @kpa and would otherwise dangle. If task_work_cancel + * returns false the callback has already started running on @task, + * but it does its work via current->seccomp.pinned_args (already + * NULL) so the in-flight callback observes nothing-to-do. + */ + if (kpa->clear_queued) + task_work_cancel(task, &kpa->clear_work); + seccomp_free_pinned_args(kpa); +} + +/* + * task_work callback: runs on the trapped task when it returns to user + * mode after the resumed syscall body has completed. The pin is single- + * shot; subsequent traps must call PIN_ARGS again. + */ +static void seccomp_pin_clear_cb(struct callback_head *cb) +{ + seccomp_clear_pinned_args(current); +} + +int seccomp_pin_queue_clear(struct task_struct *task) +{ + struct seccomp_pinned_args *kpa =3D task->seccomp.pinned_args; + int ret; + + if (!kpa || kpa->clear_queued) + return 0; + init_task_work(&kpa->clear_work, seccomp_pin_clear_cb); + ret =3D task_work_add(task, &kpa->clear_work, TWA_RESUME); + if (ret =3D=3D 0) + kpa->clear_queued =3D true; + return ret; +} + +/* Snapshot SECCOMP_PIN_FIXED: copy exactly @desc->max_bytes from @user_ad= dr + * in the trapped child's mm into a freshly-allocated kernel buffer. + * + * On success, @out is populated and @desc->actual_size / .truncated are + * filled. The caller is responsible for chaining the bytes into the + * supervisor's bulk buffer. + */ +static long pin_one_fixed(struct task_struct *task, u64 user_addr, + struct seccomp_pin_arg *desc, + struct seccomp_pinned_arg *out) +{ + struct mm_struct *mm; + void *kbuf; + int read; + + kbuf =3D kvmalloc(desc->max_bytes, GFP_KERNEL_ACCOUNT); + if (!kbuf) + return -ENOMEM; + + mm =3D get_task_mm(task); + if (!mm) { + kvfree(kbuf); + return -ESRCH; + } + + read =3D access_remote_vm(mm, user_addr, kbuf, desc->max_bytes, 0); + mmput(mm); + + if (read <=3D 0) { + kvfree(kbuf); + return read ? read : -EFAULT; + } + + out->user_addr =3D user_addr; + out->size =3D read; + out->arg_idx =3D desc->arg_idx; + out->kind =3D SECCOMP_PIN_FIXED; + out->data =3D kbuf; + + desc->actual_size =3D read; + desc->truncated =3D (read < desc->max_bytes) ? + SECCOMP_PIN_TRUNCATED_BYTES : 0; + return 0; +} + +/* MAX_ARG_STRINGS is fs/exec.c-private; redefine our own ceiling. */ +#define SECCOMP_PIN_DEFAULT_MAX_ENTRIES 0x7FFFFFFF + +/* + * Packed CSTRING_ARRAY layout: + * + * [u32 count][u32 offsets[count]][u8 strings[]] + * + * Each offset is from the start of the buffer; each string at + * data + offsets[i] is NUL-terminated. + */ + +/* Snapshot SECCOMP_PIN_CSTRING: NUL-bounded copy from the trapped child's + * mm via the existing copy_remote_vm_str() primitive. The result is + * always NUL-terminated; truncation is reported when the byte cap was + * hit before the source NUL. + */ +static long pin_one_cstring(struct task_struct *task, u64 user_addr, + struct seccomp_pin_arg *desc, + struct seccomp_pinned_arg *out) +{ + void *kbuf; + int copied; + + kbuf =3D kvmalloc(desc->max_bytes, GFP_KERNEL_ACCOUNT); + if (!kbuf) + return -ENOMEM; + + copied =3D copy_remote_vm_str(task, user_addr, kbuf, desc->max_bytes, 0); + if (copied < 0) { + kvfree(kbuf); + return copied; + } + + /* + * copy_remote_vm_str() returns bytes not including the trailing NUL, + * which it always writes on success. If we filled the buffer all the + * way (copied =3D=3D max_bytes - 1) the source NUL may not have been + * reached; flag that as truncation. + */ + out->user_addr =3D user_addr; + out->size =3D copied + 1; /* include the trailing NUL */ + out->arg_idx =3D desc->arg_idx; + out->kind =3D SECCOMP_PIN_CSTRING; + out->data =3D kbuf; + + desc->actual_size =3D copied + 1; + desc->truncated =3D (copied =3D=3D desc->max_bytes - 1) ? + SECCOMP_PIN_TRUNCATED_BYTES : 0; + return 0; +} + +/* + * Snapshot SECCOMP_PIN_CSTRING_ARRAY: walk the NULL-terminated pointer + * table at @user_addr in the trapped child's mm; for each non-NULL ptr, + * copy its NUL-bounded string into a packed kernel buffer. Format: + * + * [u32 count][u32 offsets[count]][u8 strings[]] + * + * Caps on both byte total (@desc->max_bytes) and entry count + * (@desc->max_entries; 0 means default cap). The pointer table is + * walked first to determine count, *before* any string copy, so a + * hostile child can't tie up the kernel walking a giant table. + * + * v1: native pointer width only. Compat (32-bit pointer table read by + * a native supervisor) is a TODO. + */ +static long pin_one_cstring_array(struct task_struct *task, u64 user_addr, + struct seccomp_pin_arg *desc, + struct seccomp_pinned_arg *out) +{ + struct mm_struct *mm; + void *kbuf =3D NULL; + u32 max_entries; + u32 *header; + u32 count =3D 0; + u32 byte_off; + u32 truncated =3D 0; + u32 i; + long ret; + + max_entries =3D desc->max_entries ?: SECCOMP_PIN_DEFAULT_MAX_ENTRIES; + /* Cap entries by what fits in the supervisor's max_bytes assuming + * even the smallest header (count + per-entry offset + 1 NUL). + * Each entry costs at least 4 (offset) + 1 (NUL) =3D 5 bytes. + */ + if (max_entries > (desc->max_bytes / 5)) + max_entries =3D desc->max_bytes / 5; + + if (desc->max_bytes < sizeof(u32)) + return -EINVAL; + + kbuf =3D kvmalloc(desc->max_bytes, GFP_KERNEL_ACCOUNT); + if (!kbuf) + return -ENOMEM; + + mm =3D get_task_mm(task); + if (!mm) { + ret =3D -ESRCH; + goto err_free; + } + + /* Phase 1: count entries by walking the pointer table. */ + for (i =3D 0; i < max_entries; i++) { + unsigned long ptr; + int got; + + got =3D access_remote_vm(mm, user_addr + i * sizeof(ptr), + &ptr, sizeof(ptr), 0); + if (got !=3D sizeof(ptr)) { + mmput(mm); + ret =3D -EFAULT; + goto err_free; + } + if (ptr =3D=3D 0) + break; + count++; + } + if (i =3D=3D max_entries) { + /* Hit the entry cap before the NULL terminator: still report + * what we have, flag truncation. + */ + truncated |=3D SECCOMP_PIN_TRUNCATED_ENTRIES; + } + + /* Header layout fits in max_bytes? */ + if ((u64)sizeof(u32) + (u64)count * sizeof(u32) > desc->max_bytes) { + mmput(mm); + ret =3D -EINVAL; + goto err_free; + } + + header =3D kbuf; + header[0] =3D count; + byte_off =3D sizeof(u32) + count * sizeof(u32); + + /* Phase 2: copy each string into the packed area. */ + for (i =3D 0; i < count; i++) { + unsigned long ptr; + u32 remaining; + int got, copied; + + if (access_remote_vm(mm, user_addr + i * sizeof(ptr), + &ptr, sizeof(ptr), 0) !=3D sizeof(ptr)) { + mmput(mm); + ret =3D -EFAULT; + goto err_free; + } + if (byte_off >=3D desc->max_bytes) { + truncated |=3D SECCOMP_PIN_TRUNCATED_BYTES; + count =3D i; + header[0] =3D count; + break; + } + remaining =3D desc->max_bytes - byte_off; + copied =3D copy_remote_vm_str(task, ptr, + (char *)kbuf + byte_off, + remaining, 0); + if (copied < 0) { + mmput(mm); + ret =3D copied; + goto err_free; + } + header[1 + i] =3D byte_off; + got =3D copied + 1; /* include the NUL written by helper */ + if (got >=3D remaining) + truncated |=3D SECCOMP_PIN_TRUNCATED_BYTES; + byte_off +=3D got; + } + mmput(mm); + + out->user_addr =3D user_addr; + out->size =3D byte_off; + out->arg_idx =3D desc->arg_idx; + out->kind =3D SECCOMP_PIN_CSTRING_ARRAY; + out->data =3D kbuf; + + desc->actual_size =3D byte_off; + desc->actual_entries =3D count; + desc->truncated =3D truncated; + return 0; + +err_free: + kvfree(kbuf); + return ret; +} + +const struct kvec *seccomp_pin_kvec_for(const struct seccomp_pinned_arg *p= in) +{ + struct seccomp_pinned_args *kpa; + long idx; + + kpa =3D READ_ONCE(current->seccomp.pinned_args); + if (!kpa) + return NULL; + idx =3D pin - kpa->args; + if (idx < 0 || idx >=3D kpa->nr_args) + return NULL; + return &kpa->arg_kvecs[idx]; +} + +const struct seccomp_pinned_arg *seccomp_pin_lookup_current(u64 user_addr) +{ + struct seccomp_pinned_args *kpa; + int i; + + kpa =3D READ_ONCE(current->seccomp.pinned_args); + if (!kpa || !kpa->live) + return NULL; + + /* + * If the current syscall doesn't match the one snapshotted at pin + * time, return NULL so the caller reads user memory. This guards + * against a signal handler issuing an unrelated syscall during + * -ERESTART* resolution =E2=80=94 that syscall has its own user pointers + * and must not be served from the pin. + */ + if (kpa->syscall_nr !=3D + syscall_get_nr(current, task_pt_regs(current))) + return NULL; + + for (i =3D 0; i < kpa->nr_args; i++) { + if (kpa->args[i].user_addr =3D=3D user_addr) + return &kpa->args[i]; + } + return NULL; +} + +long seccomp_pin_args_walk(struct task_struct *task, + struct seccomp_notif_pin_args *kargs, + const u64 *args, int syscall_nr, + void __user *user_buf, u32 user_buf_size, + struct seccomp_pinned_args **out) +{ + struct seccomp_pinned_args *kpa; + u32 buf_off =3D 0; + int i; + long ret; + + kpa =3D seccomp_alloc_pinned_args(kargs->nr_args); + if (IS_ERR(kpa)) + return PTR_ERR(kpa); + kpa->notif_id =3D kargs->id; + kpa->syscall_nr =3D syscall_nr; + + for (i =3D 0; i < kargs->nr_args; i++) { + struct seccomp_pin_arg *d =3D &kargs->args[i]; + u64 user_addr =3D args[d->arg_idx]; + + d->user_addr =3D user_addr; + d->actual_size =3D 0; + d->actual_entries =3D 0; + d->truncated =3D 0; + d->buf_offset =3D buf_off; + + /* NULL pointers (e.g. execveat with AT_EMPTY_PATH): record + * a zero-size pin and move on without faulting. + */ + if (user_addr =3D=3D 0) + continue; + + switch (d->kind) { + case SECCOMP_PIN_FIXED: + ret =3D pin_one_fixed(task, user_addr, d, &kpa->args[i]); + break; + case SECCOMP_PIN_CSTRING: + ret =3D pin_one_cstring(task, user_addr, d, &kpa->args[i]); + break; + case SECCOMP_PIN_CSTRING_ARRAY: + ret =3D pin_one_cstring_array(task, user_addr, d, + &kpa->args[i]); + break; + default: + ret =3D -EOPNOTSUPP; + break; + } + if (ret < 0) + goto err_free; + + /* Stable kvec for iov_iter_kvec consumers (import_ubuf). */ + kpa->arg_kvecs[i].iov_base =3D kpa->args[i].data; + kpa->arg_kvecs[i].iov_len =3D kpa->args[i].size; + + if (kpa->args[i].size > user_buf_size - buf_off) { + ret =3D -ENOSPC; + goto err_free; + } + if (copy_to_user(user_buf + buf_off, + kpa->args[i].data, kpa->args[i].size)) { + ret =3D -EFAULT; + goto err_free; + } + d->buf_offset =3D buf_off; + buf_off +=3D kpa->args[i].size; + } + + *out =3D kpa; + return 0; + +err_free: + seccomp_free_pinned_args(kpa); + return ret; +} diff --git a/kernel/seccomp_pin.h b/kernel/seccomp_pin.h new file mode 100644 index 000000000000..ea699bc09645 --- /dev/null +++ b/kernel/seccomp_pin.h @@ -0,0 +1,109 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Internal interfaces for SECCOMP_IOCTL_NOTIF_PIN_ARGS. + * + * The pin lifecycle and walker live in kernel/seccomp_pin.c to keep + * kernel/seccomp.c focused on the existing notify machinery. + */ +#ifndef _KERNEL_SECCOMP_PIN_H +#define _KERNEL_SECCOMP_PIN_H + +#include +#include + +#include /* struct seccomp_pinned_arg, SECCOMP_PIN= _MAX_ARGS */ +#include /* struct kvec */ +#include /* struct callback_head */ + +struct task_struct; +struct seccomp_filter; +struct seccomp_knotif; +struct seccomp_notif_pin_args; + +/* + * Maximum cumulative bytes a single PIN_ARGS request may snapshot on + * behalf of one notification. Defensive bound only =E2=80=94 typical pins= are + * a few KiB (one PATH_MAX path; argv up to MAX_ARG_STRLEN). Hardcoded + * rather than a sysctl: there is no legitimate use case for runtime + * tuning. Smaller is always reachable via desc->max_bytes; larger + * indicates a policy bug. + */ +#define SECCOMP_PIN_MAX_TOTAL_BYTES (1UL << 20) /* 1 MiB */ + +/** + * struct seccomp_pinned_args - the per-task pin record. + * @notif_id: id of the outstanding notification this pin belongs to. + * @syscall_nr: syscall number captured at pin time; consumption checks th= is + * against current to skip pinned data on a mismatched syscall + * (e.g. one issued from a signal handler during restart). + * @nr_args: number of populated entries in @args. + * @live: false during the pin-decision window, set to true on + * CONTINUE_PINNED so consumption hooks know to use the snaps= hot. + * @args: per-slot pinned data; only the first @nr_args entries are valid. + */ +struct seccomp_pinned_args { + u64 notif_id; + int syscall_nr; + u8 nr_args; + bool live; + bool clear_queued; /* clear_work has been task_work_add()'d */ + struct callback_head clear_work; + struct seccomp_pinned_arg args[SECCOMP_PIN_MAX_ARGS]; + /* + * Per-arg stable kvec storage. Populated by the walker for kinds + * whose consumption hooks build an iov_iter (currently FIXED -> + * import_ubuf). The kvec must outlive the iter; this struct lives + * until syscall exit, which is after the iter is fully consumed. + */ + struct kvec arg_kvecs[SECCOMP_PIN_MAX_ARGS]; +}; + +#ifdef CONFIG_SECCOMP_FILTER + +struct seccomp_pinned_args *seccomp_alloc_pinned_args(u8 nr_args); +void seccomp_free_pinned_args(struct seccomp_pinned_args *kpa); +void seccomp_clear_pinned_args(struct task_struct *task); + +/* + * Queue a one-shot task_work that will clear @task's pinned_args when + * @task next returns to userspace, i.e. after the trapped-and-resumed + * syscall body has completed. Called from NOTIF_SEND on CONTINUE_PINNED. + */ +int seccomp_pin_queue_clear(struct task_struct *task); + +/** + * seccomp_pin_args_walk - per-arg snapshot phase (no seccomp locks). + * @task: the trapped child whose mm we're reading; caller must hold a + * reference (via get_task_struct). + * @kargs: in/out ioctl payload; the walker reads .nr_args / .args[i] inpu= ts + * and writes back .args[i] outputs (actual_size, truncated, etc.). + * @args: syscall register args (knotif->data->args). + * @syscall_nr: syscall number captured at notif time. + * @user_buf: the supervisor's bulk byte buffer (user pointer). + * @user_buf_size: capacity of @user_buf. + * @out: on success, *@out is a freshly-allocated kpa with the snapshot; + * caller takes ownership and must seccomp_free_pinned_args() if + * the attach step fails. + * + * Return: 0 on success, negative errno on failure. + * + * Phase B of PIN_ARGS: this runs without seccomp locks held. Phase A (not= if + * validation) and Phase C (attach) live in kernel/seccomp.c. + */ +long seccomp_pin_args_walk(struct task_struct *task, + struct seccomp_notif_pin_args *kargs, + const u64 *args, int syscall_nr, + void __user *user_buf, u32 user_buf_size, + struct seccomp_pinned_args **out); + +/* seccomp_pin_lookup_current() lives in include/linux/seccomp.h; it is + * called from consumption sites outside kernel/seccomp/ (fs/, net/, lib/). + */ + +#else + +static inline void seccomp_clear_pinned_args(struct task_struct *task) { } + +#endif /* CONFIG_SECCOMP_FILTER */ + +#endif /* _KERNEL_SECCOMP_PIN_H */ diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 243662af1af7..e0b038b54ce9 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -1444,8 +1445,29 @@ EXPORT_SYMBOL(import_iovec); =20 int import_ubuf(int rw, void __user *buf, size_t len, struct iov_iter *i) { + const struct seccomp_pinned_arg *pin; + const struct kvec *kvec; + if (len > MAX_RW_COUNT) len =3D MAX_RW_COUNT; + + /* + * Pinned by a seccomp PIN_ARGS supervisor on this task? Build the + * iov_iter over the kernel snapshot rather than re-reading user + * memory. The kvec storage is owned by current->seccomp.pinned_args + * and lives until syscall exit, so it outlasts @i's consumption. + */ + pin =3D seccomp_pin_lookup_current((u64)(uintptr_t)buf); + if (pin && pin->kind =3D=3D SECCOMP_PIN_FIXED) { + kvec =3D seccomp_pin_kvec_for(pin); + if (kvec) { + size_t n =3D min_t(size_t, len, pin->size); + + iov_iter_kvec(i, rw, kvec, 1, n); + return 0; + } + } + if (unlikely(!access_ok(buf, len))) return -EFAULT; =20 diff --git a/mm/memory.c b/mm/memory.c index ea6568571131..766ea403d983 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -7168,7 +7168,7 @@ int access_process_vm(struct task_struct *tsk, unsign= ed long addr, } EXPORT_SYMBOL_GPL(access_process_vm); =20 -#ifdef CONFIG_BPF_SYSCALL +#if defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_SECCOMP_FILTER) /* * Copy a string from another process's address space as given in mm. * If there is any error return -EFAULT. @@ -7286,7 +7286,7 @@ int copy_remote_vm_str(struct task_struct *tsk, unsig= ned long addr, return ret; } EXPORT_SYMBOL_GPL(copy_remote_vm_str); -#endif /* CONFIG_BPF_SYSCALL */ +#endif /* CONFIG_BPF_SYSCALL || CONFIG_SECCOMP_FILTER */ =20 /* * Print the name of a VMA. diff --git a/mm/nommu.c b/mm/nommu.c index ed3934bc2de4..4c14ed97d661 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1711,7 +1711,7 @@ int access_process_vm(struct task_struct *tsk, unsign= ed long addr, void *buf, in } EXPORT_SYMBOL_GPL(access_process_vm); =20 -#ifdef CONFIG_BPF_SYSCALL +#if defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_SECCOMP_FILTER) /* * Copy a string from another process's address space as given in mm. * If there is any error return -EFAULT. @@ -1788,7 +1788,7 @@ int copy_remote_vm_str(struct task_struct *tsk, unsig= ned long addr, return ret; } EXPORT_SYMBOL_GPL(copy_remote_vm_str); -#endif /* CONFIG_BPF_SYSCALL */ +#endif /* CONFIG_BPF_SYSCALL || CONFIG_SECCOMP_FILTER */ =20 /** * nommu_shrink_inode_mappings - Shrink the shared mappings on an inode diff --git a/net/socket.c b/net/socket.c index 22a412fdec07..6e3af6114a60 100644 --- a/net/socket.c +++ b/net/socket.c @@ -82,6 +82,7 @@ #include #include #include +#include #include #include #include @@ -248,10 +249,25 @@ static const struct net_proto_family __rcu *net_famil= ies[NPROTO] __read_mostly; =20 int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr_stor= age *kaddr) { + const struct seccomp_pinned_arg *pin; + if (ulen < 0 || ulen > sizeof(struct sockaddr_storage)) return -EINVAL; if (ulen =3D=3D 0) return 0; + + /* If a seccomp supervisor pinned this sockaddr via PIN_ARGS and + * sent CONTINUE_PINNED, consume from the kernel snapshot instead + * of re-reading user memory. Closes the unotify TOCTOU. + */ + pin =3D seccomp_pin_lookup_current((u64)(uintptr_t)uaddr); + if (pin) { + size_t n =3D min_t(size_t, (size_t)ulen, pin->size); + + memcpy(kaddr, pin->data, n); + return audit_sockaddr(ulen, kaddr); + } + if (copy_from_user(kaddr, uaddr, ulen)) return -EFAULT; return audit_sockaddr(ulen, kaddr); --=20 2.43.0 From nobody Sun Jun 14 04:09:40 2026 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 142241C861D for ; Mon, 4 May 2026 01:12:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857136; cv=none; b=GOMb1KYR61FMn/A3kBslxf4E015oh8uzbAKEM3U3nwulY6Z96/ehrS1Kfb88eNRIinrPCi+q8h1/wlowTAbWo/f2YYR09tc6hp2OwBZOAau7D/VHDasagG5Vw0u0oI6XY1ROV3LZ3C7KrnzE5e0l6YkV3+SGW71GUqwMef3noHk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857136; c=relaxed/simple; bh=x1cuCLb+dlKl1t3S9VUrAbnksy8fLWfb/H9N5EiIlbU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Iywl/+PQmCqVXcNCfta8UKoglTDLld8YSDCskF0HBDBxa8kEB/L6sE6ETSwQaZ22ejKi+hwJWuVSbmsnDr1BwwJV+1YH7JgKwWw05pUFk/ANtogBomHHe0BSG29xeQ3kGJaz5dHJoWniFWWdvfTihDQ16N3IQ5q0PHz59YAHJt4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=BiXkfigE; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="BiXkfigE" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-3654ebfd57cso187389a91.1 for ; Sun, 03 May 2026 18:12:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777857133; x=1778461933; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=h2lkORKMoPb1SEdEEae5ysqZYhC/uigDnZ6QXtaYrQg=; b=BiXkfigEdTewc7K5q73OPEr+/2J+tEzJcnJbsvSZruvrK/UY1awbklw/Cl7uB75Iv/ 1GNOhNeNBkJcO3n4GH4Q5/TfqAKZOwB7shyI9hMZdS5Qkwlg/uvfqd2Mkfd+x09f4/VZ 9fswxyPbod1Ir0i4skLa3lZWOgBykgMmcSeXE5VbBKjwnQInyTXTj5HhTkAmLOWHpJF0 7G09TW3dKKXMXbgdct8dTDYCKP4qM2WTcP7eotSkEkBwE++QP+4wEHfd22Mxu77QBb9e rWl5bF+GSbqz1NRuLk8nXFOOOVOnLVkK9XamxDlmMeXdiQHAiA9nkslKAH5Mfx4jwro/ vfGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777857133; x=1778461933; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=h2lkORKMoPb1SEdEEae5ysqZYhC/uigDnZ6QXtaYrQg=; b=PsloxlT21ORXu9/cjO80FJK231NcQivyCRXxJaphDjUtcoiLO/fHQSCO7wBw8/Yldc zVSP/IMXP+RvQdtIQy5BGbhVBh80sail7cDcnqfFkJeJvmwNRdpzeN+K0NaRJQMQB2Vp KBoORtdlOZgc93HLCaA2/hl/yddG9eYZoy1pclQ/Y2ebRSKpse7o1gaFX2vEMYUgXDpN yNu49Kv9iewYN5Tvpc4medol/N92C2XSSHqefzm/rnlaJPDv1wLhX1XOzcB+Q2bwVm7/ C3ovDeDAdZtQ7t8BhZP1r6J+OPBsdujcX5SJzQ3YL4fCI2clAP7uOdK8/2jDUEtUk479 RSBQ== X-Forwarded-Encrypted: i=1; AFNElJ/KVfV8PHmJsr4V5bSGd8aWge+Wd5iBWEd9Lu2y+vTel+sEkgI4gjVLBmYyBzc0DhstPwMj7+g5+3ZJuTA=@vger.kernel.org X-Gm-Message-State: AOJu0Yz5Q63AEUs9F51GFMsV2S1fxz8dLzJRUHbdQj7J91k2OGJWyufy KIAKKTsWEuHp3rfycqk1uEZJ2cYpYF2lcOIGQkDGyCv/nR0V62Ew5V/0 X-Gm-Gg: AeBDieu9R139SxSv1T9neXm99Dbhbd3xO13i02xBq8LwXI3PDB8f74sSSX8O3CKkO21 44BAJ3Ei3uUtUBSmgc+ya5ASgAKQ1fVWbIrDuIOGrbMaoVFcu8qzbsgnzOwL9asc1dnlf1Fvwq9 RB7szHG72S7l6xvIacJ31ikOAREzU24uO0pTjXJNmagZke2bpsHMwY+MPk3W5pdHYPA3TnrSrPd QpetydyWYCrBBwZWAGlnwvxjDVd74wTIXH3eA9RKg61zsQ70dakWRJBfqD1twY7lDL74jb6VoOd XrBopJT6uDBDOCLZh/82OnnnqgGxT63jWI4I6m1hY702yMMWK5/wGAH4Ay5zb2UcKSbza6R8tLo zK0u4QHe7nlhUOqauycRVhMqCBMeEPoKiMvjXqdGa9ofK4tWv2m6NZhHA4NBLWswiA8qreNsLu/ t8yzaCJl10kMDzMyKpI0HTHLB6g6TlxxoUqKRTejS4H61pN5rhBHaHyyaEEYgOTjokwdShtH0= X-Received: by 2002:a17:90b:3f0f:b0:35d:ab26:5786 with SMTP id 98e67ed59e1d1-3650ceb4dcamr7785192a91.19.1777857133097; Sun, 03 May 2026 18:12:13 -0700 (PDT) Received: from pop-os.. ([2601:647:6802:dbc0:8bb8:1710:d99a:3c81]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b9cae3697bsm83810065ad.58.2026.05.03.18.12.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 18:12:11 -0700 (PDT) From: Cong Wang To: Kees Cook , linux-kernel@vger.kernel.org Cc: Andy Lutomirski , Will Drewry , Christian Brauner , Cong Wang Subject: [RFC PATCH 2/3] selftests/seccomp: add seccomp_pin_args end-to-end coverage Date: Sun, 3 May 2026 18:12:06 -0700 Message-ID: <20260504011207.539408-3-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260504011207.539408-1-xiyou.wangcong@gmail.com> References: <20260504011207.539408-1-xiyou.wangcong@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Cong Wang Add a standalone selftest binary for SECCOMP_IOCTL_NOTIF_PIN_ARGS exercising all three v1 shapes (fixed/cstring/cstring-array) on real syscalls (bind, openat, execve, write), plus negative paths (CONTINUE without PINNED, double pin, mismatched flags) and the single-shot lifecycle (post-syscall clear, SIGKILL teardown). The tests use MAP_SHARED to mirror the documented CLONE_VM peer attack: the supervisor pins the trapped child's pointer arg, the parent mutates the underlying bytes, and the test verifies the kernel acted on the pinned snapshot rather than the mutation. Lives in its own file rather than seccomp_bpf.c since the feature is unrelated to the BPF filter machinery. Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Cong Wang --- tools/testing/selftests/seccomp/.gitignore | 1 + tools/testing/selftests/seccomp/Makefile | 2 +- .../selftests/seccomp/seccomp_pin_args.c | 857 ++++++++++++++++++ 3 files changed, 859 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/seccomp/seccomp_pin_args.c diff --git a/tools/testing/selftests/seccomp/.gitignore b/tools/testing/sel= ftests/seccomp/.gitignore index dec678577f9c..0e39a7297b0a 100644 --- a/tools/testing/selftests/seccomp/.gitignore +++ b/tools/testing/selftests/seccomp/.gitignore @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only seccomp_bpf seccomp_benchmark +seccomp_pin_args diff --git a/tools/testing/selftests/seccomp/Makefile b/tools/testing/selft= ests/seccomp/Makefile index 584fba487037..26abbb3126a5 100644 --- a/tools/testing/selftests/seccomp/Makefile +++ b/tools/testing/selftests/seccomp/Makefile @@ -3,5 +3,5 @@ CFLAGS +=3D -Wl,-no-as-needed -Wall $(KHDR_INCLUDES) LDFLAGS +=3D -lpthread LDLIBS +=3D -lcap =20 -TEST_GEN_PROGS :=3D seccomp_bpf seccomp_benchmark +TEST_GEN_PROGS :=3D seccomp_bpf seccomp_benchmark seccomp_pin_args include ../lib.mk diff --git a/tools/testing/selftests/seccomp/seccomp_pin_args.c b/tools/tes= ting/selftests/seccomp/seccomp_pin_args.c new file mode 100644 index 000000000000..df21bd0781d3 --- /dev/null +++ b/tools/testing/selftests/seccomp/seccomp_pin_args.c @@ -0,0 +1,857 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Selftests for SECCOMP_IOCTL_NOTIF_PIN_ARGS =E2=80=94 atomic snapshot of + * pointer-arg payloads for seccomp_unotify(2) supervisors. + * + * The motivating attack (see Documentation/userspace-api/seccomp_filter.r= st): + * an unprivileged supervisor inspects bytes that a sibling thread (or + * CLONE_VM peer) mutates between supervisor read and kernel re-read, + * defeating any decision the supervisor made on the bytes it saw. + * + * Each test sets up a USER_NOTIF filter, traps a syscall, calls + * PIN_ARGS to atomically copy designated pointer-arg payloads into + * kernel buffers, mutates the underlying user memory (simulating a + * racy peer), sends NOTIF_SEND with CONTINUE | CONTINUE_PINNED, and + * verifies the kernel used the snapshotted bytes rather than the + * mutated ones. + */ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest_harness.h" + +#ifndef SECCOMP_IOCTL_NOTIF_PIN_ARGS +# error "kernel UAPI lacks SECCOMP_IOCTL_NOTIF_PIN_ARGS" +#endif + +#ifndef ARRAY_SIZE +# define ARRAY_SIZE(a) (sizeof(a) / sizeof((a)[0])) +#endif + +/* Install a USER_NOTIF filter that traps the given syscall number and + * allows everything else; returns the listener fd. + */ +static int install_user_notif_filter(int nr) +{ + struct sock_filter filter[] =3D { + BPF_STMT(BPF_LD | BPF_W | BPF_ABS, + offsetof(struct seccomp_data, nr)), + BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, nr, 0, 1), + BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF), + BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog prog =3D { + .len =3D (unsigned short)ARRAY_SIZE(filter), + .filter =3D filter, + }; + + return syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, + SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog); +} + +/* + * Helpers shared by the bind-on-shared-sockaddr tests below. + * MAP_SHARED gives parent and child the same physical bytes, mirroring + * the CLONE_VM peer in the documented attack scenario. + */ +struct bind_race { + int listener; + pid_t child; + struct sockaddr_un *shared; /* mmap'd MAP_SHARED, sockaddr_un */ + char path_a[64]; /* original path (set before fork) */ + char path_b[64]; /* path the parent mutates to before SEND */ +}; + +/* Set up filter, mmap, fill path_a; fork the child to bind() against + * @shared. On return, the child is trapped in the seccomp wait and the + * supervisor (caller) is ready to NOTIF_RECV. Returns 0 on success or + * -1 on a setup failure (with errno preserved). + */ +static int bind_race_setup(struct bind_race *r) +{ + r->listener =3D -1; + r->child =3D -1; + r->shared =3D MAP_FAILED; + + snprintf(r->path_a, sizeof(r->path_a), + "/tmp/seccomp-pin-%d-A", getpid()); + snprintf(r->path_b, sizeof(r->path_b), + "/tmp/seccomp-pin-%d-B", getpid()); + unlink(r->path_a); + unlink(r->path_b); + + if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) !=3D 0) + return -1; + + r->shared =3D mmap(NULL, sizeof(*r->shared), PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + if (r->shared =3D=3D MAP_FAILED) + return -1; + memset(r->shared, 0, sizeof(*r->shared)); + r->shared->sun_family =3D AF_UNIX; + strcpy(r->shared->sun_path, r->path_a); + + r->listener =3D install_user_notif_filter(__NR_bind); + if (r->listener < 0) + return -1; + + r->child =3D fork(); + if (r->child < 0) + return -1; + if (r->child =3D=3D 0) { + int fd =3D socket(AF_UNIX, SOCK_DGRAM, 0); + + if (fd < 0) + _exit(10); + if (bind(fd, (struct sockaddr *)r->shared, + sizeof(*r->shared)) < 0) + _exit(11); + _exit(0); + } + return 0; +} + +static void bind_race_teardown(struct bind_race *r) +{ + if (r->child > 0) + waitpid(r->child, NULL, WNOHANG); + if (r->listener >=3D 0) + close(r->listener); + if (r->shared !=3D MAP_FAILED) + munmap(r->shared, sizeof(*r->shared)); + unlink(r->path_a); + unlink(r->path_b); +} + +/* Pin arg 1 (the sockaddr*) of the outstanding bind() notif. On success, + * @readback (>=3D sizeof(sockaddr_un)) holds the snapshotted bytes. + */ +static int do_pin_sockaddr(int listener, __u64 id, + void *readback, size_t readback_size) +{ + struct seccomp_notif_pin_args pinreq; + + memset(&pinreq, 0, sizeof(pinreq)); + pinreq.id =3D id; + pinreq.nr_args =3D 1; + pinreq.buf_size =3D readback_size; + pinreq.buf =3D (uintptr_t)readback; + pinreq.args[0].arg_idx =3D 1; + pinreq.args[0].kind =3D SECCOMP_PIN_FIXED; + pinreq.args[0].max_bytes =3D sizeof(struct sockaddr_un); + + return ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq); +} + +/* + * Pin a sockaddr the trapped child is about to bind(), mutate the + * underlying shared memory, send CONTINUE | CONTINUE_PINNED, and verify + * that the kernel binds against the *pinned* path rather than the + * mutated one. + */ +TEST(pin_args_sockaddr_bind) +{ + struct bind_race r; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[sizeof(struct sockaddr_un)]; + struct sockaddr_un *seen; + struct stat st; + int status; + + ASSERT_EQ(0, bind_race_setup(&r)); + + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.data.nr, __NR_bind); + + memset(readback, 0, sizeof(readback)); + ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id, + readback, sizeof(readback))); + + seen =3D (struct sockaddr_un *)readback; + EXPECT_EQ(seen->sun_family, (sa_family_t)AF_UNIX); + EXPECT_STREQ(seen->sun_path, r.path_a); + + /* Race: mutate shared memory before SEND. */ + strcpy(r.shared->sun_path, r.path_b); + + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(r.child, &status, 0), r.child); + r.child =3D -1; + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + /* Pinned path won. */ + EXPECT_EQ(stat(r.path_a, &st), 0); + EXPECT_EQ(stat(r.path_b, &st), -1); + EXPECT_EQ(errno, ENOENT); + + bind_race_teardown(&r); +} + +/* + * Negative pair of the above: pin then send CONTINUE *without* PINNED. + * The pin must be discarded and the kernel re-read user memory, so the + * bind should land at the mutated path (path_b) =E2=80=94 the existing + * SECCOMP_USER_NOTIF_FLAG_CONTINUE behavior is preserved. + */ +TEST(pin_args_continue_without_pinned) +{ + struct bind_race r; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[sizeof(struct sockaddr_un)]; + struct stat st; + int status; + + ASSERT_EQ(0, bind_race_setup(&r)); + + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.data.nr, __NR_bind); + + memset(readback, 0, sizeof(readback)); + ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id, + readback, sizeof(readback))); + + strcpy(r.shared->sun_path, r.path_b); + + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE; /* no PINNED */ + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(r.child, &status, 0), r.child); + r.child =3D -1; + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + /* Pin discarded; mutated path won. */ + EXPECT_EQ(stat(r.path_a, &st), -1); + EXPECT_EQ(errno, ENOENT); + EXPECT_EQ(stat(r.path_b, &st), 0); + + bind_race_teardown(&r); +} + +/* + * CONTINUE_PINNED without CONTINUE must be rejected with -EINVAL by + * NOTIF_SEND (the flag is meaningless in isolation). After the rejection + * the supervisor can still send a normal CONTINUE to let the child run. + */ +TEST(pin_args_continue_pinned_alone) +{ + struct bind_race r; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[sizeof(struct sockaddr_un)]; + int status; + + ASSERT_EQ(0, bind_race_setup(&r)); + + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + memset(readback, 0, sizeof(readback)); + ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id, + readback, sizeof(readback))); + + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; /* alone =E2=80= =94 invalid */ + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1); + EXPECT_EQ(errno, EINVAL); + + /* Recover by sending a regular CONTINUE so the child can finish. */ + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE; + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(r.child, &status, 0), r.child); + r.child =3D -1; + EXPECT_EQ(true, WIFEXITED(status)); + + bind_race_teardown(&r); +} + +/* + * Two PIN_ARGS calls for the same notif id: the second must be rejected + * with -EEXIST. The original snapshot stays in effect. + */ +TEST(pin_args_double_pin) +{ + struct bind_race r; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[sizeof(struct sockaddr_un)]; + char readback2[sizeof(struct sockaddr_un)]; + int status; + + ASSERT_EQ(0, bind_race_setup(&r)); + + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + memset(readback, 0, sizeof(readback)); + ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id, + readback, sizeof(readback))); + + memset(readback2, 0, sizeof(readback2)); + EXPECT_EQ(do_pin_sockaddr(r.listener, req.id, + readback2, sizeof(readback2)), -1); + EXPECT_EQ(errno, EEXIST); + + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(r.child, &status, 0), r.child); + r.child =3D -1; + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + bind_race_teardown(&r); +} + +/* + * SECCOMP_PIN_CSTRING: pin the path passed to openat(), mutate the + * shared user-memory copy of the path between PIN_ARGS and SEND, and + * verify that the kernel opens the *pinned* path rather than the + * mutated one. + * + * Matches the motivating attack against path-based filters: supervisor + * blesses /tmp/pin-A; sibling rewrites the path to /tmp/pin-B; the + * kernel must still open /tmp/pin-A. + */ +TEST(pin_args_openat_cstring) +{ + char *shared_path; + char path_a[64], path_b[64]; + struct seccomp_notif_pin_args pinreq; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[PATH_MAX]; + int listener, status, fd_a, fd_b; + pid_t pid; + long ret; + + snprintf(path_a, sizeof(path_a), "/tmp/seccomp-pin-cstr-%d-A", getpid()); + snprintf(path_b, sizeof(path_b), "/tmp/seccomp-pin-cstr-%d-B", getpid()); + + /* Pre-create both targets so openat() succeeds either way; we + * verify *which* file got opened, not whether open succeeded. + */ + fd_a =3D open(path_a, O_CREAT | O_TRUNC | O_WRONLY, 0600); + ASSERT_GE(fd_a, 0); + write(fd_a, "A", 1); + close(fd_a); + + fd_b =3D open(path_b, O_CREAT | O_TRUNC | O_WRONLY, 0600); + ASSERT_GE(fd_b, 0); + write(fd_b, "B", 1); + close(fd_b); + + ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)); + + shared_path =3D mmap(NULL, PATH_MAX, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(MAP_FAILED, shared_path); + memset(shared_path, 0, PATH_MAX); + strcpy(shared_path, path_a); + + listener =3D install_user_notif_filter(__NR_openat); + ASSERT_GE(listener, 0); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + char buf[2] =3D {0}; + int fd =3D openat(AT_FDCWD, shared_path, O_RDONLY); + + if (fd < 0) + _exit(10); + if (read(fd, buf, 1) !=3D 1) + _exit(11); + close(fd); + /* Encode which file we read in the exit code. */ + _exit(buf[0] =3D=3D 'A' ? 0 : (buf[0] =3D=3D 'B' ? 1 : 12)); + } + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.data.nr, __NR_openat); + + memset(&pinreq, 0, sizeof(pinreq)); + memset(readback, 0, sizeof(readback)); + pinreq.id =3D req.id; + pinreq.nr_args =3D 1; + pinreq.buf_size =3D sizeof(readback); + pinreq.buf =3D (uintptr_t)readback; + pinreq.args[0].arg_idx =3D 1; /* openat: pathname */ + pinreq.args[0].kind =3D SECCOMP_PIN_CSTRING; + pinreq.args[0].max_bytes =3D PATH_MAX; + + ret =3D ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq); + ASSERT_EQ(ret, 0) { + TH_LOG("PIN_ARGS failed: %s", strerror(errno)); + } + EXPECT_STREQ(readback, path_a); + EXPECT_EQ(pinreq.args[0].truncated, 0); + + /* Race: mutate the path before SEND. */ + strcpy(shared_path, path_b); + + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + /* Child read 'A' if pin won, 'B' if mutation won. */ + EXPECT_EQ(0, WEXITSTATUS(status)) { + TH_LOG("opened %s instead of pinned %s", + WEXITSTATUS(status) =3D=3D 1 ? "path_b" : "?", path_a); + } + + unlink(path_a); + unlink(path_b); + munmap(shared_path, PATH_MAX); + close(listener); +} + +/* CSTRING truncation: ask for fewer bytes than the actual path; verify + * the truncation flag is set and actual_size =3D=3D max_bytes. + */ +TEST(pin_args_cstring_truncated) +{ + char *shared_path; + char path[128]; + struct seccomp_notif_pin_args pinreq; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[16]; + int listener, status; + pid_t pid; + + snprintf(path, sizeof(path), + "/tmp/seccomp-pin-trunc-%d-LONG-PATH-NAME", getpid()); + + ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)); + + shared_path =3D mmap(NULL, PATH_MAX, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(MAP_FAILED, shared_path); + memset(shared_path, 0, PATH_MAX); + strcpy(shared_path, path); + + listener =3D install_user_notif_filter(__NR_openat); + ASSERT_GE(listener, 0); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + /* Will fail with ENOENT =E2=80=94 we don't care, we just want to + * trigger the trap so the supervisor can run PIN_ARGS. + */ + openat(AT_FDCWD, shared_path, O_RDONLY); + _exit(0); + } + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + memset(&pinreq, 0, sizeof(pinreq)); + memset(readback, 0, sizeof(readback)); + pinreq.id =3D req.id; + pinreq.nr_args =3D 1; + pinreq.buf_size =3D sizeof(readback); + pinreq.buf =3D (uintptr_t)readback; + pinreq.args[0].arg_idx =3D 1; + pinreq.args[0].kind =3D SECCOMP_PIN_CSTRING; + pinreq.args[0].max_bytes =3D sizeof(readback); /* deliberately small */ + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq), 0); + EXPECT_EQ(pinreq.args[0].truncated, SECCOMP_PIN_TRUNCATED_BYTES); + EXPECT_EQ(pinreq.args[0].actual_size, sizeof(readback)); + /* Buffer is NUL-terminated even when truncated. */ + EXPECT_EQ(readback[sizeof(readback) - 1], '\0'); + + /* Just continue normally so the child completes. */ + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + + munmap(shared_path, PATH_MAX); + close(listener); +} + +/* + * SECCOMP_PIN_CSTRING_ARRAY: pin argv at execve(), mutate the argv + * pointer table (and the strings it points to) between PIN_ARGS and + * SEND, and verify the kernel execs against the *pinned* argv. + * + * Reproduces the =C2=A71 attack from the design doc: the supervisor sees + * a blessed argv, a shared peer rewrites argv between supervisor read + * and kernel re-read, and without PIN_ARGS the kernel would exec + * against the rewritten bytes. + */ +TEST(pin_args_execve_argv) +{ + char *shared; + char *strA, *strB; + char **argv_ptrs; + struct seccomp_notif_pin_args pinreq; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[1024]; + static const char *const envp_a[] =3D {"CHK=3DA", NULL}; + int listener, status; + pid_t pid; + long ret; + + /* + * Set up the argv table and string storage in shared memory so + * the supervisor can mutate them between PIN_ARGS and SEND. + */ + shared =3D mmap(NULL, 4096, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(MAP_FAILED, shared); + + /* Layout in the shared page: + * [argv_ptrs: char*[3]] [strA: 32 bytes] [strB: 32 bytes] + */ + argv_ptrs =3D (char **)shared; + strA =3D shared + sizeof(char *) * 3; + strB =3D strA + 32; + strcpy(strA, "/bin/true"); + strcpy(strB, "/bin/false"); + argv_ptrs[0] =3D strA; + argv_ptrs[1] =3D NULL; /* will mutate to strB before SEND */ + argv_ptrs[2] =3D NULL; + + ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)); + + listener =3D install_user_notif_filter(__NR_execve); + ASSERT_GE(listener, 0); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + /* execve("/bin/true", {strA, NULL}, {"CHK=3DA", NULL}). + * The supervisor will mutate argv to point at strB before + * CONTINUE_PINNED. With PIN_ARGS working, the kernel still + * execs /bin/true (filename is also pinned in this test), + * exit code 0. Without it, the kernel would re-read argv + * and exec /bin/false, exit code 1. + * + * We pin the *filename* (arg 0) too so the mutation can't + * change which binary runs by changing argv[0]. + */ + execve(strA, argv_ptrs, (char *const *)envp_a); + _exit(99); + } + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.data.nr, __NR_execve); + + /* Pin filename (CSTRING) and argv (CSTRING_ARRAY). */ + memset(&pinreq, 0, sizeof(pinreq)); + memset(readback, 0, sizeof(readback)); + pinreq.id =3D req.id; + pinreq.nr_args =3D 2; + pinreq.buf_size =3D sizeof(readback); + pinreq.buf =3D (uintptr_t)readback; + + pinreq.args[0].arg_idx =3D 0; + pinreq.args[0].kind =3D SECCOMP_PIN_CSTRING; + pinreq.args[0].max_bytes =3D 64; + + pinreq.args[1].arg_idx =3D 1; + pinreq.args[1].kind =3D SECCOMP_PIN_CSTRING_ARRAY; + pinreq.args[1].max_bytes =3D 512; + pinreq.args[1].max_entries =3D 8; + + ret =3D ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq); + ASSERT_EQ(ret, 0) { + TH_LOG("PIN_ARGS failed: %s", strerror(errno)); + } + EXPECT_EQ(pinreq.args[1].actual_entries, 1); + + /* Mutate the argv pointer table to swap in strB ("/bin/false"). */ + argv_ptrs[0] =3D strB; + + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + /* /bin/true exits 0; /bin/false exits 1; execve failure exits 99. */ + EXPECT_EQ(0, WEXITSTATUS(status)) { + TH_LOG("expected /bin/true (pinned) but got exit code %d", + WEXITSTATUS(status)); + } + + munmap(shared, 4096); + close(listener); +} + +/* + * SECCOMP_PIN_FIXED applied to write(fd, buf, count): pin @buf via + * PIN_ARGS, mutate the underlying shared bytes between PIN_ARGS and + * SEND, and verify the bytes the kernel actually writes to disk are + * the *pinned* ones, not the mutated ones. + */ +TEST(pin_args_write_buf) +{ + char *shared_buf; + char file_path[64]; + struct seccomp_notif_pin_args pinreq; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + const char *pinned_msg =3D "PINNED"; + const char *mutated_msg =3D "MUTATED"; + size_t msg_len =3D strlen(pinned_msg); + char readback[16]; + char file_content[16]; + int listener, status, file_fd; + pid_t pid; + long ret; + + snprintf(file_path, sizeof(file_path), + "/tmp/seccomp-pin-write-%d", getpid()); + unlink(file_path); + + file_fd =3D open(file_path, O_CREAT | O_TRUNC | O_WRONLY, 0600); + ASSERT_GE(file_fd, 0); + + ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)); + + shared_buf =3D mmap(NULL, 4096, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(MAP_FAILED, shared_buf); + memcpy(shared_buf, pinned_msg, msg_len); + + listener =3D install_user_notif_filter(__NR_write); + ASSERT_GE(listener, 0); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + ssize_t n; + + n =3D write(file_fd, shared_buf, msg_len); + _exit(n =3D=3D (ssize_t)msg_len ? 0 : 10); + } + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.data.nr, __NR_write); + + memset(&pinreq, 0, sizeof(pinreq)); + memset(readback, 0, sizeof(readback)); + pinreq.id =3D req.id; + pinreq.nr_args =3D 1; + pinreq.buf_size =3D sizeof(readback); + pinreq.buf =3D (uintptr_t)readback; + pinreq.args[0].arg_idx =3D 1; /* write: buf */ + pinreq.args[0].kind =3D SECCOMP_PIN_FIXED; + pinreq.args[0].max_bytes =3D msg_len; + + ret =3D ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq); + ASSERT_EQ(ret, 0) { + TH_LOG("PIN_ARGS failed: %s", strerror(errno)); + } + EXPECT_EQ(pinreq.args[0].actual_size, msg_len); + EXPECT_EQ(0, memcmp(readback, pinned_msg, msg_len)); + + /* Race: rewrite the buffer the child is about to write. */ + memcpy(shared_buf, mutated_msg, msg_len); + + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + close(file_fd); + file_fd =3D open(file_path, O_RDONLY); + ASSERT_GE(file_fd, 0); + memset(file_content, 0, sizeof(file_content)); + EXPECT_EQ((ssize_t)msg_len, + read(file_fd, file_content, sizeof(file_content))); + close(file_fd); + + /* The pinned bytes should be on disk. */ + EXPECT_EQ(0, memcmp(file_content, pinned_msg, msg_len)) { + TH_LOG("file contained '%.*s'; expected '%s'", + (int)msg_len, file_content, pinned_msg); + } + + unlink(file_path); + munmap(shared_buf, 4096); + close(listener); +} + +/* + * The pin is single-shot: after CONTINUE_PINNED, the subsequent + * task_work-driven clear must run before the trapped task issues its + * *next* filtered syscall, so a second PIN_ARGS for the new notif id + * succeeds (no stale -EEXIST). Validates the post-syscall lifecycle. + */ +TEST(pin_args_one_shot) +{ + struct sockaddr_un *shared; + char path_a[64], path_b[64]; + struct seccomp_notif req =3D {}; + struct seccomp_notif_resp resp =3D {}; + char readback[sizeof(struct sockaddr_un)]; + int listener, status; + pid_t pid; + + snprintf(path_a, sizeof(path_a), "/tmp/seccomp-pin-1shot-%d-A", getpid()); + snprintf(path_b, sizeof(path_b), "/tmp/seccomp-pin-1shot-%d-B", getpid()); + unlink(path_a); + unlink(path_b); + + ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)); + + shared =3D mmap(NULL, sizeof(*shared), PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(MAP_FAILED, shared); + + listener =3D install_user_notif_filter(__NR_bind); + ASSERT_GE(listener, 0); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + int fd1, fd2; + + memset(shared, 0, sizeof(*shared)); + shared->sun_family =3D AF_UNIX; + strcpy(shared->sun_path, path_a); + fd1 =3D socket(AF_UNIX, SOCK_DGRAM, 0); + if (fd1 < 0) + _exit(10); + if (bind(fd1, (struct sockaddr *)shared, + sizeof(*shared)) < 0) + _exit(11); + + strcpy(shared->sun_path, path_b); + fd2 =3D socket(AF_UNIX, SOCK_DGRAM, 0); + if (fd2 < 0) + _exit(12); + if (bind(fd2, (struct sockaddr *)shared, + sizeof(*shared)) < 0) + _exit(13); + _exit(0); + } + + /* First trap: bind(path_a). Pin and CONTINUE_PINNED. */ + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + memset(readback, 0, sizeof(readback)); + ASSERT_EQ(0, do_pin_sockaddr(listener, req.id, + readback, sizeof(readback))); + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + /* Second trap: bind(path_b). PIN_ARGS must succeed (no stale pin + * from the first trap leaking via -EEXIST). + */ + memset(&req, 0, sizeof(req)); + memset(&resp, 0, sizeof(resp)); + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.data.nr, __NR_bind); + memset(readback, 0, sizeof(readback)); + ASSERT_EQ(0, do_pin_sockaddr(listener, req.id, + readback, sizeof(readback))) { + TH_LOG("second PIN_ARGS failed (errno=3D%d %s); pin from prior trap may = have leaked", + errno, strerror(errno)); + } + resp.id =3D req.id; + resp.flags =3D SECCOMP_USER_NOTIF_FLAG_CONTINUE | + SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + struct stat st; + + EXPECT_EQ(stat(path_a, &st), 0); + EXPECT_EQ(stat(path_b, &st), 0); + + unlink(path_a); + unlink(path_b); + munmap(shared, sizeof(*shared)); + close(listener); +} + +/* SIGKILL the trapped child while a pin is attached but not yet armed. + * The kpa must be freed; supervisor's listener fd must remain healthy. + */ +TEST(pin_args_sigkill_child) +{ + struct bind_race r; + struct seccomp_notif req =3D {}; + char readback[sizeof(struct sockaddr_un)]; + int status; + + ASSERT_EQ(0, bind_race_setup(&r)); + + EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + memset(readback, 0, sizeof(readback)); + ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id, + readback, sizeof(readback))); + + /* Pin attached, not armed. Kill the child mid-wait. */ + kill(r.child, SIGKILL); + + EXPECT_EQ(waitpid(r.child, &status, 0), r.child); + r.child =3D -1; + EXPECT_EQ(true, WIFSIGNALED(status)); + EXPECT_EQ(SIGKILL, WTERMSIG(status)); + + /* + * Listener fd is still valid. F_GETFD returns the FD flags + * (FD_CLOEXEC is set on the listener by seccomp), so the + * health-check is "not -1", not "=3D=3D 0". + */ + EXPECT_NE(-1, fcntl(r.listener, F_GETFD)); + + bind_race_teardown(&r); +} + +TEST_HARNESS_MAIN --=20 2.43.0 From nobody Sun Jun 14 04:09:40 2026 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AAA01A9F83 for ; Mon, 4 May 2026 01:12:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857136; cv=none; b=tC/7WgXEemxtGEfXyCaIEpsnfhb/2NpnNqGhnr/2HSeHdoxb5kdusBnGLJj6n+BJbNNksrZopNnYOkz/TL/kgHM3nx85sdB4FSSKJXov4pDHY2SAxIP88/yygb4oD+VTcdX188LyAxlPXBjWCuWAGJ01h06o7MdYB1OBv2zDD9g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857136; c=relaxed/simple; bh=DjVWfX3srKf6vSH6PgssRkQAmq5zFauPEDk1f4mXIRM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZQVMK4NqXsxS/e91ZgDaFr0tjsDYqBt8sU99jGF1SEQWqsPPlRZlke+exPKbPQcDY32NNq0nuctS1YveUUraCrV/bcJ6RFxZW9Ve+9ExxiSR/YltVwZq69imk23N3eUHu8w6ehPw93+2H6iMGBS3XCKkvDDXWgF0p3zj/f1h7uI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Up4bMEaT; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Up4bMEaT" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-2ad21f437eeso20790415ad.0 for ; Sun, 03 May 2026 18:12:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777857134; x=1778461934; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=pe+63BlkBmvqkVhFx7hQK5NzSQ5A9S6uPXIXO6WgjYQ=; b=Up4bMEaTBd8mi3XfHo840SzWjdhCsBEhYZbpgdvgTqIAtQ6IrkOa/Xb4PZyAxPXAWX qPaqBJi3oeM+Bn/01lcn2gVJTsRp9u46VP0UNevru+uchyiU8JD3hSrHRpRs8ry2EzLd zcMkSJRsZBykXrti+W7ufEFdob3ZrYKbY01YMvRXXNrmtaHHZ02j4x7iTO2GCFb5fGja qh8bbAqXM6t0Hy/8FGoAM2E9XHvwfu2eC5Rsoqhrkbh5QmeJ+RkAiFCpRrbvmvW6ZPAX BKCE0BbWzz9kGiByaG1onVLew6+CDDn7MQUJH3k23pMbhSlaeCR2gKYLRWJoljU91s4Q hDIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777857134; x=1778461934; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=pe+63BlkBmvqkVhFx7hQK5NzSQ5A9S6uPXIXO6WgjYQ=; b=DUGvBFCaNwaQkOpnxwcthrHBmP42hA6i95qAdLRhORp0WFqYqfrCEFJMJwcUk75Q1B 5cT0jV0WOC48waVlW8GQOO3zQtXdlGqdGsegLXTWlmq6KAoYmCOfKug6LuwusWTNuuW3 1S4PA6HG8sts++dw9gHCqJuYCSxy+odhMKJvQEPIXEUfgxf413e8F6iyNjV5UX853ZVH kAm2nCUa5vtC7ahLECRoSnzlEYUoRpLXlCEsbroWIRcO/7ofN54ULW46VststcLeUMNU rZ3/XIrNnThYwUq9JXQaJTSHEzwARDwMcnbzrUuop2Na69eFleu3Q/ncWA7WsdPTME+v U9Xw== X-Forwarded-Encrypted: i=1; AFNElJ9779htnVNfAz6oCXcC1uI8K0Riv0KFZxKg/hYsDQR+Q4SXPqYk3SxHve+ACir6n8wLNP30302i2+E8n04=@vger.kernel.org X-Gm-Message-State: AOJu0YzPfZX1GnGxbOLlpRBaNIUDIfiGlosBEuHUz+skYm3U/1Q2wqTt D2HQ8UpbpxOjvHWXHa3dO6k6tiXPzKP0EctUwhSuXGc2Zm8uk82od/hFOXqbUQ== X-Gm-Gg: AeBDiev4cAfOmIw33uevnzqmB3CUtTHlt2Pkw29Lt+DGvWu4/2fOWhVT0IvGGyAMknh 8azePj0jyivnKzzrF6ZkxZ3Pju+C5N5cRq6qIzScRxFgjlx0P2O3eY2jRIWOyYhg7INcHlBVBIM JcnlfjvZ33WOrZ2kfq0ltj7lEpBTr76+cLZhg5IFanc5pz6Mgs26516KljHfTzqzmYMW3GuwwUA UB+AxDrpq5uZz8Nf9KkNjorn2lHVot5O79v5C3OyYWjY/n2e0HOE/ObLVoHAfUxB88z/xk9TtUY kd+EGI94ycMq5hUH0ZtEaS5iZbA/O+UZ0R9SFoOaylCsGMluJ/i0CD+0Zq+vNpxMzjv6QgwXMlf YXPimo2b4sjBGXfeVL/mdI3xvctzZ9OmfPYzhdpLI12GPw+otun51lV0iQwsaykVGA4Q0Bpnml4 G2j+mKG37Dw1V1IEfyxtG5GAs4PpXwfytEaDTxV0ISGbckEusB18EMEpEPzimKGHLwiwlVqCQ= X-Received: by 2002:a17:902:db0f:b0:2b2:4dc4:18cc with SMTP id d9443c01a7336-2b9f1dad672mr60012135ad.12.1777857134544; Sun, 03 May 2026 18:12:14 -0700 (PDT) Received: from pop-os.. ([2601:647:6802:dbc0:8bb8:1710:d99a:3c81]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b9cae3697bsm83810065ad.58.2026.05.03.18.12.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 18:12:13 -0700 (PDT) From: Cong Wang To: Kees Cook , linux-kernel@vger.kernel.org Cc: Andy Lutomirski , Will Drewry , Christian Brauner , Cong Wang Subject: [RFC PATCH 3/3] Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS Date: Sun, 3 May 2026 18:12:07 -0700 Message-ID: <20260504011207.539408-4-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260504011207.539408-1-xiyou.wangcong@gmail.com> References: <20260504011207.539408-1-xiyou.wangcong@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Cong Wang Add a "Pinned arguments" section to the userspace API doc covering the motivation (closing the documented TOCTOU window for unprivileged supervisors), the pin/consume flow via SECCOMP_IOCTL_NOTIF_PIN_ARGS and SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the three v1 shapes with their per-shape semantics, the single-shot lifecycle, the syscall_nr mismatch check, and the explicitly-not-covered cases left for follow-ups (vector I/O, nested-pointer payloads). Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Cong Wang --- .../userspace-api/seccomp_filter.rst | 76 +++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation= /userspace-api/seccomp_filter.rst index cff0fa7f3175..8bbbd923c31d 100644 --- a/Documentation/userspace-api/seccomp_filter.rst +++ b/Documentation/userspace-api/seccomp_filter.rst @@ -289,6 +289,82 @@ above in this document: all arguments being read from = the tracee's memory should be read into the tracer's memory before any policy decisions are ma= de. This allows for an atomic decision on syscall arguments. =20 +Pinned arguments +---------------- + +For unprivileged supervisors, ``ptrace()``/``/proc/pid/mem`` are not +available, and reading the tracee's memory via ``process_vm_readv()`` +remains racy: a sibling thread or ``CLONE_VM`` peer can mutate the +buffer between supervisor read and the kernel's re-read on +``SECCOMP_USER_NOTIF_FLAG_CONTINUE``. ``SECCOMP_IOCTL_NOTIF_PIN_ARGS`` +closes that race by atomically copying designated pointer-arg payloads +from the tracee's address space into kernel-owned buffers, and binding +those buffers to the tracee's next-syscall execution. + +The supervisor receives a notification as today, then issues +``ioctl(SECCOMP_IOCTL_NOTIF_PIN_ARGS, &payload)`` with a +``struct seccomp_notif_pin_args`` describing which pointer-args to +snapshot. Each per-arg descriptor names a syscall register slot +(``arg_idx``, 0..5), one of three shapes (``SECCOMP_PIN_FIXED``, +``SECCOMP_PIN_CSTRING``, ``SECCOMP_PIN_CSTRING_ARRAY``), and a +``max_bytes`` cap. The kernel walks the trapped task's mm, copies +the bytes into kernel buffers, and writes them back to a supervisor- +provided byte buffer (``buf`` / ``buf_size``) plus per-arg metadata +(``actual_size``, ``buf_offset``, ``truncated``). + +To consume the snapshot on syscall re-execution, the supervisor sends +``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` with both +``SECCOMP_USER_NOTIF_FLAG_CONTINUE`` and +``SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED`` set. The kernel's syscall +fetch points (``getname_flags``, ``copy_strings``, +``move_addr_to_kernel``, ``import_ubuf``) check +``current->seccomp.pinned_args`` and consume from the kernel buffer +instead of re-reading user memory; mutations to the original buffer +after ``PIN_ARGS`` returns have no effect. + +The pin is single-shot: it is cleared automatically when the trapped +task next returns to user mode after the resumed syscall body +completes, when the task exits, when the listener fd is closed, or +when the supervisor sends ``CONTINUE`` without ``CONTINUE_PINNED`` +(an explicit "I changed my mind" path). Subsequent traps require a +fresh ``PIN_ARGS`` for the new notification id. + +Per-shape semantics: + +* ``SECCOMP_PIN_FIXED`` copies exactly ``max_bytes`` from + ``args[arg_idx]``. Suitable for ``struct sockaddr`` (``bind``, + ``connect``, ``sendto``) and for ``write(fd, buf, count)`` (the + supervisor sets ``max_bytes =3D count`` from + ``seccomp_data.args[2]``). + +* ``SECCOMP_PIN_CSTRING`` walks to the trailing NUL, capped at + ``max_bytes``. The pinned buffer is always NUL-terminated; if the + cap was hit before the source NUL, ``truncated`` carries + ``SECCOMP_PIN_TRUNCATED_BYTES``. Suitable for paths + (``open``/``openat``/``execve`` filename, etc.). + +* ``SECCOMP_PIN_CSTRING_ARRAY`` walks a NULL-terminated pointer table + at ``args[arg_idx]`` and copies each non-NULL string. Suitable for + ``execve``'s argv and envp. Bounded by both ``max_bytes`` and + ``max_entries``. Result is packed as + ``[u32 count][u32 offsets[count]][u8 strings[]]``. + +The total cumulative ``max_bytes`` across all per-arg descriptors and +the supervisor-provided ``buf_size`` are each bounded at 1 MiB; this +is a hard-coded defensive ceiling, not a tunable. + +The kernel records the syscall number at pin time and verifies a +match at consumption: a signal handler running on the trapped task +during ``-ERESTART*`` resolution that issues an unrelated syscall +will not consume the pin. + +Cumulative scope of v1: ``SECCOMP_PIN_FIXED`` covers sockaddr and +single-buffer write content; ``SECCOMP_PIN_CSTRING`` covers paths; +``SECCOMP_PIN_CSTRING_ARRAY`` covers argv and envp. Vector I/O +(``readv``/``writev``) and nested-pointer payloads +(``sendmsg``/``recvmsg`` ``msghdr``, ``futex_waitv``) are not covered +in v1. + Sysctls =3D=3D=3D=3D=3D=3D=3D =20 --=20 2.43.0