From nobody Mon Jun  8 07:22:03 2026
Received: from mail-dy1-f196.google.com (mail-dy1-f196.google.com
 [74.125.82.196])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 838BE31AABC
	for <linux-kernel@vger.kernel.org>; Thu,  4 Jun 2026 19:47:07 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=74.125.82.196
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780602429; cv=none;
 b=dkbAMlpvshA1w68ydE0fr+bJyDO3muTlSOJiYQLcaGu4dyz2AEHVuRUrjPlnDz7e3sXtWft59RpU14A2WX40r2JngEIb7mY1fh9JUco9r5Ds5op8zxMpqqYnkvyN3bWycyZ5kLQn9RW+Qi+DrOlpTvevjoee8Vppi1efmElSmek=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780602429; c=relaxed/simple;
	bh=M+ZzsX5p85PGrfpITsWUg6sfDTAyZPfmyRgLp49+MxI=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=h9o8jkjoY4JvCnXNZzq9F34BhLAWZD9jekxa9prg4OfyAYXyahSm6UpYUxgGxHAYL93Q+yRNBsStxp9SV8sl01i4pbC6XLcBERv4lRBsucbAldKUDryUjhi+QXJhDKwGdZvXVMKBjax7bUSPMEbR/vvz0uBARPEeSgdSLTr6df8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=gom5bWBo; arc=none smtp.client-ip=74.125.82.196
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="gom5bWBo"
Received: by mail-dy1-f196.google.com with SMTP id
 5a478bee46e88-304df7ff4c2so1490777eec.0
        for <linux-kernel@vger.kernel.org>;
 Thu, 04 Jun 2026 12:47:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20251104; t=1780602426; x=1781207226;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=mejW8PaA8z97cDibT5Pd0sISLOYkfEMXCte0NLxtt+o=;
        b=gom5bWBoeBuN+6hqN4cXjROXUamWGY30yBNYRdNz3Sn+JR441BdSa9fUp2c2R1FgvG
         cVqeQu4J+TSxse08FTuJfWEGr6X0FWpnCKh8RadmsYF7c0Mj+hVEqto1MAgl093/RsGc
         4zq2TKllOTjCuxuHBaikZoUfvGyjwGcSaC8H/F850cztnQrTuYFY+3BhSAgt0PjllSOf
         FEMTpOozrrUPyZP3GGKZJIYFI1iFu5rv1OLSkZWMNsFehZdSWfWjVrt1l50JJC+LAgf7
         0rpV4G5N6WaF+iK2KlR3UfqEiYZzsCGMKnVSSzbJeVl1fFGkgSFHTwA2ySUb4JR+Pwr1
         C9eA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1780602426; x=1781207226;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=mejW8PaA8z97cDibT5Pd0sISLOYkfEMXCte0NLxtt+o=;
        b=dFV/OAQhOHzgSwTJsLMMSXvw3Du7qcPVPsnfDhIx0cvpwyV3zGYFRrwX7ZKTZL8xba
         phzr5O789Kzm3wxO17GwNUgUQ3IxKnnqdAOg36JTg1A6/Xy8Ux+MxUJvRTSgTPZbsrP3
         oFV+TthtpXN1t42vlJy5BavyM6xiqDzqbDWMdhxLi7BaQT8gE2d8sn7k4EBZsVe8iMs8
         0V5csHgyLXcMufM9U8DJM0Mucp0FP1ZiNKj/jAzWXbsOzjUA0MUdx9tOtd4XssjKMpjA
         DEK/OoYnO6RnuC5iDXwezdE1W8wCF9kApEI91V1X87a4vNUhttW6Ud4zWZD1EhGi4KvI
         GM/Q==
X-Forwarded-Encrypted: i=1;
 AFNElJ8IG77R7qrzpTElDKhzJIx+pfnC0mFKqsJvPvBHgh226jqekuTBRFTsxppZJSvASP1ni8Ugcohgt6hrZ3o=@vger.kernel.org
X-Gm-Message-State: AOJu0YxbjtUYAiMRf0PH+uTMyp4nIC6CybMNwon2jLOYxt9SLV9jRl1p
	cM9GiTkAV75/xnH0n1KhJ/8sIIrzVcjANn0OwvslY5Q2wNEmCwVAFcSQ
X-Gm-Gg: Acq92OFKHzybBWq2z2J1LCHAKpVrw3Vj3gFxlddhrihZPj5PnzeLKXl0Gam3+EC+OU9
	MNUqnRFOlPfmG2KtmWgKwAJ2tNDaJsrk9npTATnfY77cARKlCGhOFfsjHwaBWhXTLEswkNx1SVc
	MSjF9H5CT6+bPUVM0x6gnDExZ6ssiOF00iQ1PcMmQj4waxxt4LDXyT5W3yfyleLezmRT9dVTYzu
	M88nufnRZboI+nYDeCQ39bHwqOtZoLjXqdIxco1ui0YdjcClsjjr643jmWqb6F5iybL49enVcic
	Omha5PlwLLQDOhVoJ2wL+9QzPpivoPTay+AjzXbd7/vMTqIClORg8weBUg2IKN3DUyPTlPWLvyp
	g9GQvCcEieX6lpXCxYLk8fX+FcPorToAaW22fFgcN2nDBYWAzNaDtoKksnlEoKpy53VuosD0Xny
	q3pf4y8YagOH1WjadDtH2m9ybXYGYmFQ==
X-Received: by 2002:a05:7301:188b:b0:2c5:c532:1fe1 with SMTP id
 5a478bee46e88-30761d0ad83mr1687234eec.3.1780602426335;
        Thu, 04 Jun 2026 12:47:06 -0700 (PDT)
Received: from anonyme ([2605:52c0:2:2f27:be24:11ff:fe89:6f0f])
        by smtp.gmail.com with ESMTPSA id
 5a478bee46e88-3074df9bbd4sm5159499eec.30.2026.06.04.12.47.04
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 04 Jun 2026 12:47:05 -0700 (PDT)
From: AnonymeMeow <anonymemeow@gmail.com>
To: amir73il@gmail.com
Cc: brauner@kernel.org,
	jack@suse.cz,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	AnonymeMeow <anonymemeow@gmail.com>
Subject: [DRAFT][PATCH] fanotify: lift pidfd reporting restrictions
Date: Fri,  5 Jun 2026 03:46:50 +0800
Message-ID: <20260604194650.2826-1-anonymemeow@gmail.com>
X-Mailer: git-send-email 2.54.0
In-Reply-To: 
 <CAOQ4uxgudh_W2Gcs5qqGce8yO6503yjrGzEwbjTKkT+DLAqufA@mail.gmail.com>
References: 
 <CAOQ4uxgudh_W2Gcs5qqGce8yO6503yjrGzEwbjTKkT+DLAqufA@mail.gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

The FAN_REPORT_PIDFD and FAN_REPORT_TID flags used to be mutually
exclusive because by the time the pidfd support was introduced to
fanotify, pidfds could only be created for thread group leaders. Now
that the pidfd API supports thread-specific pidfds via PIDFD_THREAD,
this restriction can be lifted.

Fanotify used to refuse to report pidfds for reaped tasks by applying a
pid_has_task() check before calling pidfd_prepare(). This prevented
userspace from obtaining information about the task.

Register the event pid with pidfs when creating the fanotify event if
pidfd reporting was requested, so pidfd_prepare() can later create a
pidfd for the reaped task.

Signed-off-by: AnonymeMeow <anonymemeow@gmail.com>
---

On 2026-06-03 12:00 +0200, Amir Goldstein wrote:
>=20
> Thanks for noticing the oddity and explaining the options.
>=20
> We could fail the event in case on ENOMEM because the system
> is likely is a bad shape anyway and other events are likely to fail
> allocation themselves, so I am not objecting to your patch as is,

The existing behavior of fanotify is that, for groups with unlimited
event queues, event allocation is performed with __GFP_NOFAIL and
therefore must not fail. For groups with limited queues, event allocation
uses __GFP_RETRY_MAYFAIL, which avoids entering the OOM path.

Currently, pidfs_register_pid() uses hardcoded GFP_KERNEL. For non-costly
allocations(pid attr falls into this category, am I right?), this is
almost no-fail, but it will not avoid the OOM path. That means it can
bypass fanotify's OOM-killer avoidance mitigation for limited queues
and potentially trigger the OOM killer in the monitoring memcg under
memory pressure, breaking the intended security isolation, as sashiko
mentioned. In the meanwhile, GFP_KERNEL does not guarantee no-fail in
all corner cases, so for unlimited queues, it can still cause events
to be dropped under memory pressure, this is also reported by sashiko.

Since this patch set is meant to be a functional enhancement, I think
dropping events from an unlimited queue is not acceptable and can be
a regression.

The cleanest solution I can come up with is something like this patch.
(I squashed the commits so that sashiko can apply the patch correctly)
It adds a pidfs registration helper that takes a gfp_t argument, and
fanotify registers the pid before allocating the event. With an unlimited
queue, both pid registration and event allocation are no-fail, the
existing behavior is preserved. With a limited queue, if pid registration
fails with ENOMEM, fanotify simply returns NULL and drops the event,
just like it would do if event allocation itself failed with
__GFP_RETRY_MAYFAIL.

This also avoids having to destroy an already allocated event on pid
registration failure. Since the pid reference is only taken when
assigning the event->pid field, there is no extra put_pid() path needed
either. And this guarantees that whenever event allocation succeeds,
the pid has already been registered with pidfs, so userspace can
reliably get a pidfd later. The API semantics are no longer ambiguous.

I also originally thought pidfs_register_pid() did not need to run in
the target memcg because it does not explicitly use __GFP_ACCOUNT.
However, pidfs_attr_cachep is created with SLAB_ACCOUNT, so the
registration should also happen inside the set_active_memcg() scope.
This version does that.

But I don't know whether this amount of churn is justified for a
relatively small feature enhancement. For the next revision, I am
still inclined to send the simpler best-effort registration approach,
unless reviewers think this stricter version is preferable.

In the best-effort version, fanotify would allocate the event first
and then try to register the pid with pidfs using the existing
GFP_KERNEL flag. If registration fails, the event is still queued.
This does mean that pidfd creation at read time is not guarenteed to
succeed if the task has already been reaped, and it also means the
registration allocation may enter the OOM path even though the event
allocation used __GFP_RETRY_MAYFAIL. But I think accepting that very
small corner cases may be preferable if it avoids most of the
additional code churn.

With Best Regards,
Anonymemeow

>=20
> but I am slightly leaning towards the semantically ambiguous API.
>=20
> I don't think that we are going to change the man page to say that
> FAN_REPORT_PIDFD is guaranteed to return a pidfd, maybe we just
> add a NOTE about increased likelihood of getting a pidfd since kernel XXX
> and mention the known cases of pid namespace and ENOMEM.
>=20
> Thanks,
> Amir.
>=20

---
 fs/notify/fanotify/fanotify.c      | 17 +++++++++------
 fs/notify/fanotify/fanotify_user.c | 33 +++++++-----------------------
 fs/pidfs.c                         | 23 +++++++++++++++++----
 include/linux/pidfs.h              |  3 +++
 4 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 38290b9c07f7..ece9523e775b 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -14,6 +14,7 @@
 #include <linux/sched/mm.h>
 #include <linux/statfs.h>
 #include <linux/stringhash.h>
+#include <linux/pidfs.h>
=20
 #include "fanotify.h"
=20
@@ -842,6 +843,15 @@ static struct fanotify_event *fanotify_alloc_event(
 	/* Whoever is interested in the event, pays for the allocation. */
 	old_memcg =3D set_active_memcg(group->memcg);
=20
+	if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
+		pid =3D task_pid(current);
+	else
+		pid =3D task_tgid(current);
+
+	if (FAN_GROUP_FLAG(group, FAN_REPORT_PIDFD) &&
+	    pidfs_register_pid_gfp(pid, gfp))
+		goto out;
+
 	if (fanotify_is_perm_event(mask)) {
 		event =3D fanotify_alloc_perm_event(data, data_type, gfp);
 	} else if (fanotify_is_error_event(mask)) {
@@ -863,15 +873,10 @@ static struct fanotify_event *fanotify_alloc_event(
 	if (!event)
 		goto out;
=20
-	if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
-		pid =3D get_pid(task_pid(current));
-	else
-		pid =3D get_pid(task_tgid(current));
-
 	/* Mix event info, FAN_ONDIR flag and pid into event merge key */
 	hash ^=3D hash_long((unsigned long)pid | ondir, FANOTIFY_EVENT_HASH_BITS);
 	fanotify_init_event(event, hash, mask);
-	event->pid =3D pid;
+	event->pid =3D get_pid(pid);
=20
 out:
 	set_active_memcg(old_memcg);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanoti=
fy_user.c
index ae904451dfc0..b604e3da58ad 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -19,6 +19,7 @@
 #include <linux/memcontrol.h>
 #include <linux/statfs.h>
 #include <linux/exportfs.h>
+#include <linux/pidfd.h>
=20
 #include <asm/ioctls.h>
=20
@@ -903,25 +904,13 @@ static ssize_t copy_event_to_user(struct fsnotify_gro=
up *group,
 		metadata.fd =3D fd >=3D 0 ? fd : FAN_NOFD;
=20
 	if (pidfd_mode) {
-		/*
-		 * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
-		 * exclusion is ever lifted. At the time of incoporating pidfd
-		 * support within fanotify, the pidfd API only supported the
-		 * creation of pidfds for thread-group leaders.
-		 */
-		WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
+		unsigned int pidfd_flags =3D PIDFD_STALE;
=20
-		/*
-		 * The PIDTYPE_TGID check for an event->pid is performed
-		 * preemptively in an attempt to catch out cases where the event
-		 * listener reads events after the event generating process has
-		 * already terminated.  Depending on flag FAN_REPORT_FD_ERROR,
-		 * report either -ESRCH or FAN_NOPIDFD to the event listener in
-		 * those cases with all other pidfd creation errors reported as
-		 * the error code itself or as FAN_EPIDFD.
-		 */
-		if (metadata.pid && pid_has_task(event->pid, PIDTYPE_TGID))
-			pidfd =3D pidfd_prepare(event->pid, 0, &pidfd_file);
+		if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
+			pidfd_flags |=3D PIDFD_THREAD;
+
+		if (metadata.pid)
+			pidfd =3D pidfd_prepare(event->pid, pidfd_flags, &pidfd_file);
=20
 		if (!FAN_GROUP_FLAG(group, FAN_REPORT_FD_ERROR) && pidfd < 0)
 			pidfd =3D pidfd =3D=3D -ESRCH ? FAN_NOPIDFD : FAN_EPIDFD;
@@ -1628,14 +1617,6 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, =
unsigned int, event_f_flags)
 #endif
 		return -EINVAL;
=20
-	/*
-	 * A pidfd can only be returned for a thread-group leader; thus
-	 * FAN_REPORT_PIDFD and FAN_REPORT_TID need to remain mutually
-	 * exclusive.
-	 */
-	if ((flags & FAN_REPORT_PIDFD) && (flags & FAN_REPORT_TID))
-		return -EINVAL;
-
 	/* Don't allow mixing mnt events with inode events for now */
 	if (flags & FAN_REPORT_MNT) {
 		if (class !=3D FAN_CLASS_NOTIF)
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 1cce4f34a051..e9f6113efc5b 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -991,14 +991,16 @@ static void pidfs_put_data(void *data)
 }
=20
 /**
- * pidfs_register_pid - register a struct pid in pidfs
+ * pidfs_register_pid_gfp - register a struct pid in pidfs with custom GFP
+ * flags
  * @pid: pid to pin
+ * @gfp: GFP flags for memory allocation
  *
- * Register a struct pid in pidfs.
+ * Register a struct pid in pidfs with custom GFP flags.
  *
  * Return: On success zero, on error a negative error code is returned.
  */
-int pidfs_register_pid(struct pid *pid)
+int pidfs_register_pid_gfp(struct pid *pid, gfp_t gfp)
 {
 	struct pidfs_attr *new_attr __free(kfree) =3D NULL;
 	struct pidfs_attr *attr;
@@ -1014,7 +1016,7 @@ int pidfs_register_pid(struct pid *pid)
 	if (attr)
 		return 0;
=20
-	new_attr =3D kmem_cache_zalloc(pidfs_attr_cachep, GFP_KERNEL);
+	new_attr =3D kmem_cache_zalloc(pidfs_attr_cachep, gfp);
 	if (!new_attr)
 		return -ENOMEM;
=20
@@ -1031,6 +1033,19 @@ int pidfs_register_pid(struct pid *pid)
 	return 0;
 }
=20
+/**
+ * pidfs_register_pid - register a struct pid in pidfs
+ * @pid: pid to pin
+ *
+ * Register a struct pid in pidfs.
+ *
+ * Return: On success zero, on error a negative error code is returned.
+ */
+int pidfs_register_pid(struct pid *pid)
+{
+	return pidfs_register_pid_gfp(pid, GFP_KERNEL);
+}
+
 static struct dentry *pidfs_stash_dentry(struct dentry **stashed,
 					 struct dentry *dentry)
 {
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 416bdff4d6ce..8054f49bc139 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_PID_FS_H
 #define _LINUX_PID_FS_H
=20
+#include <linux/types.h>
+
 struct coredump_params;
=20
 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
@@ -14,6 +16,7 @@ void pidfs_exit(struct task_struct *tsk);
 void pidfs_coredump(const struct coredump_params *cprm);
 #endif
 extern const struct dentry_operations pidfs_dentry_operations;
+int pidfs_register_pid_gfp(struct pid *pid, gfp_t gfp);
 int pidfs_register_pid(struct pid *pid);
 void pidfs_free_pid(struct pid *pid);
=20
--=20
2.54.0