From nobody Sun Jun 21 02:29:23 2026 Received: from out-178.mta0.migadu.com (out-178.mta0.migadu.com [91.218.175.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 22E5E3D75C3 for ; Wed, 8 Apr 2026 17:25:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775669131; cv=none; b=AfqluAB6FVkbxq4B0eF6Q7dyqpdJRhQnOG6z0kKhQbYi2a0l6Wj7DGpvUntHlFc9m72z4hHoZ+GwJNej7uDrPi82J0sOxJUzPeKyUR4g01AEGVCyoz9+7ab/L/6dsNyUWclL6r4WGcZKf9qPB1IIS1i3z7+01L3XN2ApbmEtS8M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775669131; c=relaxed/simple; bh=sVyOqSuKK31rgTue4uvQiahhS1FfzPeaeQi3ygSoaqg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=jx02ClmIpIRrbqSCtaS24eKZr5R8h/C2ir7fCcunehK//9gNLd9H8e9TdTF4b9AdE8Xt7UXfS68m+Lgs0JLIdWSidgEuzQDLseMSLguIneE+4wPFkNAeU7gvmy6zEgfPhsJMK+VrtTVIKQGW8/O+eJSXjVurbFcYJPe7H8y6Cnw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=SycBMjRa; arc=none smtp.client-ip=91.218.175.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="SycBMjRa" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1775669127; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jv7R7lRthgcrO9PPVAeU4iCNg5RLQONuhbbZQkJ3z+g=; b=SycBMjRaAd1RmJ0eFH3pDgf8jVXOWe6h0BtZfldEorYeXQcSKNJ69RGEdxtELOb2FjaTV7 HMvh7BMAu0o4PZlkPUfkqbG5+AUZatVHHVyNP1szi0dsDbB4nlB9PsyyP/l+GouKrtKfga tlWaU3B7pzaq7COMm+RJbDhzsnfXy7Q= From: wen.yang@linux.dev To: Christian Brauner , Jan Kara , Alexander Viro Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Wen Yang , Jens Axboe Subject: [RFC PATCH v5 1/2] eventfd: add configurable per-fd counter maximum for flow control Date: Thu, 9 Apr 2026 01:24:48 +0800 Message-Id: <530e8b5e22e08f8459d335eaf31ff78b999fa5cf.1775668339.git.wen.yang@linux.dev> In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT From: Wen Yang In non-semaphore mode, write(2) accumulates into the counter and a single read(2) drains it entirely. A producer issuing repeated write(1) calls coalesces N signals into the counter; each write succeeds immediately regardless of whether the consumer has processed earlier events. With no bound below ULLONG_MAX (~1.8=C3=9710=C2=B9=E2=81=B9= ), the counter grows without bound, consumer lag is invisible to the producer, and in tight loops both sides burn CPU at 100% even though the consumer is not keeping up. Without a maximum, the batch size seen by each read(2) is also unbounded: a slow consumer may drain thousands of accumulated signals in one call, losing visibility into how far behind it has fallen. Introduce two ioctl commands: EFD_IOC_SET_MAXIMUM (_IOW('J', 0, __u64)) Set the overflow threshold. A write(2) that would push the counter to or beyond this value blocks (EAGAIN for O_NONBLOCK fds). Returns -EINVAL if the requested maximum is <=3D the current counter. Wakes any blocked writers so they re-evaluate the new limit without waiting for the next read(2). EFD_IOC_GET_MAXIMUM (_IOR('J', 1, __u64)) Return the current threshold. Defaults to ULLONG_MAX, preserving the original unlimited behaviour. The value is also visible in /proc/self/fdinfo as "eventfd-maximum". The maximum acts as the overflow level, exactly as ULLONG_MAX did in the original design: the kernel-internal eventfd_signal() path may still raise the counter to maximum (triggering EPOLLERR), while userspace writes are capped at maximum-1. This follows the backpressure pattern established by pipe(2): writers block when the buffer is full, and capacity is adjustable via fcntl(F_SETPIPE_SZ). POSIX message queues apply the same model: mq_send(3) blocks when the queue depth reaches mq_maxmsg. The following self-contained program covers three benchmarks. Build and run with: gcc -O2 -lpthread bench.c -o bench && ./bench /* bench.c */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #define SECS 5 #define MAX 10ULL #define LAT_N 5000 #define COAL_N 10000ULL #define WINT 100000ULL /* 100 =C2=B5s =E2=86=92 10 K events/s */ #define RSLT 125000ULL /* 125 =C2=B5s =E2=86=92 ~8 K events/s */ /* helpers */ static uint64_t cpu_ms(void) { struct timespec t; clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t); return (uint64_t)t.tv_sec * 1000 + t.tv_nsec / 1000000; } static uint64_t mono_ns(void) { struct timespec t; clock_gettime(CLOCK_MONOTONIC, &t); return (uint64_t)t.tv_sec * 1000000000ULL + t.tv_nsec; } static void set_max(int fd, uint64_t m) { if (m) ioctl(fd, EFD_IOC_SET_MAXIMUM, &m); } static void maxstr(char *b, uint64_t m) { if (!m) snprintf(b, 24, "ULLONG_MAX"); else snprintf(b, 24, "%llu", (unsigned long long)m); } /* bench 1: burst/CPU savings */ enum mode { BLOCKING, SPIN, POLL_OUT }; static int burst_fd; static volatile int stop; static enum mode wmode; static uint64_t wcpu, rcpu, neagain, nwrites, nreads; static void *burst_writer(void *_) { (void)_; uint64_t v=3D1, n=3D0, ea=3D0, t0=3Dcpu_ms(); struct pollfd p=3D{.fd=3Dburst_fd,.events=3DPOLLOUT}; while (!stop) { if (wmode=3D=3DBLOCKING) { if (write(burst_fd,&v,8)=3D=3D8) n++; } else if (wmode=3D=3DSPIN) { if (write(burst_fd,&v,8)<0 && errno=3D= =3DEAGAIN) ea++; else n++; } else { while (!stop && !(poll(&p,1,20)>0 && p.revents&POLLOUT)); if (write(burst_fd,&v,8)=3D=3D8) n++; } } wcpu=3Dcpu_ms()-t0; neagain=3Dea; nwrites=3Dn; return NULL; } static void *burst_reader(void *_) { (void)_; struct pollfd p=3D{.fd=3Dburst_fd,.events=3DPOLLIN}; uint64_t v, nr=3D0, t0=3Dcpu_ms(); while (stop=3D=3D0 || (poll(&p,1,0)>0 && p.revents&POLLIN)) if (poll(&p,1,5)>0 && read(burst_fd,&v,8)=3D=3D8) { nr++; usleep(1000);= } rcpu=3Dcpu_ms()-t0; nreads=3Dnr; return NULL; } static void run_burst(const char *lbl, enum mode m, uint64_t max) { burst_fd=3Deventfd(0, m!=3DBLOCKING ? EFD_CLOEXEC|EFD_NONBLOCK : EFD_CLO= EXEC); set_max(burst_fd, max); wmode=3Dm; stop=3D0; pthread_t w,r; pthread_create(&r,NULL,burst_reader,NULL); pthread_create(&w,NULL,burst_= writer,NULL); cpu_set_t c; CPU_ZERO(&c); CPU_SET(0,&c); pthread_setaffinity_np(r,sizeof(c),&c); CPU_ZERO(&c); CPU_SET(1,&c); pthread_setaffinity_np(w,sizeof(c),&c); sleep(SECS); stop=3D1; pthread_join(w,NULL); pthread_join(r,NULL); close(burst_fd); char mb[24]; maxstr(mb, max); printf(" %-22s %-12s %8llu %8llu %10llu %10llu %8llu\n", lbl, mb, (unsigned long long)wcpu, (unsigned long long)rcpu, (unsigned long long)neagain, (unsigned long long)nwrites, (unsigned long long)nreads); } /* bench 2: latency tail (EFD_SEMAPHORE) */ static int latency_fd; static uint64_t wts[LAT_N], rts[LAT_N]; static void *latency_writer(void *_) { (void)_; uint64_t v=3D1, next=3Dmono_ns(); for (int i=3D0; iy)-(x97% (5002 ms =E2=86=92 133 ms); latency= p999 drops ~60x (142 ms =E2=86=92 2.4 ms); coalescing batch size is bounded to 9 (vs 127 without a limit), so the consumer always knows the backlog is small. O_NONBLOCK+spin bypasses flow control entirely =E2=80=94 use poll(POLLOUT)+write to get the same benefit as a blocking write while still multiplexing other fds in a single poll(2) call. Signed-off-by: Wen Yang Cc: Christian Brauner Cc: Jan Kara Cc: Alexander Viro Cc: Jens Axboe Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- .../userspace-api/ioctl/ioctl-number.rst | 1 + fs/eventfd.c | 74 ++++++++++++++++--- include/uapi/linux/eventfd.h | 6 ++ 3 files changed, 69 insertions(+), 12 deletions(-) diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documenta= tion/userspace-api/ioctl/ioctl-number.rst index 331223761fff..d233559179b1 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -170,6 +170,7 @@ Code Seq# Include File = Comments 'I' all linux/isdn.h con= flict! 'I' 00-0F drivers/isdn/divert/isdn_divert.h con= flict! 'I' 40-4F linux/mISDNif.h con= flict! +'J' 00-01 linux/eventfd.h eve= ntfd ioctl 'K' all linux/kd.h 'L' 00-1F linux/loop.h con= flict! 'L' 10-1F drivers/scsi/mpt3sas/mpt3sas_ctl.h con= flict! diff --git a/fs/eventfd.c b/fs/eventfd.c index 3219e0d596fe..11985d07e904 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -39,6 +39,7 @@ struct eventfd_ctx { * also, adds to the "count" counter and issue a wakeup. */ __u64 count; + __u64 maximum; unsigned int flags; int id; }; @@ -49,9 +50,9 @@ struct eventfd_ctx { * @mask: [in] poll mask * * This function is supposed to be called by the kernel in paths that do n= ot - * allow sleeping. In this function we allow the counter to reach the ULLO= NG_MAX - * value, and we signal this as overflow condition by returning a EPOLLERR - * to poll(2). + * allow sleeping. In this function we allow the counter to reach the maxi= mum + * value (ctx->maximum), and we signal this as overflow condition by retur= ning + * a EPOLLERR to poll(2). */ void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask) { @@ -70,7 +71,7 @@ void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_= t mask) =20 spin_lock_irqsave(&ctx->wqh.lock, flags); current->in_eventfd =3D 1; - if (ctx->count < ULLONG_MAX) + if (ctx->count < ctx->maximum) ctx->count++; if (waitqueue_active(&ctx->wqh)) wake_up_locked_poll(&ctx->wqh, EPOLLIN | mask); @@ -119,7 +120,7 @@ static __poll_t eventfd_poll(struct file *file, poll_ta= ble *wait) { struct eventfd_ctx *ctx =3D file->private_data; __poll_t events =3D 0; - u64 count; + u64 count, max; =20 poll_wait(file, &ctx->wqh, wait); =20 @@ -162,12 +163,13 @@ static __poll_t eventfd_poll(struct file *file, poll_= table *wait) * eventfd_poll returns 0 */ count =3D READ_ONCE(ctx->count); + max =3D READ_ONCE(ctx->maximum); =20 if (count > 0) events |=3D EPOLLIN; - if (count =3D=3D ULLONG_MAX) + if (count =3D=3D max) events |=3D EPOLLERR; - if (ULLONG_MAX - 1 > count) + if (max - 1 > count) events |=3D EPOLLOUT; =20 return events; @@ -244,6 +246,11 @@ static ssize_t eventfd_read(struct kiocb *iocb, struct= iov_iter *to) return sizeof(ucnt); } =20 +static inline bool eventfd_is_writable(struct eventfd_ctx *ctx, __u64 cnt) +{ + return ctx->maximum > ctx->count && ctx->maximum - ctx->count > cnt; +} + static ssize_t eventfd_write(struct file *file, const char __user *buf, si= ze_t count, loff_t *ppos) { @@ -259,11 +266,11 @@ static ssize_t eventfd_write(struct file *file, const= char __user *buf, size_t c return -EINVAL; spin_lock_irq(&ctx->wqh.lock); res =3D -EAGAIN; - if (ULLONG_MAX - ctx->count > ucnt) + if (eventfd_is_writable(ctx, ucnt)) res =3D sizeof(ucnt); else if (!(file->f_flags & O_NONBLOCK)) { res =3D wait_event_interruptible_locked_irq(ctx->wqh, - ULLONG_MAX - ctx->count > ucnt); + eventfd_is_writable(ctx, ucnt)); if (!res) res =3D sizeof(ucnt); } @@ -283,22 +290,62 @@ static ssize_t eventfd_write(struct file *file, const= char __user *buf, size_t c static void eventfd_show_fdinfo(struct seq_file *m, struct file *f) { struct eventfd_ctx *ctx =3D f->private_data; - __u64 cnt; + __u64 cnt, max; =20 spin_lock_irq(&ctx->wqh.lock); cnt =3D ctx->count; + max =3D ctx->maximum; spin_unlock_irq(&ctx->wqh.lock); =20 seq_printf(m, "eventfd-count: %16llx\n" "eventfd-id: %d\n" - "eventfd-semaphore: %d\n", + "eventfd-semaphore: %d\n" + "eventfd-maximum: %16llx\n", cnt, ctx->id, - !!(ctx->flags & EFD_SEMAPHORE)); + !!(ctx->flags & EFD_SEMAPHORE), + max); } #endif =20 +static long eventfd_ioctl(struct file *file, unsigned int cmd, unsigned lo= ng arg) +{ + struct eventfd_ctx *ctx =3D file->private_data; + void __user *argp =3D (void __user *)arg; + __u64 max; + int ret; + + switch (cmd) { + case EFD_IOC_SET_MAXIMUM: + if (copy_from_user(&max, argp, sizeof(max))) + return -EFAULT; + + spin_lock_irq(&ctx->wqh.lock); + if (ctx->count >=3D max) { + ret =3D -EINVAL; + } else { + ctx->maximum =3D max; + ret =3D 0; + /* wake blocked writers that may now fit within the new maximum */ + if (waitqueue_active(&ctx->wqh)) + wake_up_locked_poll(&ctx->wqh, EPOLLOUT); + } + spin_unlock_irq(&ctx->wqh.lock); + return ret; + + case EFD_IOC_GET_MAXIMUM: + spin_lock_irq(&ctx->wqh.lock); + max =3D ctx->maximum; + spin_unlock_irq(&ctx->wqh.lock); + + return copy_to_user(argp, &max, sizeof(max)) ? -EFAULT : 0; + + default: + return -ENOTTY; + } +} + static const struct file_operations eventfd_fops =3D { #ifdef CONFIG_PROC_FS .show_fdinfo =3D eventfd_show_fdinfo, @@ -307,6 +354,8 @@ static const struct file_operations eventfd_fops =3D { .poll =3D eventfd_poll, .read_iter =3D eventfd_read, .write =3D eventfd_write, + .unlocked_ioctl =3D eventfd_ioctl, + .compat_ioctl =3D compat_ptr_ioctl, .llseek =3D noop_llseek, }; =20 @@ -395,6 +444,7 @@ static int do_eventfd(unsigned int count, int flags) kref_init(&ctx->kref); init_waitqueue_head(&ctx->wqh); ctx->count =3D count; + ctx->maximum =3D ULLONG_MAX; ctx->flags =3D flags; =20 flags &=3D EFD_SHARED_FCNTL_FLAGS; diff --git a/include/uapi/linux/eventfd.h b/include/uapi/linux/eventfd.h index 2eb9ab6c32f3..ba46b746f597 100644 --- a/include/uapi/linux/eventfd.h +++ b/include/uapi/linux/eventfd.h @@ -3,9 +3,15 @@ #define _UAPI_LINUX_EVENTFD_H =20 #include +#include +#include =20 #define EFD_SEMAPHORE (1 << 0) #define EFD_CLOEXEC O_CLOEXEC #define EFD_NONBLOCK O_NONBLOCK =20 +/* Flow-control ioctls: configure the per-fd counter maximum. */ +#define EFD_IOC_SET_MAXIMUM _IOW('J', 0, __u64) +#define EFD_IOC_GET_MAXIMUM _IOR('J', 1, __u64) + #endif /* _UAPI_LINUX_EVENTFD_H */ --=20 2.25.1 From nobody Sun Jun 21 02:29:23 2026 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 328393D7D75; Wed, 8 Apr 2026 17:25:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775669135; cv=none; b=N8D+4krblPoa+OoEpkZr94yKfsPXggMvAirXbqWdmViIPC439OUMKWwe0kPxmjZdkcctlHVMU+OG8Y0YfbUZQfHQshkJZzWQcQakQqq/ISGOQmSsh6mLaSwTCaRQ7xujR0nRf6AvBC6twVoH/LSLGjFax19GeNKPv72YfNEgGdw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775669135; c=relaxed/simple; bh=eBMOIfYkZSjmKX09Qny+B41bP+n6c8P5iR1PGnajsZk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=NbgMBeHxHNLtkDS3mNsEsrVSVDbADLXhyayXX2w0pZEsKrRq9oh7bC/Z/SU/w8v0dZCWTijB884ASNCQwIZrD4WPXqeMYHgfz6xrXf/6xSGCXIYW/F5QmyHkNUbzRF2kHBZVHI8aKSqbGR1pxQDEBrkO7HHhnQzi37eVXIkLrfE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=ElVfCR+Y; arc=none smtp.client-ip=91.218.175.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="ElVfCR+Y" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1775669130; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AfOVlvYe3DzzLYGaVKTGmYsApwZhw34DllCA7bYlkW4=; b=ElVfCR+Y4cwEiqW8LnT2Ka8dX35hxyfM+ZWjT3ExOi3YlGv0BHS0cQZiOXJwsheSOrdSLG ryKhJk9U2t+gVvkLoZky1zbNqyHrYHMTB0I10cYfcX6ZiuojJXsVPKosXmso6b9FtQPJ7p uUfpVQskTuy/dBsHW8rm2ct3oN/sbLI= From: wen.yang@linux.dev To: Christian Brauner , Jan Kara , Alexander Viro Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Wen Yang Subject: [RFC PATCH v5 2/2] selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests Date: Thu, 9 Apr 2026 01:24:49 +0800 Message-Id: <183b5d6987f8a23b5912c3f8a8e618ef70f2fbd4.1775668339.git.wen.yang@linux.dev> In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT From: Wen Yang Add correctness tests for the flow-control ioctls introduced in the preceding commit. Cover the GET/SET round-trip, EINVAL when the requested maximum does not exceed the current counter, EAGAIN on an O_NONBLOCK fd when a write would reach the configured maximum, EPOLLOUT gating at maximum-1, /proc/self/fdinfo exposure of the "eventfd-maximum" field, and ENOTTY for unrecognised ioctl commands. Signed-off-by: Wen Yang Cc: Christian Brauner Cc: Jan Kara Cc: Alexander Viro Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- .../filesystems/eventfd/eventfd_test.c | 238 +++++++++++++++++- 1 file changed, 237 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/filesystems/eventfd/eventfd_test.c b/t= ools/testing/selftests/filesystems/eventfd/eventfd_test.c index 1b48f267157d..9e33780f5330 100644 --- a/tools/testing/selftests/filesystems/eventfd/eventfd_test.c +++ b/tools/testing/selftests/filesystems/eventfd/eventfd_test.c @@ -5,12 +5,22 @@ #include #include #include +#include +#include +#include #include #include #include #include #include -#include +#include +/* + * Prevent (pulled in via -> + * -> ) from redefining struct flock and + * friends that are already provided by the system above. + */ +#define _ASM_GENERIC_FCNTL_H +#include #include "kselftest_harness.h" =20 #define EVENTFD_TEST_ITERATIONS 100000UL @@ -308,4 +318,230 @@ TEST(eventfd_check_read_with_semaphore) close(fd); } =20 +/* + * The default maximum is ULLONG_MAX, matching the original behaviour. + */ +TEST(eventfd_check_ioctl_get_maximum_default) +{ + uint64_t max; + int fd, ret; + + fd =3D sys_eventfd2(0, EFD_NONBLOCK); + ASSERT_GE(fd, 0); + + ret =3D ioctl(fd, EFD_IOC_GET_MAXIMUM, &max); + EXPECT_EQ(ret, 0); + EXPECT_EQ(max, UINT64_MAX); + + close(fd); +} + +/* + * EFD_IOC_SET_MAXIMUM and EFD_IOC_GET_MAXIMUM round-trip. + */ +TEST(eventfd_check_ioctl_set_get_maximum) +{ + uint64_t max; + int fd, ret; + + fd =3D sys_eventfd2(0, EFD_NONBLOCK); + ASSERT_GE(fd, 0); + + max =3D 1000; + ret =3D ioctl(fd, EFD_IOC_SET_MAXIMUM, &max); + EXPECT_EQ(ret, 0); + + max =3D 0; + ret =3D ioctl(fd, EFD_IOC_GET_MAXIMUM, &max); + EXPECT_EQ(ret, 0); + EXPECT_EQ(max, 1000); + + close(fd); +} + +/* + * Setting a maximum that is less than or equal to the current counter + * must fail with EINVAL. + */ +TEST(eventfd_check_ioctl_set_maximum_invalid) +{ + uint64_t value =3D 5, max; + ssize_t size; + int fd, ret; + + fd =3D sys_eventfd2(0, EFD_NONBLOCK); + ASSERT_GE(fd, 0); + + /* write 5 into the counter */ + size =3D write(fd, &value, sizeof(value)); + EXPECT_EQ(size, (ssize_t)sizeof(value)); + + /* setting maximum =3D=3D count (5) must fail */ + max =3D 5; + ret =3D ioctl(fd, EFD_IOC_SET_MAXIMUM, &max); + EXPECT_EQ(ret, -1); + EXPECT_EQ(errno, EINVAL); + + /* setting maximum < count (3 < 5) must also fail */ + max =3D 3; + ret =3D ioctl(fd, EFD_IOC_SET_MAXIMUM, &max); + EXPECT_EQ(ret, -1); + EXPECT_EQ(errno, EINVAL); + + /* setting maximum > count (10 > 5) must succeed */ + max =3D 10; + ret =3D ioctl(fd, EFD_IOC_SET_MAXIMUM, &max); + EXPECT_EQ(ret, 0); + + close(fd); +} + +/* + * Writes that would push the counter to or beyond maximum must return + * EAGAIN on a non-blocking fd. After a read drains the counter the + * write should succeed again. + */ +TEST(eventfd_check_ioctl_write_blocked_at_maximum) +{ + uint64_t value, max_val =3D 5; + ssize_t size; + int fd, ret; + + fd =3D sys_eventfd2(0, EFD_NONBLOCK); + ASSERT_GE(fd, 0); + + ret =3D ioctl(fd, EFD_IOC_SET_MAXIMUM, &max_val); + ASSERT_EQ(ret, 0); + + /* write 4 =E2=80=94 counter becomes 4, one slot before maximum */ + value =3D 4; + size =3D write(fd, &value, sizeof(value)); + EXPECT_EQ(size, (ssize_t)sizeof(value)); + + /* + * Writing 1 more would reach maximum (4+1 =3D=3D 5 =3D=3D maximum), which + * is the overflow level. The write must block, i.e. return EAGAIN + * in non-blocking mode. + */ + value =3D 1; + size =3D write(fd, &value, sizeof(value)); + EXPECT_EQ(size, -1); + EXPECT_EQ(errno, EAGAIN); + + /* drain the counter */ + size =3D read(fd, &value, sizeof(value)); + EXPECT_EQ(size, (ssize_t)sizeof(value)); + EXPECT_EQ(value, 4); + + /* now the write must succeed (counter was reset to 0) */ + value =3D 1; + size =3D write(fd, &value, sizeof(value)); + EXPECT_EQ(size, (ssize_t)sizeof(value)); + + close(fd); +} + +/* + * Verify that EPOLLOUT is correctly gated by the configured maximum: + * it should be clear when count >=3D maximum - 1, and set again after a r= ead. + */ +TEST(eventfd_check_ioctl_poll_epollout) +{ + struct epoll_event ev, events[2]; + uint64_t value, max_val =3D 5; + ssize_t sz; + int fd, epfd, nfds, ret; + + fd =3D sys_eventfd2(0, EFD_NONBLOCK); + ASSERT_GE(fd, 0); + + epfd =3D epoll_create1(0); + ASSERT_GE(epfd, 0); + + ev.events =3D EPOLLIN | EPOLLOUT | EPOLLERR; + ev.data.fd =3D fd; + ret =3D epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev); + ASSERT_EQ(ret, 0); + + ret =3D ioctl(fd, EFD_IOC_SET_MAXIMUM, &max_val); + ASSERT_EQ(ret, 0); + + /* fresh fd: EPOLLOUT must be set (count=3D0 < maximum-1=3D4) */ + nfds =3D epoll_wait(epfd, events, 2, 0); + EXPECT_EQ(nfds, 1); + EXPECT_TRUE(!!(events[0].events & EPOLLOUT)); + + /* write 4 =E2=80=94 count reaches maximum-1=3D4, EPOLLOUT must clear */ + value =3D 4; + sz =3D write(fd, &value, sizeof(value)); + EXPECT_EQ(sz, (ssize_t)sizeof(value)); + + nfds =3D epoll_wait(epfd, events, 2, 0); + EXPECT_EQ(nfds, 1); + EXPECT_FALSE(!!(events[0].events & EPOLLOUT)); + EXPECT_TRUE(!!(events[0].events & EPOLLIN)); + + /* drain counter =E2=80=94 EPOLLOUT must reappear */ + sz =3D read(fd, &value, sizeof(value)); + EXPECT_EQ(sz, (ssize_t)sizeof(value)); + + nfds =3D epoll_wait(epfd, events, 2, 0); + EXPECT_EQ(nfds, 1); + EXPECT_TRUE(!!(events[0].events & EPOLLOUT)); + + close(epfd); + close(fd); +} + +/* + * /proc/self/fdinfo must expose the configured maximum. + */ +TEST(eventfd_check_fdinfo_maximum) +{ + struct error err =3D {0}; + uint64_t max_val =3D 12345; + int fd, ret; + + fd =3D sys_eventfd2(0, 0); + ASSERT_GE(fd, 0); + + /* before setting: default should be ULLONG_MAX */ + ret =3D verify_fdinfo(fd, &err, "eventfd-maximum: ", 17, + "%16llx\n", (unsigned long long)UINT64_MAX); + if (ret !=3D 0) + ksft_print_msg("eventfd-maximum default check failed: %s\n", + err.msg); + EXPECT_EQ(ret, 0); + + ret =3D ioctl(fd, EFD_IOC_SET_MAXIMUM, &max_val); + ASSERT_EQ(ret, 0); + + memset(&err, 0, sizeof(err)); + ret =3D verify_fdinfo(fd, &err, "eventfd-maximum: ", 17, + "%16llx\n", (unsigned long long)max_val); + if (ret !=3D 0) + ksft_print_msg("eventfd-maximum after set check failed: %s\n", + err.msg); + EXPECT_EQ(ret, 0); + + close(fd); +} + +/* + * An unrecognised ioctl must return ENOTTY (not EINVAL or ENOENT). + */ +TEST(eventfd_check_ioctl_unknown) +{ + int fd, ret; + + fd =3D sys_eventfd2(0, 0); + ASSERT_GE(fd, 0); + + ret =3D ioctl(fd, _IO('J', 0xff)); + EXPECT_EQ(ret, -1); + EXPECT_EQ(errno, ENOTTY); + + close(fd); +} + TEST_HARNESS_MAIN --=20 2.25.1