From: wen.yang@linux.dev
To: Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Wen Yang <wen.yang@linux.dev>
Subject: [RFC PATCH v5 0/2] eventfd: add configurable maximum counter value for flow control
Date: Thu, 9 Apr 2026 01:24:47 +0800 [thread overview]
Message-ID: <cover.1775668339.git.wen.yang@linux.dev> (raw)
From: Wen Yang <wen.yang@linux.dev>
eventfd's counter is bounded only by ULLONG_MAX (~1.8x10^19). In
non-semaphore mode a fast producer can write continuously while a slow
consumer falls behind: the producer never stalls, the counter grows
without limit, both sides burn CPU at 100%, and consumer lag is
invisible. There is no mechanism to apply back-pressure.
Add EFD_IOC_SET_MAXIMUM and EFD_IOC_GET_MAXIMUM ioctl commands that
set a configurable overflow threshold. A write(2) that would push the
counter to or beyond maximum blocks (EAGAIN for O_NONBLOCK fds). The
kernel-internal eventfd_signal() path may still raise the counter to
maximum (EPOLLERR), preserving the original overflow semantics. The
default is ULLONG_MAX, preserving backward compatibility.
This follows the back-pressure pattern already established in the
kernel: pipe(2) writers block when the buffer is full, capacity is
tunable via fcntl(F_SETPIPE_SZ); mq_send(3) blocks when the queue
depth reaches mq_maxmsg. EFD_IOC_SET_MAXIMUM applies the same
pattern to eventfd.
Measured on a 4-core x86_64, writer and reader pinned to separate CPUs,
reader sleeps 1 ms between reads to simulate processing time:
Bench 1 - burst/CPU (5 s, blocking write)
maximum wcpu_ms rcpu_ms EAGAIN writes reads
--------------------------------------------------------------
ULLONG_MAX 5002 132 0 6517388 4506
10 133 150 0 40456 4496
(O_NONBLOCK+spin bypasses flow control; use O_NONBLOCK+poll(POLLOUT)
to avoid wasting CPU on EAGAIN retries while still multiplexing fds)
Bench 2 - latency tail (EFD_SEMAPHORE, 10 K/s writer, ~8 K/s reader,
5000 events)
maximum p99_us p999_us max_us
----------------------------------------
ULLONG_MAX 141218 142477 142588
10 1719 2378 2381
Bench 3 - coalescing (non-EFD_SEMAPHORE, 10000 writes, 125 us/read
reader; each read drains the full counter)
maximum writes reads avg_batch
-----------------------------------------
ULLONG_MAX 10000 79 126.6
10 10000 1121 8.9
With maximum=10: burst CPU drops >97% (5002 ms -> 133 ms); latency p999
drops ~60x (142 ms -> 2.4 ms); coalescing batch bounded to 9 vs 127,
so the consumer always knows the backlog is small.
Notes:
- Magic 'J': 'E' conflicts with linux/input.h and xen/evtchn.h; 'J' is
unregistered, added to ioctl-number.rst.
- Command numbers 0/1: explicit distinct numbers are clearer than
relying solely on direction bits to disambiguate SET from GET.
- .compat_ioctl = compat_ptr_ioctl handles 32-bit user pointers.
- Writers woken on SET_MAXIMUM: a raised limit takes effect immediately
without waiting for the next read(2).
Changes since v4
(https://lore.kernel.org/all/20250310051832.5658-1-wen.yang@linux.dev/)
- Use ioctl magic 'J' instead of 'E' (conflict with input.h/xen).
- Add .compat_ioctl = compat_ptr_ioctl.
- Expose eventfd-maximum in /proc/self/fdinfo.
- Return -ENOTTY for unrecognised ioctl commands (was -ENOENT).
- Remove the unnecessary !argp guard in eventfd_ioctl().
- Register magic 'J' in Documentation/userspace-api/ioctl/ioctl-number.rst.
- Add kselftest correctness tests.
Wen Yang (2):
eventfd: add configurable per-fd counter maximum for flow control
selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests
.../userspace-api/ioctl/ioctl-number.rst | 1 +
fs/eventfd.c | 74 +++++-
include/uapi/linux/eventfd.h | 6 +
.../filesystems/eventfd/eventfd_test.c | 238 +++++++++++++++++-
4 files changed, 306 insertions(+), 13 deletions(-)
--
2.25.1
next reply other threads:[~2026-04-08 17:25 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 17:24 wen.yang [this message]
2026-04-08 17:24 ` [RFC PATCH v5 1/2] eventfd: add configurable per-fd counter maximum for flow control wen.yang
2026-04-08 17:24 ` [RFC PATCH v5 2/2] selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests wen.yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1775668339.git.wen.yang@linux.dev \
--to=wen.yang@linux.dev \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.