public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: wen.yang@linux.dev
To: Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wen Yang <wen.yang@linux.dev>
Subject: [RFC PATCH v5 0/2] eventfd: add configurable maximum counter value for flow control
Date: Thu,  9 Apr 2026 01:24:47 +0800	[thread overview]
Message-ID: <cover.1775668339.git.wen.yang@linux.dev> (raw)

From: Wen Yang <wen.yang@linux.dev>

eventfd's counter is bounded only by ULLONG_MAX (~1.8x10^19). In
non-semaphore mode a fast producer can write continuously while a slow
consumer falls behind: the producer never stalls, the counter grows
without limit, both sides burn CPU at 100%, and consumer lag is
invisible. There is no mechanism to apply back-pressure.

Add EFD_IOC_SET_MAXIMUM and EFD_IOC_GET_MAXIMUM ioctl commands that
set a configurable overflow threshold. A write(2) that would push the
counter to or beyond maximum blocks (EAGAIN for O_NONBLOCK fds). The
kernel-internal eventfd_signal() path may still raise the counter to
maximum (EPOLLERR), preserving the original overflow semantics. The
default is ULLONG_MAX, preserving backward compatibility.

This follows the back-pressure pattern already established in the
kernel: pipe(2) writers block when the buffer is full, capacity is
tunable via fcntl(F_SETPIPE_SZ); mq_send(3) blocks when the queue
depth reaches mq_maxmsg. EFD_IOC_SET_MAXIMUM applies the same
pattern to eventfd.

Measured on a 4-core x86_64, writer and reader pinned to separate CPUs,
reader sleeps 1 ms between reads to simulate processing time:

  Bench 1 - burst/CPU (5 s, blocking write)
  maximum      wcpu_ms  rcpu_ms      EAGAIN      writes    reads
  --------------------------------------------------------------
  ULLONG_MAX      5002      132           0     6517388     4506
  10               133      150           0       40456     4496
  (O_NONBLOCK+spin bypasses flow control; use O_NONBLOCK+poll(POLLOUT)
   to avoid wasting CPU on EAGAIN retries while still multiplexing fds)

  Bench 2 - latency tail (EFD_SEMAPHORE, 10 K/s writer, ~8 K/s reader,
            5000 events)
  maximum      p99_us   p999_us    max_us
  ----------------------------------------
  ULLONG_MAX   141218   142477    142588
  10             1719     2378      2381

  Bench 3 - coalescing (non-EFD_SEMAPHORE, 10000 writes, 125 us/read
            reader; each read drains the full counter)
  maximum      writes    reads   avg_batch
  -----------------------------------------
  ULLONG_MAX    10000       79       126.6
  10            10000     1121         8.9

With maximum=10: burst CPU drops >97% (5002 ms -> 133 ms); latency p999
drops ~60x (142 ms -> 2.4 ms); coalescing batch bounded to 9 vs 127,
so the consumer always knows the backlog is small.

Notes:
- Magic 'J': 'E' conflicts with linux/input.h and xen/evtchn.h; 'J' is
  unregistered, added to ioctl-number.rst.
- Command numbers 0/1: explicit distinct numbers are clearer than
  relying solely on direction bits to disambiguate SET from GET.
- .compat_ioctl = compat_ptr_ioctl handles 32-bit user pointers.
- Writers woken on SET_MAXIMUM: a raised limit takes effect immediately
  without waiting for the next read(2).

Changes since v4
  (https://lore.kernel.org/all/20250310051832.5658-1-wen.yang@linux.dev/)
- Use ioctl magic 'J' instead of 'E' (conflict with input.h/xen).
- Add .compat_ioctl = compat_ptr_ioctl.
- Expose eventfd-maximum in /proc/self/fdinfo.
- Return -ENOTTY for unrecognised ioctl commands (was -ENOENT).
- Remove the unnecessary !argp guard in eventfd_ioctl().
- Register magic 'J' in Documentation/userspace-api/ioctl/ioctl-number.rst.
- Add kselftest correctness tests.

Wen Yang (2):
  eventfd: add configurable per-fd counter maximum for flow control
  selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests

 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 fs/eventfd.c                                  |  74 +++++-
 include/uapi/linux/eventfd.h                  |   6 +
 .../filesystems/eventfd/eventfd_test.c        | 238 +++++++++++++++++-
 4 files changed, 306 insertions(+), 13 deletions(-)

-- 
2.25.1


             reply	other threads:[~2026-04-08 17:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 17:24 wen.yang [this message]
2026-04-08 17:24 ` [RFC PATCH v5 1/2] eventfd: add configurable per-fd counter maximum for flow control wen.yang
2026-04-08 17:24 ` [RFC PATCH v5 2/2] selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests wen.yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1775668339.git.wen.yang@linux.dev \
    --to=wen.yang@linux.dev \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox