All of lore.kernel.org
 help / color / mirror / Atom feed
From: wen.yang@linux.dev
To: Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wen Yang <wen.yang@linux.dev>
Subject: [RFC PATCH v5 0/2] eventfd: add configurable maximum counter value for flow control
Date: Thu,  9 Apr 2026 01:24:47 +0800	[thread overview]
Message-ID: <cover.1775668339.git.wen.yang@linux.dev> (raw)

From: Wen Yang <wen.yang@linux.dev>

eventfd's counter is bounded only by ULLONG_MAX (~1.8x10^19). In
non-semaphore mode a fast producer can write continuously while a slow
consumer falls behind: the producer never stalls, the counter grows
without limit, both sides burn CPU at 100%, and consumer lag is
invisible. There is no mechanism to apply back-pressure.

Add EFD_IOC_SET_MAXIMUM and EFD_IOC_GET_MAXIMUM ioctl commands that
set a configurable overflow threshold. A write(2) that would push the
counter to or beyond maximum blocks (EAGAIN for O_NONBLOCK fds). The
kernel-internal eventfd_signal() path may still raise the counter to
maximum (EPOLLERR), preserving the original overflow semantics. The
default is ULLONG_MAX, preserving backward compatibility.

This follows the back-pressure pattern already established in the
kernel: pipe(2) writers block when the buffer is full, capacity is
tunable via fcntl(F_SETPIPE_SZ); mq_send(3) blocks when the queue
depth reaches mq_maxmsg. EFD_IOC_SET_MAXIMUM applies the same
pattern to eventfd.

Measured on a 4-core x86_64, writer and reader pinned to separate CPUs,
reader sleeps 1 ms between reads to simulate processing time:

  Bench 1 - burst/CPU (5 s, blocking write)
  maximum      wcpu_ms  rcpu_ms      EAGAIN      writes    reads
  --------------------------------------------------------------
  ULLONG_MAX      5002      132           0     6517388     4506
  10               133      150           0       40456     4496
  (O_NONBLOCK+spin bypasses flow control; use O_NONBLOCK+poll(POLLOUT)
   to avoid wasting CPU on EAGAIN retries while still multiplexing fds)

  Bench 2 - latency tail (EFD_SEMAPHORE, 10 K/s writer, ~8 K/s reader,
            5000 events)
  maximum      p99_us   p999_us    max_us
  ----------------------------------------
  ULLONG_MAX   141218   142477    142588
  10             1719     2378      2381

  Bench 3 - coalescing (non-EFD_SEMAPHORE, 10000 writes, 125 us/read
            reader; each read drains the full counter)
  maximum      writes    reads   avg_batch
  -----------------------------------------
  ULLONG_MAX    10000       79       126.6
  10            10000     1121         8.9

With maximum=10: burst CPU drops >97% (5002 ms -> 133 ms); latency p999
drops ~60x (142 ms -> 2.4 ms); coalescing batch bounded to 9 vs 127,
so the consumer always knows the backlog is small.

Notes:
- Magic 'J': 'E' conflicts with linux/input.h and xen/evtchn.h; 'J' is
  unregistered, added to ioctl-number.rst.
- Command numbers 0/1: explicit distinct numbers are clearer than
  relying solely on direction bits to disambiguate SET from GET.
- .compat_ioctl = compat_ptr_ioctl handles 32-bit user pointers.
- Writers woken on SET_MAXIMUM: a raised limit takes effect immediately
  without waiting for the next read(2).

Changes since v4
  (https://lore.kernel.org/all/20250310051832.5658-1-wen.yang@linux.dev/)
- Use ioctl magic 'J' instead of 'E' (conflict with input.h/xen).
- Add .compat_ioctl = compat_ptr_ioctl.
- Expose eventfd-maximum in /proc/self/fdinfo.
- Return -ENOTTY for unrecognised ioctl commands (was -ENOENT).
- Remove the unnecessary !argp guard in eventfd_ioctl().
- Register magic 'J' in Documentation/userspace-api/ioctl/ioctl-number.rst.
- Add kselftest correctness tests.

Wen Yang (2):
  eventfd: add configurable per-fd counter maximum for flow control
  selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests

 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 fs/eventfd.c                                  |  74 +++++-
 include/uapi/linux/eventfd.h                  |   6 +
 .../filesystems/eventfd/eventfd_test.c        | 238 +++++++++++++++++-
 4 files changed, 306 insertions(+), 13 deletions(-)

-- 
2.25.1


             reply	other threads:[~2026-04-08 17:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 17:24 wen.yang [this message]
2026-04-08 17:24 ` [RFC PATCH v5 1/2] eventfd: add configurable per-fd counter maximum for flow control wen.yang
2026-04-08 17:24 ` [RFC PATCH v5 2/2] selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests wen.yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1775668339.git.wen.yang@linux.dev \
    --to=wen.yang@linux.dev \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.