linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH vfs/vfs.fixes v2] eventpoll: Set epoll timeout if it's in the future
@ 2025-04-16 18:58 Joe Damato
  2025-04-17  7:56 ` Christian Brauner
  2025-04-26 12:29 ` Christian Brauner
  0 siblings, 2 replies; 11+ messages in thread
From: Joe Damato @ 2025-04-16 18:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, brauner, Joe Damato, Alexander Viro, David S. Miller,
	Eric Dumazet, Sridhar Samudrala, Alexander Duyck, open list

Avoid an edge case where epoll_wait arms a timer and calls schedule()
even if the timer will expire immediately.

For example: if the user has specified an epoll busy poll usecs which is
equal or larger than the epoll_wait/epoll_pwait2 timeout, it is
unnecessary to call schedule_hrtimeout_range; the busy poll usecs have
consumed the entire timeout duration so it is unnecessary to induce
scheduling latency by calling schedule() (via schedule_hrtimeout_range).

This can be measured using a simple bpftrace script:

tracepoint:sched:sched_switch
/ args->prev_pid == $1 /
{
  print(kstack());
  print(ustack());
}

Before this patch is applied:

  Testing an epoll_wait app with busy poll usecs set to 1000, and
  epoll_wait timeout set to 1ms using the script above shows:

     __traceiter_sched_switch+69
     __schedule+1495
     schedule+32
     schedule_hrtimeout_range+159
     do_epoll_wait+1424
     __x64_sys_epoll_wait+97
     do_syscall_64+95
     entry_SYSCALL_64_after_hwframe+118

     epoll_wait+82

  Which is unexpected; the busy poll usecs should have consumed the
  entire timeout and there should be no reason to arm a timer.

After this patch is applied: the same test scenario does not generate a
call to schedule() in the above edge case. If the busy poll usecs are
reduced (for example usecs: 100, epoll_wait timeout 1ms) the timer is
armed as expected.

Fixes: bf3b9f6372c4 ("epoll: Add busy poll support to epoll with socket fds.")
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 v2: 
   - No longer an RFC and rebased on vfs/vfs.fixes
   - Added Jan's Reviewed-by
   - Added Fixes tag
   - No functional changes from the RFC

 rfcv1: https://lore.kernel.org/linux-fsdevel/20250415184346.39229-1-jdamato@fastly.com/

 fs/eventpoll.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 100376863a44..4bc264b854c4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1996,6 +1996,14 @@ static int ep_try_send_events(struct eventpoll *ep,
 	return res;
 }
 
+static int ep_schedule_timeout(ktime_t *to)
+{
+	if (to)
+		return ktime_after(*to, ktime_get());
+	else
+		return 1;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
@@ -2103,7 +2111,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 
 		write_unlock_irq(&ep->lock);
 
-		if (!eavail)
+		if (!eavail && ep_schedule_timeout(to))
 			timed_out = !schedule_hrtimeout_range(to, slack,
 							      HRTIMER_MODE_ABS);
 		__set_current_state(TASK_RUNNING);

base-commit: a681b7c17dd21d5aa0da391ceb27a2007ba970a4
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-04-29 11:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16 18:58 [PATCH vfs/vfs.fixes v2] eventpoll: Set epoll timeout if it's in the future Joe Damato
2025-04-17  7:56 ` Christian Brauner
2025-04-26 12:29 ` Christian Brauner
2025-04-28 12:14   ` Jan Kara
2025-04-28 13:18     ` Tudor Ambarus
2025-04-28 13:32       ` Tudor Ambarus
2025-04-28 16:50     ` Joe Damato
2025-04-28 22:32       ` Carlos Llamas
2025-04-28 22:41         ` Joe Damato
2025-04-29 10:19           ` Jan Kara
2025-04-29 11:08             ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).