netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf@google.com>
To: Joe Damato <jdamato@fastly.com>
Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	 chuck.lever@oracle.com, jlayton@kernel.org,
	linux-api@vger.kernel.org,  brauner@kernel.org,
	edumazet@google.com, davem@davemloft.net,
	 alexander.duyck@gmail.com, sridhar.samudrala@intel.com,
	kuba@kernel.org,  willemdebruijn.kernel@gmail.com,
	weiwan@google.com, David.Laight@aculab.com,  arnd@arndb.de,
	amritha.nambiar@intel.com, Albert Ou <aou@eecs.berkeley.edu>,
	 Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Waterman <waterman@eecs.berkeley.edu>,
	 Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jan Kara <jack@suse.cz>,  Jiri Slaby <jirislaby@kernel.org>,
	Jonathan Corbet <corbet@lwn.net>,
	 Julien Panis <jpanis@baylibre.com>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	 "open list:FILESYSTEMS (VFS and infrastructure)"
	<linux-fsdevel@vger.kernel.org>,
	Maik Broemme <mbroemme@libmpq.org>,
	 Michael Ellerman <mpe@ellerman.id.au>,
	Namjae Jeon <linkinjeon@kernel.org>,
	 Nathan Lynch <nathanl@linux.ibm.com>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	 Steve French <stfrench@microsoft.com>,
	Thomas Huth <thuth@redhat.com>,
	 Thomas Zimmermann <tzimmermann@suse.de>
Subject: Re: [PATCH net-next v6 0/4] Per epoll context busy poll support
Date: Tue, 6 Feb 2024 10:51:44 -0800	[thread overview]
Message-ID: <ZcJ_wGG_f8wi_rkG@google.com> (raw)
In-Reply-To: <20240205210453.11301-1-jdamato@fastly.com>

On 02/05, Joe Damato wrote:
> Greetings:
> 
> Welcome to v6.
> 
> TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> epoll with socket fds.") by allowing user applications to enable
> epoll-based busy polling, set a busy poll packet budget, and enable or
> disable prefer busy poll on a per epoll context basis.
> 
> This makes epoll-based busy polling much more usable for user
> applications than the current system-wide sysctl and hardcoded budget.
> 
> To allow for this, two ioctls have been added for epoll contexts for
> getting and setting a new struct, struct epoll_params.
> 
> ioctl was chosen vs a new syscall after reviewing a suggestion by Willem
> de Bruijn [1]. I am open to using a new syscall instead of an ioctl, but it
> seemed that: 
>   - Busy poll affects all existing epoll_wait and epoll_pwait variants in
>     the same way, so new verions of many syscalls might be needed. It
>     seems much simpler for users to use the correct
>     epoll_wait/epoll_pwait for their app and add a call to ioctl to enable
>     or disable busy poll as needed. This also probably means less work to
>     get an existing epoll app using busy poll.
> 
>   - previously added epoll_pwait2 helped to bring epoll closer to
>     existing syscalls (like pselect and ppoll) and this busy poll change
>     reflected as a new syscall would not have the same effect.
> 
> Note: patch 1/4 as of v4 uses an or (||) instead of an xor. I thought about
> it some more and I realized that if the user enables both the per-epoll
> context setting and the system wide sysctl, then busy poll should be
> enabled and not disabled. Using xor doesn't seem to make much sense after
> thinking through this a bit.
> 
> Longer explanation:
> 
> Presently epoll has support for a very useful form of busy poll based on
> the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [2]).
> 
> This form of busy poll allows epoll_wait to drive NAPI packet processing
> which allows for a few interesting user application designs which can
> reduce latency and also potentially improve L2/L3 cache hit rates by
> deferring NAPI until userland has finished its work.
> 
> The documentation available on this is, IMHO, a bit confusing so please
> allow me to explain how one might use this:
> 
> 1. Ensure each application thread has its own epoll instance mapping
> 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> direct connections with specific dest ports to these queues.
> 
> 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> polling will occur. This can help avoid the userland app from being
> pre-empted by a hard IRQ while userland is running. Note this means that
> userland must take care to call epoll_wait and not take too long in
> userland since it now drives NAPI via epoll_wait.
> 
> 3. Optionally: Consider using napi_defer_hard_irqs and gro_flush_timeout to
> further restrict IRQ generation from the NIC. These settings are
> system-wide so their impact must be carefully weighed against the running
> applications.
> 
> 4. Ensure that all incoming connections added to an epoll instance
> have the same NAPI ID. This can be done with a BPF filter when
> SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> accept thread is used which dispatches incoming connections to threads.
> 
> 5. Lastly, busy poll must be enabled via a sysctl
> (/proc/sys/net/core/busy_poll).
> 
> Please see Eric Dumazet's paper about busy polling [3] and a recent
> academic paper about measured performance improvements of busy polling [4]
> (albeit with a modification that is not currently present in the kernel)
> for additional context.
> 
> The unfortunate part about step 5 above is that this enables busy poll
> system-wide which affects all user applications on the system,
> including epoll-based network applications which were not intended to
> be used this way or applications where increased CPU usage for lower
> latency network processing is unnecessary or not desirable.
> 
> If the user wants to run one low latency epoll-based server application
> with epoll-based busy poll, but would like to run the rest of the
> applications on the system (which may also use epoll) without busy poll,
> this system-wide sysctl presents a significant problem.
> 
> This change preserves the system-wide sysctl, but adds a mechanism (via
> ioctl) to enable or disable busy poll for epoll contexts as needed by
> individual applications, making epoll-based busy poll more usable.
> 
> Note that this change includes an or (as of v4) instead of an xor. If the
> user has enabled both the system-wide sysctl and also the per epoll-context
> busy poll settings, then epoll should probably busy poll (vs being
> disabled). 
> 
> Thanks,
> Joe
> 
> v5 -> v6:
>   - patch 1/3 no functional change, but commit message corrected to explain
>     that an or (||) is being used instead of xor.
> 
>   - patch 3/4 is a new patch which adds support for per epoll context
>     prefer busy poll setting.
> 
>   - patch 4/4 updated to allow getting/setting per epoll context prefer
>     busy poll setting; this setting is limited to either 0 or 1.
> 
> v4 -> v5:
>   - patch 3/3 updated to use memchr_inv to ensure that __pad is zero for
>     the EPIOCSPARAMS ioctl. Recommended by Greg K-H [5], Dave Chinner [6],
>     and Jiri Slaby [7].
> 
> v3 -> v4:
>   - patch 1/3 was updated to include an important functional change:
>     ep_busy_loop_on was updated to use or (||) instead of xor (^). After
>     thinking about it a bit more, I thought xor didn't make much sense.
>     Enabling both the per-epoll context and the system-wide sysctl should
>     probably enable busy poll, not disable it. So, or (||) makes more
>     sense, I think.
> 
>   - patch 3/3 was updated:
>     - to change the epoll_params fields to be __u64, __u16, and __u8 and
>       to pad the struct to a multiple of 64bits. Suggested by Greg K-H [8]
>       and Arnd Bergmann [9].
>     - remove an unused pr_fmt, left over from the previous revision.
>     - ioctl now returns -EINVAL when epoll_params.busy_poll_usecs >
>       U32_MAX.
> 
> v2 -> v3:
>   - cover letter updated to mention why ioctl seems (to me) like a better
>     choice vs a new syscall.
> 
>   - patch 3/4 was modified in 3 ways:
>     - when an unknown ioctl is received, -ENOIOCTLCMD is returned instead
>       of -EINVAL as the ioctl documentation requires.
>     - epoll_params.busy_poll_budget can only be set to a value larger than
>       NAPI_POLL_WEIGHT if code is run by privileged (CAP_NET_ADMIN) users.
>       Otherwise, -EPERM is returned.
>     - busy poll specific ioctl code moved out to its own function. On
>       kernels without busy poll support, -EOPNOTSUPP is returned. This also
>       makes the kernel build robot happier without littering the code with
>       more #ifdefs.
> 
>   - dropped patch 4/4 after Eric Dumazet's review of it when it was sent
>     independently to the list [10].
> 
> v1 -> v2:
>   - cover letter updated to make a mention of napi_defer_hard_irqs and
>     gro_flush_timeout as an added step 3 and to cite both Eric Dumazet's
>     busy polling paper and a paper from University of Waterloo for
>     additional context. Specifically calling out the xor in patch 1/4
>     incase it is missed by reviewers.
> 
>   - Patch 2/4 has its commit message updated, but no functional changes.
>     Commit message now describes that allowing for a settable budget helps
>     to improve throughput and is more consistent with other busy poll
>     mechanisms that allow a settable budget via SO_BUSY_POLL_BUDGET.
> 
>   - Patch 3/4 was modified to check if the epoll_params.busy_poll_budget
>     exceeds NAPI_POLL_WEIGHT. The larger value is allowed, but an error is
>     printed. This was done for consistency with netif_napi_add_weight,
>     which does the same.
> 
>   - Patch 3/4 the struct epoll_params was updated to fix the type of the
>     data field; it was uint8_t and was changed to u8.
> 
>   - Patch 4/4 added to check if SO_BUSY_POLL_BUDGET exceeds
>     NAPI_POLL_WEIGHT. The larger value is allowed, but an error is
>     printed. This was done for consistency with netif_napi_add_weight,
>     which does the same.
> 
> [1]: https://lore.kernel.org/lkml/65b1cb7f73a6a_250560294bd@willemb.c.googlers.com.notmuch/
> [2]: https://lore.kernel.org/lkml/20170324170836.15226.87178.stgit@localhost.localdomain/
> [3]: https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf
> [4]: https://dl.acm.org/doi/pdf/10.1145/3626780
> [5]: https://lore.kernel.org/lkml/2024013001-prison-strum-899d@gregkh/
> [6]: https://lore.kernel.org/lkml/Zbm3AXgcwL9D6TNM@dread.disaster.area/
> [7]: https://lore.kernel.org/lkml/efee9789-4f05-4202-9a95-21d88f6307b0@kernel.org/
> [8]: https://lore.kernel.org/lkml/2024012551-anyone-demeaning-867b@gregkh/
> [9]: https://lore.kernel.org/lkml/57b62135-2159-493d-a6bb-47d5be55154a@app.fastmail.com/
> [10]: https://lore.kernel.org/lkml/CANn89i+uXsdSVFiQT9fDfGw+h_5QOcuHwPdWi9J=5U6oLXkQTA@mail.gmail.com/
> 
> Joe Damato (4):
>   eventpoll: support busy poll per epoll instance
>   eventpoll: Add per-epoll busy poll packet budget
>   eventpoll: Add per-epoll prefer busy poll option
>   eventpoll: Add epoll ioctl for epoll_params
> 
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  fs/eventpoll.c                                | 136 +++++++++++++++++-
>  include/uapi/linux/eventpoll.h                |  13 ++
>  3 files changed, 144 insertions(+), 6 deletions(-)

Coincidentally, we were looking into the same area and your patches are
super useful :-) Thank you for plumbing in prefer_busy_poll. 

Acked-by: Stanislav Fomichev <sdf@google.com>

      parent reply	other threads:[~2024-02-06 18:51 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-05 21:04 [PATCH net-next v6 0/4] Per epoll context busy poll support Joe Damato
2024-02-05 21:04 ` [PATCH net-next v6 1/4] eventpoll: support busy poll per epoll instance Joe Damato
2024-02-07 19:04   ` Jakub Kicinski
2024-02-07 19:14     ` Joe Damato
2024-02-07 20:11       ` Jakub Kicinski
2024-02-07 20:23         ` Joe Damato
2024-02-07 20:56           ` Jakub Kicinski
2024-02-08 17:46   ` Eric Dumazet
2024-02-08 18:06     ` Joe Damato
2024-02-05 21:04 ` [PATCH net-next v6 2/4] eventpoll: Add per-epoll busy poll packet budget Joe Damato
2024-02-07 19:04   ` Jakub Kicinski
2024-02-08 17:47     ` Eric Dumazet
2024-02-05 21:04 ` [PATCH net-next v6 3/4] eventpoll: Add per-epoll prefer busy poll option Joe Damato
2024-02-07 19:04   ` Jakub Kicinski
2024-02-08 17:49     ` Eric Dumazet
2024-02-05 21:04 ` [PATCH net-next v6 4/4] eventpoll: Add epoll ioctl for epoll_params Joe Damato
2024-02-07  8:37   ` Jiri Slaby
2024-02-07 18:50     ` Joe Damato
2024-02-07 19:07       ` Jakub Kicinski
2024-02-07 19:16         ` Joe Damato
2024-02-07 20:18           ` Jakub Kicinski
2024-02-06 18:51 ` Stanislav Fomichev [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZcJ_wGG_f8wi_rkG@google.com \
    --to=sdf@google.com \
    --cc=David.Laight@aculab.com \
    --cc=alexander.duyck@gmail.com \
    --cc=amritha.nambiar@intel.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=arnd@arndb.de \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jack@suse.cz \
    --cc=jdamato@fastly.com \
    --cc=jirislaby@kernel.org \
    --cc=jlayton@kernel.org \
    --cc=jpanis@baylibre.com \
    --cc=kuba@kernel.org \
    --cc=linkinjeon@kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbroemme@libmpq.org \
    --cc=mpe@ellerman.id.au \
    --cc=nathanl@linux.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=palmer@dabbelt.com \
    --cc=sridhar.samudrala@intel.com \
    --cc=stfrench@microsoft.com \
    --cc=thuth@redhat.com \
    --cc=tzimmermann@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    --cc=waterman@eecs.berkeley.edu \
    --cc=weiwan@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).