From: Bagas Sanjaya <bagasdotme@gmail.com>
To: Joe Damato <jdamato@fastly.com>,
Linux Networking <netdev@vger.kernel.org>
Cc: namangulati@google.com, edumazet@google.com,
amritha.nambiar@intel.com, sridhar.samudrala@intel.com,
sdf@fomichev.me, peter@typeblog.net, m2shafiei@uwaterloo.ca,
bjorn@rivosinc.com, hch@infradead.org, willy@infradead.org,
willemdebruijn.kernel@gmail.com, skhawaja@google.com,
kuba@kernel.org, Martin Karsten <mkarsten@uwaterloo.ca>,
"David S. Miller" <davem@davemloft.net>,
Paolo Abeni <pabeni@redhat.com>, Jonathan Corbet <corbet@lwn.net>,
Linux Documentation <linux-doc@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux BPF <bpf@vger.kernel.org>
Subject: Re: [PATCH net-next v2 6/6] docs: networking: Describe irq suspension
Date: Mon, 21 Oct 2024 17:49:14 +0700 [thread overview]
Message-ID: <ZxYxqhj7cesDO8-j@archie.me> (raw)
In-Reply-To: <20241021015311.95468-7-jdamato@fastly.com>
[-- Attachment #1: Type: text/plain, Size: 8482 bytes --]
On Mon, Oct 21, 2024 at 01:53:01AM +0000, Joe Damato wrote:
> diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
> index dfa5d549be9c..3b43477a52ce 100644
> --- a/Documentation/networking/napi.rst
> +++ b/Documentation/networking/napi.rst
> @@ -192,6 +192,28 @@ is reused to control the delay of the timer, while
> ``napi_defer_hard_irqs`` controls the number of consecutive empty polls
> before NAPI gives up and goes back to using hardware IRQs.
>
> +The above parameters can also be set on a per-NAPI basis using netlink via
> +netdev-genl. This can be done programmatically in a user application or by
> +using a script included in the kernel source tree: ``tools/net/ynl/cli.py``.
> +
> +For example, using the script:
> +
> +.. code-block:: bash
> +
> + $ kernel-source/tools/net/ynl/cli.py \
> + --spec Documentation/netlink/specs/netdev.yaml \
> + --do napi-set \
> + --json='{"id": 345,
> + "defer-hard-irqs": 111,
> + "gro-flush-timeout": 11111}'
> +
> +Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
> +via netdev-genl. There is no global sysfs parameter for this value.
In JSON, both gro-flush-timeout and irq-suspend-timeout parameter
names are written in hyphens; but the rest of the docs uses underscores
(that is, gro_flush_timeout and irq_suspend_timeout), right?
> +
> +``irq_suspend_timeout`` is used to determine how long an application can
> +completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
> +which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
> +
> .. _poll:
>
> Busy polling
> @@ -207,6 +229,46 @@ selected sockets or using the global ``net.core.busy_poll`` and
> ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
> also exists.
>
> +epoll-based busy polling
> +------------------------
> +
> +It is possible to trigger packet processing directly from calls to
> +``epoll_wait``. In order to use this feature, a user application must ensure
> +all file descriptors which are added to an epoll context have the same NAPI ID.
> +
> +If the application uses a dedicated acceptor thread, the application can obtain
> +the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
> +distribute that file descriptor to a worker thread. The worker thread would add
> +the file descriptor to its epoll context. This would ensure each worker thread
> +has an epoll context with FDs that have the same NAPI ID.
> +
> +Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program be
> +inserted to distribute incoming connections to threads such that each thread is
> +only given incoming connections with the same NAPI ID. Care must be taken to
> +carefully handle cases where a system may have multiple NICs.
> +
> +In order to enable busy polling, there are two choices:
> +
> +1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
> + loop waiting for events. This is a system-wide setting and will cause all
> + epoll-based applications to busy poll when they call epoll_wait. This may
> + not be desirable as many applications may not have the need to busy poll.
> +
> +2. Applications using recent kernels can issue an ioctl on the epoll context
> + file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
> + epoll_params``:, which user programs can define as follows:
> +
> +.. code-block:: c
> +
> + struct epoll_params {
> + uint32_t busy_poll_usecs;
> + uint16_t busy_poll_budget;
> + uint8_t prefer_busy_poll;
> +
> + /* pad the struct to a multiple of 64bits */
> + uint8_t __pad;
> + };
> +
> IRQ mitigation
> ---------------
>
> @@ -222,12 +284,78 @@ Such applications can pledge to the kernel that they will perform a busy
> polling operation periodically, and the driver should keep the device IRQs
> permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
> socket option. To avoid system misbehavior the pledge is revoked
> -if ``gro_flush_timeout`` passes without any busy poll call.
> +if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
> +busy polling applications, the ``prefer_busy_poll`` field of ``struct
> +epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
> +enable this mode. See the above section for more details.
>
> The NAPI budget for busy polling is lower than the default (which makes
> sense given the low latency intention of normal busy polling). This is
> not the case with IRQ mitigation, however, so the budget can be adjusted
> -with the ``SO_BUSY_POLL_BUDGET`` socket option.
> +with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
> +applications, the ``busy_poll_budget`` field can be adjusted to the desired value
> +in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
> +ioctl. See the above section for more details.
> +
> +It is important to note that choosing a large value for ``gro_flush_timeout``
> +will defer IRQs to allow for better batch processing, but will induce latency
> +when the system is not fully loaded. Choosing a small value for
> +``gro_flush_timeout`` can cause interference of the user application which is
> +attempting to busy poll by device IRQs and softirq processing. This value
> +should be chosen carefully with these tradeoffs in mind. epoll-based busy
> +polling applications may be able to mitigate how much user processing happens
> +by choosing an appropriate value for ``maxevents``.
> +
> +Users may want to consider an alternate approach, IRQ suspension, to help deal
> +with these tradeoffs.
> +
> +IRQ suspension
> +--------------
> +
> +IRQ suspension is a mechanism wherein device IRQs are masked while epoll
> +triggers NAPI packet processing.
> +
> +While application calls to epoll_wait successfully retrieve events, the kernel will
> +defer the IRQ suspension timer. If the kernel does not retrieve any events
> +while busy polling (for example, because network traffic levels subsided), IRQ
> +suspension is disabled and the IRQ mitigation strategies described above are
> +engaged.
> +
> +This allows users to balance CPU consumption with network processing
> +efficiency.
> +
> +To use this mechanism:
> +
> + 1. The per-NAPI config parameter ``irq_suspend_timeout`` should be set to the
> + maximum time (in nanoseconds) the application can have its IRQs
> + suspended. This is done using netlink, as described above. This timeout
> + serves as a safety mechanism to restart IRQ driver interrupt processing if
> + the application has stalled. This value should be chosen so that it covers
> + the amount of time the user application needs to process data from its
> + call to epoll_wait, noting that applications can control how much data
> + they retrieve by setting ``max_events`` when calling epoll_wait.
> +
> + 2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
> + and ``napi_defer_hard_irqs`` can be set to low values. They will be used
> + to defer IRQs after busy poll has found no data.
> +
> + 3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
> + the ``EPIOCSPARAMS`` ioctl as described above.
> +
> + 4. The application uses epoll as described above to trigger NAPI packet
> + processing.
> +
> +As mentioned above, as long as subsequent calls to epoll_wait return events to
> +userland, the ``irq_suspend_timeout`` is deferred and IRQs are disabled. This
> +allows the application to process data without interference.
> +
> +Once a call to epoll_wait results in no events being found, IRQ suspension is
> +automatically disabled and the ``gro_flush_timeout`` and
> +``napi_defer_hard_irqs`` mitigation mechanisms take over.
> +
> +It is expected that ``irq_suspend_timeout`` will be set to a value much larger
> +than ``gro_flush_timeout`` as ``irq_suspend_timeout`` should suspend IRQs for
> +the duration of one userland processing cycle.
>
> .. _threaded:
>
The rest LGTM, thanks!
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
--
An old man doll... just what I always wanted! - Clara
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
next prev parent reply other threads:[~2024-10-21 10:49 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-21 1:52 [PATCH net-next v2 0/6] Suspend IRQs during application busy periods Joe Damato
2024-10-21 1:52 ` [PATCH net-next v2 1/6] net: Add napi_struct parameter irq_suspend_timeout Joe Damato
2024-10-21 1:52 ` [PATCH net-next v2 2/6] net: Suspend softirq when prefer_busy_poll is set Joe Damato
2024-10-21 1:52 ` [PATCH net-next v2 3/6] net: Add control functions for irq suspension Joe Damato
2024-10-21 1:52 ` [PATCH net-next v2 4/6] eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set Joe Damato
2024-10-21 1:53 ` [PATCH net-next v2 5/6] eventpoll: Control irq suspension for prefer_busy_poll Joe Damato
2024-10-21 1:53 ` [PATCH net-next v2 6/6] docs: networking: Describe irq suspension Joe Damato
2024-10-21 10:49 ` Bagas Sanjaya [this message]
2024-10-21 16:33 ` Joe Damato
2024-10-25 1:03 ` Bagas Sanjaya
2024-10-24 23:05 ` [PATCH net-next v2 0/6] Suspend IRQs during application busy periods Stanislav Fomichev
2024-10-29 10:25 ` Paolo Abeni
2024-10-29 15:03 ` Joe Damato
2024-10-29 17:53 ` Joe Damato
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZxYxqhj7cesDO8-j@archie.me \
--to=bagasdotme@gmail.com \
--cc=amritha.nambiar@intel.com \
--cc=bjorn@rivosinc.com \
--cc=bpf@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hch@infradead.org \
--cc=jdamato@fastly.com \
--cc=kuba@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=m2shafiei@uwaterloo.ca \
--cc=mkarsten@uwaterloo.ca \
--cc=namangulati@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=peter@typeblog.net \
--cc=sdf@fomichev.me \
--cc=skhawaja@google.com \
--cc=sridhar.samudrala@intel.com \
--cc=willemdebruijn.kernel@gmail.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).