From: Simon Horman <horms@kernel.org>
To: darinzon@amazon.com
Cc: "Nelson, Shannon" <shannon.nelson@amd.com>,
David Miller <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
netdev@vger.kernel.org, "Woodhouse, David" <dwmw@amazon.com>,
"Machulsky, Zorik" <zorik@amazon.com>,
"Matushevsky, Alexander" <matua@amazon.com>,
Saeed Bshara <saeedb@amazon.com>, "Wilson, Matt" <msw@amazon.com>,
"Liguori, Anthony" <aliguori@amazon.com>,
"Bshara, Nafea" <nafea@amazon.com>,
"Belgazal, Netanel" <netanel@amazon.com>,
"Saidi, Ali" <alisaidi@amazon.com>,
"Herrenschmidt, Benjamin" <benh@amazon.com>,
"Kiyanovski, Arthur" <akiyano@amazon.com>,
"Dagan, Noam" <ndagan@amazon.com>,
"Agroskin, Shay" <shayagr@amazon.com>,
"Itzko, Shahar" <itzko@amazon.com>,
"Abboud, Osama" <osamaabb@amazon.com>,
"Ostrovsky, Evgeny" <evostrov@amazon.com>,
"Tabachnik, Ofir" <ofirt@amazon.com>,
"Koler, Nati" <nkoler@amazon.com>
Subject: Re: [PATCH v2 net-next 07/11] net: ena: Add more information on TX timeouts
Date: Thu, 1 Feb 2024 13:27:05 +0100 [thread overview]
Message-ID: <20240201122705.GA530335@kernel.org> (raw)
In-Reply-To: <20240130095353.2881-8-darinzon@amazon.com>
On Tue, Jan 30, 2024 at 09:53:49AM +0000, darinzon@amazon.com wrote:
> From: David Arinzon <darinzon@amazon.com>
>
> The function responsible for polling TX completions might not receive
> the CPU resources it needs due to higher priority tasks running on the
> requested core.
>
> The driver might not be able to recognize such cases, but it can use its
> state to suspect that they happened. If both conditions are met:
>
> - napi hasn't been executed more than the TX completion timeout value
> - napi is scheduled (meaning that we've received an interrupt)
>
> Then it's more likely that the napi handler isn't scheduled because of
> an overloaded CPU.
> It was decided that for this case, the driver would wait twice as long
> as the regular timeout before scheduling a reset.
> The driver uses ENA_REGS_RESET_SUSPECTED_POLL_STARVATION reset reason to
> indicate this case to the device.
>
> This patch also adds more information to the ena_tx_timeout() callback.
> This function is called by the kernel when it detects that a specific TX
> queue has been closed for too long.
>
> Signed-off-by: Shay Agroskin <shayagr@amazon.com>
> Signed-off-by: David Arinzon <darinzon@amazon.com>
> ---
> drivers/net/ethernet/amazon/ena/ena_netdev.c | 77 +++++++++++++++----
> .../net/ethernet/amazon/ena/ena_regs_defs.h | 1 +
> 2 files changed, 64 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 18acb76..ae9291b 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -47,19 +47,44 @@ static int ena_restore_device(struct ena_adapter *adapter);
>
> static void ena_tx_timeout(struct net_device *dev, unsigned int txqueue)
> {
> + enum ena_regs_reset_reason_types reset_reason = ENA_REGS_RESET_OS_NETDEV_WD;
> struct ena_adapter *adapter = netdev_priv(dev);
> + unsigned int time_since_last_napi, threshold;
> + struct ena_ring *tx_ring;
> + int napi_scheduled;
> +
> + if (txqueue >= adapter->num_io_queues) {
> + netdev_err(dev, "TX timeout on invalid queue %u\n", txqueue);
> + goto schedule_reset;
> + }
> +
> + threshold = jiffies_to_usecs(dev->watchdog_timeo);
> + tx_ring = &adapter->tx_ring[txqueue];
> +
> + time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies);
> + napi_scheduled = !!(tx_ring->napi->state & NAPIF_STATE_SCHED);
>
> + netdev_err(dev,
> + "TX q %d is paused for too long (threshold %u). Time since last napi %u usec. napi scheduled: %d\n",
> + txqueue,
> + threshold,
> + time_since_last_napi,
> + napi_scheduled);
> +
> + if (threshold < time_since_last_napi && napi_scheduled) {
> + netdev_err(dev,
> + "napi handler hasn't been called for a long time but is scheduled\n");
> + reset_reason = ENA_REGS_RESET_SUSPECTED_POLL_STARVATION;
Hi David,
a nit from my side: the line above is indented one tab-stop too many.
No need to respin just for this AFAIC.
> + }
> +schedule_reset:
> /* Change the state of the device to trigger reset
> * Check that we are not in the middle or a trigger already
> */
> -
> if (test_and_set_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags))
> return;
>
> - ena_reset_device(adapter, ENA_REGS_RESET_OS_NETDEV_WD);
> + ena_reset_device(adapter, reset_reason);
> ena_increase_stat(&adapter->dev_stats.tx_timeout, 1, &adapter->syncp);
> -
> - netif_err(adapter, tx_err, dev, "Transmit time out\n");
> }
>
...
next prev parent reply other threads:[~2024-02-01 12:27 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-30 9:53 [PATCH v2 net-next 00/11] ENA driver changes darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 01/11] net: ena: Remove an unused field darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 02/11] net: ena: Add more documentation for RX copybreak darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 03/11] net: ena: Minor cosmetic changes darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 04/11] net: ena: Enable DIM by default darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 05/11] net: ena: Remove CQ tail pointer update darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 06/11] net: ena: Change error print during ena_device_init() darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 07/11] net: ena: Add more information on TX timeouts darinzon
2024-02-01 12:17 ` Paolo Abeni
2024-02-01 13:21 ` Arinzon, David
2024-02-01 14:36 ` Paolo Abeni
2024-02-01 15:43 ` Arinzon, David
2024-02-01 12:27 ` Simon Horman [this message]
2024-02-01 12:53 ` Arinzon, David
2024-01-30 9:53 ` [PATCH v2 net-next 08/11] net: ena: Relocate skb_tx_timestamp() to improve time stamping accuracy darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 09/11] net: ena: Change default print level for netif_ prints darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 10/11] net: ena: handle ena_calc_io_queue_size() possible errors darinzon
2024-01-30 9:53 ` [PATCH v2 net-next 11/11] net: ena: Reduce lines with longer column width boundary darinzon
2024-02-01 16:12 ` Jakub Kicinski
2024-02-01 17:40 ` Arinzon, David
2024-02-01 12:30 ` [PATCH v2 net-next 00/11] ENA driver changes patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240201122705.GA530335@kernel.org \
--to=horms@kernel.org \
--cc=akiyano@amazon.com \
--cc=aliguori@amazon.com \
--cc=alisaidi@amazon.com \
--cc=benh@amazon.com \
--cc=darinzon@amazon.com \
--cc=davem@davemloft.net \
--cc=dwmw@amazon.com \
--cc=evostrov@amazon.com \
--cc=itzko@amazon.com \
--cc=kuba@kernel.org \
--cc=matua@amazon.com \
--cc=msw@amazon.com \
--cc=nafea@amazon.com \
--cc=ndagan@amazon.com \
--cc=netanel@amazon.com \
--cc=netdev@vger.kernel.org \
--cc=nkoler@amazon.com \
--cc=ofirt@amazon.com \
--cc=osamaabb@amazon.com \
--cc=saeedb@amazon.com \
--cc=shannon.nelson@amd.com \
--cc=shayagr@amazon.com \
--cc=zorik@amazon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.