Re: [PATCH net] gve: fix Rx queue stall on alloc failure

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Eddie Phillips <eddiephillips@google.com>
Cc: Harshitha Ramamurthy <hramamurthy@google.com>,
	<netdev@vger.kernel.org>, <joshwash@google.com>,
	<andrew+netdev@lunn.ch>, <davem@davemloft.net>,
	<edumazet@google.com>, <kuba@kernel.org>, <pabeni@redhat.com>,
	<ast@kernel.org>, <daniel@iogearbox.net>, <hawk@kernel.org>,
	<john.fastabend@gmail.com>, <bpf@vger.kernel.org>,
	<sdf@fomichev.me>, <willemb@google.com>, <jordanrhee@google.com>,
	<nktgrg@google.com>, <maolson@google.com>,
	<jacob.e.keller@intel.com>, <thostet@google.com>,
	<csully@google.com>, <bcf@google.com>,
	<linux-kernel@vger.kernel.org>, <stable@vger.kernel.org>
Subject: Re: [PATCH net] gve: fix Rx queue stall on alloc failure
Date: Fri, 3 Jul 2026 18:05:57 +0200	[thread overview]
Message-ID: <akfd5abdGbxFl5o9@boxer> (raw)
In-Reply-To: <CAPBb8HmE6q0VPa5PooFP3VFF27GU3B4622Xww6MHRT-9i4zTxA@mail.gmail.com>

On Fri, Jul 03, 2026 at 01:03:20AM -0700, Eddie Phillips wrote:
> > I think this deserves to be pulled out of the timer logic?
> 
> If by this you mean pull the stats into a separate patch, I agree.

Hi Eddie,

instead of forming a response at the top of the mail, please have your
answers inlined; it is preferred way of communication on mailing lists.

> 
> > - couldn't you detect this case within napi poll loop?
> 
> It can only be detected after attempting to refill the queue and finding
> that we are still below the critical threshold.
> 
> > - if not, does it have to be per-q timer? wouldn't one global per pf timer
> >   satisfy your needs?
> 
> There are a few ways a global timer could be implemented,
>  - The global timer could queue napi for *all* queues, which would
> result in a lot of unnecessary work.
>  - The global timer could iterate over each queue and try to detect
> the critical low buffer condition, however this would require
> introducing synchronization between the timer and the napis, which
> would introduce expensive locking into the hot path.
>  - The global timer could be paired with a bitmap that stores which
> queues need to be serviced.

bitmap would probably do the job but i won't insist here tho.

One more question/idea:
Before arming the starvation timer, could we first try to make a smaller batch
of already-posted buffers visible to HW?

It seems the HW can accept RX buffer tail doorbell updates at a granularity
lower than the normal `GVE_RX_BUF_THRESH_DQO` batching threshold, apparently as
low as 8 descriptors. If that is the case, could we first use this as an
emergency low-watermark path: when refill posts at least 8 descriptors but does
not reach the normal 32-descriptor threshold, ring the doorbell immediately and
only arm the starvation timer if even that lower threshold cannot be reached?

> 
> A `struct timer_list` is only 40 bytes, so the current implemention is
> not expensive. Though a global timer is valid, it's not strictly better.
> 
> That said, I agree that we can clean up the structure—I will move the
> timer state from the individual RX rings to the `gve_priv` structure.
> 
> On Wed, Jul 1, 2026 at 6:22 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Wed, Jul 01, 2026 at 12:53:41AM +0000, Harshitha Ramamurthy wrote:
> > > From: Eddie Phillips <eddiephillips@google.com>
> > >
> > > When the system is under extreme memory pressure, page allocations can
> > > fail during the Rx buffer refill loop. If the number of buffers posted
> > > to hardware falls below a critical low threshold and the refill loop
> > > exits due to allocation failures, the queue can stall:
> > >
> > > 1. The device drops incoming packets because there are no descriptors.
> > > 2. Since no packets are processed, no Rx completions are generated.
> > > 3. Because no completions occur, NAPI is never scheduled, preventing
> > >    the refill loop from running again even after memory is freed.
> > >
> > > This results in a permanent queue stall.
> > >
> > > Resolve this by introducing a starvation recovery timer for each Rx queue.
> > > If the number of buffers posted to hardware falls below a critical low
> > > threshold, start a timer to periodically reschedule NAPI. Once NAPI runs
> > > and successfully refills the queue above the threshold, the timer is
> > > not rescheduled.
> > >
> > > Also add a new ethtool statistic "rx_critical_low_bufs" to track the
> > > number of times the starvation recovery timer is triggered.
> >
> > I think this deserves to be pulled out of the timer logic?
> >
> > Two questions tho:
> > - couldn't you detect this case within napi poll loop?
> > - if not, does it have to be per-q timer? wouldn't one global per pf timer
> >   satisfy your needs?
> >
> > >
> > > Cc: stable@vger.kernel.org
> > > Fixes: 9b8dd5e5ea48 ("gve: DQO: Add RX path")
> > > Reviewed-by: Jordan Rhee <jordanrhee@google.com>
> > > Signed-off-by: Eddie Phillips <eddiephillips@google.com>
> > > Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
> > > ---
> > >  drivers/net/ethernet/google/gve/gve.h         |  4 ++++
> > >  drivers/net/ethernet/google/gve/gve_ethtool.c | 14 +++++++++++++-
> > >  drivers/net/ethernet/google/gve/gve_rx_dqo.c  | 32 ++++++++++++++++++++++++++++++++
> > >  3 files changed, 49 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> > > index 2f7bd330..8378bef2 100644
> > > --- a/drivers/net/ethernet/google/gve/gve.h
> > > +++ b/drivers/net/ethernet/google/gve/gve.h
> > > @@ -13,6 +13,7 @@
> > >  #include <linux/netdevice.h>
> > >  #include <linux/net_tstamp.h>
> > >  #include <linux/pci.h>
> > > +#include <linux/timer.h>
> > >  #include <linux/ptp_clock_kernel.h>
> > >  #include <linux/u64_stats_sync.h>
> > >  #include <net/page_pool/helpers.h>
> > > @@ -41,6 +42,7 @@
> > >
> > >  /* Interval to schedule a stats report update, 20000ms. */
> > >  #define GVE_STATS_REPORT_TIMER_PERIOD        20000
> > > +#define GVE_RX_NAPI_RESCHED_MS 20 /* msecs */
> > >
> > >  /* Numbers of NIC tx/rx stats in stats report. */
> > >  #define NIC_TX_STATS_REPORT_NUM      0
> > > @@ -318,6 +320,7 @@ struct gve_rx_ring {
> > >       u64 rx_copied_pkt; /* free-running total number of copied packets */
> > >       u64 rx_skb_alloc_fail; /* free-running count of skb alloc fails */
> > >       u64 rx_buf_alloc_fail; /* free-running count of buffer alloc fails */
> > > +     u64 rx_critical_low_bufs; /* count of critical low buffer events */
> > >       u64 rx_desc_err_dropped_pkt; /* free-running count of packets dropped by descriptor error */
> > >       /* free-running count of unsplit packets due to header buffer overflow or hdr_len is 0 */
> > >       u64 rx_hsplit_unsplit_pkt;
> > > @@ -334,6 +337,7 @@ struct gve_rx_ring {
> > >       struct gve_queue_resources *q_resources; /* head and tail pointer idx */
> > >       dma_addr_t q_resources_bus; /* dma address for the queue resources */
> > >       struct u64_stats_sync statss; /* sync stats for 32bit archs */
> > > +     struct timer_list starvation_timer; /* for queue starvation recovery */
> > >
> > >       struct gve_rx_ctx ctx; /* Info for packet currently being processed in this ring. */
> > >
> > > diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
> > > index a0e0472b..71b6efbf 100644
> > > --- a/drivers/net/ethernet/google/gve/gve_ethtool.c
> > > +++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
> > > @@ -46,6 +46,7 @@ static const char gve_gstrings_main_stats[][ETH_GSTRING_LEN] = {
> > >       "rx_hsplit_unsplit_pkt",
> > >       "interface_up_cnt", "interface_down_cnt", "reset_cnt",
> > >       "page_alloc_fail", "dma_mapping_error", "stats_report_trigger_cnt",
> > > +     "rx_critical_low_bufs",
> > >  };
> > >
> > >  static const char gve_gstrings_rx_stats[][ETH_GSTRING_LEN] = {
> > > @@ -58,6 +59,7 @@ static const char gve_gstrings_rx_stats[][ETH_GSTRING_LEN] = {
> > >       "rx_xdp_aborted[%u]", "rx_xdp_drop[%u]", "rx_xdp_pass[%u]",
> > >       "rx_xdp_tx[%u]", "rx_xdp_redirect[%u]",
> > >       "rx_xdp_tx_errors[%u]", "rx_xdp_redirect_errors[%u]", "rx_xdp_alloc_fails[%u]",
> > > +     "rx_critical_low_bufs[%u]",
> > >  };
> > >
> > >  static const char gve_gstrings_tx_stats[][ETH_GSTRING_LEN] = {
> > > @@ -151,12 +153,14 @@ gve_get_ethtool_stats(struct net_device *netdev,
> > >  {
> > >       u64 tmp_rx_pkts, tmp_rx_hsplit_pkt, tmp_rx_bytes, tmp_rx_hsplit_bytes,
> > >               tmp_rx_skb_alloc_fail, tmp_rx_buf_alloc_fail,
> > > +             tmp_rx_critical_low_bufs,
> > >               tmp_rx_desc_err_dropped_pkt, tmp_rx_hsplit_unsplit_pkt,
> > >               tmp_tx_pkts, tmp_tx_bytes,
> > >               tmp_xdp_tx_errors, tmp_xdp_redirect_errors;
> > >       u64 rx_buf_alloc_fail, rx_desc_err_dropped_pkt, rx_hsplit_unsplit_pkt,
> > >               rx_pkts, rx_hsplit_pkt, rx_skb_alloc_fail, rx_bytes, tx_pkts, tx_bytes,
> > > -             tx_dropped, xdp_tx_errors, xdp_redirect_errors;
> > > +             rx_critical_low_bufs, tx_dropped, xdp_tx_errors,
> > > +             xdp_redirect_errors;
> > >       int rx_base_stats_idx, max_rx_stats_idx, max_tx_stats_idx;
> > >       int stats_idx, stats_region_len, nic_stats_len;
> > >       struct stats *report_stats;
> > > @@ -197,6 +201,7 @@ gve_get_ethtool_stats(struct net_device *netdev,
> > >
> > >       for (rx_pkts = 0, rx_bytes = 0, rx_hsplit_pkt = 0,
> > >            rx_skb_alloc_fail = 0, rx_buf_alloc_fail = 0,
> > > +          rx_critical_low_bufs = 0,
> > >            rx_desc_err_dropped_pkt = 0, rx_hsplit_unsplit_pkt = 0,
> > >            xdp_tx_errors = 0, xdp_redirect_errors = 0,
> > >            ring = 0;
> > > @@ -212,6 +217,8 @@ gve_get_ethtool_stats(struct net_device *netdev,
> > >                               tmp_rx_bytes = rx->rbytes;
> > >                               tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail;
> > >                               tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail;
> > > +                             tmp_rx_critical_low_bufs =
> > > +                                     rx->rx_critical_low_bufs;
> > >                               tmp_rx_desc_err_dropped_pkt =
> > >                                       rx->rx_desc_err_dropped_pkt;
> > >                               tmp_rx_hsplit_unsplit_pkt =
> > > @@ -226,6 +233,7 @@ gve_get_ethtool_stats(struct net_device *netdev,
> > >                       rx_bytes += tmp_rx_bytes;
> > >                       rx_skb_alloc_fail += tmp_rx_skb_alloc_fail;
> > >                       rx_buf_alloc_fail += tmp_rx_buf_alloc_fail;
> > > +                     rx_critical_low_bufs += tmp_rx_critical_low_bufs;
> > >                       rx_desc_err_dropped_pkt += tmp_rx_desc_err_dropped_pkt;
> > >                       rx_hsplit_unsplit_pkt += tmp_rx_hsplit_unsplit_pkt;
> > >                       xdp_tx_errors += tmp_xdp_tx_errors;
> > > @@ -269,6 +277,7 @@ gve_get_ethtool_stats(struct net_device *netdev,
> > >       data[i++] = priv->page_alloc_fail;
> > >       data[i++] = priv->dma_mapping_error;
> > >       data[i++] = priv->stats_report_trigger_cnt;
> > > +     data[i++] = rx_critical_low_bufs;
> > >       i = GVE_MAIN_STATS_LEN;
> > >
> > >       rx_base_stats_idx = 0;
> > > @@ -337,6 +346,8 @@ gve_get_ethtool_stats(struct net_device *netdev,
> > >                               tmp_rx_hsplit_bytes = rx->rx_hsplit_bytes;
> > >                               tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail;
> > >                               tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail;
> > > +                             tmp_rx_critical_low_bufs =
> > > +                                     rx->rx_critical_low_bufs;
> > >                               tmp_rx_desc_err_dropped_pkt =
> > >                                       rx->rx_desc_err_dropped_pkt;
> > >                               tmp_xdp_tx_errors = rx->xdp_tx_errors;
> > > @@ -381,6 +392,7 @@ gve_get_ethtool_stats(struct net_device *netdev,
> > >                       } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
> > >                                                      start));
> > >                       i += GVE_XDP_ACTIONS + 3; /* XDP rx counters */
> > > +                     data[i++] = tmp_rx_critical_low_bufs;
> > >               }
> > >       } else {
> > >               i += priv->rx_cfg.num_queues * NUM_GVE_RX_CNTS;
> > > diff --git a/drivers/net/ethernet/google/gve/gve_rx_dqo.c b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
> > > index 02cba280..303db4fa 100644
> > > --- a/drivers/net/ethernet/google/gve/gve_rx_dqo.c
> > > +++ b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
> > > @@ -18,6 +18,16 @@
> > >  #include <net/tcp.h>
> > >  #include <net/xdp_sock_drv.h>
> > >
> > > +static void gve_rx_starvation_timer(struct timer_list *t)
> > > +{
> > > +     struct gve_rx_ring *rx = timer_container_of(rx, t, starvation_timer);
> > > +     struct gve_priv *priv = rx->gve;
> > > +     struct gve_notify_block *block;
> > > +
> > > +     block = &priv->ntfy_blocks[rx->ntfy_id];
> > > +     napi_schedule(&block->napi);
> > > +}
> > > +
> > >  static void gve_rx_free_hdr_bufs(struct gve_priv *priv, struct gve_rx_ring *rx)
> > >  {
> > >       struct device *hdev = &priv->pdev->dev;
> > > @@ -120,6 +130,7 @@ void gve_rx_stop_ring_dqo(struct gve_priv *priv, int idx)
> > >
> > >       if (rx->dqo.page_pool)
> > >               page_pool_disable_direct_recycling(rx->dqo.page_pool);
> > > +     timer_delete_sync(&rx->starvation_timer);
> > >       gve_remove_napi(priv, ntfy_idx);
> > >       gve_rx_remove_from_block(priv, idx);
> > >       gve_rx_reset_ring_dqo(priv, idx);
> > > @@ -136,6 +147,8 @@ void gve_rx_free_ring_dqo(struct gve_priv *priv, struct gve_rx_ring *rx,
> > >       u32 qpl_id;
> > >       int i;
> > >
> > > +     timer_shutdown_sync(&rx->starvation_timer);
> > > +
> > >       completion_queue_slots = rx->dqo.complq.mask + 1;
> > >       buffer_queue_slots = rx->dqo.bufq.mask + 1;
> > >
> > > @@ -232,6 +245,7 @@ int gve_rx_alloc_ring_dqo(struct gve_priv *priv,
> > >       rx->gve = priv;
> > >       rx->q_num = idx;
> > >       rx->packet_buffer_size = cfg->packet_buffer_size;
> > > +     timer_setup(&rx->starvation_timer, gve_rx_starvation_timer, 0);
> > >
> > >       if (cfg->xdp) {
> > >               rx->packet_buffer_truesize = GVE_XDP_RX_BUFFER_SIZE_DQO;
> > > @@ -365,6 +379,7 @@ void gve_rx_post_buffers_dqo(struct gve_rx_ring *rx)
> > >       struct gve_rx_compl_queue_dqo *complq = &rx->dqo.complq;
> > >       struct gve_rx_buf_queue_dqo *bufq = &rx->dqo.bufq;
> > >       struct gve_priv *priv = rx->gve;
> > > +     u32 num_bufs_avail_to_hw;
> > >       u32 num_avail_slots;
> > >       u32 num_full_slots;
> > >       u32 num_posted = 0;
> > > @@ -400,6 +415,23 @@ void gve_rx_post_buffers_dqo(struct gve_rx_ring *rx)
> > >       }
> > >
> > >       rx->fill_cnt += num_posted;
> > > +
> > > +     /* If the queue has fewer than GVE_RX_BUF_THRESH_DQO descriptors
> > > +      * visible to the hardware, and no doorbell was written, the hardware
> > > +      * is in danger of starving and cannot trigger interrupts. Start the
> > > +      * timer to periodically reschedule NAPI and recover from starvation.
> > > +      */
> > > +     num_bufs_avail_to_hw =
> > > +             ((bufq->tail & ~(GVE_RX_BUF_THRESH_DQO - 1)) -
> > > +              bufq->head) & bufq->mask;
> > > +
> > > +     if (num_bufs_avail_to_hw < GVE_RX_BUF_THRESH_DQO) {
> > > +             u64_stats_update_begin(&rx->statss);
> > > +             rx->rx_critical_low_bufs++;
> > > +             u64_stats_update_end(&rx->statss);
> > > +             mod_timer(&rx->starvation_timer,
> > > +                       jiffies + msecs_to_jiffies(GVE_RX_NAPI_RESCHED_MS));
> > > +     }
> > >  }
> > >
> > >  static void gve_rx_skb_csum(struct sk_buff *skb,
> > > --
> > > 2.55.0.rc2.803.g1fd1e6609c-goog
> > >
> > >

     prev parent reply	other threads:[~2026-07-03 16:06 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-01  0:53 [PATCH net] gve: fix Rx queue stall on alloc failure Harshitha Ramamurthy
2026-07-01 13:21 ` Maciej Fijalkowski
2026-07-03  8:03   ` Eddie Phillips
2026-07-03 16:05     ` Maciej Fijalkowski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=akfd5abdGbxFl5o9@boxer \
    --to=maciej.fijalkowski@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bcf@google.com \
    --cc=bpf@vger.kernel.org \
    --cc=csully@google.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eddiephillips@google.com \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=hramamurthy@google.com \
    --cc=jacob.e.keller@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=jordanrhee@google.com \
    --cc=joshwash@google.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maolson@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=nktgrg@google.com \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=stable@vger.kernel.org \
    --cc=thostet@google.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox