All of lore.kernel.org
 help / color / mirror / Atom feed
From: Krister Johansen <kjlx@templeofstupid.com>
To: Shay Agroskin <shayagr@amazon.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Arthur Kiyanovski <akiyano@amazon.com>,
	David Arinzon <darinzon@amazon.com>,
	Noam Dagan <ndagan@amazon.com>, Saeed Bishara <saeedb@amazon.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff
Date: Tue, 11 Jul 2023 15:52:10 -0700	[thread overview]
Message-ID: <20230711225210.GA2088@templeofstupid.com> (raw)
In-Reply-To: <pj41zllefmpbw7.fsf@u95c7fd9b18a35b.ant.amazon.com>

On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote:
> 
> Krister Johansen <kjlx@templeofstupid.com> writes:
> 
> > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c
> > b/drivers/net/ethernet/amazon/ena/ena_com.c
> > index 451c3a1b6255..633b321d7fdd 100644
> > --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> > @@ -35,6 +35,8 @@
> >  #define ENA_REGS_ADMIN_INTR_MASK 1
> > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
> > +
> >  #define ENA_MIN_ADMIN_POLL_US 100
> >  #define ENA_MAX_ADMIN_POLL_US 5000
> > @@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct
> > ena_com_admin_queue *admin_queue,
> >    static void ena_delay_exponential_backoff_us(u32 exp, u32  delay_us)
> >  {
> > +	exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
> >  	delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
> >  	delay_us = min_t(u32, delay_us * (1U << exp),  ENA_MAX_ADMIN_POLL_US);
> >  	usleep_range(delay_us, 2 * delay_us);
> 
> Hi, thanks for submitting this patch (:

Absolutely; thanks for the review!

> Going over the logic here, the driver sleeps for `delay_us` micro-seconds in
> each iteration that this function gets called.
> 
> For an exp = 14 it'd sleep (I added units notation)
> delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) / (1000000) s
> = 1.6 s
> 
> For an exp = 15 it'd sleep
> (10 * (2 ^ 15)) / (1000000) = 3.2s
> 
> To even get close to an overflow value, say exp=29 the driver would sleep in
> a single iteration
> 53687 s = 14.9 hours.
> 
> The driver should stop trying to get a response from the device after a
> timeout period received from the device which is 3 seconds by default.
> 
> The point being, it seems very unlikely to hit this overflow. Did you
> experience it or was the issue discovered by a static analyzer ?

No, no use of fuzzing or static analysis.  This was hit on a production
instance that was having ENA trouble.

I'm apparently reading the code differently.  I thought this line:

> >  	delay_us = min_t(u32, delay_us * (1U << exp),  ENA_MAX_ADMIN_POLL_US);

Was going to cap that delay_us at (delay_us * (1U << exp)) or
5000us, whichever is smaller.  By that measure, if delay_us is 100 and
ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped after
exp = 6, correct?  By my estimate, that puts it at between 160ms and
320ms of sleeping before one could hit this problem.

I went and pulled the logs out of the archive and have the following
timeline.  This is seconds from boot as reported by dmesg:

   11244.226583 - ena warns TX not completed on time, 10112000 usecs since
    last napi execution, missing tx timeout val of 5000 msec
   
   11245.190453 - netdev watchdog fires
   
   11245.190781 - ena records Transmit timeout
   11245.250739 - ena records Trigger reset on
   
   11246.812620 - UBSAN message to console
   
   11248.590441 - ena reports Reset inidication didn't turn off
   11250.633545 - ena reports failure to reset device
   12013.529338 - last logline before new boot

While the difference between the panic and the trigger reset is more
than 320ms, it is definitely on the order of seconds instead of hours.

> Regarding the patch itself, I don't mind adding it since exp=16 limit should
> be more than enough to wait for the device's response.
> Reviewed-by: Shay Agroskin <shayagr@amazon.com>

Thanks,

-K

  reply	other threads:[~2023-07-11 22:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-11  1:36 [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff Krister Johansen
2023-07-11  7:26 ` Leon Romanovsky
2023-07-13 15:34   ` David Laight
2023-07-11 17:47 ` Shay Agroskin
2023-07-11 22:52   ` Krister Johansen [this message]
2023-07-13  7:46     ` Shay Agroskin
2023-07-14  0:05       ` Krister Johansen
2023-07-12 23:00 ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230711225210.GA2088@templeofstupid.com \
    --to=kjlx@templeofstupid.com \
    --cc=akiyano@amazon.com \
    --cc=darinzon@amazon.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ndagan@amazon.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeedb@amazon.com \
    --cc=shayagr@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.