All of lore.kernel.org
 help / color / mirror / Atom feed
From: Krister Johansen <kjlx@templeofstupid.com>
To: Shay Agroskin <shayagr@amazon.com>
Cc: Krister Johansen <kjlx@templeofstupid.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Arthur Kiyanovski <akiyano@amazon.com>,
	David Arinzon <darinzon@amazon.com>,
	Noam Dagan <ndagan@amazon.com>, Saeed Bishara <saeedb@amazon.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff
Date: Thu, 13 Jul 2023 17:05:30 -0700	[thread overview]
Message-ID: <20230714000530.GA1982@templeofstupid.com> (raw)
In-Reply-To: <pj41zlilao9rqt.fsf@u95c7fd9b18a35b.ant.amazon.com>

On Thu, Jul 13, 2023 at 10:46:55AM +0300, Shay Agroskin wrote:
> 
> Krister Johansen <kjlx@templeofstupid.com> writes:
> 
> > 
> > On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote:
> > > 
> > > Krister Johansen <kjlx@templeofstupid.com> writes:
> > > 
> > > > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > b/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > index 451c3a1b6255..633b321d7fdd 100644
> > > > --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > @@ -35,6 +35,8 @@
> > > >  #define ENA_REGS_ADMIN_INTR_MASK 1
> > > > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
> > > > +
> > > >  #define ENA_MIN_ADMIN_POLL_US 100
> > > >  #define ENA_MAX_ADMIN_POLL_US 5000
> > > > @@ -536,6 +538,7 @@ static int >
> > > ena_com_comp_status_to_errno(struct
> > > > ena_com_admin_queue *admin_queue,
> > > >    static void ena_delay_exponential_backoff_us(u32 exp, u32 >
> > > delay_us)
> > > >  {
> > > > +   exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
> > > >     delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
> > > >     delay_us = min_t(u32, delay_us * (1U << exp), >
> > > ENA_MAX_ADMIN_POLL_US);
> > > >     usleep_range(delay_us, 2 * delay_us);
> > > 
> > > Hi, thanks for submitting this patch (:
> > 
> > Absolutely; thanks for the review!
> > 
> > > Going over the logic here, the driver sleeps for `delay_us`
> > > micro-seconds in
> > > each iteration that this function gets called.
> > > 
> > > For an exp = 14 it'd sleep (I added units notation)
> > > delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) /
> > > (1000000) s
> > > = 1.6 s
> > > 
> > > For an exp = 15 it'd sleep
> > > (10 * (2 ^ 15)) / (1000000) = 3.2s
> > > 
> > > To even get close to an overflow value, say exp=29 the driver would
> > > sleep in
> > > a single iteration
> > > 53687 s = 14.9 hours.
> > > 
> > > The driver should stop trying to get a response from the device
> > > after a
> > > timeout period received from the device which is 3 seconds by
> > > default.
> > > 
> > > The point being, it seems very unlikely to hit this overflow. Did
> > > you
> > > experience it or was the issue discovered by a static analyzer ?
> > 
> > No, no use of fuzzing or static analysis.  This was hit on a production
> > instance that was having ENA trouble.
> > 
> > I'm apparently reading the code differently.  I thought this line:
> > 
> > > >     delay_us = min_t(u32, delay_us * (1U << exp), >
> > > ENA_MAX_ADMIN_POLL_US);
> > 
> > Was going to cap that delay_us at (delay_us * (1U << exp)) or
> > 5000us, whichever is smaller.  By that measure, if delay_us is 100 and
> > ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped after
> > exp = 6, correct?  By my estimate, that puts it at between 160ms and
> > 320ms of sleeping before one could hit this problem.
> > 
> > I went and pulled the logs out of the archive and have the following
> > timeline.  This is seconds from boot as reported by dmesg:
> > 
> >    11244.226583 - ena warns TX not completed on time, 10112000    usecs
> > since
> >     last napi execution, missing tx timeout val of 5000 msec
> > 
> >    11245.190453 - netdev watchdog fires
> > 
> >    11245.190781 - ena records Transmit timeout
> >    11245.250739 - ena records Trigger reset on
> > 
> >    11246.812620 - UBSAN message to console
> > 
> >    11248.590441 - ena reports Reset inidication didn't turn off
> >    11250.633545 - ena reports failure to reset device
> >    12013.529338 - last logline before new boot
> > 
> > While the difference between the panic and the trigger reset is more
> > than 320ms, it is definitely on the order of seconds instead of hours.
> > 
> 
> Yup you're right. I was so entangled in my exponent calculations that I
> completely missed the min_t expression there.
> 
> That's quite an awkward design to be honest, I hope to submit a re-write for
> it in one of the releases.
> 
> Thanks again for the work you've put into it

Totally welcome. I appreciate the speedy reviews from you and Leon.

Thanks again,

-K

  reply	other threads:[~2023-07-14  0:05 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-11  1:36 [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff Krister Johansen
2023-07-11  7:26 ` Leon Romanovsky
2023-07-13 15:34   ` David Laight
2023-07-11 17:47 ` Shay Agroskin
2023-07-11 22:52   ` Krister Johansen
2023-07-13  7:46     ` Shay Agroskin
2023-07-14  0:05       ` Krister Johansen [this message]
2023-07-12 23:00 ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230714000530.GA1982@templeofstupid.com \
    --to=kjlx@templeofstupid.com \
    --cc=akiyano@amazon.com \
    --cc=darinzon@amazon.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ndagan@amazon.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeedb@amazon.com \
    --cc=shayagr@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.