* [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff
@ 2023-07-11 1:36 Krister Johansen
2023-07-11 7:26 ` Leon Romanovsky
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Krister Johansen @ 2023-07-11 1:36 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, Shay Agroskin, Arthur Kiyanovski, David Arinzon,
Noam Dagan, Saeed Bishara, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
The ENA adapters on our instances occasionally reset. Once recently
logged a UBSAN failure to console in the process:
UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13
shift exponent 32 is too large for 32-bit type 'unsigned int'
CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117
Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
Workqueue: ena ena_fw_reset_device [ena]
Call Trace:
<TASK>
dump_stack_lvl+0x4a/0x63
dump_stack+0x10/0x16
ubsan_epilogue+0x9/0x36
__ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e
? __const_udelay+0x43/0x50
ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena]
wait_for_reset_state+0x54/0xa0 [ena]
ena_com_dev_reset+0xc8/0x110 [ena]
ena_down+0x3fe/0x480 [ena]
ena_destroy_device+0xeb/0xf0 [ena]
ena_fw_reset_device+0x30/0x50 [ena]
process_one_work+0x22b/0x3d0
worker_thread+0x4d/0x3f0
? process_one_work+0x3d0/0x3d0
kthread+0x12a/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x22/0x30
</TASK>
Apparently, the reset delays are getting so large they can trigger a
UBSAN panic.
Looking at the code, the current timeout is capped at 5000us. Using a
base value of 100us, the current code will overflow after (1<<29). Even
at values before 32, this function wraps around, perhaps
unintentionally.
Cap the value of the exponent used for this backoff at (1<<16) which is
larger than currently necessary, but large enough to support bigger
values in the future.
Cc: stable@vger.kernel.org
Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time")
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
drivers/net/ethernet/amazon/ena/ena_com.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c
index 451c3a1b6255..633b321d7fdd 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -35,6 +35,8 @@
#define ENA_REGS_ADMIN_INTR_MASK 1
+#define ENA_MAX_BACKOFF_DELAY_EXP 16U
+
#define ENA_MIN_ADMIN_POLL_US 100
#define ENA_MAX_ADMIN_POLL_US 5000
@@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct ena_com_admin_queue *admin_queue,
static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us)
{
+ exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
delay_us = min_t(u32, delay_us * (1U << exp), ENA_MAX_ADMIN_POLL_US);
usleep_range(delay_us, 2 * delay_us);
--
2.25.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff 2023-07-11 1:36 [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff Krister Johansen @ 2023-07-11 7:26 ` Leon Romanovsky 2023-07-13 15:34 ` David Laight 2023-07-11 17:47 ` Shay Agroskin 2023-07-12 23:00 ` patchwork-bot+netdevbpf 2 siblings, 1 reply; 8+ messages in thread From: Leon Romanovsky @ 2023-07-11 7:26 UTC (permalink / raw) To: Krister Johansen Cc: netdev, linux-kernel, Shay Agroskin, Arthur Kiyanovski, David Arinzon, Noam Dagan, Saeed Bishara, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni On Mon, Jul 10, 2023 at 06:36:21PM -0700, Krister Johansen wrote: > The ENA adapters on our instances occasionally reset. Once recently > logged a UBSAN failure to console in the process: > > UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13 > shift exponent 32 is too large for 32-bit type 'unsigned int' > CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117 > Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017 > Workqueue: ena ena_fw_reset_device [ena] > Call Trace: > <TASK> > dump_stack_lvl+0x4a/0x63 > dump_stack+0x10/0x16 > ubsan_epilogue+0x9/0x36 > __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e > ? __const_udelay+0x43/0x50 > ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena] > wait_for_reset_state+0x54/0xa0 [ena] > ena_com_dev_reset+0xc8/0x110 [ena] > ena_down+0x3fe/0x480 [ena] > ena_destroy_device+0xeb/0xf0 [ena] > ena_fw_reset_device+0x30/0x50 [ena] > process_one_work+0x22b/0x3d0 > worker_thread+0x4d/0x3f0 > ? process_one_work+0x3d0/0x3d0 > kthread+0x12a/0x150 > ? set_kthread_struct+0x50/0x50 > ret_from_fork+0x22/0x30 > </TASK> > > Apparently, the reset delays are getting so large they can trigger a > UBSAN panic. > > Looking at the code, the current timeout is capped at 5000us. Using a > base value of 100us, the current code will overflow after (1<<29). Even > at values before 32, this function wraps around, perhaps > unintentionally. > > Cap the value of the exponent used for this backoff at (1<<16) which is > larger than currently necessary, but large enough to support bigger > values in the future. > > Cc: stable@vger.kernel.org > Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time") > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> > --- > drivers/net/ethernet/amazon/ena/ena_com.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c > index 451c3a1b6255..633b321d7fdd 100644 > --- a/drivers/net/ethernet/amazon/ena/ena_com.c > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c > @@ -35,6 +35,8 @@ > > #define ENA_REGS_ADMIN_INTR_MASK 1 > > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U > + > #define ENA_MIN_ADMIN_POLL_US 100 > > #define ENA_MAX_ADMIN_POLL_US 5000 > @@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct ena_com_admin_queue *admin_queue, > > static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us) > { > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP); Thanks, Reviewed-by: Leon Romanovsky <leonro@nvidia.com> ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff 2023-07-11 7:26 ` Leon Romanovsky @ 2023-07-13 15:34 ` David Laight 0 siblings, 0 replies; 8+ messages in thread From: David Laight @ 2023-07-13 15:34 UTC (permalink / raw) To: 'Leon Romanovsky', Krister Johansen Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Shay Agroskin, Arthur Kiyanovski, David Arinzon, Noam Dagan, Saeed Bishara, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni From: Leon Romanovsky > Sent: 11 July 2023 08:26 ... > > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U > > + > > #define ENA_MIN_ADMIN_POLL_US 100 > > > > #define ENA_MAX_ADMIN_POLL_US 5000 > > @@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct ena_com_admin_queue *admin_queue, > > > > static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us) > > { > > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP); This shouldn't need to be a min_t() David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff 2023-07-11 1:36 [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff Krister Johansen 2023-07-11 7:26 ` Leon Romanovsky @ 2023-07-11 17:47 ` Shay Agroskin 2023-07-11 22:52 ` Krister Johansen 2023-07-12 23:00 ` patchwork-bot+netdevbpf 2 siblings, 1 reply; 8+ messages in thread From: Shay Agroskin @ 2023-07-11 17:47 UTC (permalink / raw) To: Krister Johansen Cc: netdev, linux-kernel, Arthur Kiyanovski, David Arinzon, Noam Dagan, Saeed Bishara, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni Krister Johansen <kjlx@templeofstupid.com> writes: > The ENA adapters on our instances occasionally reset. Once > recently > logged a UBSAN failure to console in the process: > > UBSAN: shift-out-of-bounds in > build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13 > shift exponent 32 is too large for 32-bit type 'unsigned int' > CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not > tainted 5.15.117 > Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017 > Workqueue: ena ena_fw_reset_device [ena] > Call Trace: > <TASK> > dump_stack_lvl+0x4a/0x63 > dump_stack+0x10/0x16 > ubsan_epilogue+0x9/0x36 > __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e > ? __const_udelay+0x43/0x50 > ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena] > wait_for_reset_state+0x54/0xa0 [ena] > ena_com_dev_reset+0xc8/0x110 [ena] > ena_down+0x3fe/0x480 [ena] > ena_destroy_device+0xeb/0xf0 [ena] > ena_fw_reset_device+0x30/0x50 [ena] > process_one_work+0x22b/0x3d0 > worker_thread+0x4d/0x3f0 > ? process_one_work+0x3d0/0x3d0 > kthread+0x12a/0x150 > ? set_kthread_struct+0x50/0x50 > ret_from_fork+0x22/0x30 > </TASK> > > Apparently, the reset delays are getting so large they can > trigger a > UBSAN panic. > > Looking at the code, the current timeout is capped at 5000us. > Using a > base value of 100us, the current code will overflow after > (1<<29). Even > at values before 32, this function wraps around, perhaps > unintentionally. > > Cap the value of the exponent used for this backoff at (1<<16) > which is > larger than currently necessary, but large enough to support > bigger > values in the future. > > Cc: stable@vger.kernel.org > Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time") > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> > --- > drivers/net/ethernet/amazon/ena/ena_com.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c > b/drivers/net/ethernet/amazon/ena/ena_com.c > index 451c3a1b6255..633b321d7fdd 100644 > --- a/drivers/net/ethernet/amazon/ena/ena_com.c > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c > @@ -35,6 +35,8 @@ > > #define ENA_REGS_ADMIN_INTR_MASK 1 > > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U > + > #define ENA_MIN_ADMIN_POLL_US 100 > > #define ENA_MAX_ADMIN_POLL_US 5000 > @@ -536,6 +538,7 @@ static int > ena_com_comp_status_to_errno(struct ena_com_admin_queue > *admin_queue, > > static void ena_delay_exponential_backoff_us(u32 exp, u32 > delay_us) > { > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP); > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us); > delay_us = min_t(u32, delay_us * (1U << exp), > ENA_MAX_ADMIN_POLL_US); > usleep_range(delay_us, 2 * delay_us); Hi, thanks for submitting this patch (: Going over the logic here, the driver sleeps for `delay_us` micro-seconds in each iteration that this function gets called. For an exp = 14 it'd sleep (I added units notation) delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) / (1000000) s = 1.6 s For an exp = 15 it'd sleep (10 * (2 ^ 15)) / (1000000) = 3.2s To even get close to an overflow value, say exp=29 the driver would sleep in a single iteration 53687 s = 14.9 hours. The driver should stop trying to get a response from the device after a timeout period received from the device which is 3 seconds by default. The point being, it seems very unlikely to hit this overflow. Did you experience it or was the issue discovered by a static analyzer ? Regarding the patch itself, I don't mind adding it since exp=16 limit should be more than enough to wait for the device's response. Reviewed-by: Shay Agroskin <shayagr@amazon.com> Thanks, Shay ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff 2023-07-11 17:47 ` Shay Agroskin @ 2023-07-11 22:52 ` Krister Johansen 2023-07-13 7:46 ` Shay Agroskin 0 siblings, 1 reply; 8+ messages in thread From: Krister Johansen @ 2023-07-11 22:52 UTC (permalink / raw) To: Shay Agroskin Cc: netdev, linux-kernel, Arthur Kiyanovski, David Arinzon, Noam Dagan, Saeed Bishara, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote: > > Krister Johansen <kjlx@templeofstupid.com> writes: > > > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c > > b/drivers/net/ethernet/amazon/ena/ena_com.c > > index 451c3a1b6255..633b321d7fdd 100644 > > --- a/drivers/net/ethernet/amazon/ena/ena_com.c > > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c > > @@ -35,6 +35,8 @@ > > #define ENA_REGS_ADMIN_INTR_MASK 1 > > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U > > + > > #define ENA_MIN_ADMIN_POLL_US 100 > > #define ENA_MAX_ADMIN_POLL_US 5000 > > @@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct > > ena_com_admin_queue *admin_queue, > > static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us) > > { > > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP); > > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us); > > delay_us = min_t(u32, delay_us * (1U << exp), ENA_MAX_ADMIN_POLL_US); > > usleep_range(delay_us, 2 * delay_us); > > Hi, thanks for submitting this patch (: Absolutely; thanks for the review! > Going over the logic here, the driver sleeps for `delay_us` micro-seconds in > each iteration that this function gets called. > > For an exp = 14 it'd sleep (I added units notation) > delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) / (1000000) s > = 1.6 s > > For an exp = 15 it'd sleep > (10 * (2 ^ 15)) / (1000000) = 3.2s > > To even get close to an overflow value, say exp=29 the driver would sleep in > a single iteration > 53687 s = 14.9 hours. > > The driver should stop trying to get a response from the device after a > timeout period received from the device which is 3 seconds by default. > > The point being, it seems very unlikely to hit this overflow. Did you > experience it or was the issue discovered by a static analyzer ? No, no use of fuzzing or static analysis. This was hit on a production instance that was having ENA trouble. I'm apparently reading the code differently. I thought this line: > > delay_us = min_t(u32, delay_us * (1U << exp), ENA_MAX_ADMIN_POLL_US); Was going to cap that delay_us at (delay_us * (1U << exp)) or 5000us, whichever is smaller. By that measure, if delay_us is 100 and ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped after exp = 6, correct? By my estimate, that puts it at between 160ms and 320ms of sleeping before one could hit this problem. I went and pulled the logs out of the archive and have the following timeline. This is seconds from boot as reported by dmesg: 11244.226583 - ena warns TX not completed on time, 10112000 usecs since last napi execution, missing tx timeout val of 5000 msec 11245.190453 - netdev watchdog fires 11245.190781 - ena records Transmit timeout 11245.250739 - ena records Trigger reset on 11246.812620 - UBSAN message to console 11248.590441 - ena reports Reset inidication didn't turn off 11250.633545 - ena reports failure to reset device 12013.529338 - last logline before new boot While the difference between the panic and the trigger reset is more than 320ms, it is definitely on the order of seconds instead of hours. > Regarding the patch itself, I don't mind adding it since exp=16 limit should > be more than enough to wait for the device's response. > Reviewed-by: Shay Agroskin <shayagr@amazon.com> Thanks, -K ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff 2023-07-11 22:52 ` Krister Johansen @ 2023-07-13 7:46 ` Shay Agroskin 2023-07-14 0:05 ` Krister Johansen 0 siblings, 1 reply; 8+ messages in thread From: Shay Agroskin @ 2023-07-13 7:46 UTC (permalink / raw) To: Krister Johansen Cc: netdev, linux-kernel, Arthur Kiyanovski, David Arinzon, Noam Dagan, Saeed Bishara, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni Krister Johansen <kjlx@templeofstupid.com> writes: > > On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote: >> >> Krister Johansen <kjlx@templeofstupid.com> writes: >> >> > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c >> > b/drivers/net/ethernet/amazon/ena/ena_com.c >> > index 451c3a1b6255..633b321d7fdd 100644 >> > --- a/drivers/net/ethernet/amazon/ena/ena_com.c >> > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c >> > @@ -35,6 +35,8 @@ >> > #define ENA_REGS_ADMIN_INTR_MASK 1 >> > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U >> > + >> > #define ENA_MIN_ADMIN_POLL_US 100 >> > #define ENA_MAX_ADMIN_POLL_US 5000 >> > @@ -536,6 +538,7 @@ static int >> > ena_com_comp_status_to_errno(struct >> > ena_com_admin_queue *admin_queue, >> > static void ena_delay_exponential_backoff_us(u32 exp, u32 >> > delay_us) >> > { >> > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP); >> > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us); >> > delay_us = min_t(u32, delay_us * (1U << exp), >> > ENA_MAX_ADMIN_POLL_US); >> > usleep_range(delay_us, 2 * delay_us); >> >> Hi, thanks for submitting this patch (: > > Absolutely; thanks for the review! > >> Going over the logic here, the driver sleeps for `delay_us` >> micro-seconds in >> each iteration that this function gets called. >> >> For an exp = 14 it'd sleep (I added units notation) >> delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) / >> (1000000) s >> = 1.6 s >> >> For an exp = 15 it'd sleep >> (10 * (2 ^ 15)) / (1000000) = 3.2s >> >> To even get close to an overflow value, say exp=29 the driver >> would sleep in >> a single iteration >> 53687 s = 14.9 hours. >> >> The driver should stop trying to get a response from the device >> after a >> timeout period received from the device which is 3 seconds by >> default. >> >> The point being, it seems very unlikely to hit this >> overflow. Did you >> experience it or was the issue discovered by a static analyzer >> ? > > No, no use of fuzzing or static analysis. This was hit on a > production > instance that was having ENA trouble. > > I'm apparently reading the code differently. I thought this > line: > >> > delay_us = min_t(u32, delay_us * (1U << exp), >> > ENA_MAX_ADMIN_POLL_US); > > Was going to cap that delay_us at (delay_us * (1U << exp)) or > 5000us, whichever is smaller. By that measure, if delay_us is > 100 and > ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped > after > exp = 6, correct? By my estimate, that puts it at between 160ms > and > 320ms of sleeping before one could hit this problem. > > I went and pulled the logs out of the archive and have the > following > timeline. This is seconds from boot as reported by dmesg: > > 11244.226583 - ena warns TX not completed on time, 10112000 > usecs since > last napi execution, missing tx timeout val of 5000 msec > > 11245.190453 - netdev watchdog fires > > 11245.190781 - ena records Transmit timeout > 11245.250739 - ena records Trigger reset on > > 11246.812620 - UBSAN message to console > > 11248.590441 - ena reports Reset inidication didn't turn off > 11250.633545 - ena reports failure to reset device > 12013.529338 - last logline before new boot > > While the difference between the panic and the trigger reset is > more > than 320ms, it is definitely on the order of seconds instead of > hours. > Yup you're right. I was so entangled in my exponent calculations that I completely missed the min_t expression there. That's quite an awkward design to be honest, I hope to submit a re-write for it in one of the releases. Thanks again for the work you've put into it ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff 2023-07-13 7:46 ` Shay Agroskin @ 2023-07-14 0:05 ` Krister Johansen 0 siblings, 0 replies; 8+ messages in thread From: Krister Johansen @ 2023-07-14 0:05 UTC (permalink / raw) To: Shay Agroskin Cc: Krister Johansen, netdev, linux-kernel, Arthur Kiyanovski, David Arinzon, Noam Dagan, Saeed Bishara, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Leon Romanovsky On Thu, Jul 13, 2023 at 10:46:55AM +0300, Shay Agroskin wrote: > > Krister Johansen <kjlx@templeofstupid.com> writes: > > > > > On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote: > > > > > > Krister Johansen <kjlx@templeofstupid.com> writes: > > > > > > > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c > > > > b/drivers/net/ethernet/amazon/ena/ena_com.c > > > > index 451c3a1b6255..633b321d7fdd 100644 > > > > --- a/drivers/net/ethernet/amazon/ena/ena_com.c > > > > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c > > > > @@ -35,6 +35,8 @@ > > > > #define ENA_REGS_ADMIN_INTR_MASK 1 > > > > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U > > > > + > > > > #define ENA_MIN_ADMIN_POLL_US 100 > > > > #define ENA_MAX_ADMIN_POLL_US 5000 > > > > @@ -536,6 +538,7 @@ static int > > > > ena_com_comp_status_to_errno(struct > > > > ena_com_admin_queue *admin_queue, > > > > static void ena_delay_exponential_backoff_us(u32 exp, u32 > > > > delay_us) > > > > { > > > > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP); > > > > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us); > > > > delay_us = min_t(u32, delay_us * (1U << exp), > > > > ENA_MAX_ADMIN_POLL_US); > > > > usleep_range(delay_us, 2 * delay_us); > > > > > > Hi, thanks for submitting this patch (: > > > > Absolutely; thanks for the review! > > > > > Going over the logic here, the driver sleeps for `delay_us` > > > micro-seconds in > > > each iteration that this function gets called. > > > > > > For an exp = 14 it'd sleep (I added units notation) > > > delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) / > > > (1000000) s > > > = 1.6 s > > > > > > For an exp = 15 it'd sleep > > > (10 * (2 ^ 15)) / (1000000) = 3.2s > > > > > > To even get close to an overflow value, say exp=29 the driver would > > > sleep in > > > a single iteration > > > 53687 s = 14.9 hours. > > > > > > The driver should stop trying to get a response from the device > > > after a > > > timeout period received from the device which is 3 seconds by > > > default. > > > > > > The point being, it seems very unlikely to hit this overflow. Did > > > you > > > experience it or was the issue discovered by a static analyzer ? > > > > No, no use of fuzzing or static analysis. This was hit on a production > > instance that was having ENA trouble. > > > > I'm apparently reading the code differently. I thought this line: > > > > > > delay_us = min_t(u32, delay_us * (1U << exp), > > > > ENA_MAX_ADMIN_POLL_US); > > > > Was going to cap that delay_us at (delay_us * (1U << exp)) or > > 5000us, whichever is smaller. By that measure, if delay_us is 100 and > > ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped after > > exp = 6, correct? By my estimate, that puts it at between 160ms and > > 320ms of sleeping before one could hit this problem. > > > > I went and pulled the logs out of the archive and have the following > > timeline. This is seconds from boot as reported by dmesg: > > > > 11244.226583 - ena warns TX not completed on time, 10112000 usecs > > since > > last napi execution, missing tx timeout val of 5000 msec > > > > 11245.190453 - netdev watchdog fires > > > > 11245.190781 - ena records Transmit timeout > > 11245.250739 - ena records Trigger reset on > > > > 11246.812620 - UBSAN message to console > > > > 11248.590441 - ena reports Reset inidication didn't turn off > > 11250.633545 - ena reports failure to reset device > > 12013.529338 - last logline before new boot > > > > While the difference between the panic and the trigger reset is more > > than 320ms, it is definitely on the order of seconds instead of hours. > > > > Yup you're right. I was so entangled in my exponent calculations that I > completely missed the min_t expression there. > > That's quite an awkward design to be honest, I hope to submit a re-write for > it in one of the releases. > > Thanks again for the work you've put into it Totally welcome. I appreciate the speedy reviews from you and Leon. Thanks again, -K ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff 2023-07-11 1:36 [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff Krister Johansen 2023-07-11 7:26 ` Leon Romanovsky 2023-07-11 17:47 ` Shay Agroskin @ 2023-07-12 23:00 ` patchwork-bot+netdevbpf 2 siblings, 0 replies; 8+ messages in thread From: patchwork-bot+netdevbpf @ 2023-07-12 23:00 UTC (permalink / raw) To: Krister Johansen Cc: netdev, linux-kernel, shayagr, akiyano, darinzon, ndagan, saeedb, davem, edumazet, kuba, pabeni Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Mon, 10 Jul 2023 18:36:21 -0700 you wrote: > The ENA adapters on our instances occasionally reset. Once recently > logged a UBSAN failure to console in the process: > > UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13 > shift exponent 32 is too large for 32-bit type 'unsigned int' > CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117 > Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017 > Workqueue: ena ena_fw_reset_device [ena] > Call Trace: > <TASK> > dump_stack_lvl+0x4a/0x63 > dump_stack+0x10/0x16 > ubsan_epilogue+0x9/0x36 > __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e > ? __const_udelay+0x43/0x50 > ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena] > wait_for_reset_state+0x54/0xa0 [ena] > ena_com_dev_reset+0xc8/0x110 [ena] > ena_down+0x3fe/0x480 [ena] > ena_destroy_device+0xeb/0xf0 [ena] > ena_fw_reset_device+0x30/0x50 [ena] > process_one_work+0x22b/0x3d0 > worker_thread+0x4d/0x3f0 > ? process_one_work+0x3d0/0x3d0 > kthread+0x12a/0x150 > ? set_kthread_struct+0x50/0x50 > ret_from_fork+0x22/0x30 > </TASK> > > [...] Here is the summary with links: - [net] net: ena: fix shift-out-of-bounds in exponential backoff https://git.kernel.org/netdev/net/c/1e9cb763e9ba You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-07-14 0:05 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-11 1:36 [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff Krister Johansen 2023-07-11 7:26 ` Leon Romanovsky 2023-07-13 15:34 ` David Laight 2023-07-11 17:47 ` Shay Agroskin 2023-07-11 22:52 ` Krister Johansen 2023-07-13 7:46 ` Shay Agroskin 2023-07-14 0:05 ` Krister Johansen 2023-07-12 23:00 ` patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).