* netxen: box stuck in netxen_napi_disable()
@ 2015-01-22 4:43 Mike Galbraith
2015-01-22 5:57 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Mike Galbraith @ 2015-01-22 4:43 UTC (permalink / raw)
To: netdev
Greetings network wizards,
After doing some generic NO_HZ_FULL isolated core perturbation
measurements with a 64 core DL980G7 running 3.19-rc5, everything seeming
just peachy, I came back later to check on the box only to find that I
could no longer ssh into the thing. NO_HZ_FULL doesn't seem to be
involved in any obvious way, but I thought I should mention it.
No idea how repeatable this is, the box has other work to do atm. File
under 'noted', or if you want me to peek at something, holler.
rtnl_mutex was holding up the show, was held by the kworker below, who
was stuck in napi_synchronize() waiting for NAPI_STATE_SCHED to go away,
but whoever was supposed to make that happen, didn't.
crash> ps | grep UN
405 2 2 ffff880273958000 UN 0.0 0 0 [kworker/2:1]
419 2 16 ffff880273bf0000 UN 0.0 0 0 [kworker/16:1]
4259 1 21 ffff88026f3cbaa0 UN 0.0 14636 1908 dhcpcd
6007 1 3 ffff8802736d1d50 UN 0.0 32292 3200 ntpd
6048 1 0 ffff880272521d50 UN 0.0 59568 3460 ypbind
13650 2 2 ffff8802749b0000 UN 0.0 0 0 [kworker/2:2]
crash> bt ffff880273958000
PID: 405 TASK: ffff880273958000 CPU: 2 COMMAND: "kworker/2:1"
#0 [ffff880273957c10] __schedule at ffffffff81588c59
#1 [ffff880273957c80] schedule at ffffffff81589119
#2 [ffff880273957c90] schedule_timeout at ffffffff8158bbe6
#3 [ffff880273957d30] msleep at ffffffff810c5aa7
#4 [ffff880273957d50] netxen_napi_disable at ffffffffa032892a [netxen_nic]
#5 [ffff880273957d80] __netxen_nic_down at ffffffffa032c6fc [netxen_nic]
#6 [ffff880273957dc0] netxen_nic_reset_context at ffffffffa032d56b [netxen_nic]
#7 [ffff880273957de0] netxen_tx_timeout_task at ffffffffa032d63d [netxen_nic]
#8 [ffff880273957e00] process_one_work at ffffffff81077b7a
#9 [ffff880273957e50] worker_thread at ffffffff81078231
#10 [ffff880273957ec0] kthread at ffffffff8107d139
#11 [ffff880273957f50] ret_from_fork at ffffffff8158cf7c
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: netxen: box stuck in netxen_napi_disable()
2015-01-22 4:43 netxen: box stuck in netxen_napi_disable() Mike Galbraith
@ 2015-01-22 5:57 ` Eric Dumazet
2015-01-22 6:15 ` Mike Galbraith
0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-01-22 5:57 UTC (permalink / raw)
To: Mike Galbraith; +Cc: netdev
On Thu, 2015-01-22 at 05:43 +0100, Mike Galbraith wrote:
> Greetings network wizards,
>
> After doing some generic NO_HZ_FULL isolated core perturbation
> measurements with a 64 core DL980G7 running 3.19-rc5, everything seeming
> just peachy, I came back later to check on the box only to find that I
> could no longer ssh into the thing. NO_HZ_FULL doesn't seem to be
> involved in any obvious way, but I thought I should mention it.
>
> No idea how repeatable this is, the box has other work to do atm. File
> under 'noted', or if you want me to peek at something, holler.
>
> rtnl_mutex was holding up the show, was held by the kworker below, who
> was stuck in napi_synchronize() waiting for NAPI_STATE_SCHED to go away,
> but whoever was supposed to make that happen, didn't.
>
> crash> ps | grep UN
> 405 2 2 ffff880273958000 UN 0.0 0 0 [kworker/2:1]
> 419 2 16 ffff880273bf0000 UN 0.0 0 0 [kworker/16:1]
> 4259 1 21 ffff88026f3cbaa0 UN 0.0 14636 1908 dhcpcd
> 6007 1 3 ffff8802736d1d50 UN 0.0 32292 3200 ntpd
> 6048 1 0 ffff880272521d50 UN 0.0 59568 3460 ypbind
> 13650 2 2 ffff8802749b0000 UN 0.0 0 0 [kworker/2:2]
> crash> bt ffff880273958000
> PID: 405 TASK: ffff880273958000 CPU: 2 COMMAND: "kworker/2:1"
> #0 [ffff880273957c10] __schedule at ffffffff81588c59
> #1 [ffff880273957c80] schedule at ffffffff81589119
> #2 [ffff880273957c90] schedule_timeout at ffffffff8158bbe6
> #3 [ffff880273957d30] msleep at ffffffff810c5aa7
> #4 [ffff880273957d50] netxen_napi_disable at ffffffffa032892a [netxen_nic]
> #5 [ffff880273957d80] __netxen_nic_down at ffffffffa032c6fc [netxen_nic]
> #6 [ffff880273957dc0] netxen_nic_reset_context at ffffffffa032d56b [netxen_nic]
> #7 [ffff880273957de0] netxen_tx_timeout_task at ffffffffa032d63d [netxen_nic]
> #8 [ffff880273957e00] process_one_work at ffffffff81077b7a
> #9 [ffff880273957e50] worker_thread at ffffffff81078231
> #10 [ffff880273957ec0] kthread at ffffffff8107d139
> #11 [ffff880273957f50] ret_from_fork at ffffffff8158cf7c
Hi Mike
This driver doesn't follow the NAPI model correctly.
Please try following fix :
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
index 613037584d08..c531c8ae1be4 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
@@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi, int budget)
work_done = netxen_process_rcv_ring(sds_ring, budget);
- if ((work_done < budget) && tx_complete) {
+ if (!tx_complete)
+ work_done = budget;
+
+ if (work_done < budget) {
napi_complete(&sds_ring->napi);
if (test_bit(__NX_DEV_UP, &adapter->state))
netxen_nic_enable_int(sds_ring);
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: netxen: box stuck in netxen_napi_disable()
2015-01-22 5:57 ` Eric Dumazet
@ 2015-01-22 6:15 ` Mike Galbraith
2015-01-22 6:52 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Mike Galbraith @ 2015-01-22 6:15 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On Wed, 2015-01-21 at 21:57 -0800, Eric Dumazet wrote:
> This driver doesn't follow the NAPI model correctly.
>
> Please try following fix :
Thanks Eric, I'll plug it in in a bit and poke at it. No news is good
news, as good as news gets for unknown repeatability bugs that is ;-)
> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> index 613037584d08..c531c8ae1be4 100644
> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> @@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi, int budget)
>
> work_done = netxen_process_rcv_ring(sds_ring, budget);
>
> - if ((work_done < budget) && tx_complete) {
> + if (!tx_complete)
> + work_done = budget;
> +
> + if (work_done < budget) {
> napi_complete(&sds_ring->napi);
> if (test_bit(__NX_DEV_UP, &adapter->state))
> netxen_nic_enable_int(sds_ring);
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: netxen: box stuck in netxen_napi_disable()
2015-01-22 6:15 ` Mike Galbraith
@ 2015-01-22 6:52 ` Eric Dumazet
2015-01-22 8:37 ` Mike Galbraith
0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-01-22 6:52 UTC (permalink / raw)
To: Mike Galbraith; +Cc: netdev
On Thu, 2015-01-22 at 07:15 +0100, Mike Galbraith wrote:
> On Wed, 2015-01-21 at 21:57 -0800, Eric Dumazet wrote:
>
> > This driver doesn't follow the NAPI model correctly.
> >
> > Please try following fix :
>
> Thanks Eric, I'll plug it in in a bit and poke at it. No news is good
> news, as good as news gets for unknown repeatability bugs that is ;-)
To trigger the bug, all you had to do was to stress the transmit side.
You could reduce MAX_STATUS_HANDLE from 64 to 4 to trigger it even
faster.
I'll send an official patch tomorrow morning.
Thanks !
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: netxen: box stuck in netxen_napi_disable()
2015-01-22 6:52 ` Eric Dumazet
@ 2015-01-22 8:37 ` Mike Galbraith
2015-01-22 15:56 ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Mike Galbraith @ 2015-01-22 8:37 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On Wed, 2015-01-21 at 22:52 -0800, Eric Dumazet wrote:
> On Thu, 2015-01-22 at 07:15 +0100, Mike Galbraith wrote:
> > On Wed, 2015-01-21 at 21:57 -0800, Eric Dumazet wrote:
> >
> > > This driver doesn't follow the NAPI model correctly.
> > >
> > > Please try following fix :
> >
> > Thanks Eric, I'll plug it in in a bit and poke at it. No news is good
> > news, as good as news gets for unknown repeatability bugs that is ;-)
>
> To trigger the bug, all you had to do was to stress the transmit side.
Nope, it's easier than that...
> You could reduce MAX_STATUS_HANDLE from 64 to 4 to trigger it even
> faster.
Wow, that made it amazingly easy to verify.
With MAX_STATUS_HANDLE=4, without your patch applied box hangs as soon
as I bring the network up. Add your patch on top, all is peachy. Fresh
boot, or rmmod/modprobe and network restart, doesn't matter, box works
with your patch, is dead without it.
Here's an ornament for the fix if you think it'll look prettier :)
Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
-Mike
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH net] netxen: fix netxen_nic_poll() logic
2015-01-22 8:37 ` Mike Galbraith
@ 2015-01-22 15:56 ` Eric Dumazet
2015-01-23 10:59 ` Manish Chopra
2015-01-25 8:22 ` David Miller
0 siblings, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-01-22 15:56 UTC (permalink / raw)
To: Mike Galbraith, David Miller; +Cc: netdev, Manish Chopra
From: Eric Dumazet <edumazet@google.com>
NAPI poll logic now enforces that a poller returns exactly the budget
when it wants to be called again.
If a driver limits TX completion, it has to return budget as well when
the limit is hit, not the number of received packets.
Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: d75b1ade567f ("net: less interrupt masking in NAPI")
Cc: Manish Chopra <manish.chopra@qlogic.com>
---
drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
index 613037584d08..c531c8ae1be4 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
@@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi, int budget)
work_done = netxen_process_rcv_ring(sds_ring, budget);
- if ((work_done < budget) && tx_complete) {
+ if (!tx_complete)
+ work_done = budget;
+
+ if (work_done < budget) {
napi_complete(&sds_ring->napi);
if (test_bit(__NX_DEV_UP, &adapter->state))
netxen_nic_enable_int(sds_ring);
^ permalink raw reply related [flat|nested] 8+ messages in thread* RE: [PATCH net] netxen: fix netxen_nic_poll() logic
2015-01-22 15:56 ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
@ 2015-01-23 10:59 ` Manish Chopra
2015-01-25 8:22 ` David Miller
1 sibling, 0 replies; 8+ messages in thread
From: Manish Chopra @ 2015-01-23 10:59 UTC (permalink / raw)
To: Eric Dumazet, Mike Galbraith, David Miller; +Cc: netdev
> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Thursday, January 22, 2015 9:26 PM
> To: Mike Galbraith; David Miller
> Cc: netdev; Manish Chopra
> Subject: [PATCH net] netxen: fix netxen_nic_poll() logic
>
> From: Eric Dumazet <edumazet@google.com>
>
> NAPI poll logic now enforces that a poller returns exactly the budget when it
> wants to be called again.
>
> If a driver limits TX completion, it has to return budget as well when the limit is
> hit, not the number of received packets.
>
> Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: d75b1ade567f ("net: less interrupt masking in NAPI")
> Cc: Manish Chopra <manish.chopra@qlogic.com>
> ---
> drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> index 613037584d08..c531c8ae1be4 100644
> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> @@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi,
> int budget)
>
> work_done = netxen_process_rcv_ring(sds_ring, budget);
>
> - if ((work_done < budget) && tx_complete) {
> + if (!tx_complete)
> + work_done = budget;
> +
> + if (work_done < budget) {
> napi_complete(&sds_ring->napi);
> if (test_bit(__NX_DEV_UP, &adapter->state))
> netxen_nic_enable_int(sds_ring);
>
Thanks Eric.
Acked-by: Manish Chopra <manish.chopra@qlogic.com>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH net] netxen: fix netxen_nic_poll() logic
2015-01-22 15:56 ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
2015-01-23 10:59 ` Manish Chopra
@ 2015-01-25 8:22 ` David Miller
1 sibling, 0 replies; 8+ messages in thread
From: David Miller @ 2015-01-25 8:22 UTC (permalink / raw)
To: eric.dumazet; +Cc: umgwanakikbuti, netdev, manish.chopra
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 22 Jan 2015 07:56:18 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> NAPI poll logic now enforces that a poller returns exactly the budget
> when it wants to be called again.
>
> If a driver limits TX completion, it has to return budget as well when
> the limit is hit, not the number of received packets.
>
> Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: d75b1ade567f ("net: less interrupt masking in NAPI")
Applied and queued up for -stable, thanks Eric.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-01-25 8:22 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-22 4:43 netxen: box stuck in netxen_napi_disable() Mike Galbraith
2015-01-22 5:57 ` Eric Dumazet
2015-01-22 6:15 ` Mike Galbraith
2015-01-22 6:52 ` Eric Dumazet
2015-01-22 8:37 ` Mike Galbraith
2015-01-22 15:56 ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
2015-01-23 10:59 ` Manish Chopra
2015-01-25 8:22 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox