public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* netxen: box stuck in netxen_napi_disable()
@ 2015-01-22  4:43 Mike Galbraith
  2015-01-22  5:57 ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Galbraith @ 2015-01-22  4:43 UTC (permalink / raw)
  To: netdev

Greetings network wizards,

After doing some generic NO_HZ_FULL isolated core perturbation
measurements with a 64 core DL980G7 running 3.19-rc5, everything seeming
just peachy, I came back later to check on the box only to find that I
could no longer ssh into the thing.  NO_HZ_FULL doesn't seem to be
involved in any obvious way, but I thought I should mention it.

No idea how repeatable this is, the box has other work to do atm.  File
under 'noted', or if you want me to peek at something, holler.

rtnl_mutex was holding up the show, was held by the kworker below, who
was stuck in napi_synchronize() waiting for NAPI_STATE_SCHED to go away,
but whoever was supposed to make that happen, didn't.

crash> ps | grep UN
    405      2   2  ffff880273958000  UN   0.0       0      0  [kworker/2:1]
    419      2  16  ffff880273bf0000  UN   0.0       0      0  [kworker/16:1]
   4259      1  21  ffff88026f3cbaa0  UN   0.0   14636   1908  dhcpcd
   6007      1   3  ffff8802736d1d50  UN   0.0   32292   3200  ntpd
   6048      1   0  ffff880272521d50  UN   0.0   59568   3460  ypbind
  13650      2   2  ffff8802749b0000  UN   0.0       0      0  [kworker/2:2]
crash> bt ffff880273958000
PID: 405    TASK: ffff880273958000  CPU: 2   COMMAND: "kworker/2:1"
 #0 [ffff880273957c10] __schedule at ffffffff81588c59
 #1 [ffff880273957c80] schedule at ffffffff81589119
 #2 [ffff880273957c90] schedule_timeout at ffffffff8158bbe6
 #3 [ffff880273957d30] msleep at ffffffff810c5aa7
 #4 [ffff880273957d50] netxen_napi_disable at ffffffffa032892a [netxen_nic]
 #5 [ffff880273957d80] __netxen_nic_down at ffffffffa032c6fc [netxen_nic]
 #6 [ffff880273957dc0] netxen_nic_reset_context at ffffffffa032d56b [netxen_nic]
 #7 [ffff880273957de0] netxen_tx_timeout_task at ffffffffa032d63d [netxen_nic]
 #8 [ffff880273957e00] process_one_work at ffffffff81077b7a
 #9 [ffff880273957e50] worker_thread at ffffffff81078231
#10 [ffff880273957ec0] kthread at ffffffff8107d139
#11 [ffff880273957f50] ret_from_fork at ffffffff8158cf7c

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: netxen: box stuck in netxen_napi_disable()
  2015-01-22  4:43 netxen: box stuck in netxen_napi_disable() Mike Galbraith
@ 2015-01-22  5:57 ` Eric Dumazet
  2015-01-22  6:15   ` Mike Galbraith
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-01-22  5:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: netdev

On Thu, 2015-01-22 at 05:43 +0100, Mike Galbraith wrote:
> Greetings network wizards,
> 
> After doing some generic NO_HZ_FULL isolated core perturbation
> measurements with a 64 core DL980G7 running 3.19-rc5, everything seeming
> just peachy, I came back later to check on the box only to find that I
> could no longer ssh into the thing.  NO_HZ_FULL doesn't seem to be
> involved in any obvious way, but I thought I should mention it.
> 
> No idea how repeatable this is, the box has other work to do atm.  File
> under 'noted', or if you want me to peek at something, holler.
> 
> rtnl_mutex was holding up the show, was held by the kworker below, who
> was stuck in napi_synchronize() waiting for NAPI_STATE_SCHED to go away,
> but whoever was supposed to make that happen, didn't.
> 
> crash> ps | grep UN
>     405      2   2  ffff880273958000  UN   0.0       0      0  [kworker/2:1]
>     419      2  16  ffff880273bf0000  UN   0.0       0      0  [kworker/16:1]
>    4259      1  21  ffff88026f3cbaa0  UN   0.0   14636   1908  dhcpcd
>    6007      1   3  ffff8802736d1d50  UN   0.0   32292   3200  ntpd
>    6048      1   0  ffff880272521d50  UN   0.0   59568   3460  ypbind
>   13650      2   2  ffff8802749b0000  UN   0.0       0      0  [kworker/2:2]
> crash> bt ffff880273958000
> PID: 405    TASK: ffff880273958000  CPU: 2   COMMAND: "kworker/2:1"
>  #0 [ffff880273957c10] __schedule at ffffffff81588c59
>  #1 [ffff880273957c80] schedule at ffffffff81589119
>  #2 [ffff880273957c90] schedule_timeout at ffffffff8158bbe6
>  #3 [ffff880273957d30] msleep at ffffffff810c5aa7
>  #4 [ffff880273957d50] netxen_napi_disable at ffffffffa032892a [netxen_nic]
>  #5 [ffff880273957d80] __netxen_nic_down at ffffffffa032c6fc [netxen_nic]
>  #6 [ffff880273957dc0] netxen_nic_reset_context at ffffffffa032d56b [netxen_nic]
>  #7 [ffff880273957de0] netxen_tx_timeout_task at ffffffffa032d63d [netxen_nic]
>  #8 [ffff880273957e00] process_one_work at ffffffff81077b7a
>  #9 [ffff880273957e50] worker_thread at ffffffff81078231
> #10 [ffff880273957ec0] kthread at ffffffff8107d139
> #11 [ffff880273957f50] ret_from_fork at ffffffff8158cf7c

Hi Mike

This driver doesn't follow the NAPI model correctly.

Please try following fix :

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
index 613037584d08..c531c8ae1be4 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
@@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi, int budget)
 
 	work_done = netxen_process_rcv_ring(sds_ring, budget);
 
-	if ((work_done < budget) && tx_complete) {
+	if (!tx_complete)
+		work_done = budget;
+
+	if (work_done < budget) {
 		napi_complete(&sds_ring->napi);
 		if (test_bit(__NX_DEV_UP, &adapter->state))
 			netxen_nic_enable_int(sds_ring);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: netxen: box stuck in netxen_napi_disable()
  2015-01-22  5:57 ` Eric Dumazet
@ 2015-01-22  6:15   ` Mike Galbraith
  2015-01-22  6:52     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Galbraith @ 2015-01-22  6:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, 2015-01-21 at 21:57 -0800, Eric Dumazet wrote:

> This driver doesn't follow the NAPI model correctly.
> 
> Please try following fix :

Thanks Eric, I'll plug it in in a bit and poke at it.  No news is good
news, as good as news gets for unknown repeatability bugs that is ;-)

> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> index 613037584d08..c531c8ae1be4 100644
> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> @@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi, int budget)
>  
>  	work_done = netxen_process_rcv_ring(sds_ring, budget);
>  
> -	if ((work_done < budget) && tx_complete) {
> +	if (!tx_complete)
> +		work_done = budget;
> +
> +	if (work_done < budget) {
>  		napi_complete(&sds_ring->napi);
>  		if (test_bit(__NX_DEV_UP, &adapter->state))
>  			netxen_nic_enable_int(sds_ring);
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: netxen: box stuck in netxen_napi_disable()
  2015-01-22  6:15   ` Mike Galbraith
@ 2015-01-22  6:52     ` Eric Dumazet
  2015-01-22  8:37       ` Mike Galbraith
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-01-22  6:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: netdev

On Thu, 2015-01-22 at 07:15 +0100, Mike Galbraith wrote:
> On Wed, 2015-01-21 at 21:57 -0800, Eric Dumazet wrote:
> 
> > This driver doesn't follow the NAPI model correctly.
> > 
> > Please try following fix :
> 
> Thanks Eric, I'll plug it in in a bit and poke at it.  No news is good
> news, as good as news gets for unknown repeatability bugs that is ;-)

To trigger the bug, all you had to do was to stress the transmit side.

You could reduce MAX_STATUS_HANDLE from 64 to 4 to trigger it even
faster.

I'll send an official patch tomorrow morning.

Thanks !

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: netxen: box stuck in netxen_napi_disable()
  2015-01-22  6:52     ` Eric Dumazet
@ 2015-01-22  8:37       ` Mike Galbraith
  2015-01-22 15:56         ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Galbraith @ 2015-01-22  8:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, 2015-01-21 at 22:52 -0800, Eric Dumazet wrote: 
> On Thu, 2015-01-22 at 07:15 +0100, Mike Galbraith wrote:
> > On Wed, 2015-01-21 at 21:57 -0800, Eric Dumazet wrote:
> > 
> > > This driver doesn't follow the NAPI model correctly.
> > > 
> > > Please try following fix :
> > 
> > Thanks Eric, I'll plug it in in a bit and poke at it.  No news is good
> > news, as good as news gets for unknown repeatability bugs that is ;-)
> 
> To trigger the bug, all you had to do was to stress the transmit side.

Nope, it's easier than that...

> You could reduce MAX_STATUS_HANDLE from 64 to 4 to trigger it even
> faster.

Wow, that made it amazingly easy to verify.

With MAX_STATUS_HANDLE=4, without your patch applied box hangs as soon
as I bring the network up.  Add your patch on top, all is peachy.  Fresh
boot, or rmmod/modprobe and network restart, doesn't matter, box works
with your patch, is dead without it.

Here's an ornament for the fix if you think it'll look prettier :)

Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>

-Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net] netxen: fix netxen_nic_poll() logic
  2015-01-22  8:37       ` Mike Galbraith
@ 2015-01-22 15:56         ` Eric Dumazet
  2015-01-23 10:59           ` Manish Chopra
  2015-01-25  8:22           ` David Miller
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-01-22 15:56 UTC (permalink / raw)
  To: Mike Galbraith, David Miller; +Cc: netdev, Manish Chopra

From: Eric Dumazet <edumazet@google.com>

NAPI poll logic now enforces that a poller returns exactly the budget
when it wants to be called again.

If a driver limits TX completion, it has to return budget as well when
the limit is hit, not the number of received packets.

Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: d75b1ade567f ("net: less interrupt masking in NAPI")
Cc: Manish Chopra <manish.chopra@qlogic.com>
---
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
index 613037584d08..c531c8ae1be4 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
@@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi, int budget)
 
 	work_done = netxen_process_rcv_ring(sds_ring, budget);
 
-	if ((work_done < budget) && tx_complete) {
+	if (!tx_complete)
+		work_done = budget;
+
+	if (work_done < budget) {
 		napi_complete(&sds_ring->napi);
 		if (test_bit(__NX_DEV_UP, &adapter->state))
 			netxen_nic_enable_int(sds_ring);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: [PATCH net] netxen: fix netxen_nic_poll() logic
  2015-01-22 15:56         ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
@ 2015-01-23 10:59           ` Manish Chopra
  2015-01-25  8:22           ` David Miller
  1 sibling, 0 replies; 8+ messages in thread
From: Manish Chopra @ 2015-01-23 10:59 UTC (permalink / raw)
  To: Eric Dumazet, Mike Galbraith, David Miller; +Cc: netdev

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Thursday, January 22, 2015 9:26 PM
> To: Mike Galbraith; David Miller
> Cc: netdev; Manish Chopra
> Subject: [PATCH net] netxen: fix netxen_nic_poll() logic
> 
> From: Eric Dumazet <edumazet@google.com>
> 
> NAPI poll logic now enforces that a poller returns exactly the budget when it
> wants to be called again.
> 
> If a driver limits TX completion, it has to return budget as well when the limit is
> hit, not the number of received packets.
> 
> Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: d75b1ade567f ("net: less interrupt masking in NAPI")
> Cc: Manish Chopra <manish.chopra@qlogic.com>
> ---
>  drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> index 613037584d08..c531c8ae1be4 100644
> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
> @@ -2388,7 +2388,10 @@ static int netxen_nic_poll(struct napi_struct *napi,
> int budget)
> 
>  	work_done = netxen_process_rcv_ring(sds_ring, budget);
> 
> -	if ((work_done < budget) && tx_complete) {
> +	if (!tx_complete)
> +		work_done = budget;
> +
> +	if (work_done < budget) {
>  		napi_complete(&sds_ring->napi);
>  		if (test_bit(__NX_DEV_UP, &adapter->state))
>  			netxen_nic_enable_int(sds_ring);
> 
Thanks Eric.
Acked-by: Manish Chopra <manish.chopra@qlogic.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] netxen: fix netxen_nic_poll() logic
  2015-01-22 15:56         ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
  2015-01-23 10:59           ` Manish Chopra
@ 2015-01-25  8:22           ` David Miller
  1 sibling, 0 replies; 8+ messages in thread
From: David Miller @ 2015-01-25  8:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: umgwanakikbuti, netdev, manish.chopra

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 22 Jan 2015 07:56:18 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> NAPI poll logic now enforces that a poller returns exactly the budget
> when it wants to be called again.
> 
> If a driver limits TX completion, it has to return budget as well when
> the limit is hit, not the number of received packets.
> 
> Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: d75b1ade567f ("net: less interrupt masking in NAPI")

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-01-25  8:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-22  4:43 netxen: box stuck in netxen_napi_disable() Mike Galbraith
2015-01-22  5:57 ` Eric Dumazet
2015-01-22  6:15   ` Mike Galbraith
2015-01-22  6:52     ` Eric Dumazet
2015-01-22  8:37       ` Mike Galbraith
2015-01-22 15:56         ` [PATCH net] netxen: fix netxen_nic_poll() logic Eric Dumazet
2015-01-23 10:59           ` Manish Chopra
2015-01-25  8:22           ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox