From mboxrd@z Thu Jan 1 00:00:00 1970 From: Laurence Oberman Subject: Re: [PATCH] bnx2fc: Fix hung task messages when a cleanup response is not received during abort. Date: Wed, 15 Nov 2017 10:21:32 -0500 Message-ID: <1510759292.5626.1.camel@redhat.com> References: <20171115150606.10994-1-chad.dupuis@cavium.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Return-path: Received: from mail-qt0-f194.google.com ([209.85.216.194]:56748 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932593AbdKOPVe (ORCPT ); Wed, 15 Nov 2017 10:21:34 -0500 Received: by mail-qt0-f194.google.com with SMTP id r39so18806594qtr.13 for ; Wed, 15 Nov 2017 07:21:34 -0800 (PST) In-Reply-To: <20171115150606.10994-1-chad.dupuis@cavium.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Chad Dupuis , martin.petersen@oracle.com Cc: linux-scsi@vger.kernel.org, james.bottomley@hansenpartnership.com, QLogic-Storage-Upstream@cavium.com On Wed, 2017-11-15 at 07:06 -0800, Chad Dupuis wrote: > If a cleanup task is not responded to while we are in > bnx2fc_abts_cleanup, it > will hang the SCSI error handler since we use wait_for_completion > instead of > wait_for_completion_timeout.  So, use wait_for_completion_timeout so > that we > don't hang the SCSI error handler thread forever. > > Fixes the call trace: > > [183373.131468] INFO: task scsi_eh_16:110146 blocked for more than > 120 seconds. > [183373.131469] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [183373.131470] scsi_eh_16      D ffff88103f2fca14     0 > 110146      2 0x00000080 > [183373.131472]  ffff880855e77cb0 0000000000000046 ffff881050654e70 > ffff880855e77fd8 > [183373.131474]  ffff880855e77fd8 ffff880855e77fd8 ffff881050654e70 > ffff88103f2fcb48 > [183373.131475]  ffff88103f2fcb50 7fffffffffffffff ffff881050654e70 > ffff88103f2fca14 > [183373.131477] Call Trace: > [183373.131479]  [] schedule+0x29/0x70 > [183373.131481]  [] schedule_timeout+0x239/0x2d0 > [183373.131486]  [] ? __dev_printk+0x3e/0x90 > [183373.131487]  [] ? dev_printk+0x5d/0x80 > [183373.131490]  [] wait_for_completion+0x116/0x170 > [183373.131492]  [] ? wake_up_state+0x20/0x20 > [183373.131494]  [] bnx2fc_abts_cleanup+0x3d/0x62 > [bnx2fc] > [183373.131497]  [] bnx2fc_eh_abort+0x470/0x580 > [bnx2fc] > [183373.131500]  [] scsi_error_handler+0x59f/0x8b0 > [183373.131501]  [] ? scsi_eh_get_sense+0x250/0x250 > [183373.131503]  [] kthread+0xcf/0xe0 > [183373.131505]  [] ? > kthread_create_on_node+0x140/0x140 > [183373.131507]  [] ret_from_fork+0x58/0x90 > [183373.131509]  [] ? > kthread_create_on_node+0x140/0x140 > > Signed-off-by: Chad Dupuis > --- >  drivers/scsi/bnx2fc/bnx2fc_io.c | 40 > ++++++++++++++++++++++++++++++++-------- >  1 file changed, 32 insertions(+), 8 deletions(-) > > diff --git a/drivers/scsi/bnx2fc/bnx2fc_io.c > b/drivers/scsi/bnx2fc/bnx2fc_io.c > index 5b6153f23f01..2cbb5be98ecb 100644 > --- a/drivers/scsi/bnx2fc/bnx2fc_io.c > +++ b/drivers/scsi/bnx2fc/bnx2fc_io.c > @@ -1084,24 +1084,35 @@ static int bnx2fc_abts_cleanup(struct > bnx2fc_cmd *io_req) >  { >   struct bnx2fc_rport *tgt = io_req->tgt; >   int rc = SUCCESS; > + unsigned int time_left; >   >   io_req->wait_for_comp = 1; >   bnx2fc_initiate_cleanup(io_req); >   >   spin_unlock_bh(&tgt->tgt_lock); >   > - wait_for_completion(&io_req->tm_done); > - > + /* > +  * Can't wait forever on cleanup response lest we let the > SCSI error > +  * handler wait forever > +  */ > + time_left = wait_for_completion_timeout(&io_req->tm_done, > +     BNX2FC_FW_TIMEOUT); >   io_req->wait_for_comp = 0; > + if (!time_left) > + BNX2FC_IO_DBG(io_req, "%s(): Wait for cleanup timed > out.\n", > +     __func__); > + >   /* > -  * release the reference taken in eh_abort to allow the > -  * target to re-login after flushing IOs > +  * Release reference held by SCSI command the cleanup > completion > +  * hits the BNX2FC_CLEANUP case in bnx2fc_process_cq_compl() > and > +  * thus the SCSI command is not returnedi by > bnx2fc_scsi_done(). >    */ >   kref_put(&io_req->refcount, bnx2fc_cmd_release); >   >   spin_lock_bh(&tgt->tgt_lock); >   return rc; >  } > + >  /** >   * bnx2fc_eh_abort - eh_abort_handler api to abort an outstanding >   * SCSI command > @@ -1118,6 +1129,7 @@ int bnx2fc_eh_abort(struct scsi_cmnd *sc_cmd) >   struct fc_lport *lport; >   struct bnx2fc_rport *tgt; >   int rc; > + unsigned int time_left; >   >   rc = fc_block_scsi_eh(sc_cmd); >   if (rc) > @@ -1194,6 +1206,11 @@ int bnx2fc_eh_abort(struct scsi_cmnd *sc_cmd) >   if (cancel_delayed_work(&io_req->timeout_work)) >   kref_put(&io_req->refcount, >    bnx2fc_cmd_release); /* drop timer > hold */ > + /* > +  * We don't want to hold off the upper layer timer > so simply > +  * cleanup the command and return that I/O was > successfully > +  * aborted. > +  */ >   rc = bnx2fc_abts_cleanup(io_req); >   /* This only occurs when an task abort was requested > while ABTS >      is in progress.  Setting the IO_CLEANUP flag will > skip the > @@ -1201,7 +1218,7 @@ int bnx2fc_eh_abort(struct scsi_cmnd *sc_cmd) >      was a result from the ABTS request rather than > the CLEANUP >      request */ >   set_bit(BNX2FC_FLAG_IO_CLEANUP, &io_req- > >req_flags); > - goto out; > + goto done; >   } >   >   /* Cancel the current timer running on this io_req */ > @@ -1221,7 +1238,11 @@ int bnx2fc_eh_abort(struct scsi_cmnd *sc_cmd) >   } >   spin_unlock_bh(&tgt->tgt_lock); >   > - wait_for_completion(&io_req->tm_done); > + /* Wait 2 * RA_TOV + 1 to be sure timeout function hasn't > fired */ > + time_left = wait_for_completion_timeout(&io_req->tm_done, > +     (2 * rp->r_a_tov + 1) * HZ); > + if (time_left) > + BNX2FC_IO_DBG(io_req, "Timed out in eh_abort waiting > for tm_done"); >   >   spin_lock_bh(&tgt->tgt_lock); >   io_req->wait_for_comp = 0; > @@ -1233,8 +1254,12 @@ int bnx2fc_eh_abort(struct scsi_cmnd *sc_cmd) >   /* Let the scsi-ml try to recover this command */ >   printk(KERN_ERR PFX "abort failed, xid = 0x%x\n", >          io_req->xid); > + /* > +  * Cleanup firmware residuals before returning > control back > +  * to SCSI ML. > +  */ >   rc = bnx2fc_abts_cleanup(io_req); > - goto out; > + goto done; >   } else { >   /* >    * We come here even when there was a race condition > @@ -1249,7 +1274,6 @@ int bnx2fc_eh_abort(struct scsi_cmnd *sc_cmd) >  done: >   /* release the reference taken in eh_abort */ >   kref_put(&io_req->refcount, bnx2fc_cmd_release); > -out: >   spin_unlock_bh(&tgt->tgt_lock); >   return rc; >  } We experienced this at a major customer and provided this patch after working with Chad and backported to RHEL as a test kernel. Its been stable while running a test kernel and the changes look good ate least to me. Reviewed-by: Laurence Oberman Tested-by: Laurence Oberman Thanks Laurence