public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Avoid that ATA error handling hangs
@ 2018-02-21 17:23 Bart Van Assche
  2018-02-22  2:23 ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2018-02-21 17:23 UTC (permalink / raw)
  To: Martin K . Petersen, James E . J . Bottomley
  Cc: linux-scsi, Bart Van Assche, Natanael Copa, Damien Le Moal,
	Pavel Tikhomirov, Hannes Reinecke, Johannes Thumshirn, stable

Avoid that the recently introduced call_rcu() call in the SCSI core
causes the RCU core to complain about double call_rcu() calls.

Reported-by: Natanael Copa <ncopa@alpinelinux.org>
Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=198861
Fixes: 3bd6f43f5cb3 ("scsi: core: Ensure that the SCSI error handler gets woken up")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Natanael Copa <ncopa@alpinelinux.org>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
---
 drivers/scsi/scsi_error.c | 5 +++--
 include/scsi/scsi_cmnd.h  | 3 +++
 include/scsi/scsi_host.h  | 2 --
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index ae325985eac1..ac9ce099530e 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -229,7 +229,8 @@ static void scsi_eh_reset(struct scsi_cmnd *scmd)
 
 static void scsi_eh_inc_host_failed(struct rcu_head *head)
 {
-	struct Scsi_Host *shost = container_of(head, typeof(*shost), rcu);
+	struct scsi_cmnd *scmd = container_of(head, typeof(*scmd), rcu);
+	struct Scsi_Host *shost = scmd->device->host;
 	unsigned long flags;
 
 	spin_lock_irqsave(shost->host_lock, flags);
@@ -265,7 +266,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 	 * Ensure that all tasks observe the host state change before the
 	 * host_failed change.
 	 */
-	call_rcu(&shost->rcu, scsi_eh_inc_host_failed);
+	call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
 }
 
 /**
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index d8d4a902a88d..2280b2351739 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -68,6 +68,9 @@ struct scsi_cmnd {
 	struct list_head list;  /* scsi_cmnd participates in queue lists */
 	struct list_head eh_entry; /* entry for the host eh_cmd_q */
 	struct delayed_work abort_work;
+
+	struct rcu_head rcu;
+
 	int eh_eflags;		/* Used by error handlr */
 
 	/*
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 1a1df0d21ee3..a8b7bf879ced 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -571,8 +571,6 @@ struct Scsi_Host {
 		struct blk_mq_tag_set	tag_set;
 	};
 
-	struct rcu_head rcu;
-
 	atomic_t host_busy;		   /* commands actually active on low-level */
 	atomic_t host_blocked;
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-21 17:23 [PATCH] Avoid that ATA error handling hangs Bart Van Assche
@ 2018-02-22  2:23 ` Damien Le Moal
  2018-02-22  3:53   ` Bart Van Assche
  0 siblings, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2018-02-22  2:23 UTC (permalink / raw)
  To: jejb@linux.vnet.ibm.com, Bart Van Assche,
	martin.petersen@oracle.com
  Cc: linux-scsi@vger.kernel.org, hare@suse.com, jthumshirn@suse.de,
	ptikhomirov@virtuozzo.com, ncopa@alpinelinux.org,
	stable@vger.kernel.org

Bart,

On Wed, 2018-02-21 at 09:23 -0800, Bart Van Assche wrote:
> Avoid that the recently introduced call_rcu() call in the SCSI core
> causes the RCU core to complain about double call_rcu() calls.
> 
> Reported-by: Natanael Copa <ncopa@alpinelinux.org>
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> References: https://bugzilla.kernel.org/show_bug.cgi?id=198861
> Fixes: 3bd6f43f5cb3 ("scsi: core: Ensure that the SCSI error handler gets
> woken up")
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Natanael Copa <ncopa@alpinelinux.org>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/scsi/scsi_error.c | 5 +++--
>  include/scsi/scsi_cmnd.h  | 3 +++
>  include/scsi/scsi_host.h  | 2 --
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index ae325985eac1..ac9ce099530e 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -229,7 +229,8 @@ static void scsi_eh_reset(struct scsi_cmnd *scmd)
>  
>  static void scsi_eh_inc_host_failed(struct rcu_head *head)
>  {
> -	struct Scsi_Host *shost = container_of(head, typeof(*shost), rcu);
> +	struct scsi_cmnd *scmd = container_of(head, typeof(*scmd), rcu);
> +	struct Scsi_Host *shost = scmd->device->host;
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(shost->host_lock, flags);
> @@ -265,7 +266,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
>  	 * Ensure that all tasks observe the host state change before the
>  	 * host_failed change.
>  	 */
> -	call_rcu(&shost->rcu, scsi_eh_inc_host_failed);
> +	call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
>  }
>  
>  /**
> diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
> index d8d4a902a88d..2280b2351739 100644
> --- a/include/scsi/scsi_cmnd.h
> +++ b/include/scsi/scsi_cmnd.h
> @@ -68,6 +68,9 @@ struct scsi_cmnd {
>  	struct list_head list;  /* scsi_cmnd participates in queue lists */
>  	struct list_head eh_entry; /* entry for the host eh_cmd_q */
>  	struct delayed_work abort_work;
> +
> +	struct rcu_head rcu;
> +
>  	int eh_eflags;		/* Used by error handlr */
>  
>  	/*
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index 1a1df0d21ee3..a8b7bf879ced 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -571,8 +571,6 @@ struct Scsi_Host {
>  		struct blk_mq_tag_set	tag_set;
>  	};
>  
> -	struct rcu_head rcu;
> -
>  	atomic_t host_busy;		   /* commands actually active
> on low-level */
>  	atomic_t host_blocked;

This does not compile. You missed the init_rcu_head() and destroy_rcu_head()
changes. Adding this:

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 57bf43e34863..dd9464920456 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -328,8 +328,6 @@ static void scsi_host_dev_release(struct device *dev)
        if (shost->work_q)
                destroy_workqueue(shost->work_q);
 
-       destroy_rcu_head(&shost->rcu);
-
        if (shost->shost_state == SHOST_CREATED) {
                /*
                 * Free the shost_dev device name here if scsi_host_alloc()
@@ -404,7 +402,6 @@ struct Scsi_Host *scsi_host_alloc(struct
scsi_host_template *sht, int privsize)
        INIT_LIST_HEAD(&shost->starved_list);
        init_waitqueue_head(&shost->host_wait);
        mutex_init(&shost->scan_mutex);
-       init_rcu_head(&shost->rcu);
 
        index = ida_simple_get(&host_index_ida, 0, 0, GFP_KERNEL);
        if (index < 0)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a86df9ca7d1c..488e5c9acedf 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -590,6 +590,8 @@ static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
                if (drv->uninit_command)
                        drv->uninit_command(cmd);
        }
+
+       destroy_rcu_head(&cmd->rcu);
 }
 
 static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
@@ -1153,6 +1155,7 @@ static void scsi_initialize_rq(struct request *rq)
        scsi_req_init(&cmd->req);
        cmd->jiffies_at_alloc = jiffies;
        cmd->retries = 0;
+       init_rcu_head(&cmd->rcu);
 }
 
 /* Add a command to the list used by the aacraid and dpt_i2o drivers */

And it compiles.

Testing this, the rcu hang is now gone.

However, the behavior of the error recovery  is still different from what I
see in 4.15 and 4.14. For my test case, an unaligned write to a sequential
zone on a ZAC drive connected to an AHCI port, the report zone issued during
the disk revalidation after the write error fails with a timeout, which causes
capacity change to 0, port reset and recovery again. Eventually, everything
comes back up OK, but it takes some time.

I am investigating to make sure I am not hitting a device FW bug to confirm if
this is a kernel problem.

Best regards.

-- 
Damien Le Moal
Western Digital

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-22  2:23 ` Damien Le Moal
@ 2018-02-22  3:53   ` Bart Van Assche
  2018-02-22  4:06     ` Martin K. Petersen
  2018-02-22 17:15     ` Natanael Copa
  0 siblings, 2 replies; 9+ messages in thread
From: Bart Van Assche @ 2018-02-22  3:53 UTC (permalink / raw)
  To: jejb@linux.vnet.ibm.com, Damien Le Moal,
	martin.petersen@oracle.com
  Cc: linux-scsi@vger.kernel.org, hare@suse.com, jthumshirn@suse.de,
	ptikhomirov@virtuozzo.com, ncopa@alpinelinux.org,
	stable@vger.kernel.org

On Thu, 2018-02-22 at 02:23 +0000, Damien Le Moal wrote:
> On Wed, 2018-02-21 at 09:23 -0800, Bart Van Assche wrote:
> > [ ... ]
> This does not compile.

This patch depends on another patch that is not yet in Martin's tree. See also
https://marc.info/?l=linux-scsi&m=151675130615597. I should have mentioned this
in the patch description.

> Testing this, the rcu hang is now gone.

Thanks for the testing :-)

> However, the behavior of the error recovery  is still different from what I
> see in 4.15 and 4.14. For my test case, an unaligned write to a sequential
> zone on a ZAC drive connected to an AHCI port, the report zone issued during
> the disk revalidation after the write error fails with a timeout, which causes
> capacity change to 0, port reset and recovery again. Eventually, everything
> comes back up OK, but it takes some time.
> 
> I am investigating to make sure I am not hitting a device FW bug to confirm if
> this is a kernel problem.

This patch was tested with the SRP protocol. I'm not an ATA expert but I hope
that someone who is more familiar with ATA than I can chime in.

Bart.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-22  3:53   ` Bart Van Assche
@ 2018-02-22  4:06     ` Martin K. Petersen
  2018-02-22  4:19       ` Damien Le Moal
  2018-02-22 17:15     ` Natanael Copa
  1 sibling, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2018-02-22  4:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jejb@linux.vnet.ibm.com, Damien Le Moal,
	martin.petersen@oracle.com, linux-scsi@vger.kernel.org,
	hare@suse.com, jthumshirn@suse.de, ptikhomirov@virtuozzo.com,
	ncopa@alpinelinux.org, stable@vger.kernel.org


Bart,

> This patch depends on another patch that is not yet in Martin's
> tree.

Nobody reviewed it. Same goes for your queuecommand tweak :/

I'm pretty picky about getting at least one other person than me to look
over core changes.

Reviewers: Fame and fortune awaits!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-22  4:06     ` Martin K. Petersen
@ 2018-02-22  4:19       ` Damien Le Moal
  2018-02-22  4:39         ` Bart Van Assche
  0 siblings, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2018-02-22  4:19 UTC (permalink / raw)
  To: Martin K. Petersen, Bart Van Assche
  Cc: jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	hare@suse.com, jthumshirn@suse.de, ptikhomirov@virtuozzo.com,
	ncopa@alpinelinux.org, stable@vger.kernel.org

Martin,

On 2/22/18 13:06, Martin K. Petersen wrote:
> 
> Bart,
> 
>> This patch depends on another patch that is not yet in Martin's
>> tree.
> 
> Nobody reviewed it. Same goes for your queuecommand tweak :/
> 
> I'm pretty picky about getting at least one other person than me to look
> over core changes.
> 
> Reviewers: Fame and fortune awaits!

It looks OK to me, at least if CONFIG_DEBUG_OBJECTS_RCU_HEAD is turned
off since the init_rcu_head() and destroy_rcu_head() functions only care
about that.

With rcu head debug turned on, I am not so sure. The object debug code
will have references to unused rcu heads left behind for unused scsi
cmds, which are indeed dynamically allocated for a device together with
requests when the device is initialized, but they are never freed until
the device is removed. So "dynamically allocated object", yes, but that
does not match the use of the object done in scsi (i.e. alloc before use
+ free after use).

Because of this doubt, No reviewed-by from me. I will miss fame and
fortune this time :)

Best regards.

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-22  4:19       ` Damien Le Moal
@ 2018-02-22  4:39         ` Bart Van Assche
  2018-02-22  4:39           ` Bart Van Assche
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2018-02-22  4:39 UTC (permalink / raw)
  To: Damien Le Moal, martin.petersen@oracle.com
  Cc: jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	hare@suse.com, jthumshirn@suse.de, ptikhomirov@virtuozzo.com,
	ncopa@alpinelinux.org, stable@vger.kernel.org

On Thu, 2018-02-22 at 04:19 +0000, Damien Le Moal wrote:
> It looks OK to me, at least if CONFIG_DEBUG_OBJECTS_RCU_HEAD is turned
> off since the init_rcu_head() and destroy_rcu_head() functions only care
> about that.
> 
> With rcu head debug turned on, I am not so sure. The object debug code
> will have references to unused rcu heads left behind for unused scsi
> cmds, which are indeed dynamically allocated for a device together with
> requests when the device is initialized, but they are never freed until
> the device is removed. So "dynamically allocated object", yes, but that
> does not match the use of the object done in scsi (i.e. alloc before use
> + free after use).

Hello Damien,

Please have a look at the following part of
Documentation/RCU/Design/Requirements/Requirements.html:

	Similarly, statically allocated non-stack <tt>rcu_head</tt>
	structures must be initialized with <tt>init_rcu_head()</tt>
	and cleaned up with <tt>destroy_rcu_head()</tt>.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-22  4:39         ` Bart Van Assche
@ 2018-02-22  4:39           ` Bart Van Assche
  2018-02-22  4:55             ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2018-02-22  4:39 UTC (permalink / raw)
  To: Damien Le Moal, martin.petersen@oracle.com
  Cc: jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	hare@suse.com, jthumshirn@suse.de, ptikhomirov@virtuozzo.com,
	ncopa@alpinelinux.org, stable@vger.kernel.org

On Thu, 2018-02-22 at 04:39 +0000, Bart Van Assche wrote:
> On Thu, 2018-02-22 at 04:19 +0000, Damien Le Moal wrote:
> > It looks OK to me, at least if CONFIG_DEBUG_OBJECTS_RCU_HEAD is turned
> > off since the init_rcu_head() and destroy_rcu_head() functions only care
> > about that.
> > 
> > With rcu head debug turned on, I am not so sure. The object debug code
> > will have references to unused rcu heads left behind for unused scsi
> > cmds, which are indeed dynamically allocated for a device together with
> > requests when the device is initialized, but they are never freed until
> > the device is removed. So "dynamically allocated object", yes, but that
> > does not match the use of the object done in scsi (i.e. alloc before use
> > + free after use).
> 
> Hello Damien,
> 
> Please have a look at the following part of
> Documentation/RCU/Design/Requirements/Requirements.html:
> 
> 	Similarly, statically allocated non-stack <tt>rcu_head</tt>
> 	structures must be initialized with <tt>init_rcu_head()</tt>
> 	and cleaned up with <tt>destroy_rcu_head()</tt>.

And from <linux/rcupdate.h>:

 * rcu_head structures
 * allocated dynamically in the heap or defined statically don't need any
 * initialization.

Bart.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-22  4:39           ` Bart Van Assche
@ 2018-02-22  4:55             ` Damien Le Moal
  0 siblings, 0 replies; 9+ messages in thread
From: Damien Le Moal @ 2018-02-22  4:55 UTC (permalink / raw)
  To: Bart Van Assche, martin.petersen@oracle.com
  Cc: jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	hare@suse.com, jthumshirn@suse.de, ptikhomirov@virtuozzo.com,
	ncopa@alpinelinux.org, stable@vger.kernel.org

Bart,

On 2/22/18 13:39, Bart Van Assche wrote:
> On Thu, 2018-02-22 at 04:39 +0000, Bart Van Assche wrote:
>> On Thu, 2018-02-22 at 04:19 +0000, Damien Le Moal wrote:
>>> It looks OK to me, at least if CONFIG_DEBUG_OBJECTS_RCU_HEAD is turned
>>> off since the init_rcu_head() and destroy_rcu_head() functions only care
>>> about that.
>>>
>>> With rcu head debug turned on, I am not so sure. The object debug code
>>> will have references to unused rcu heads left behind for unused scsi
>>> cmds, which are indeed dynamically allocated for a device together with
>>> requests when the device is initialized, but they are never freed until
>>> the device is removed. So "dynamically allocated object", yes, but that
>>> does not match the use of the object done in scsi (i.e. alloc before use
>>> + free after use).
>>
>> Hello Damien,
>>
>> Please have a look at the following part of
>> Documentation/RCU/Design/Requirements/Requirements.html:
>>
>> 	Similarly, statically allocated non-stack <tt>rcu_head</tt>
>> 	structures must be initialized with <tt>init_rcu_head()</tt>
>> 	and cleaned up with <tt>destroy_rcu_head()</tt>.
> 
> And from <linux/rcupdate.h>:
> 
>  * rcu_head structures
>  * allocated dynamically in the heap or defined statically don't need any
>  * initialization.

Yes, I understood that. But my guess is this comment implies that the
objects are freed after use, which clears any reference to it from the
memory object debug hash automatically. That is not the case with scsi
command structs: there are allocated dynamically with the device, but
they are not freed after use. And here by use, I mean the normal use
cycle of a request+cmd: get unused request -> issue command -> command
completed -> return request in free state.
That is not an alloc+free cycle, so the memory object debug code will
never be involved and the scsi command rcu head never destroyed.

Considering that, I am not sure if it is really safe to remove the
init/destroy rcu head functions. At the very least, that will make the
memory object debug table grow larger with the first use of any scsi
command.

Cheers.

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Avoid that ATA error handling hangs
  2018-02-22  3:53   ` Bart Van Assche
  2018-02-22  4:06     ` Martin K. Petersen
@ 2018-02-22 17:15     ` Natanael Copa
  1 sibling, 0 replies; 9+ messages in thread
From: Natanael Copa @ 2018-02-22 17:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jejb@linux.vnet.ibm.com, Damien Le Moal,
	martin.petersen@oracle.com, linux-scsi@vger.kernel.org,
	hare@suse.com, jthumshirn@suse.de, ptikhomirov@virtuozzo.com,
	stable@vger.kernel.org

On Thu, 22 Feb 2018 03:53:19 +0000
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Thu, 2018-02-22 at 02:23 +0000, Damien Le Moal wrote:
> > On Wed, 2018-02-21 at 09:23 -0800, Bart Van Assche wrote:  
> > > [ ... ]  
> > This does not compile.  
> 
> This patch depends on another patch that is not yet in Martin's tree. See also
> https://marc.info/?l=linux-scsi&m=151675130615597. I should have mentioned this
> in the patch description.

I applied the two patches on top of 4.14.20 and it solves my problem[1]. I
think those two should be included in the 4.14.21 release.

Thanks!

-nc

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=198861

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-02-22 17:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-21 17:23 [PATCH] Avoid that ATA error handling hangs Bart Van Assche
2018-02-22  2:23 ` Damien Le Moal
2018-02-22  3:53   ` Bart Van Assche
2018-02-22  4:06     ` Martin K. Petersen
2018-02-22  4:19       ` Damien Le Moal
2018-02-22  4:39         ` Bart Van Assche
2018-02-22  4:39           ` Bart Van Assche
2018-02-22  4:55             ` Damien Le Moal
2018-02-22 17:15     ` Natanael Copa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox