linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* deadlock during fc_remove_host
@ 2011-04-21  0:24 Bhanu Prakash Gollapudi
  2011-04-21  2:53 ` Mike Christie
  0 siblings, 1 reply; 5+ messages in thread
From: Bhanu Prakash Gollapudi @ 2011-04-21  0:24 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org, devel@open-fcoe.org
  Cc: Mike Christie, Joe Eykholt

Hi,

We are seeing a similar issue to what Joe has observed a while back - 
http://www.mail-archive.com/devel@open-fcoe.org/msg02993.html.

This happens in a very corner case scenario by  creating and destroying 
fcoe interface in a tight loop. (fcoeadm -c followed by fcoeadm -d). The 
system had a simple configuration with  a single local port 2 remote ports.

Reason for the deadlock:

1. destroy (fcoeadm -d) thread hangs in fc_remove_host().
2. fc_remove_host() is trying to flush the shost->work_q, via 
scsi_flush_work(), but the operation never completes.
3. There are two works scheduled to be run in this work_q, one belonging 
to rport A, and other rport B.
4. The thread is currently executing rport_delete_work (fc_rport_final
_delete) for rport A.  It calls fc_terminate_rport_io() that unblocks 
the sdev->request_queue, so that __blk_run_queue() can be called. So, IO 
for rport A is ready to run, but stuck at the async layer.
5. Meanwhile, async layer is serializing all the IOs belonging to both 
rport A and rport B. At this point, it is waiting for IO belonging to 
rport B to complete.
6. However, the request_queue for rport B is stopped and 
fc_terminate_rport_io on rport B is not called yet to unblock the 
device, which will only be called after rport A completes. rport A does 
not complete as async layer is still stuck with IO belonging to rport B. 
Hence the deadlock.

The fact that async layer doesn't distinguish IOs belonging to different 
rports, it can process them in any order. If it happens to complete IOs 
belonging to rport A followed by rport B, then there is no problem. 
However, the other way causes the deadlock.

Experiment:

To verify the above, we tried to first call fc_terminate_rport_io for 
all the rports before actually queuing the rport_delete_work, so that 
the sdev->request_queue is unblocked for the all the rports, thus 
avoiding the deadlock.

One possible way of doing it is by having a separate work item that 
calls fc_terminate_rport_io:

diff --git a/drivers/scsi/scsi_transport_fc.c 
b/drivers/scsi/scsi_transport_fc.c
index 2941d2d..514fa2b 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -2405,6 +2405,9 @@ fc_remove_host(struct Scsi_Host *shost)
                 fc_queue_work(shost, &vport->vport_delete_work);

         /* Remove any remote ports */
+       list_for_each_entry_safe(rport, next_rport, &fc_host->rports, peers)
+               fc_queue_work(shost, &rport->rport_terminate_io_work);
+
         list_for_each_entry_safe(rport, next_rport,
                         &fc_host->rports, peers) {
                 list_del(&rport->peers);
@@ -2413,6 +2416,10 @@ fc_remove_host(struct Scsi_Host *shost)
         }

         list_for_each_entry_safe(rport, next_rport,
+                       &fc_host->rport_bindings, peers)
+               fc_queue_work(shost, &rport->rport_terminate_io_work);
+
+       list_for_each_entry_safe(rport, next_rport,
                         &fc_host->rport_bindings, peers) {
                 list_del(&rport->peers);
                 rport->port_state = FC_PORTSTATE_DELETED;
@@ -2457,6 +2464,16 @@ static void fc_terminate_rport_io(struct fc_rport 
*rport)
         scsi_target_unblock(&rport->dev);
  }

This may not be the ideal solution, but would like to discuss with folks 
here to converge to an appropriate solution.

Thanks,
Bhanu


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: deadlock during fc_remove_host
  2011-04-21  0:24 deadlock during fc_remove_host Bhanu Prakash Gollapudi
@ 2011-04-21  2:53 ` Mike Christie
  2011-04-21  3:21   ` [Open-FCoE] " Mike Christie
  2011-04-21  5:32   ` Bhanu Prakash Gollapudi
  0 siblings, 2 replies; 5+ messages in thread
From: Mike Christie @ 2011-04-21  2:53 UTC (permalink / raw)
  To: Bhanu Prakash Gollapudi
  Cc: linux-scsi@vger.kernel.org, devel@open-fcoe.org, Joe Eykholt

On 04/20/2011 07:24 PM, Bhanu Prakash Gollapudi wrote:
> Hi,
>
> We are seeing a similar issue to what Joe has observed a while back -
> http://www.mail-archive.com/devel@open-fcoe.org/msg02993.html.
>
> This happens in a very corner case scenario by creating and destroying
> fcoe interface in a tight loop. (fcoeadm -c followed by fcoeadm -d). The
> system had a simple configuration with a single local port 2 remote ports.
>
> Reason for the deadlock:
>
> 1. destroy (fcoeadm -d) thread hangs in fc_remove_host().
> 2. fc_remove_host() is trying to flush the shost->work_q, via
> scsi_flush_work(), but the operation never completes.
> 3. There are two works scheduled to be run in this work_q, one belonging
> to rport A, and other rport B.
> 4. The thread is currently executing rport_delete_work (fc_rport_final
> _delete) for rport A. It calls fc_terminate_rport_io() that unblocks the
> sdev->request_queue, so that __blk_run_queue() can be called. So, IO for
> rport A is ready to run, but stuck at the async layer.
> 5. Meanwhile, async layer is serializing all the IOs belonging to both
> rport A and rport B. At this point, it is waiting for IO belonging to
> rport B to complete.
> 6. However, the request_queue for rport B is stopped and
> fc_terminate_rport_io on rport B is not called yet to unblock the
> device, which will only be called after rport A completes. rport A does

Is the reason that rport b's terminate_rport_io has not been called, 
because that workqueue is queued behind rport a's workqueue and rport 
b's workqueue function is not called? If so, have you tested this with 
the current upstream kernel?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Open-FCoE] deadlock during fc_remove_host
  2011-04-21  2:53 ` Mike Christie
@ 2011-04-21  3:21   ` Mike Christie
  2011-04-22  5:47     ` Bhanu Prakash Gollapudi
  2011-04-21  5:32   ` Bhanu Prakash Gollapudi
  1 sibling, 1 reply; 5+ messages in thread
From: Mike Christie @ 2011-04-21  3:21 UTC (permalink / raw)
  To: Bhanu Prakash Gollapudi
  Cc: Joe Eykholt, devel@open-fcoe.org, linux-scsi@vger.kernel.org

On 04/20/2011 09:53 PM, Mike Christie wrote:
> Is the reason that rport b's terminate_rport_io has not been called,
> because that workqueue is queued behind rport a's workqueue and rport
> b's workqueue function is not called? If so, have you tested this with
> the current upstream kernel?

Oh wait, I think you also need to change the fc class to use 
alloc_workqueue instead of create_singlethread_workqueue.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: deadlock during fc_remove_host
  2011-04-21  2:53 ` Mike Christie
  2011-04-21  3:21   ` [Open-FCoE] " Mike Christie
@ 2011-04-21  5:32   ` Bhanu Prakash Gollapudi
  1 sibling, 0 replies; 5+ messages in thread
From: Bhanu Prakash Gollapudi @ 2011-04-21  5:32 UTC (permalink / raw)
  To: Mike Christie; +Cc: linux-scsi@vger.kernel.org, devel@open-fcoe.org

On 4/20/2011 7:53 PM, Mike Christie wrote:
> On 04/20/2011 07:24 PM, Bhanu Prakash Gollapudi wrote:
>> Hi,
>>
>> We are seeing a similar issue to what Joe has observed a while back -
>> http://www.mail-archive.com/devel@open-fcoe.org/msg02993.html.
>>
>> This happens in a very corner case scenario by creating and destroying
>> fcoe interface in a tight loop. (fcoeadm -c followed by fcoeadm -d). The
>> system had a simple configuration with a single local port 2 remote ports.
>>
>> Reason for the deadlock:
>>
>> 1. destroy (fcoeadm -d) thread hangs in fc_remove_host().
>> 2. fc_remove_host() is trying to flush the shost->work_q, via
>> scsi_flush_work(), but the operation never completes.
>> 3. There are two works scheduled to be run in this work_q, one belonging
>> to rport A, and other rport B.
>> 4. The thread is currently executing rport_delete_work (fc_rport_final
>> _delete) for rport A. It calls fc_terminate_rport_io() that unblocks the
>> sdev->request_queue, so that __blk_run_queue() can be called. So, IO for
>> rport A is ready to run, but stuck at the async layer.
>> 5. Meanwhile, async layer is serializing all the IOs belonging to both
>> rport A and rport B. At this point, it is waiting for IO belonging to
>> rport B to complete.
>> 6. However, the request_queue for rport B is stopped and
>> fc_terminate_rport_io on rport B is not called yet to unblock the
>> device, which will only be called after rport A completes. rport A does
>
> Is the reason that rport b's terminate_rport_io has not been called,
> because that workqueue is queued behind rport a's workqueue and rport
> b's workqueue function is not called? If so, have you tested this with
> the current upstream kernel?
>
Yes, this has been tested with upstream kernel.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Open-FCoE] deadlock during fc_remove_host
  2011-04-21  3:21   ` [Open-FCoE] " Mike Christie
@ 2011-04-22  5:47     ` Bhanu Prakash Gollapudi
  0 siblings, 0 replies; 5+ messages in thread
From: Bhanu Prakash Gollapudi @ 2011-04-22  5:47 UTC (permalink / raw)
  To: Mike Christie; +Cc: devel@open-fcoe.org, linux-scsi@vger.kernel.org

On 4/20/2011 8:21 PM, Mike Christie wrote:
> On 04/20/2011 09:53 PM, Mike Christie wrote:
>> Is the reason that rport b's terminate_rport_io has not been called,
>> because that workqueue is queued behind rport a's workqueue and rport
>> b's workqueue function is not called? If so, have you tested this with
>> the current upstream kernel?
>
> Oh wait, I think you also need to change the fc class to use
> alloc_workqueue instead of create_singlethread_workqueue.
>
This seem to work, except that I had to use allow_workqueue with flags 
WQ_UNBOUND and max_active set to WQ_UNBOUND_MAX_ACTIVE.  It didn't help 
With flags 0 and max_active set to 0. Test has been going on for the 
last 4 hours.  I'll let it run overnight.

Thanks,
Bhanu


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-22  5:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-21  0:24 deadlock during fc_remove_host Bhanu Prakash Gollapudi
2011-04-21  2:53 ` Mike Christie
2011-04-21  3:21   ` [Open-FCoE] " Mike Christie
2011-04-22  5:47     ` Bhanu Prakash Gollapudi
2011-04-21  5:32   ` Bhanu Prakash Gollapudi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).