Re: [ofa-general] [Bug 14235] New: SRP initiator lockup

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [ofa-general] [Bug 14235] New: SRP initiator lockup
       [not found] <bug-14235-11804@http.bugzilla.kernel.org/>
@ 2009-09-28 16:27 ` Roland Dreier
       [not found]   ` <ada63b3xcdt.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Roland Dreier @ 2009-09-28 16:27 UTC (permalink / raw)
  To: bart.vanassche; +Cc: linux-rdma, general

 > If an SRP target processes SRP I/O slow enough, the SRP initiator locks up.

 > INFO: task fio:6389 blocked for more than 120 seconds.
 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 > fio           D 0000000000000000     0  6389   6388 0x00000000
 >  ffff880071dc5bd8 0000000000000046 ffff880071dc5b08 000000018107764d
 >  0000000000012cc0 000000000000de20 0000000000000001 ffff880070cd8000
 >  ffff880070cd83b0 0000000100000000 000000010001193e ffff88007fb99050
 > Call Trace:
 >  [<ffffffff812ec5e5>] ? _spin_unlock_irqrestore+0x65/0x80
 >  [<ffffffff812e9b37>] io_schedule+0x37/0x50
 >  [<ffffffff8110cff2>] __blockdev_direct_IO+0x692/0xd80
 >  [<ffffffff810e0357>] ? get_super+0x27/0xc0
 >  [<ffffffff8110b169>] blkdev_direct_IO+0x49/0x50
 >  [<ffffffff8110a1f0>] ? blkdev_get_blocks+0x0/0xc0
 >  [<ffffffff810a1799>] generic_file_aio_read+0x679/0x690
 >  [<ffffffff810dc35a>] ? __dentry_open+0x13a/0x340
 >  [<ffffffff810de091>] do_sync_read+0xf1/0x140
 >  [<ffffffff810775ed>] ? trace_hardirqs_on_caller+0x14d/0x1a0
 >  [<ffffffff810662f0>] ? autoremove_wake_function+0x0/0x40
 >  [<ffffffff810775ed>] ? trace_hardirqs_on_caller+0x14d/0x1a0
 >  [<ffffffff8107764d>] ? trace_hardirqs_on+0xd/0x10
 >  [<ffffffff810ded28>] vfs_read+0xc8/0x180
 >  [<ffffffff810deed0>] sys_read+0x50/0x90
 >  [<ffffffff8100be6b>] system_call_fastpath+0x16/0x1b
 > no locks held by fio/6389.

It will probably be a while until I can get the time to build an scst
test set up to reproduce this unfortunately.  So we'll have to debug
this with your set up for the moment.

I don't have a good idea of where in the SRP initiator the problem could
be... the non-error path for ordinary SCSI commands is pretty trivial.
Presumably slowing down the target means that the queue of outstanding
commands fills up, but they should complete and let things make
progress.  I guess the possibilities are a bug higher up in the block or
SCSI stack, or some accounting problem in SRP.

You could try adding printks to srp_queuecommand() to see that all SCSI
commands are sent on the SRP connection and also add tracing to
srp_process_rsp() to make sure there's a matching call to ->scsi_done
for each SCSI command.  And also we should make sure there's no
disconnections or task management commands or anything like that
confusing things ... there is definitely more room for bugs in the parts
of the SRP driver that handle exceptions.

 - R.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ofa-general] [Bug 14235] New: SRP initiator lockup
       [not found]   ` <ada63b3xcdt.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2009-12-05 17:44     ` Bart Van Assche
       [not found]       ` <e2e108260912050944k1228e964ta7a70dde493ba010-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Bart Van Assche @ 2009-12-05 17:44 UTC (permalink / raw)
  To: Roland Dreier; +Cc: OFED mailing list

On Mon, Sep 28, 2009 at 5:27 PM, Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
>
>  > If an SRP target processes SRP I/O slow enough, the SRP initiator locks up.
>
>  > INFO: task fio:6389 blocked for more than 120 seconds.
>  > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  > fio           D 0000000000000000     0  6389   6388 0x00000000
>  >  ffff880071dc5bd8 0000000000000046 ffff880071dc5b08 000000018107764d
>  >  0000000000012cc0 000000000000de20 0000000000000001 ffff880070cd8000
>  >  ffff880070cd83b0 0000000100000000 000000010001193e ffff88007fb99050
>  > Call Trace:
>  >  [<ffffffff812ec5e5>] ? _spin_unlock_irqrestore+0x65/0x80
>  >  [<ffffffff812e9b37>] io_schedule+0x37/0x50
>  >  [<ffffffff8110cff2>] __blockdev_direct_IO+0x692/0xd80
>  >  [<ffffffff810e0357>] ? get_super+0x27/0xc0
>  >  [<ffffffff8110b169>] blkdev_direct_IO+0x49/0x50
>  >  [<ffffffff8110a1f0>] ? blkdev_get_blocks+0x0/0xc0
>  >  [<ffffffff810a1799>] generic_file_aio_read+0x679/0x690
>  >  [<ffffffff810dc35a>] ? __dentry_open+0x13a/0x340
>  >  [<ffffffff810de091>] do_sync_read+0xf1/0x140
>  >  [<ffffffff810775ed>] ? trace_hardirqs_on_caller+0x14d/0x1a0
>  >  [<ffffffff810662f0>] ? autoremove_wake_function+0x0/0x40
>  >  [<ffffffff810775ed>] ? trace_hardirqs_on_caller+0x14d/0x1a0
>  >  [<ffffffff8107764d>] ? trace_hardirqs_on+0xd/0x10
>  >  [<ffffffff810ded28>] vfs_read+0xc8/0x180
>  >  [<ffffffff810deed0>] sys_read+0x50/0x90
>  >  [<ffffffff8100be6b>] system_call_fastpath+0x16/0x1b
>  > no locks held by fio/6389.
>
> It will probably be a while until I can get the time to build an scst
> test set up to reproduce this unfortunately.  So we'll have to debug
> this with your set up for the moment.
>
> I don't have a good idea of where in the SRP initiator the problem could
> be... the non-error path for ordinary SCSI commands is pretty trivial.
> Presumably slowing down the target means that the queue of outstanding
> commands fills up, but they should complete and let things make
> progress.  I guess the possibilities are a bug higher up in the block or
> SCSI stack, or some accounting problem in SRP.
>
> You could try adding printks to srp_queuecommand() to see that all SCSI
> commands are sent on the SRP connection and also add tracing to
> srp_process_rsp() to make sure there's a matching call to ->scsi_done
> for each SCSI command.  And also we should make sure there's no
> disconnections or task management commands or anything like that
> confusing things ... there is definitely more room for bugs in the parts
> of the SRP driver that handle exceptions.

(replying to an e-mail of two months ago -- finally got the time to
have a closer look at the SRP initiator source code)

I'm not sure that the non-error path for ordinary SCSI commands is
that trivial. If my interpretation of the SRP initiator source code is
correct, the statements complete(&target->done) and
init_completion(&target->done) can be executed concurrently. Although
I do not know what the exact consequences are, and although I do not
know whether this is related to the issue I reported, this is a race
condition. I'm not sure that allowing such races is good kernel
programming practice.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ofa-general] [Bug 14235] New: SRP initiator lockup
       [not found]       ` <e2e108260912050944k1228e964ta7a70dde493ba010-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-12-05 21:49         ` Roland Dreier
       [not found]           ` <adatyw5t7jt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Roland Dreier @ 2009-12-05 21:49 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: OFED mailing list


 > I'm not sure that the non-error path for ordinary SCSI commands is
 > that trivial. If my interpretation of the SRP initiator source code is
 > correct, the statements complete(&target->done) and
 > init_completion(&target->done) can be executed concurrently. Although
 > I do not know what the exact consequences are, and although I do not
 > know whether this is related to the issue I reported, this is a race
 > condition. I'm not sure that allowing such races is good kernel
 > programming practice.

target->done is only used in connection setup I think, so not related to
hangs during IO processing.  However I would like to know more details
of where you see this race, since yes we would want to fix that.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ofa-general] [Bug 14235] New: SRP initiator lockup
       [not found]           ` <adatyw5t7jt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2009-12-06 11:02             ` Bart Van Assche
  0 siblings, 0 replies; 4+ messages in thread
From: Bart Van Assche @ 2009-12-06 11:02 UTC (permalink / raw)
  To: Roland Dreier; +Cc: OFED mailing list

On Sat, Dec 5, 2009 at 10:49 PM, Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
>
>  > I'm not sure that the non-error path for ordinary SCSI commands is
>  > that trivial. If my interpretation of the SRP initiator source code is
>  > correct, the statements complete(&target->done) and
>  > init_completion(&target->done) can be executed concurrently. Although
>  > I do not know what the exact consequences are, and although I do not
>  > know whether this is related to the issue I reported, this is a race
>  > condition. I'm not sure that allowing such races is good kernel
>  > programming practice.
>
> target->done is only used in connection setup I think, so not related to
> hangs during IO processing.  However I would like to know more details
> of where you see this race, since yes we would want to fix that.

What I wrote above was found via source reading, so I'm not sure yet
that concurrent calls of complete(&target->done) and
init_completion(&target->done) really happen. Is it possible that e.g.
one CPU calls complete(&target->done) as a result of processing the
IB_CM_TIMEWAIT_EXIT event while another CPU is calling
init_completion(&target->done) from inside srp_disconnect_target() ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-12-06 11:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bug-14235-11804@http.bugzilla.kernel.org/>
2009-09-28 16:27 ` [ofa-general] [Bug 14235] New: SRP initiator lockup Roland Dreier
     [not found]   ` <ada63b3xcdt.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-12-05 17:44     ` Bart Van Assche
     [not found]       ` <e2e108260912050944k1228e964ta7a70dde493ba010-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-12-05 21:49         ` Roland Dreier
     [not found]           ` <adatyw5t7jt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2009-12-06 11:02             ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox