From: James Bottomley <jejb@linux.vnet.ibm.com>
To: Wei Fang <fangwei1@huawei.com>, Christoph Hellwig <hch@infradead.org>
Cc: tj@kernel.org, martin.petersen@oracle.com,
linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: [PATCH] scsi: fix race between simultaneous decrements of ->host_failed
Date: Mon, 30 May 2016 09:04:58 -0700 [thread overview]
Message-ID: <1464624298.2287.54.camel@linux.vnet.ibm.com> (raw)
In-Reply-To: <574BEB5F.5090509@huawei.com>
On Mon, 2016-05-30 at 15:27 +0800, Wei Fang wrote:
> Hi James, Christoph,
>
> On 2016/5/29 23:41, James Bottomley wrote:
> > On Sat, 2016-05-28 at 23:54 -0700, Christoph Hellwig wrote:
> > > On Sat, May 28, 2016 at 11:51:11AM +0800, Wei Fang wrote:
> > > > async_sas_ata_eh(), which will call scsi_eh_finish_cmd() in
> > > > some case, would be performed simultaneously in
> > > > sas_ata_strategy_handler(). In this case, ->host_failed may be
> > > > decreased simultaneously in scsi_eh_finish_cmd() on different
> > > > CPUs, and become abnormal.
> > > >
> > > > It will lead to permanently inequal between ->host_failed and
> > > > ->host_busy. Then SCSI error handler thread won't become
> > > > running, SCSI errors after that won't be handled forever.
> > > >
> > > > Use atomic type for ->host_failed to fix this race.
> > >
> > > Looks fine,
> >
> > Actually, it doesn't look fine at all. The same mechanism that's
> > supposed to protect the host_failed decrement is also supposed to
> > protect the list_move_tail(). If there's a problem with the former
> > then we're also in danger of corrupting the list.
>
> Scmd is moved to local eh_done_q list here, and I checked that the
> list won't be touched concurrently.
>
> > Can we go back to the theory of what the problem is, since it's not
> > spelled out very clearly in the change log. Our usual reason for
> > not requiring locking in eh routines is that the eh is single
> > threaded on the eh thread per host, so any host manipulations can't
> > have concurrency problems. In this case, the sas_ata routines are
> > trying to be clever and use asynchronous workqueues for the port
> > error handler and you theorise that these can execute concurrently
> > on two CPUs, thus causing the problem?
>
> Yes, it's the case. The works of the port error handler are added to
> system_unbound_wq, and will be performed concurrently on different
> CPUs. We have already met that problem on our machine.
OK, add that to the changelog and also that this fixes
commit 50824d6c5657ce340e3911171865a8d99fdd8eba
Author: Dan Williams <dan.j.williams@intel.com>
Date: Sun Dec 4 01:06:24 2011 -0800
[SCSI] libsas: async ata-eh
Because that's where the concurrency rules weren't verified when this
async threading was added.
One final thing is that we don't need this replaced by atomics. The
only atomic check we need is the up count, which is already serialised
by the host lock. Nothing actually ever bothers with the down count,
so it can just be eliminated and host_failed set to zero after the
strategy handle is complete (but before scsi_restart_operations) in the
eh_thread.
Once this change is made, scsi_eh_finish_cmd() and
scsi_eh_flush_done_q() are safe provided the done_q list is not
modifiable by any other thread.
As Christoph said, the documentation needs updating to reflect these
new concurrency rules.
James
next prev parent reply other threads:[~2016-05-30 16:05 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-28 3:51 [PATCH] scsi: fix race between simultaneous decrements of ->host_failed Wei Fang
2016-05-29 6:54 ` Christoph Hellwig
2016-05-29 15:41 ` James Bottomley
2016-05-29 18:06 ` Christoph Hellwig
2016-05-29 19:15 ` James Bottomley
2016-05-30 7:27 ` Wei Fang
2016-05-30 16:04 ` James Bottomley [this message]
2016-05-30 7:43 ` Wei Fang
2016-05-30 19:10 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1464624298.2287.54.camel@linux.vnet.ibm.com \
--to=jejb@linux.vnet.ibm.com \
--cc=fangwei1@huawei.com \
--cc=hch@infradead.org \
--cc=linux-ide@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).