From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: Race condition between "read CFQ stats" and "block device shutdown" Date: Thu, 26 Sep 2013 09:54:43 -0400 Message-ID: <20130926135443.GC2480@htj.dyndns.org> References: <5226D661.7070301@suse.de> <20130904160723.GC26609@mtj.dyndns.org> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=/WynM0R9Q8Sb0JunGp6d7BzzagwDVbk+OTJQuEfaocg=; b=ge427A/3KGzsgkcyB+BEfTUC3QtAoBYeJQ/rqx0KsDFTLOucIwwOWKagZaWJPedcw6 P0xbXJdYkArOKRxeUG5DRBNUECo2TUU+VnbCWiSJlJPXOJwrtwfOeClbFNiKPsvQam71 EuTR2JM96DEmu+kaPe/NCOJy29WHQ8i6q6chUbr33UdP721yfLc380mZo0g1ZrgBfuoH a4BkCMfBsCX4VD50CLsxx3Bs5IRA+ow9YQ5eeBO3Q2WTzqXVkscA0V+SrKFSi8Kr/FI+ A4Oe40YggThcOf7SmDdHq1cMMgUOMXUx7e6+5yE1a/HS3CnJTAMLryZdNlojftD3SS+W zDew== Content-Disposition: inline In-Reply-To: Sender: linux-scsi-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Anatol Pomozov Cc: Hannes Reinecke , Cgroups , Jens Axboe , linux-scsi@vger.kernel.org Hello, (cc'ing linux-scsi) On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote: > Hi > > On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo wrote: > > Hello, > > > > On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote: > >> I am not an expect in block code, so I have a few questions here: > >> > >> - are we sure that this operation is atomic? What if blkg->q becomes > >> dead right after we checked it, and blkg->q->queue_lock got invalid so > >> we have the same crash as before? > > > > request_queue lock switching is something inherently broken in block > > layer. It's unsalvageable. > > Fully agree. The problem that request_queue->queue_lock is a shared > resource that concurrently modified/accessed. In this case (when one > thread changes, another thread access it) we need synchronization to > prevent race conditions. So we need a spin_lock to access queue_lock > spin_lock, otherwise we have a crash like one above... > > > Maybe we can drop lock switching once blk-mq is fully merged. > > Could you please provide more information about it? What is the timeline? I have no idea. Hopefully, not too far out. Jens would have better idea. > If there is an easy way to fix the race condition I would like to > help. Please give me some pointer what direction I should move. The first step would be identifying who are actually making use of lock switching, why and how much difference it would make for them to not do that. > PS Just a little bit of context why I care about this bug. We test a > large farm that actively uses iscsi. We are going to have a lot of > iscsi device startup/shutdown. I am testing whether this codepath has > race conditions and I found one above. Thanks. -- tejun