From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754670Ab0EUNfN (ORCPT ); Fri, 21 May 2010 09:35:13 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43254 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753829Ab0EUNfK (ORCPT ); Fri, 21 May 2010 09:35:10 -0400 Date: Fri, 21 May 2010 09:34:56 -0400 From: Mike Snitzer To: Kiyoshi Ueda Cc: Nikanth Karthikesan , linux-kernel@vger.kernel.org, dm-devel@redhat.com, Alasdair Kergon , Jens Axboe , "Jun'ichi Nomura" , Vivek Goyal Subject: Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device Message-ID: <20100521133456.GA14338@redhat.com> References: <4BF10BF1.3040108@ct.jp.nec.com> <20100517172737.GA24591@redhat.com> <4BF25091.3000507@ct.jp.nec.com> <20100518134639.GA27582@redhat.com> <4BF37DD5.9050409@ct.jp.nec.com> <20100519143900.GC24618@redhat.com> <4BF51B52.70308@ct.jp.nec.com> <20100520170732.GA22791@redhat.com> <4BF64538.6010305@ct.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BF64538.6010305@ct.jp.nec.com> User-Agent: Mutt/1.5.20 (2009-08-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 21 2010 at 4:32am -0400, Kiyoshi Ueda wrote: > Hi Mike, > > On 05/21/2010 02:07 AM +0900, Mike Snitzer wrote: > > Kiyoshi Ueda wrote: > >> On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote: > >>> Thanks again for pointing this out; I'll work to arrive at an > >>> alternative locking scheme. Likely introduce a lock local to the > >>> multiple_device (effectively the 'queue_lock' I had before). But > >>> difference is I'd take that lock before taking _hash_lock. > >> > >> You have to see do_resume(), too. > >> do_resume() gets hc->new_map with _hash_lock held but without your > >> 'queue_lock' held. This could cause inconsistent status; > >> table_load() for bio-based table do_resume() for rq-based table > >> -------------------------------------------------------------------- > >> dm_lock_md_queue(md) > >> dm_table_setup_md_queue(t) (<= bio-based) > >> dm_clear_request_based_queue(md) > >> md->queue->request_fn = NULL > >> down_write(_hash_lock) > >> new_map = hc->new_map (<= rq-based) > >> up_write(_hash_lock) > >> dm_swap_table(md, new_map) > >> __bind(md, new_map) (<= rq-based) > >> down_write(_hash_lock) > >> hc->new_map = t > >> up_write(_hash_lock) > >> dm_unlock_md_queue(md) > snip > >>>>> Also, your patch changes the queue configuration even when a table is > >>>>> already active and used. (e.g. Loading bio-based table to a mapped_device > >>>>> which is already active/used as request-based sets q->requst_fn in NULL.) > >>>>> That could cause some critical problems. > >>>> > >>>> Yes, that is possible and I can add additional checks to prevent this. > >>>> But this speaks to a more general problem with the existing DM code. > >>>> > >>>> dm_swap_table() has the negative check to prevent such table loads, e.g.: > >>>> /* cannot change the device type, once a table is bound */ > >>>> > >>>> This check should come during table_load, as part of > >>>> dm_table_set_type(), rather than during table resume. > >>> > >>> I'll look to address this issue in v6 too. > >> > >> It can race between table_load() and do_resume() as well; > >> table_load() for bio-based do_resume() for rq-based > >> --------------------------------------------------------------------- > >> dm_table_set_type(t) (<= bio-based) > >> live_table = dm_get_live_table(md) > >> (assume no live_table yet) > >> new_map = hc->new_map (<= rq-based) > >> dm_swap_table(md, new_map) > >> __bind(md, new_map) > >> dm_table_setup_md_queue(t) > >> dm_clear_request_based_queue(md) > >> md->queue->request_fn = NULL > > > > Understood. > > > > I've addressed the above races by: > > 1) properly protecting md->queue during do_resume(). Both do_resume() > > and table_load() first take md->queue_lock and then _hash_lock. > > 2) adding a negative check for an existing live table to > > dm_table_setup_md_queue(). > > > > I will mail out v7 after I've given it additional review. > > dm_table_setup_md_queue() may allocate memory with blocking mode > inside md->queue_lock which is needed to resume the md. > That could cause a deadlock which doesn't happen in the current model; > e.g: > - Assume a bio-based table has been already loaded in hc->new_map > and no live table yet in a mapped_device. > - Load a request-based table and then no memory situation happens > in dm_table_setup_md_queue(). > - Then, the mapped_device can't be resumed even in the case that > resuming the mapped_device with the preloaded bio-based table > can make some memory by writeback. > => deadlock My patch introduces a new mutex that is local to the mapped_device being loaded. And that same mutex is taken during do_resume (just like you implicitly said was needed to avoid races). But I'm _very_ skeptical of the validity of this deadlock scenario. You're leaving out some important context for why we might expect a DM device to service (queue) IO during the table load _before_ the device has _ever_ been resumed. When I issue IO to a DM device that only has an inactive table it returns immediately: # dmsetup table --inactive bio-based: 0 8388608 linear 254:16 384 # dd if=/dev/mapper/bio-based of=/dev/null bs=1024k count=10 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.00016418 s, 0.0 kB/s As we discussed earlier in this thread, a table load can block indefinitely (waiting for memory) as long as _hash_lock is not held (so other DM ioctls are still possible). It is a serious reach to suggest that your new deadlock scenario is a legitimate concern. This would mean that a user would: 1) mistakenly load a bio-based table when they meant rq-based 2) load rq-based -- but block waiting for elevator_alloc() 3) attempt to resume device with bio-based anyway -- to make forward progress with writeback destined to the device being resumed for the first time? -- but again, why would dirty pages be destined for this inactive device already? 4) even if the user could resume they'd be left with a bio-based device; which isn't what they wanted.. Do you see the obscure nature of the scenario you've presented? Please help me understand what I am missing? Mike