From: Mike Snitzer <snitzer@redhat.com>
To: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>,
linux-kernel@vger.kernel.org, dm-devel@redhat.com,
Alasdair Kergon <agk@redhat.com>,
Jens Axboe <jens.axboe@oracle.com>,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com>,
Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
Date: Fri, 21 May 2010 09:34:56 -0400 [thread overview]
Message-ID: <20100521133456.GA14338@redhat.com> (raw)
In-Reply-To: <4BF64538.6010305@ct.jp.nec.com>
On Fri, May 21 2010 at 4:32am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> Hi Mike,
>
> On 05/21/2010 02:07 AM +0900, Mike Snitzer wrote:
> > Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> >> On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote:
> >>> Thanks again for pointing this out; I'll work to arrive at an
> >>> alternative locking scheme. Likely introduce a lock local to the
> >>> multiple_device (effectively the 'queue_lock' I had before). But
> >>> difference is I'd take that lock before taking _hash_lock.
> >>
> >> You have to see do_resume(), too.
> >> do_resume() gets hc->new_map with _hash_lock held but without your
> >> 'queue_lock' held. This could cause inconsistent status;
> >> table_load() for bio-based table do_resume() for rq-based table
> >> --------------------------------------------------------------------
> >> dm_lock_md_queue(md)
> >> dm_table_setup_md_queue(t) (<= bio-based)
> >> dm_clear_request_based_queue(md)
> >> md->queue->request_fn = NULL
> >> down_write(_hash_lock)
> >> new_map = hc->new_map (<= rq-based)
> >> up_write(_hash_lock)
> >> dm_swap_table(md, new_map)
> >> __bind(md, new_map) (<= rq-based)
> >> down_write(_hash_lock)
> >> hc->new_map = t
> >> up_write(_hash_lock)
> >> dm_unlock_md_queue(md)
> snip
> >>>>> Also, your patch changes the queue configuration even when a table is
> >>>>> already active and used. (e.g. Loading bio-based table to a mapped_device
> >>>>> which is already active/used as request-based sets q->requst_fn in NULL.)
> >>>>> That could cause some critical problems.
> >>>>
> >>>> Yes, that is possible and I can add additional checks to prevent this.
> >>>> But this speaks to a more general problem with the existing DM code.
> >>>>
> >>>> dm_swap_table() has the negative check to prevent such table loads, e.g.:
> >>>> /* cannot change the device type, once a table is bound */
> >>>>
> >>>> This check should come during table_load, as part of
> >>>> dm_table_set_type(), rather than during table resume.
> >>>
> >>> I'll look to address this issue in v6 too.
> >>
> >> It can race between table_load() and do_resume() as well;
> >> table_load() for bio-based do_resume() for rq-based
> >> ---------------------------------------------------------------------
> >> dm_table_set_type(t) (<= bio-based)
> >> live_table = dm_get_live_table(md)
> >> (assume no live_table yet)
> >> new_map = hc->new_map (<= rq-based)
> >> dm_swap_table(md, new_map)
> >> __bind(md, new_map)
> >> dm_table_setup_md_queue(t)
> >> dm_clear_request_based_queue(md)
> >> md->queue->request_fn = NULL
> >
> > Understood.
> >
> > I've addressed the above races by:
> > 1) properly protecting md->queue during do_resume(). Both do_resume()
> > and table_load() first take md->queue_lock and then _hash_lock.
> > 2) adding a negative check for an existing live table to
> > dm_table_setup_md_queue().
> >
> > I will mail out v7 after I've given it additional review.
>
> dm_table_setup_md_queue() may allocate memory with blocking mode
> inside md->queue_lock which is needed to resume the md.
> That could cause a deadlock which doesn't happen in the current model;
> e.g:
> - Assume a bio-based table has been already loaded in hc->new_map
> and no live table yet in a mapped_device.
> - Load a request-based table and then no memory situation happens
> in dm_table_setup_md_queue().
> - Then, the mapped_device can't be resumed even in the case that
> resuming the mapped_device with the preloaded bio-based table
> can make some memory by writeback.
> => deadlock
My patch introduces a new mutex that is local to the mapped_device being
loaded. And that same mutex is taken during do_resume (just like you
implicitly said was needed to avoid races).
But I'm _very_ skeptical of the validity of this deadlock scenario.
You're leaving out some important context for why we might expect a DM
device to service (queue) IO during the table load _before_ the device
has _ever_ been resumed. When I issue IO to a DM device that only has
an inactive table it returns immediately:
# dmsetup table --inactive
bio-based: 0 8388608 linear 254:16 384
# dd if=/dev/mapper/bio-based of=/dev/null bs=1024k count=10
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00016418 s, 0.0 kB/s
As we discussed earlier in this thread, a table load can block
indefinitely (waiting for memory) as long as _hash_lock is not held (so
other DM ioctls are still possible).
It is a serious reach to suggest that your new deadlock scenario is a
legitimate concern. This would mean that a user would:
1) mistakenly load a bio-based table when they meant rq-based
2) load rq-based -- but block waiting for elevator_alloc()
3) attempt to resume device with bio-based anyway -- to make forward
progress with writeback destined to the device being resumed for the
first time? -- but again, why would dirty pages be destined for this
inactive device already?
4) even if the user could resume they'd be left with a bio-based device;
which isn't what they wanted..
Do you see the obscure nature of the scenario you've presented?
Please help me understand what I am missing?
Mike
next prev parent reply other threads:[~2010-05-21 13:34 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-10 22:55 [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Mike Snitzer
2010-05-10 22:55 ` Mike Snitzer
2010-05-10 22:55 ` [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device Mike Snitzer
2010-05-10 22:55 ` Mike Snitzer
2010-05-11 4:23 ` Kiyoshi Ueda
2010-05-11 13:15 ` Mike Snitzer
2010-05-12 8:23 ` Kiyoshi Ueda
2010-05-13 3:57 ` Mike Snitzer
2010-05-14 8:06 ` Kiyoshi Ueda
2010-05-14 14:08 ` Mike Snitzer
2010-05-17 9:27 ` Kiyoshi Ueda
2010-05-17 17:27 ` Mike Snitzer
2010-05-18 8:32 ` Kiyoshi Ueda
2010-05-18 13:46 ` Mike Snitzer
2010-05-18 13:46 ` Mike Snitzer
2010-05-19 5:57 ` Kiyoshi Ueda
2010-05-19 12:01 ` Mike Snitzer
2010-05-19 12:01 ` Mike Snitzer
2010-05-19 14:39 ` Mike Snitzer
2010-05-19 14:45 ` Mike Snitzer
2010-05-20 11:21 ` Kiyoshi Ueda
2010-05-20 17:07 ` Mike Snitzer
2010-05-21 8:32 ` Kiyoshi Ueda
2010-05-21 13:34 ` Mike Snitzer [this message]
2010-05-24 9:58 ` Kiyoshi Ueda
2010-05-19 21:51 ` Mike Snitzer
2010-05-13 4:31 ` [PATCH 2/2 v2] " Mike Snitzer
2010-05-13 5:02 ` [RFC PATCH 3/2] dm: bio-based device must not register elevator in sysfs Mike Snitzer
2010-05-13 22:14 ` [PATCH 3/2 v2] " Mike Snitzer
2010-05-11 6:55 ` [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Jens Axboe
2010-05-11 13:18 ` Mike Snitzer
2010-05-11 13:21 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100521133456.GA14338@redhat.com \
--to=snitzer@redhat.com \
--cc=agk@redhat.com \
--cc=dm-devel@redhat.com \
--cc=j-nomura@ce.jp.nec.com \
--cc=jens.axboe@oracle.com \
--cc=k-ueda@ct.jp.nec.com \
--cc=knikanth@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.