Re: [Qemu-devel] [PATCH VERSION 3] Disk image exclusive and shared locks.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: "Richard W.M. Jones" <rjones@redhat.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH VERSION 3] Disk image exclusive and shared locks.
Date: Tue, 15 Dec 2009 12:45:13 -0600	[thread overview]
Message-ID: <4B27D939.5040005@codemonkey.ws> (raw)
In-Reply-To: <20091215180925.GM3406@amd.home.annexia.org>

Richard W.M. Jones wrote:
> Thanks for looking at this.
>   

Thanks for following up :-)

> The use case is "cluster filesystem with an admin tool that must be
> run exclusively".  Cluster nodes open the block device for write, but
> with a shared lock.  The admin tool needs exclusive access (no nodes
> must be writing), so it tries to open the device with lock=exclusive.
> This only succeeds if the normal cluster nodes have backed off.
>   

This seems like a reasonable uncommon scenario and my concern is that it 
really hurts the interface.  lock=on|off I think qualifies as "The 
obvious use is the correct one" whereas I think 
lock=exclusive|shared|none would qualify as "Read the implementation and 
you'll get it right."

And that's mainly due to the number of corner cases that have to be 
considered in order for exclusive to degrade into shared.

> However, the patch would be a bit simpler if we just had lock=on|off
> at the moment, and it wouldn't stop us from adding the shared case in
> future.  (For my needs I don't care about the shared case).
>   

Let's stick with lock=on|off.  If anyone comes along and actually wants 
shared|exclusive, we can revisit the thread :-)

>> I really dislike this as an interface.  I think we need to make a  
>> decision about whether we delay open until after migration has  
>> completed.  I think unless there's a really compelling argument against  
>> it, this is probably what we should do.
>>     
>
> I'm guessing this needs some quite major surgery to qemu so that block
> device opening is delayed in the case of migration.  I can take a look
> at this.
>   

I think it's actually easier than I expected it to be.

We really just need to delay lock acquisition until it's really needed 
(read/write op).  For probing, we can get away without having a lock 
held since it's read only and we only use it to decide which bdrv to use.

So what I'd expect is that we would to acquire a lock until we actually 
ran the vm.  We would basically have a bdrv_lock_all() that could fail 
if some lock couldn't be acquired.  We would execute that right before 
running a VM (which means it happens after incoming migration completes).

After we completed migration (and did an fsck), we would call 
bdrv_unlock_all() to drop the lock.  Before we did 'cont' again, we 
would issue bdrv_lock_all().

As long as we do a bdrv_unlock_all() before writing the last byte of the 
migration stream, it's not at all racy.  We also don't have to close the 
fd on the source which means that we maintain the same level of robustness.

>> As it stands, we cannot make lock=!none the default if it takes an extra  
>> monitor command to allow for live migration.  I think if we're going to  
>> introduce this functionality, we probably should be enabling it by 
>> default.
>>     
>
> Xen has a similar feature, and it is enabled by default.
>   

And I really, really dislike it :-)  There are a lot of corner cases 
with requiring the use of '!' to the point where we ended up just using 
it for everything in our management tools.  IIRC, they don't use 
advisory locking and instead try to be clever with something similar to 
lsof.

Regards,

Anthony Liguori

next prev parent reply	other threads:[~2009-12-15 18:45 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-15 16:42 [Qemu-devel] [PATCH VERSION 3] Disk image exclusive and shared locks Richard W.M. Jones
2009-12-15 18:02 ` Anthony Liguori
2009-12-15 18:09   ` Richard W.M. Jones
2009-12-15 18:45     ` Anthony Liguori [this message]
2009-12-15 18:33 ` Jamie Lokier
2009-12-15 23:26   ` Jamie Lokier
2009-12-16 10:37   ` Kevin Wolf
2009-12-17 13:26     ` Jamie Lokier
2009-12-17 10:53 ` Christoph Hellwig
2009-12-17 11:06   ` Richard W.M. Jones
2009-12-17 15:38   ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B27D939.5040005@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=qemu-devel@nongnu.org \
    --cc=rjones@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).