qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: ronnie sahlberg <ronniesahlberg@gmail.com>
To: Peter Lieven <pl@kamp.de>
Cc: "kwolf@redhat.com" <kwolf@redhat.com>,
	"famz@redhat.com" <famz@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Orit Wasserman <owasserm@redhat.com>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCHv2] block: add native support for NFS
Date: Wed, 18 Dec 2013 09:50:35 -0800	[thread overview]
Message-ID: <CAN05THTK5JFiHB7-VXgYj-c0ZmGUASkxoQ1D+g0LymRgwNFWcQ@mail.gmail.com> (raw)
In-Reply-To: <7E6420C5-311F-437E-A0C4-BC0621F0CC47@kamp.de>

On Wed, Dec 18, 2013 at 9:42 AM, Peter Lieven <pl@kamp.de> wrote:
>
> Am 18.12.2013 um 18:33 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>
>> On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>>>
>>>> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <owasserm@redhat.com> wrote:
>>>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>>>>> <berrange@redhat.com>:
>>>>>>>
>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>> This patch adds native support for accessing images on NFS shares
>>>>>>>> without
>>>>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>>>>
>>>>>>>> NFS Images can simply be specified by an url of the form:
>>>>>>>> nfs://<host>/<export>/<filename>
>>>>>>>>
>>>>>>>> For example:
>>>>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>>>>
>>>>>>>
>>>>>>> Does it support other config tunables, eg specifying which
>>>>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>>>>> URI parameters in the obvious manner ?
>>>>>>
>>>>>>
>>>>>> currently only v3 is supported by libnfs. what other tunables would you
>>>>>> like to see?
>>>>>>
>>>>>
>>>>> For live migration we need the sync option (async ignores O_SYNC and
>>>>> O_DIRECT sadly),
>>>>> will it be supported? or will it be the default?
>>>>>
>>>>
>>>> If you use the high-level API that provides posix like functions, such
>>>> as nfs_open() then libnfs does.
>>>> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
>>>> the O_SYNC flag in modes.
>>>>
>>>> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
>>>> NFS/WRITE3+UNSTABLE that allows the server to just write to
>>>> cache/memory.
>>>>
>>>> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
>>>> then libnfs will flag this handle as sync and any calls to
>>>> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
>>>>
>>>> Calls to nfs_fsync is translated to NFS/COMMIT3
>>>
>>> If this NFS/COMMIT3 would issue a sync on the server that would be all we
>>> actually need.
>>
>> You have that guarantee in NFS/COMMIT3
>> NFS/COMMIT3 will not return until the server has flushed the specified
>> range to disk.
>>
>> However, while the NFS protocol allows you to specify a range for the
>> COMMIT3 call so that you can do things like
>> WRITE3 Offset:foo Length:bar
>> COMMIT3 Offset:foo Length:bar
>> many/most nfs servers will ignore the offset/length arguments to the
>> COMMIT3 call and always unconditionally make an fsync() for the whole
>> file.
>>
>> This can make the COMMIT3 call very expensive for large files.
>>
>>
>> NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
>> specify O_SYNC to nfs_open*()
>> In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
>> which means that the server will guarantee to write the data to stable
>> storage before responding back to the client.
>> In this mode there is no real need to do anything at all or even call
>> COMMIT3  since there is never any writeback data on the server that
>> needs to be destaged.
>>
>>
>> Since many servers treat COMMIT3 as "unconditionally walk all blocks
>> for the whole file and make sure they are destaged" it is not clear
>> whether how
>>
>> WRITE3-normal Offset:foo Length:bar
>> COMMIT3 Offset:foo Length:bar
>>
>> will compare to
>>
>> WRITE3+O_SYNC Offset:foo Length:bar
>>
>> I would not be surprised if the second mode would have higher
>> (potentially significantly) performance than the former.
>
> The qemu block layer currently is designed to send a bdrv_flush after every single
> write if the write cache is not enabled. This means that the unwritten data is just
> the data of the single write operation.

I understand that, there is only a single WRITE3 worth of data to
actually destage each time.

But what I meant is that for a lot of servers, for large files,   the
server might need to spend non-trivial amount of time
crunching file metadata and check every single page for the file in
order to discover the "I only need to destage pages x,y,z"

On many nfs servers this "figure out which blocks to flush" can take a
lot of time and affect performance greatly.



> However, changing this to issue a sync
> write call would require to change the whole API. The major problem is that
> the write cache setting can be changed while the device is open otherwise
> we could just ignore all calls to bdrv flush if the device was opened without
> enabled write cache.
>
> In the very popular case of using Virtio as Driver it is the case that the device
> is always opened with disabled write cache and the write cache is only
> enabled after the host has negotiated with the guest that the guest is
> able to send flushed.
>
> We can keep in mind for a later version of the driver that we manually craft
> a write call with O_SYNC if the write cache is disabled and ignore bdrv_flush.
> And we use async write + commit via bdrv_flush in the case of an enabled
> write cache.
>
> Peter
>

  reply	other threads:[~2013-12-18 17:50 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-17  9:15 [Qemu-devel] [PATCHv2] block: add native support for NFS Peter Lieven
2013-12-17 16:47 ` Stefan Hajnoczi
2013-12-17 17:03   ` Peter Lieven
2013-12-17 17:13     ` ronnie sahlberg
2013-12-17 22:36       ` Peter Lieven
2013-12-17 22:44         ` Eric Blake
2013-12-17 22:51         ` ronnie sahlberg
2013-12-17 22:56           ` Peter Lieven
2013-12-17 17:28     ` ronnie sahlberg
2013-12-17 23:00       ` Peter Lieven
2013-12-20  9:48   ` Peter Lieven
2013-12-20 12:19     ` Stefan Hajnoczi
2013-12-20 12:53       ` Peter Lieven
2013-12-20 13:57         ` Stefan Hajnoczi
2013-12-20 14:07           ` Peter Lieven
2013-12-20 14:38             ` Stefan Hajnoczi
2013-12-20 14:43               ` Peter Lieven
2013-12-20 15:03                 ` ronnie sahlberg
2013-12-20 15:30                 ` Stefan Hajnoczi
2013-12-20 15:49                   ` Peter Lieven
2013-12-20 15:54                     ` Stefan Hajnoczi
2013-12-20 15:57                       ` Peter Lieven
2013-12-20 16:27                         ` Stefan Hajnoczi
2014-01-03 10:35                           ` Peter Lieven
2013-12-17 16:53 ` ronnie sahlberg
2013-12-17 22:57   ` Peter Lieven
2013-12-17 17:32 ` Daniel P. Berrange
2013-12-17 23:03   ` Peter Lieven
2013-12-18  9:30     ` Daniel P. Berrange
2013-12-18 10:00     ` Orit Wasserman
2013-12-18 10:18       ` Daniel P. Berrange
2013-12-18 10:24         ` Orit Wasserman
2013-12-18 10:38           ` Paolo Bonzini
2013-12-18 17:21             ` Peter Lieven
2013-12-19 14:31               ` Paolo Bonzini
2013-12-18 11:11       ` Peter Lieven
2013-12-18 11:23         ` Orit Wasserman
2013-12-18 14:42       ` ronnie sahlberg
2013-12-18 16:59         ` Peter Lieven
2013-12-18 17:33           ` ronnie sahlberg
2013-12-18 17:42             ` Peter Lieven
2013-12-18 17:50               ` ronnie sahlberg [this message]
2013-12-18 17:55                 ` Peter Lieven

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAN05THTK5JFiHB7-VXgYj-c0ZmGUASkxoQ1D+g0LymRgwNFWcQ@mail.gmail.com \
    --to=ronniesahlberg@gmail.com \
    --cc=famz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).