From: Badari Pulavarty <pbadari@us.ibm.com>
To: Jeff Layton <jlayton@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>,
linux-nfs@vger.kernel.org, khoa@us.ibm.com
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client
Date: Wed, 13 Apr 2011 06:43:53 -0700 [thread overview]
Message-ID: <4DA5A899.3040202@us.ibm.com> (raw)
In-Reply-To: <20110413083656.12e54a91@tlielax.poochiereds.net>
On 4/13/2011 5:36 AM, Jeff Layton wrote:
> On Tue, 12 Apr 2011 10:46:00 -0700
> Badari Pulavarty<pbadari@us.ibm.com> wrote:
>
>
>> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
>>
>>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
>>>
>>>
>>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
>>>>
>>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We recently ran into serious performance issue with NFS client.
>>>>>> It turned out that its due to lack of readv/write support for
>>>>>> NFS (O_DIRECT) client.
>>>>>>
>>>>>> Here is our use-case:
>>>>>>
>>>>>> In our cloud environment, our storage is over NFS. Files
>>>>>> on NFS are passed as a blockdevices to the guest (using
>>>>>> O_DIRECT). When guest is doing IO on these block devices,
>>>>>> they will end up as O_DIRECT writes to NFS (on KVM host).
>>>>>>
>>>>>> QEMU (on the host) gets a vector from virtio-ring and
>>>>>> submits them. Old versions of QEMU, linearized the vector
>>>>>> it got from KVM (copied them into a buffer) and submits
>>>>>> the buffer. So, NFS client always received a single buffer.
>>>>>>
>>>>>> Later versions of QEMU, eliminated this copy and submits
>>>>>> a vector directly using preadv/pwritev().
>>>>>>
>>>>>> NFS client loops through the vector and submits each
>>>>>> vector as separate request for each IO< wsize. In our
>>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64
>>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
>>>>>> Server end up doing each 4K synchronously. This causes
>>>>>> serious performance degrade. We are trying to see if the
>>>>>> performance improves if we convert IOs to ASYNC - but
>>>>>> our initial results doesn't look good.
>>>>>>
>>>>>> readv/writev support NFS client for all possible cases is
>>>>>> hard. Instead, if all vectors are page-aligned and
>>>>>> iosizes page-multiple - it fits the current code easily.
>>>>>> Luckily, QEMU use-case fits these requirements.
>>>>>>
>>>>>> Here is the patch to add this support. Comments ?
>>>>>>
>>>>> Restricting buffer alignment requirements would be an onerous API change, IMO.
>>>>>
>>>> I am not suggesting an API change at all. All I am doing is, if all
>>>> the IOs are aligned - we can do a fast path as we can do in a single
>>>> IO request. (as if we got a single buffer). Otherwise, we do hard
>>>> way as today - loop through each one and do them individually.
>>>>
>>> Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
>>>
>>>
>>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.
>>>>>
>>>> We are trying that patch. It does improve the performance by little,
>>>> but not anywhere close to doing it as a single vector/buffer.
>>>>
>>>> Khoa, can you share your performance data for all the
>>>> suggestions/patches you tried so far ?
>>>>
>>> The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it?
>>>
>> We haven't tried both fixes together (RPC slot increase, Turn into
>> ASYNC). Each one individually didn't help much. We will try them
>> together.
>>
>>
> I must have missed some of these emails, but here's the patch that
> Chuck mentioned and based on his suggestion. It may be reasonable as a
> stopgap solution until Trond's overhaul of the DIO code is complete.
>
> With this + a larger slot table size, I would think you'd have a
> substantial write performance improvement. Probably not as good as if
> the writes were coalesced, but it should help.
>
> Badari, if you can let us know whether this plus increasing the slot
> table size helps, then I'll plan to "officially" submit it. This one is
> against RHEL6 but it should apply to mainline kernels with a small
> offset.
>
> -----------------[snip]-----------------
>
> BZ#694309: nfs: use unstable writes for bigger groups of DIO writes
>
> Currently, the client uses FILE_SYNC whenever it's writing less than or
> equal data to the wsize. This is a problem though if we have a bunch
> of small iovec's batched up in a single writev call. The client will
> iterate over them and do a single FILE_SYNC WRITE for each.
>
> Instead, change the code to do unstable writes when we'll need to do
> multiple WRITE RPC's in order to satisfy the request. While we're at
> it, optimize away the allocation of commit_data when we aren't going
> to use it anyway.
>
> Signed-off-by: Jeff Layton<jlayton@redhat.com>
>
Khoa is running tests with this + RPC table change. Single thread
performance improved,
but multi-thread tests doesn't scale (I guess running out of RPC table
entires even with 128 ?).
He will share the results shortly.
Thanks,
Badari
> ---
> fs/nfs/direct.c | 13 +++++++++++--
> 1 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 398f8ed..d2ed659 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -870,9 +870,18 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
> dreq = nfs_direct_req_alloc();
> if (!dreq)
> goto out;
> - nfs_alloc_commit_data(dreq);
>
> - if (dreq->commit_data == NULL || count<= wsize)
> + if (count> wsize || nr_segs> 1)
> + nfs_alloc_commit_data(dreq);
> + else
> + dreq->commit_data = NULL;
> +
> + /*
> + * If we couldn't allocate commit data, or we'll just be doing a
> + * single write, then make this a NFS_FILE_SYNC write and do away
> + * with the commit.
> + */
> + if (dreq->commit_data == NULL)
> sync = NFS_FILE_SYNC;
>
> dreq->inode = inode;
>
next prev parent reply other threads:[~2011-04-13 13:43 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-12 15:32 [RFC][PATCH] Vector read/write support for NFS (DIO) client Badari Pulavarty
2011-04-12 15:36 ` Chuck Lever
2011-04-12 16:15 ` Badari Pulavarty
2011-04-12 16:42 ` Chuck Lever
2011-04-12 17:46 ` Badari Pulavarty
2011-04-13 12:36 ` Jeff Layton
2011-04-13 13:43 ` Badari Pulavarty [this message]
2011-04-13 14:02 ` Jeff Layton
2011-04-13 14:22 ` Trond Myklebust
2011-04-13 14:27 ` Andy Adamson
2011-04-13 17:20 ` Jeff Layton
2011-04-13 17:35 ` Trond Myklebust
2011-04-13 17:56 ` Andy Adamson
2011-04-13 18:14 ` Trond Myklebust
2011-04-13 18:47 ` Chuck Lever
2011-04-13 19:04 ` Jeff Layton
2011-04-14 0:21 ` Dean
2011-04-14 0:42 ` Trond Myklebust
2011-04-14 6:39 ` Dean
2011-04-12 15:49 ` Trond Myklebust
[not found] ` <1302623369.4801.28.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
2011-04-12 16:17 ` Badari Pulavarty
2011-04-12 16:26 ` Trond Myklebust
2011-04-15 17:33 ` Christoph Hellwig
2011-04-15 18:00 ` Trond Myklebust
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4DA5A899.3040202@us.ibm.com \
--to=pbadari@us.ibm.com \
--cc=chuck.lever@oracle.com \
--cc=jlayton@redhat.com \
--cc=khoa@us.ibm.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).