linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Badari Pulavarty <pbadari@us.ibm.com>
To: Jeff Layton <jlayton@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	linux-nfs@vger.kernel.org, khoa@us.ibm.com
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client
Date: Wed, 13 Apr 2011 06:43:53 -0700	[thread overview]
Message-ID: <4DA5A899.3040202@us.ibm.com> (raw)
In-Reply-To: <20110413083656.12e54a91@tlielax.poochiereds.net>

On 4/13/2011 5:36 AM, Jeff Layton wrote:
> On Tue, 12 Apr 2011 10:46:00 -0700
> Badari Pulavarty<pbadari@us.ibm.com>  wrote:
>
>    
>> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
>>      
>>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
>>>
>>>        
>>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
>>>>          
>>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
>>>>>
>>>>>            
>>>>>> Hi,
>>>>>>
>>>>>> We recently ran into serious performance issue with NFS client.
>>>>>> It turned out that its due to lack of readv/write support for
>>>>>> NFS (O_DIRECT) client.
>>>>>>
>>>>>> Here is our use-case:
>>>>>>
>>>>>> In our cloud environment, our storage is over NFS. Files
>>>>>> on NFS are passed as a blockdevices to the guest (using
>>>>>> O_DIRECT). When guest is doing IO on these block devices,
>>>>>> they will end up as O_DIRECT writes to NFS (on KVM host).
>>>>>>
>>>>>> QEMU (on the host) gets a vector from virtio-ring and
>>>>>> submits them. Old versions of QEMU, linearized the vector
>>>>>> it got from KVM (copied them into a buffer) and submits
>>>>>> the buffer. So, NFS client always received a single buffer.
>>>>>>
>>>>>> Later versions of QEMU, eliminated this copy and submits
>>>>>> a vector directly using preadv/pwritev().
>>>>>>
>>>>>> NFS client loops through the vector and submits each
>>>>>> vector as separate request for each IO<  wsize. In our
>>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64
>>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
>>>>>> Server end up doing each 4K synchronously. This causes
>>>>>> serious performance degrade. We are trying to see if the
>>>>>> performance improves if we convert IOs to ASYNC - but
>>>>>> our initial results doesn't look good.
>>>>>>
>>>>>> readv/writev support NFS client for all possible cases is
>>>>>> hard. Instead, if all vectors are page-aligned and
>>>>>> iosizes page-multiple - it fits the current code easily.
>>>>>> Luckily, QEMU use-case fits these requirements.
>>>>>>
>>>>>> Here is the patch to add this support. Comments ?
>>>>>>              
>>>>> Restricting buffer alignment requirements would be an onerous API change, IMO.
>>>>>            
>>>> I am not suggesting an API change at all. All I am doing is, if all
>>>> the IOs are aligned - we can do a fast path as we can do in a single
>>>> IO request. (as if we got a single buffer). Otherwise, we do hard
>>>> way as today - loop through each one and do them individually.
>>>>          
>>> Thanks for the clarification.  That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
>>>
>>>        
>>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.
>>>>>            
>>>> We are trying that patch. It does improve the performance by little,
>>>> but not anywhere close to doing it as a single vector/buffer.
>>>>
>>>> Khoa, can you share your performance data for all the
>>>> suggestions/patches you tried so far ?
>>>>          
>>> The individual WRITEs should be submitted in parallel.  If there is additional performance overhead, it is probably due to the small RPC slot table size.  Have you tried increasing it?
>>>        
>> We haven't tried both fixes together (RPC slot increase, Turn into
>> ASYNC). Each one individually didn't help much. We will try them
>> together.
>>
>>      
> I must have missed some of these emails, but here's the patch that
> Chuck mentioned and based on his suggestion. It may be reasonable as a
> stopgap solution until Trond's overhaul of the DIO code is complete.
>
> With this + a larger slot table size, I would think you'd have a
> substantial write performance improvement. Probably not as good as if
> the writes were coalesced, but it should help.
>
> Badari, if you can let us know whether this plus increasing the slot
> table size helps, then I'll plan to "officially" submit it. This one is
> against RHEL6 but it should apply to mainline kernels with a small
> offset.
>
> -----------------[snip]-----------------
>
> BZ#694309: nfs: use unstable writes for bigger groups of DIO writes
>
> Currently, the client uses FILE_SYNC whenever it's writing less than or
> equal data to the wsize. This is a problem though if we have a bunch
> of small iovec's batched up in a single writev call. The client will
> iterate over them and do a single FILE_SYNC WRITE for each.
>
> Instead, change the code to do unstable writes when we'll need to do
> multiple WRITE RPC's in order to satisfy the request. While we're at
> it, optimize away the allocation of commit_data when we aren't going
> to use it anyway.
>
> Signed-off-by: Jeff Layton<jlayton@redhat.com>
>    

Khoa is running tests with this + RPC table change. Single thread 
performance improved,
but multi-thread tests doesn't scale (I guess running out of RPC table 
entires even with 128 ?).
He will share the results shortly.

Thanks,
Badari

> ---
>   fs/nfs/direct.c |   13 +++++++++++--
>   1 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 398f8ed..d2ed659 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -870,9 +870,18 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
>   	dreq = nfs_direct_req_alloc();
>   	if (!dreq)
>   		goto out;
> -	nfs_alloc_commit_data(dreq);
>
> -	if (dreq->commit_data == NULL || count<= wsize)
> +	if (count>  wsize || nr_segs>  1)
> +		nfs_alloc_commit_data(dreq);
> +	else
> +		dreq->commit_data = NULL;
> +
> +	/*
> +	 * If we couldn't allocate commit data, or we'll just be doing a
> +	 * single write, then make this a NFS_FILE_SYNC write and do away
> +	 * with the commit.
> +	 */
> +	if (dreq->commit_data == NULL)
>   		sync = NFS_FILE_SYNC;
>
>   	dreq->inode = inode;
>    



  reply	other threads:[~2011-04-13 13:43 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-12 15:32 [RFC][PATCH] Vector read/write support for NFS (DIO) client Badari Pulavarty
2011-04-12 15:36 ` Chuck Lever
2011-04-12 16:15   ` Badari Pulavarty
2011-04-12 16:42     ` Chuck Lever
2011-04-12 17:46       ` Badari Pulavarty
2011-04-13 12:36         ` Jeff Layton
2011-04-13 13:43           ` Badari Pulavarty [this message]
2011-04-13 14:02             ` Jeff Layton
2011-04-13 14:22               ` Trond Myklebust
2011-04-13 14:27                 ` Andy Adamson
2011-04-13 17:20                 ` Jeff Layton
2011-04-13 17:35                   ` Trond Myklebust
2011-04-13 17:56                   ` Andy Adamson
2011-04-13 18:14                     ` Trond Myklebust
2011-04-13 18:47                       ` Chuck Lever
2011-04-13 19:04                         ` Jeff Layton
2011-04-14  0:21                     ` Dean
2011-04-14  0:42                       ` Trond Myklebust
2011-04-14  6:39                         ` Dean
2011-04-12 15:49 ` Trond Myklebust
     [not found]   ` <1302623369.4801.28.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
2011-04-12 16:17     ` Badari Pulavarty
2011-04-12 16:26       ` Trond Myklebust
2011-04-15 17:33   ` Christoph Hellwig
2011-04-15 18:00     ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DA5A899.3040202@us.ibm.com \
    --to=pbadari@us.ibm.com \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@redhat.com \
    --cc=khoa@us.ibm.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).