Re: NFS DIO overhaul - Jeff Layton

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@redhat.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond.Myklebust@netapp.com, Badari Pulavarty <pbadari@us.ibm.com>,
	linux-nfs@vger.kernel.org
Subject: Re: NFS DIO overhaul
Date: Tue, 17 May 2011 10:54:15 -0400	[thread overview]
Message-ID: <20110517105415.7b26152f@tlielax.poochiereds.net> (raw)
In-Reply-To: <4DDFE8CB-7308-44A4-9DF4-14E78A50E466@oracle.com>

On Tue, 17 May 2011 10:50:23 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Hi Jeff, Badari-
> 
> On May 17, 2011, at 10:45 AM, Jeff Layton wrote:
> 
> > Hi Trond,
> > 
> > You recently mentioned on the list that you were working on an overhaul
> > of the DIO code that would merge large parts of it with the buffered IO
> > routines.
> > 
> > I also started some work on an overhaul of the DIO code to make it work
> > better with arbitrary sized arrays of iovecs. I've only really started
> > it, but I have the following routine so far. The idea here is to always
> > try to max out the size of a direct I/O READ or WRITE regardless of the
> > layout of the iovec array. Once we have the count, we can then just
> > marshal up a single read or write and send it, advance the iov_iter and
> > repeat until we're done.
> > 
> > Obviously, it's rather complicated logic, but I think this is doable...
> > 
> > My question is -- is this something you would consider taking if I were
> > to finish it up in the near future? Or should I not bother since you're
> > working on an overhaul of this code?
> > 
> > My main concern is that we have some existing shipping code where we
> > need to fix this, and  the changes you're planning to make may be too
> > invasive to backport. If that's the case, then I may need to do
> > something along these lines anyway (or just take Badari's latest patch).
> 
> Do we have detailed analysis now on why simply sending these as separate UNSTABLE writes followed by a COMMIT is not sufficient to address performance issues?
> 

They've done some benchmarking that shows that performance plateaus out
earlier with that approach. That approach also does absolutely nothing
for the read codepath which has similar problems.


> > Thanks...My "starter" patch follows:
> > 
> > ------------------------[snip]------------------------
> > 
> > [PATCH] nfs: add a direct I/O count routine
> > 
> > ...to count the number of bytes that we can stick in a single r/w req.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > fs/nfs/direct.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 90 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> > index c8add8d..15658fe 100644
> > --- a/fs/nfs/direct.c
> > +++ b/fs/nfs/direct.c
> > @@ -227,6 +227,96 @@ static void nfs_direct_complete(struct nfs_direct_req *dreq)
> > }
> > 
> > /*
> > + * Walk an iovec array, and count bytes until it hits "maxsize" or until it
> > + * hits a "discontinuity" in the array.
> > + */
> > +unsigned int
> > +nfs_direct_count(const struct iov_iter *i_orig, unsigned int maxsize)
> > +{
> > +	struct iov_iter i;
> > +	unsigned int count = 0, end_of_page, end_of_iov, end_of_seg;
> > +	unsigned long base;
> > +
> > +	/* copy iov_iter so original is preserved */
> > +	memcpy(&i, i_orig, sizeof(i));
> > +
> > +	/*
> > +	 * Determine how many bytes to end of page. Advance to end of page or
> > +	 * end of current iovec, or "size" bytes, whichever is closer.
> > +	 *
> > +	 * If end of a page, then continue to next page and repeat.
> > +	 *
> > +	 * If end of iovec, then see whether current iov ends on
> > +	 * page boundary and next one starts on one. If so, then
> > +	 * repeat.
> > +	 *
> > +	 * If not, then see if the iov's are "continuous". Are previous
> > +	 * end and current beginning on same page? If so, then see if
> > +	 * current end is just before next beginning. If so, repeat.
> > +	 *
> > +	 * If none of the conditions above are true, then break and
> > +	 * return current count of bytes.
> > +	 */
> > +	for (;;) {
> > +		base = (unsigned long)i.iov->iov_base + i.iov_offset;
> > +
> > +		/* bytes to end of page */
> > +		end_of_page = PAGE_SIZE - (base & (PAGE_SIZE - 1));
> > +
> > +		/* bytes to end of iov */
> > +		end_of_iov = i.iov->iov_len - i.iov_offset;
> > +
> > +		/* find min of end of page, end of iov, and r/wsize */
> > +		end_of_seg = min(end_of_page, end_of_iov);
> > +		end_of_seg = min(end_of_seg, maxsize);
> > +
> > +		iov_iter_advance(&i, end_of_seg);
> > +		maxsize -= end_of_seg;
> > +		count += end_of_seg;
> > +
> > +		/* Did we hit the end of array or maxsize? */
> > +		if (maxsize - end_of_seg == 0 || i.count == 0)
> > +			break;
> > +
> > +		/*
> > +		 * Else end_of_seg either == end_of_page or end_of_iov,
> > +		 * and iov has more data in it.
> > +		 */
> > +
> > +		/* Did we hit a page boundary, but next segement is in same
> > +		 * iov? If so, then carry on.
> > +		 */
> > +		if (end_of_page != end_of_iov && end_of_seg == end_of_page)
> > +			goto advance_ii;
> > +
> > +		/* Did iov end on a page boundary? Ensure next starts on one. */
> > +		if (end_of_page == end_of_iov &&
> > +		    (unsigned long)i.iov[1].iov_base & (PAGE_SIZE - 1))
> > +			break;
> > +
> > +		/* Make sure next iovec starts on same page as first */
> > +		if ((base & PAGE_MASK) !=
> > +		    ((unsigned long)i.iov[1].iov_base & PAGE_MASK))
> > +			break;
> > +
> > +		/* and the offsets match up */
> > +		if (end_of_iov + 1 != (unsigned long)i.iov[1].iov_base)
> > +			break;
> > +
> > +advance_ii:
> > +		/* Everything checks out. Advance into next iov and continue */
> > +		iov_iter_advance(&i, 1);
> > +		--maxsize;
> > +		++count;
> > +		/* Did we hit the end of array or maxsize? */
> > +		if (maxsize == 0 || i.count == 0)
> > +			break;
> > +	}
> > +
> > +	return count;
> > +}
> > +
> > +/*
> >  * We must hold a reference to all the pages in this direct read request
> >  * until the RPCs complete.  This could be long *after* we are woken up in
> >  * nfs_direct_wait (for instance, if someone hits ^C on a slow server).
> > -- 
> > 1.7.1
> > 
> > 
> > 
> > -- 
> > Jeff Layton <jlayton@redhat.com>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Layton <jlayton@redhat.com>

     prev parent reply	other threads:[~2011-05-17 14:54 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-17 14:45 NFS DIO overhaul Jeff Layton
2011-05-17 14:50 ` Chuck Lever
2011-05-17 14:54   ` Jeff Layton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110517105415.7b26152f@tlielax.poochiereds.net \
    --to=jlayton@redhat.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=pbadari@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).