From: Jeff Layton <jlayton@redhat.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond.Myklebust@netapp.com, Badari Pulavarty <pbadari@us.ibm.com>,
linux-nfs@vger.kernel.org
Subject: Re: NFS DIO overhaul
Date: Tue, 17 May 2011 10:54:15 -0400 [thread overview]
Message-ID: <20110517105415.7b26152f@tlielax.poochiereds.net> (raw)
In-Reply-To: <4DDFE8CB-7308-44A4-9DF4-14E78A50E466@oracle.com>
On Tue, 17 May 2011 10:50:23 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:
> Hi Jeff, Badari-
>
> On May 17, 2011, at 10:45 AM, Jeff Layton wrote:
>
> > Hi Trond,
> >
> > You recently mentioned on the list that you were working on an overhaul
> > of the DIO code that would merge large parts of it with the buffered IO
> > routines.
> >
> > I also started some work on an overhaul of the DIO code to make it work
> > better with arbitrary sized arrays of iovecs. I've only really started
> > it, but I have the following routine so far. The idea here is to always
> > try to max out the size of a direct I/O READ or WRITE regardless of the
> > layout of the iovec array. Once we have the count, we can then just
> > marshal up a single read or write and send it, advance the iov_iter and
> > repeat until we're done.
> >
> > Obviously, it's rather complicated logic, but I think this is doable...
> >
> > My question is -- is this something you would consider taking if I were
> > to finish it up in the near future? Or should I not bother since you're
> > working on an overhaul of this code?
> >
> > My main concern is that we have some existing shipping code where we
> > need to fix this, and the changes you're planning to make may be too
> > invasive to backport. If that's the case, then I may need to do
> > something along these lines anyway (or just take Badari's latest patch).
>
> Do we have detailed analysis now on why simply sending these as separate UNSTABLE writes followed by a COMMIT is not sufficient to address performance issues?
>
They've done some benchmarking that shows that performance plateaus out
earlier with that approach. That approach also does absolutely nothing
for the read codepath which has similar problems.
> > Thanks...My "starter" patch follows:
> >
> > ------------------------[snip]------------------------
> >
> > [PATCH] nfs: add a direct I/O count routine
> >
> > ...to count the number of bytes that we can stick in a single r/w req.
> >
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > fs/nfs/direct.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 90 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> > index c8add8d..15658fe 100644
> > --- a/fs/nfs/direct.c
> > +++ b/fs/nfs/direct.c
> > @@ -227,6 +227,96 @@ static void nfs_direct_complete(struct nfs_direct_req *dreq)
> > }
> >
> > /*
> > + * Walk an iovec array, and count bytes until it hits "maxsize" or until it
> > + * hits a "discontinuity" in the array.
> > + */
> > +unsigned int
> > +nfs_direct_count(const struct iov_iter *i_orig, unsigned int maxsize)
> > +{
> > + struct iov_iter i;
> > + unsigned int count = 0, end_of_page, end_of_iov, end_of_seg;
> > + unsigned long base;
> > +
> > + /* copy iov_iter so original is preserved */
> > + memcpy(&i, i_orig, sizeof(i));
> > +
> > + /*
> > + * Determine how many bytes to end of page. Advance to end of page or
> > + * end of current iovec, or "size" bytes, whichever is closer.
> > + *
> > + * If end of a page, then continue to next page and repeat.
> > + *
> > + * If end of iovec, then see whether current iov ends on
> > + * page boundary and next one starts on one. If so, then
> > + * repeat.
> > + *
> > + * If not, then see if the iov's are "continuous". Are previous
> > + * end and current beginning on same page? If so, then see if
> > + * current end is just before next beginning. If so, repeat.
> > + *
> > + * If none of the conditions above are true, then break and
> > + * return current count of bytes.
> > + */
> > + for (;;) {
> > + base = (unsigned long)i.iov->iov_base + i.iov_offset;
> > +
> > + /* bytes to end of page */
> > + end_of_page = PAGE_SIZE - (base & (PAGE_SIZE - 1));
> > +
> > + /* bytes to end of iov */
> > + end_of_iov = i.iov->iov_len - i.iov_offset;
> > +
> > + /* find min of end of page, end of iov, and r/wsize */
> > + end_of_seg = min(end_of_page, end_of_iov);
> > + end_of_seg = min(end_of_seg, maxsize);
> > +
> > + iov_iter_advance(&i, end_of_seg);
> > + maxsize -= end_of_seg;
> > + count += end_of_seg;
> > +
> > + /* Did we hit the end of array or maxsize? */
> > + if (maxsize - end_of_seg == 0 || i.count == 0)
> > + break;
> > +
> > + /*
> > + * Else end_of_seg either == end_of_page or end_of_iov,
> > + * and iov has more data in it.
> > + */
> > +
> > + /* Did we hit a page boundary, but next segement is in same
> > + * iov? If so, then carry on.
> > + */
> > + if (end_of_page != end_of_iov && end_of_seg == end_of_page)
> > + goto advance_ii;
> > +
> > + /* Did iov end on a page boundary? Ensure next starts on one. */
> > + if (end_of_page == end_of_iov &&
> > + (unsigned long)i.iov[1].iov_base & (PAGE_SIZE - 1))
> > + break;
> > +
> > + /* Make sure next iovec starts on same page as first */
> > + if ((base & PAGE_MASK) !=
> > + ((unsigned long)i.iov[1].iov_base & PAGE_MASK))
> > + break;
> > +
> > + /* and the offsets match up */
> > + if (end_of_iov + 1 != (unsigned long)i.iov[1].iov_base)
> > + break;
> > +
> > +advance_ii:
> > + /* Everything checks out. Advance into next iov and continue */
> > + iov_iter_advance(&i, 1);
> > + --maxsize;
> > + ++count;
> > + /* Did we hit the end of array or maxsize? */
> > + if (maxsize == 0 || i.count == 0)
> > + break;
> > + }
> > +
> > + return count;
> > +}
> > +
> > +/*
> > * We must hold a reference to all the pages in this direct read request
> > * until the RPCs complete. This could be long *after* we are woken up in
> > * nfs_direct_wait (for instance, if someone hits ^C on a slow server).
> > --
> > 1.7.1
> >
> >
> >
> > --
> > Jeff Layton <jlayton@redhat.com>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Jeff Layton <jlayton@redhat.com>
prev parent reply other threads:[~2011-05-17 14:54 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-17 14:45 NFS DIO overhaul Jeff Layton
2011-05-17 14:50 ` Chuck Lever
2011-05-17 14:54 ` Jeff Layton [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110517105415.7b26152f@tlielax.poochiereds.net \
--to=jlayton@redhat.com \
--cc=Trond.Myklebust@netapp.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=pbadari@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).