From: Abhijith Das <adas@redhat.com>
To: Boaz Harrosh <bharrosh@panasas.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>,
Steve Dickson <steved@redhat.com>,
Jeff Layton <jlayton@redhat.com>,
lsf-pc@lists.linux-foundation.org,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Ganesha NFS List <nfs-ganesha-devel@lists.sourceforge.net>,
Frank S Filz <ffilz@us.ibm.com>,
"J. Bruce Fields" <bfields@redhat.com>,
Jim Lieb <jlieb@panasas.com>,
Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>,
DENIEL Philippe <philippe.deniel@cea.fr>,
Dave Chinner <dchinner@redhat.com>
Subject: Re: [1/8] readdir-plus system call - LSF/MM follow up
Date: Fri, 24 May 2013 12:14:59 -0400 (EDT) [thread overview]
Message-ID: <426133125.28744581.1369412099242.JavaMail.root@redhat.com> (raw)
In-Reply-To: <1784100361.2097625.1365447760580.JavaMail.root@redhat.com>
Hi all,
Hoping to revive the discussion for $SUBJECT since we ran out of time when Boaz brought it up at LSF.
Summary of what was discussed:
- readdirplus syscall can be modeled after NFS' internal readdirplus implementation.
- Need for a directory version counter (change count)
- Need for each entry to have an opaque resume key - The linux_dirent.d_off in getdents(2) does this somewhat.
- Header at top of the returned data with bits to signify what's inside.
- What data to return? entries + stat + xattrs/acls?
The fs/kernel guys were opposed to tossing xattrs/acls into the mix - I tend to agree, after having worked on a draft readdirplus syscall on GFS2 that does xattrs in addition to stat.
The potentially large amount of variable length data to handle and the alloc/realloc/dealloc of said data makes the code quite complicated and hence, difficult to maintain. I had to write a new page-backed resizeable buffer to make this worthwhile (performance was actually worse with kmalloc & friends and kmap/kunmap compared to simply doing getdents()+stat()+getxattr()).
For those who are interested, here are the patches (description in previous email below): https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
There's an interesting seekwatcher graph on there too that compares the two cases. With a cold cache, almost all the speedup obtained by readdirplus is by being able to order all the disk reads. I've seen a 2x speedup (cold cache) with my test directories, but not much more. When the relevant disk blocks are in cache, readdirplus is about 3x faster - I attribute it to the minimal allocing and user/kernel mode switching that goes on.
We might also get decent performance by simply having a system call that takes the directory as argument and goes off and pre-fetches all the relevant blocks required to do subsequent getdents()+stat()+getxattr() efficiently.
Thoughts?
Cheers!
--Abhi
----- Original Message -----
> From: "Abhijith Das" <adas@redhat.com>
> To: "Boaz Harrosh" <bharrosh@panasas.com>
> Cc: "Steven Whitehouse" <swhiteho@redhat.com>, "Steve Dickson" <steved@redhat.com>, "Jeff Layton"
> <jlayton@redhat.com>, lsf-pc@lists.linux-foundation.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "Ganesha
> NFS List" <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz" <ffilz@us.ibm.com>, "J. Bruce Fields"
> <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> Philippe" <philippe.deniel@cea.fr>
> Sent: Monday, April 8, 2013 2:02:40 PM
> Subject: Re: [1/8] readdir-plus system call
>
> Hi Boaz/All,
>
> ----- Original Message -----
> > From: "Boaz Harrosh" <bharrosh@panasas.com>
> > To: "Steven Whitehouse" <swhiteho@redhat.com>, "Steve Dickson"
> > <steved@redhat.com>, "Jeff Layton"
> > <jlayton@redhat.com>, lsf-pc@lists.linux-foundation.org, "linux-fsdevel"
> > <linux-fsdevel@vger.kernel.org>, "Ganesha
> > NFS List" <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz"
> > <ffilz@us.ibm.com>, "J. Bruce Fields"
> > <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao
> > Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> > Philippe" <philippe.deniel@cea.fr>
> > Sent: Monday, April 8, 2013 5:22:46 AM
> > Subject: [1/8] readdir-plus system call
> >
> > By: Steven Whitehouse <swhiteho@redhat.com>)
> >
> > I repeat below Steve's original mail. Steve you said you have
> > some experimental code, could you post an header and a git URL
> > so we can have a look?
>
> The patchset I'm working on is in a local tree, but the latest bits are
> available in this Red Hat Bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
>
> From a GFS2 perspective, the need for such a system call arose from our talks
> with Samba folks to better support clustered samba over GFS2. The system
> call simply collects dirents along with stat and extended attributes and
> copies the info out to the user buffer. This patchset is a first-attempt at
> tackling this problem from a GFS2 perspective and is mainly a way to get us
> talking about possible implementations.
>
> As the patches stand right now, the VFS bits are just hooks and all the real
> work is done in the GFS2 filesystem. However, there are some bits that could
> be moved into the VFS so other filesystems can utilize them.
>
> For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat
> system calls that David Howells proposed here :
> https://lists.samba.org/archive/samba-technical/2012-April/082906.html
>
> There are 4 parts to my readdirplus (xgetdents()) patches:
>
> Patch 1of4 adds the xgetdents() syscall interface, xreaddir() f_op and the
> linux_xdirent structure that specifies how the collected data is packaged to
> the user. From the caller's perspective, it behaves very much like the
> getdents() syscall except for the -EAGAIN return code. This would require
> the caller to re-issue the syscall with the same parameters.
>
> Patch 2of4 is a gfs2 patch that adds a data structure that is a resizeable
> buffer backed by a vector of pages. This is used to collect all the
> intermediate data before writing it out to the user buffer.
>
> Patch 3of4 is a simple port of the sort() function from lib/sort.c called
> ctx_sort(). Only difference is that it takes an additional (void *) opaque
> context pointer and passes it to the compare() and swap() functions. I
> needed this to be able to sort pointers stored in the vector of pages
> buffer.
>
> Patch 4of4 has GFS2's implementation of the xreaddir() f_op and all its
> supporting functions. gfs2_xreaddir() tries to collect the requested data
> efficiently by ordering disk block accesses based on the filesystem's
> on-disk layout and also by adjusting the resizeable buffer as needed.
>
> In my quick testing with a 50,000 file directory, xgetdents() is at least
> twice as fast as getdents()+stat()+getxattr() with a cold cache and nearly
> thrice as fast when the disk blocks have been cached.
>
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2013-05-24 16:19 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
2013-04-08 10:26 ` Steven Whitehouse
2013-04-08 15:18 ` [Nfs-ganesha-devel] " Matt W. Benjamin
2013-04-08 13:51 ` DENIEL Philippe
2013-04-08 19:02 ` Abhijith Das
2013-04-10 20:31 ` Andreas Dilger
2013-05-24 16:14 ` Abhijith Das [this message]
2013-05-24 19:41 ` [1/8] readdir-plus system call - LSF/MM follow up Zach Brown
2013-05-28 14:49 ` Abhijith Das
2013-05-28 15:13 ` Jim Lieb
[not found] ` <OF27E1911F.3FBABA22-ON87257B79.005C087F-88257B79.005C320B@us.ibm.com>
2013-05-29 0:57 ` Jim Lieb
[not found] ` <OF067A3B49.F63109B6-ON87257B7A.00137A60-88257B7A.00140BC7@us.ibm.com>
2013-05-29 10:06 ` Jeff Layton
2013-05-29 14:04 ` J. Bruce Fields
2013-06-04 15:38 ` [Lsf-pc] " Christoph Hellwig
2013-06-04 15:52 ` J. Bruce Fields
2013-05-29 16:52 ` Re: Re: " Jim Lieb
2013-05-28 20:00 ` Andreas Dilger
2013-05-28 20:11 ` Abhijith Das
2013-04-08 10:25 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Steven Whitehouse
2013-04-08 10:25 ` [2/8] Sane locks (UNPOSIX locks) Boaz Harrosh
2013-04-08 12:02 ` [Lsf-pc] " Jeff Layton
2013-04-08 10:28 ` [3/8] File delegations, Usermode API of Bruce's pending patches Boaz Harrosh
2013-04-08 10:32 ` [4/8] PNFS ioctls/syscall Boaz Harrosh
2013-04-08 10:36 ` [5/8] syscall_cred() a system call that receives alternate CREDs Boaz Harrosh
2013-04-08 13:54 ` DENIEL Philippe
2013-04-08 14:42 ` J. Bruce Fields
2013-04-08 14:58 ` Boaz Harrosh
2013-04-08 18:23 ` Jim Lieb
2013-04-08 18:31 ` J. Bruce Fields
2013-04-08 19:45 ` Jim Lieb
2013-04-08 21:33 ` Boaz Harrosh
2013-04-09 16:40 ` Jim Lieb
2013-04-08 10:42 ` [6/8] Rich ACLs (continued, drive through this time) Boaz Harrosh
2013-04-08 11:12 ` Vyacheslav Dubeyko
2013-04-08 14:27 ` Venkateswararao Jujjuri
2013-04-08 10:43 ` [7/8] Single call interface to getattr/setattr Boaz Harrosh
[not found] ` <OF4A1A78E0.CB4DED3E-ON87257B47.00549E35-88257B47.005520A8@us.ibm.com>
2013-04-08 16:41 ` Boaz Harrosh
2013-04-08 10:45 ` [8/8] Fix fsnotify short comings (single fd with recursive notifications) Boaz Harrosh
2013-04-08 13:59 ` DENIEL Philippe
2013-04-08 15:22 ` Al Viro
2013-04-08 15:36 ` J. Bruce Fields
2013-04-08 14:31 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Venkateswararao Jujjuri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=426133125.28744581.1369412099242.JavaMail.root@redhat.com \
--to=adas@redhat.com \
--cc=bfields@redhat.com \
--cc=bharrosh@panasas.com \
--cc=dchinner@redhat.com \
--cc=ffilz@us.ibm.com \
--cc=jlayton@redhat.com \
--cc=jlieb@panasas.com \
--cc=jvrao@linux.vnet.ibm.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=nfs-ganesha-devel@lists.sourceforge.net \
--cc=philippe.deniel@cea.fr \
--cc=steved@redhat.com \
--cc=swhiteho@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.