linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Abhijith Das <adas@redhat.com>
To: Boaz Harrosh <bharrosh@panasas.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>,
	Steve Dickson <steved@redhat.com>,
	Jeff Layton <jlayton@redhat.com>,
	lsf-pc@lists.linux-foundation.org,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Ganesha NFS List <nfs-ganesha-devel@lists.sourceforge.net>,
	Frank S Filz <ffilz@us.ibm.com>,
	"J. Bruce Fields" <bfields@redhat.com>,
	Jim Lieb <jlieb@panasas.com>,
	Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>,
	DENIEL Philippe <philippe.deniel@cea.fr>,
	Dave Chinner <dchinner@redhat.com>
Subject: Re: [1/8] readdir-plus system call - LSF/MM follow up
Date: Fri, 24 May 2013 12:14:59 -0400 (EDT)	[thread overview]
Message-ID: <426133125.28744581.1369412099242.JavaMail.root@redhat.com> (raw)
In-Reply-To: <1784100361.2097625.1365447760580.JavaMail.root@redhat.com>

Hi all,

Hoping to revive the discussion for $SUBJECT since we ran out of time when Boaz brought it up at LSF.
Summary of what was discussed:

- readdirplus syscall can be modeled after NFS' internal readdirplus implementation.
- Need for a directory version counter (change count)
- Need for each entry to have an opaque resume key - The linux_dirent.d_off in getdents(2) does this somewhat.
- Header at top of the returned data with bits to signify what's inside.
- What data to return? entries + stat + xattrs/acls?

The fs/kernel guys were opposed to tossing xattrs/acls into the mix - I tend to agree, after having worked on a draft readdirplus syscall on GFS2 that does xattrs in addition to stat.

The potentially large amount of variable length data to handle and the alloc/realloc/dealloc of said data makes the code quite complicated and hence, difficult to maintain. I had to write a new page-backed resizeable buffer to make this worthwhile (performance was actually worse with kmalloc & friends and kmap/kunmap compared to simply doing getdents()+stat()+getxattr()).

For those who are interested, here are the patches (description in previous email below): https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
There's an interesting seekwatcher graph on there too that compares the two cases. With a cold cache, almost all the speedup obtained by readdirplus is by being able to order all the disk reads. I've seen a 2x speedup (cold cache) with my test directories, but not much more. When the relevant disk blocks are in cache, readdirplus is about 3x faster - I attribute it to the minimal allocing and user/kernel mode switching that goes on.

We might also get decent performance by simply having a system call that takes the directory as argument and goes off and pre-fetches all the relevant blocks required to do subsequent getdents()+stat()+getxattr() efficiently.

Thoughts?

Cheers!
--Abhi

----- Original Message -----
> From: "Abhijith Das" <adas@redhat.com>
> To: "Boaz Harrosh" <bharrosh@panasas.com>
> Cc: "Steven Whitehouse" <swhiteho@redhat.com>, "Steve Dickson" <steved@redhat.com>, "Jeff Layton"
> <jlayton@redhat.com>, lsf-pc@lists.linux-foundation.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "Ganesha
> NFS List" <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz" <ffilz@us.ibm.com>, "J. Bruce Fields"
> <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> Philippe" <philippe.deniel@cea.fr>
> Sent: Monday, April 8, 2013 2:02:40 PM
> Subject: Re: [1/8] readdir-plus system call
> 
> Hi Boaz/All,
> 
> ----- Original Message -----
> > From: "Boaz Harrosh" <bharrosh@panasas.com>
> > To: "Steven Whitehouse" <swhiteho@redhat.com>, "Steve Dickson"
> > <steved@redhat.com>, "Jeff Layton"
> > <jlayton@redhat.com>, lsf-pc@lists.linux-foundation.org, "linux-fsdevel"
> > <linux-fsdevel@vger.kernel.org>, "Ganesha
> > NFS List" <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz"
> > <ffilz@us.ibm.com>, "J. Bruce Fields"
> > <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao
> > Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> > Philippe" <philippe.deniel@cea.fr>
> > Sent: Monday, April 8, 2013 5:22:46 AM
> > Subject: [1/8] readdir-plus system call
> > 
> > By: Steven Whitehouse <swhiteho@redhat.com>)
> > 
> > I repeat below Steve's original mail. Steve you said you have
> > some experimental code, could you post an header and a git URL
> > so we can have a look?
> 
> The patchset I'm working on is in a local tree, but the latest bits are
> available in this Red Hat Bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
> 
> From a GFS2 perspective, the need for such a system call arose from our talks
> with Samba folks to better support clustered samba over GFS2. The system
> call simply collects dirents along with stat and extended attributes and
> copies the info out to the user buffer. This patchset is a first-attempt at
> tackling this problem from a GFS2 perspective and is mainly a way to get us
> talking about possible implementations.
> 
> As the patches stand right now, the VFS bits are just hooks and all the real
> work is done in the GFS2 filesystem. However, there are some bits that could
> be moved into the VFS so other filesystems can utilize them.
> 
> For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat
> system calls that David Howells proposed here :
> https://lists.samba.org/archive/samba-technical/2012-April/082906.html
> 
> There are 4 parts to my readdirplus (xgetdents()) patches:
> 
> Patch 1of4 adds the xgetdents() syscall interface, xreaddir() f_op and the
> linux_xdirent structure that specifies how the collected data is packaged to
> the user. From the caller's perspective, it behaves very much like the
> getdents() syscall except for the -EAGAIN return code. This would require
> the caller to re-issue the syscall with the same parameters.
> 
> Patch 2of4 is a gfs2 patch that adds a data structure that is a resizeable
> buffer backed by a vector of pages. This is used to collect all the
> intermediate data before writing it out to the user buffer.
> 
> Patch 3of4 is a simple port of the sort() function from lib/sort.c called
> ctx_sort(). Only difference is that it takes an additional (void *) opaque
> context pointer and passes it to the compare() and swap() functions. I
> needed this to be able to sort pointers stored in the vector of pages
> buffer.
> 
> Patch 4of4 has GFS2's implementation of the xreaddir() f_op and all its
> supporting functions. gfs2_xreaddir() tries to collect the requested data
> efficiently by ordering disk block accesses based on the filesystem's
> on-disk layout and also by adjusting the resizeable buffer as needed.
> 
> In my quick testing with a 50,000 file directory, xgetdents() is at least
> twice as fast as getdents()+stat()+getxattr() with a cold cache and nearly
> thrice as fast when the disk blocks have been cached.
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

  parent reply	other threads:[~2013-05-24 16:19 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
2013-04-08 10:26   ` Steven Whitehouse
2013-04-08 15:18     ` [Nfs-ganesha-devel] " Matt W. Benjamin
2013-04-08 13:51   ` DENIEL Philippe
2013-04-08 19:02   ` Abhijith Das
2013-04-10 20:31     ` Andreas Dilger
2013-05-24 16:14     ` Abhijith Das [this message]
2013-05-24 19:41       ` [1/8] readdir-plus system call - LSF/MM follow up Zach Brown
2013-05-28 14:49         ` Abhijith Das
2013-05-28 15:13           ` Jim Lieb
     [not found]             ` <OF27E1911F.3FBABA22-ON87257B79.005C087F-88257B79.005C320B@us.ibm.com>
2013-05-29  0:57               ` Jim Lieb
     [not found]                 ` <OF067A3B49.F63109B6-ON87257B7A.00137A60-88257B7A.00140BC7@us.ibm.com>
2013-05-29 10:06                   ` Jeff Layton
2013-05-29 14:04                     ` J. Bruce Fields
2013-06-04 15:38                       ` [Lsf-pc] " Christoph Hellwig
2013-06-04 15:52                         ` J. Bruce Fields
2013-05-29 16:52                   ` Re: Re: " Jim Lieb
2013-05-28 20:00           ` Andreas Dilger
2013-05-28 20:11             ` Abhijith Das
2013-04-08 10:25 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Steven Whitehouse
2013-04-08 10:25 ` [2/8] Sane locks (UNPOSIX locks) Boaz Harrosh
2013-04-08 12:02   ` [Lsf-pc] " Jeff Layton
2013-04-08 10:28 ` [3/8] File delegations, Usermode API of Bruce's pending patches Boaz Harrosh
2013-04-08 10:32 ` [4/8] PNFS ioctls/syscall Boaz Harrosh
2013-04-08 10:36 ` [5/8] syscall_cred() a system call that receives alternate CREDs Boaz Harrosh
2013-04-08 13:54   ` DENIEL Philippe
2013-04-08 14:42   ` J. Bruce Fields
2013-04-08 14:58     ` Boaz Harrosh
2013-04-08 18:23     ` Jim Lieb
2013-04-08 18:31       ` J. Bruce Fields
2013-04-08 19:45         ` Jim Lieb
2013-04-08 21:33           ` Boaz Harrosh
2013-04-09 16:40             ` Jim Lieb
2013-04-08 10:42 ` [6/8] Rich ACLs (continued, drive through this time) Boaz Harrosh
2013-04-08 11:12   ` Vyacheslav Dubeyko
2013-04-08 14:27   ` Venkateswararao Jujjuri
2013-04-08 10:43 ` [7/8] Single call interface to getattr/setattr Boaz Harrosh
     [not found]   ` <OF4A1A78E0.CB4DED3E-ON87257B47.00549E35-88257B47.005520A8@us.ibm.com>
2013-04-08 16:41     ` Boaz Harrosh
2013-04-08 10:45 ` [8/8] Fix fsnotify short comings (single fd with recursive notifications) Boaz Harrosh
2013-04-08 13:59   ` DENIEL Philippe
2013-04-08 15:22     ` Al Viro
2013-04-08 15:36     ` J. Bruce Fields
2013-04-08 14:31 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Venkateswararao Jujjuri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=426133125.28744581.1369412099242.JavaMail.root@redhat.com \
    --to=adas@redhat.com \
    --cc=bfields@redhat.com \
    --cc=bharrosh@panasas.com \
    --cc=dchinner@redhat.com \
    --cc=ffilz@us.ibm.com \
    --cc=jlayton@redhat.com \
    --cc=jlieb@panasas.com \
    --cc=jvrao@linux.vnet.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=nfs-ganesha-devel@lists.sourceforge.net \
    --cc=philippe.deniel@cea.fr \
    --cc=steved@redhat.com \
    --cc=swhiteho@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).