From mboxrd@z Thu Jan 1 00:00:00 1970 From: Abhijith Das Subject: Re: [1/8] readdir-plus system call - LSF/MM follow up Date: Fri, 24 May 2013 12:14:59 -0400 (EDT) Message-ID: <426133125.28744581.1369412099242.JavaMail.root@redhat.com> References: <516299A5.8030109@panasas.com> <51629A76.1020609@panasas.com> <1784100361.2097625.1365447760580.JavaMail.root@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: Steven Whitehouse , Steve Dickson , Jeff Layton , lsf-pc@lists.linux-foundation.org, linux-fsdevel , Ganesha NFS List , Frank S Filz , "J. Bruce Fields" , Jim Lieb , Venkateswararao Jujjuri , DENIEL Philippe , Dave Chinner To: Boaz Harrosh Return-path: Received: from mx3-phx2.redhat.com ([209.132.183.24]:42565 "EHLO mx3-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754710Ab3EXQTH (ORCPT ); Fri, 24 May 2013 12:19:07 -0400 In-Reply-To: <1784100361.2097625.1365447760580.JavaMail.root@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi all, Hoping to revive the discussion for $SUBJECT since we ran out of time when Boaz brought it up at LSF. Summary of what was discussed: - readdirplus syscall can be modeled after NFS' internal readdirplus implementation. - Need for a directory version counter (change count) - Need for each entry to have an opaque resume key - The linux_dirent.d_off in getdents(2) does this somewhat. - Header at top of the returned data with bits to signify what's inside. - What data to return? entries + stat + xattrs/acls? The fs/kernel guys were opposed to tossing xattrs/acls into the mix - I tend to agree, after having worked on a draft readdirplus syscall on GFS2 that does xattrs in addition to stat. The potentially large amount of variable length data to handle and the alloc/realloc/dealloc of said data makes the code quite complicated and hence, difficult to maintain. I had to write a new page-backed resizeable buffer to make this worthwhile (performance was actually worse with kmalloc & friends and kmap/kunmap compared to simply doing getdents()+stat()+getxattr()). For those who are interested, here are the patches (description in previous email below): https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14 There's an interesting seekwatcher graph on there too that compares the two cases. With a cold cache, almost all the speedup obtained by readdirplus is by being able to order all the disk reads. I've seen a 2x speedup (cold cache) with my test directories, but not much more. When the relevant disk blocks are in cache, readdirplus is about 3x faster - I attribute it to the minimal allocing and user/kernel mode switching that goes on. We might also get decent performance by simply having a system call that takes the directory as argument and goes off and pre-fetches all the relevant blocks required to do subsequent getdents()+stat()+getxattr() efficiently. Thoughts? Cheers! --Abhi ----- Original Message ----- > From: "Abhijith Das" > To: "Boaz Harrosh" > Cc: "Steven Whitehouse" , "Steve Dickson" , "Jeff Layton" > , lsf-pc@lists.linux-foundation.org, "linux-fsdevel" , "Ganesha > NFS List" , "Frank S Filz" , "J. Bruce Fields" > , "Jim Lieb" , "Venkateswararao Jujjuri" , "DENIEL > Philippe" > Sent: Monday, April 8, 2013 2:02:40 PM > Subject: Re: [1/8] readdir-plus system call > > Hi Boaz/All, > > ----- Original Message ----- > > From: "Boaz Harrosh" > > To: "Steven Whitehouse" , "Steve Dickson" > > , "Jeff Layton" > > , lsf-pc@lists.linux-foundation.org, "linux-fsdevel" > > , "Ganesha > > NFS List" , "Frank S Filz" > > , "J. Bruce Fields" > > , "Jim Lieb" , "Venkateswararao > > Jujjuri" , "DENIEL > > Philippe" > > Sent: Monday, April 8, 2013 5:22:46 AM > > Subject: [1/8] readdir-plus system call > > > > By: Steven Whitehouse ) > > > > I repeat below Steve's original mail. Steve you said you have > > some experimental code, could you post an header and a git URL > > so we can have a look? > > The patchset I'm working on is in a local tree, but the latest bits are > available in this Red Hat Bugzilla: > https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14 > > From a GFS2 perspective, the need for such a system call arose from our talks > with Samba folks to better support clustered samba over GFS2. The system > call simply collects dirents along with stat and extended attributes and > copies the info out to the user buffer. This patchset is a first-attempt at > tackling this problem from a GFS2 perspective and is mainly a way to get us > talking about possible implementations. > > As the patches stand right now, the VFS bits are just hooks and all the real > work is done in the GFS2 filesystem. However, there are some bits that could > be moved into the VFS so other filesystems can utilize them. > > For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat > system calls that David Howells proposed here : > https://lists.samba.org/archive/samba-technical/2012-April/082906.html > > There are 4 parts to my readdirplus (xgetdents()) patches: > > Patch 1of4 adds the xgetdents() syscall interface, xreaddir() f_op and the > linux_xdirent structure that specifies how the collected data is packaged to > the user. From the caller's perspective, it behaves very much like the > getdents() syscall except for the -EAGAIN return code. This would require > the caller to re-issue the syscall with the same parameters. > > Patch 2of4 is a gfs2 patch that adds a data structure that is a resizeable > buffer backed by a vector of pages. This is used to collect all the > intermediate data before writing it out to the user buffer. > > Patch 3of4 is a simple port of the sort() function from lib/sort.c called > ctx_sort(). Only difference is that it takes an additional (void *) opaque > context pointer and passes it to the compare() and swap() functions. I > needed this to be able to sort pointers stored in the vector of pages > buffer. > > Patch 4of4 has GFS2's implementation of the xreaddir() f_op and all its > supporting functions. gfs2_xreaddir() tries to collect the requested data > efficiently by ordering disk block accesses based on the filesystem's > on-disk layout and also by adjusting the resizeable buffer as needed. > > In my quick testing with a 50,000 file directory, xgetdents() is at least > twice as fast as getdents()+stat()+getxattr() with a cold cache and nearly > thrice as fast when the disk blocks have been cached. > > Cheers! > --Abhi > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >