public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC] readdir mess
Date: Wed, 13 Aug 2008 02:19:09 +0100	[thread overview]
Message-ID: <20080813011909.GA28946@ZenIV.linux.org.uk> (raw)
In-Reply-To: <alpine.LFD.1.10.0808121708160.3462@nehalem.linux-foundation.org>

On Tue, Aug 12, 2008 at 05:28:28PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 13 Aug 2008, Al Viro wrote:
> > 
> > As for whether we want to bother...  I've looked through about half of the
> > ->readdir instances.  We _do_ want to touch them, with cattle prod if nothing
> > else.
> 
> The really sad part is that readdir() really is also the thing that should 
> make us change locking. That i_mutex thing is fine and dandy for 
> everything else, but for readdir() we really would be much better off with 
> a rwsem - and only take it for reading.

*cringe*

I agree, but... there's another shitpile in the same area and there we are
_really_ buggered - take a look at what ncpfs/smbfs/cifs/<parts of> procfs
are pulling off with very odd attempts to dump stuff into dcache.

That means manual equivalent of lookup and you _do_ want some exclusion for
that.  If nothing else, you want to prevent multiple dentries for the same
subdirectory...

And rwsem for reading wouldn't help at all - readdir/readdir contention
might be annoying, but it's readdir/lookup that really hurts.

> Right now, readdir() is one of the most serialized parts of the whole 
> kernel. Sad. And while it's a per-directory lock, there are directories 
> that get much more reading than others, and this has been a small 
> scalability issue (for samba and apache) for years.

I know.  And things like exportfs use of vfs_readdir() are also rather
painful.
 
> The reason ext2 is ok is that you long long ago fixed it to use the page 
> cache. That got rid of a _lot_ of the crap, and made all the IO look like 
> regular files (including read-ahead etc). Ext2 _used_ to be the same crap 
> that ext3 is.

I remember ;-)  However, f_version/i_version mess at that place simply an
ancient crap - we *are* holding i_mutex and we are using generic_file_lseek().
IOW, it simply hadn't been reviewed since then - or reviewers had been
stunned into glazed-eyes mode by the entire thing.
 
> I so wish that ext3 could do the same thing, but no. I still think it 
> should be possible, but the whole journalling is designed for buffer 
> heads.

Don't.  Get.  Me.  Started.   Hell, at least by now somebody had cleaned
htree code into nearly readable form; maybe it might be doable...

> > freevxfs: AFAICS simply bogus (grep for nblocks there).
> > hfs:	at least missing checks for hfs_bnode_read() failure.  And I'm not
> > 	at all sure that hfs_mac2asc() use is safe there.  BTW, open_dir_list
> > 	handling appears to be odd - how the hell does decrementing ->f_pos
> > 	help anything?  And hfs_dir_release() removes from list without any
> > 	locks, so that's almost certainly racy as well.
> > hfsplus: ditto
> 
> I don't dispute at all that the readdir() thing is one of the weakest 
> points of the whole VFS layer. And part of it is that there is no good 
> caching helper for it at the VFS level, so we always end up having to do 
> everything at the low-level filesystem level, and that invariably ends up 
> being sh*t compared to the shared VFS routines.

What _can_ a common helper do, anyway, when we are busy parsing an arseload of
possibly corrupt data in whatever weird format fs insists upon?

As for the locking...  I toyed with one idea for a while: instead of passing
a callback and one of many completely different structs, how about a common
*beginning* of a struct, with callback stored in it along with several
common fields?  Namely,
	* count of filldir calls already made
	* pointer to file in question
	* "are we done" flag
And instead of calling filldir(buf, ...) ->readdir() would call one of several
helpers:
	* readdir_emit(buf, ...) - obvious
	* readdir_relax(buf) - called in a spot convenient for dropping
and retaking lock; returns whether we need to do revalidation.
	* readdir_eof(buf) - end of directory
	* maybe readdir_dot() and readdir_dotdot() - those are certainly
common enough

That's the obvious flagday stuff, but since we need to give serious beating
to most of the instances anyway...  Might as well consider something in
that direction.

  reply	other threads:[~2008-08-13  1:19 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-12  6:22 [RFC] readdir mess Al Viro
2008-08-12 17:02 ` OGAWA Hirofumi
2008-08-12 17:18   ` Linus Torvalds
2008-08-12 18:10     ` Al Viro
2008-08-12 18:22       ` Al Viro
2008-08-12 18:37         ` Al Viro
2008-08-12 19:24           ` Al Viro
2008-08-12 20:02       ` Linus Torvalds
2008-08-12 20:21       ` Linus Torvalds
2008-08-12 20:38         ` Al Viro
2008-08-12 21:04           ` Linus Torvalds
2008-08-13  0:04             ` Al Viro
2008-08-13  0:28               ` Linus Torvalds
2008-08-13  1:19                 ` Al Viro [this message]
2008-08-13  1:51                   ` Linus Torvalds
2008-08-13  8:36               ` Brad Boyer
2008-08-13 16:19                 ` Al Viro
2008-08-15  5:06               ` Jan Harkes
2008-08-15  5:34                 ` Al Viro
2008-08-15 16:58                 ` Linus Torvalds
2008-08-24 10:10                   ` Al Viro
2008-08-24 11:03                     ` Al Viro
2008-08-25 16:16                       ` J. Bruce Fields
2008-08-24 17:20                     ` Linus Torvalds
2008-08-24 19:59                       ` Al Viro
2008-08-24 23:51                         ` Linus Torvalds
2008-08-25  1:33                           ` Al Viro
2008-08-25  1:44                             ` Al Viro
2008-08-12 19:45     ` OGAWA Hirofumi
2008-08-12 20:05       ` Linus Torvalds
2008-08-12 20:59         ` Al Viro
2008-08-12 21:24           ` Linus Torvalds
2008-08-12 21:54             ` Al Viro
2008-08-12 22:04               ` Linus Torvalds
2008-08-13 16:20                 ` J. Bruce Fields
2008-08-12 21:47         ` Alan Cox
2008-08-12 22:20           ` Linus Torvalds
2008-08-12 22:10             ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080813011909.GA28946@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox