public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@osdl.org>, Tridge <tridge@samba.org>,
	Al Viro <viro@parcelfarce.linux.theplanet.co.uk>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
Date: Sat, 21 Feb 2004 01:44:01 +0000	[thread overview]
Message-ID: <20040221014401.GD10928@mail.shareable.org> (raw)
In-Reply-To: <20040220184822.GA23460@elte.hu>

Ingo Molnar wrote:
> > > there's another class of problems: is it an issue that directory renames
> > > that move this directory (higher up in the directory hierarchy of this
> > > directory) do not invalidate the cache? In that case there's no dnotify
> > > event either.
> > 
> > This is one of the reasons why I worry about user-space caching. It's
> > just damn hard to get right.
> 
> this particular problem could be solved by walking down to the root
> dentry for every sys_manage_dir_cache() lookup and check that each
> dentry is still cache-valid. This involves some overhead, but it's still
> faster than doing the same from userspace. (ie. validating each previous
> path component at lookup time.)

All that's required is that Samba has a dcache entry for each path
component.  In Samba's case, every path component from the share root
is case-insensitive, so that'll always be true.  For the path from the
filesystem root to the share root, Samba can either keep a dcache
entry for each of those components (just single entries; readdir isn't
required), or it can do fstat() on each request.  The former is
faster.

When you do an O_CLEAN operation, that'll check the clean bits of
every component during the kernel side path walk, so that validates
Samba's dcache for the whole path.

Samba doesn't need to call sys_mark_dir_clean(), or my preferred
dnotify equivalent, for each step in its userspace dcache walk.  It's
fine to just do the kernel O_CLEAN operation after verifying that
every path component is in Samba's dcache, without validating each
component.

That means Samba does the path walk in userspace, but with no system
calls, and then it calls the O_CLEAN operation.

(In other words, all those atomic_open, atomic_rename etc. operations
in my previous mail are fine, but they can be optimised much better).

Example of Samba code:

    atomic_open(name, flags, mode) {
        ci_name = soft_cache_lookup(name, found);
        if (found) {
            fd = clean_open (ci_name ? ci_name : name, flags, mode);
	    if (fd != -ENOTCLEAN)
                return fd;
        }
        do {
            ci_name = hard_cache_lookup(name);
            if (ci_name && (flags & O_EXCL)) { return -EEXIST; }
            if (!ci_name && !(flags & O_CREAT)) { return -ENOENT; }
            fd = clean_open (ci_name ? ci_name : name, flags, mode);
        } while (fd == -ENOTCLEAN);
        return fd;
    }

Remember, if Samba's dcache has an entry, whether positive or
negative, one of these is true:

    - the dcache entry matches what the clean_*() operation will
      find in the kernel; or
    - the clean_*() operation will return -ENOTCLEAN

(If you use the slightly fancier method of a dcache that doesn't care
about deletions, using dnotify with DN_CREATE|DN_RENAME only (not
DN_DELETE), then -ENOENT can be returned instead).

> Since this doesnt change the dcache it ought to be doable via the
> rcu-read path and would thus still have pretty good SMP
> properties. [except when traversing mountpoints :-( ].

Mount changes need to count as changes anyway.  I'd like DN_MOUNT
added, if DN_MODIFY doesn't already get sent for mount changes.

> What if there are two instances of fileservers both using the
> same fileset and also trying to do caching this way?

I'm fairly sure the scheme described in my long mail (the one with the
atomic_open etc. explanation) works just fine with different
fileservers accessing the same tree.

> perhaps using a simple 64-bit generation counter would be better.

I think that isn't needed.

> Samba would get a new syscall to get the sum of each generation
> counter down to the root dentry - a total validation of the
> pathname.

You can't just sum per-directory counters along the path, because the
path may be rearranged by renaming directories, and different path
components could easily sum to the same generation value.

So that's going to have to be a careful strong hash of (a) the
generation counters of individual directories, (b) _plus_ the path
sequence e.g. as inode numbers, (c) _plus_ something to handle mount
changes.

> If the counter matches with that in the userspace cache entry then
> no need to re-create the cache. Such generation counters would be
> usable for multiple file servers as well. Hm?

I don't think it is worth it, for Samba.  It's quite complicated to
get a number which detects all feasible changes, and I don't think it
offers Samba any efficiency gain over the single "clean bit".

(It's an interesting idea in general though).

> i believe Samba already has what is in essence a duplication of the
> dcache. We could enable it to be fairly coherent, for Samba to be able
> to have an authorative 'does this file exist' answer without any
> excessive readdir()s.

I'm pretty sure it can too - and in a way that's useful for many
applications not just Samba.  (I've hardly touched the surface of
what's possible using the very simple O_CLEAN technique).

-- Jamie

  reply	other threads:[~2004-02-21  1:44 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-02-17  4:12 UTF-8 and case-insensitivity tridge
2004-02-17  5:11 ` Linus Torvalds
2004-02-17  6:54   ` tridge
2004-02-17  8:33     ` Neil Brown
2004-02-17 22:48       ` tridge
2004-02-18  0:06         ` Neil Brown
2004-02-18  9:47           ` Helge Hafting
2004-02-17 15:13     ` Linus Torvalds
2004-02-17 16:57       ` Linus Torvalds
2004-02-17 19:44         ` viro
2004-02-17 20:10           ` Linus Torvalds
2004-02-17 20:17             ` viro
2004-02-17 20:23               ` Linus Torvalds
2004-02-17 21:08         ` Robin Rosenberg
2004-02-17 21:17           ` Linus Torvalds
2004-02-17 22:27             ` Robin Rosenberg
2004-02-18  3:02               ` tridge
2004-02-17 23:57         ` tridge
2004-02-17 23:20       ` tridge
2004-02-17 23:43         ` Linus Torvalds
2004-02-18  3:26           ` tridge
2004-02-18  5:33             ` H. Peter Anvin
2004-02-18  7:54             ` Marc Lehmann
2004-02-18  2:37         ` H. Peter Anvin
2004-02-18  3:03           ` Linus Torvalds
2004-02-18  3:14             ` H. Peter Anvin
2004-02-18  3:27               ` Linus Torvalds
2004-02-18 21:31                 ` tridge
2004-02-18 22:23                   ` Linus Torvalds
2004-02-18 22:28                     ` Linus Torvalds
2004-02-18 22:50                       ` tridge
2004-02-18 22:59                         ` Linus Torvalds
2004-02-18 23:09                           ` tridge
2004-02-18 23:16                             ` Linus Torvalds
2004-02-19  8:10                               ` Jamie Lokier
2004-02-19 16:09                                 ` Linus Torvalds
2004-02-19 16:38                                   ` Jamie Lokier
2004-02-19 16:54                                     ` Linus Torvalds
2004-02-19 18:29                                       ` Jamie Lokier
2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
2004-02-19 19:51                                           ` Linus Torvalds
2004-02-19 19:48                                             ` H. Peter Anvin
2004-02-19 20:04                                               ` Linus Torvalds
2004-02-19 20:05                                           ` viro
2004-02-19 20:23                                             ` Linus Torvalds
2004-02-19 20:32                                               ` Linus Torvalds
2004-02-19 20:45                                                 ` viro
2004-02-19 21:26                                                   ` Linus Torvalds
2004-02-19 21:38                                                     ` Linus Torvalds
2004-02-19 21:45                                                     ` Linus Torvalds
2004-02-19 21:43                                                       ` viro
2004-02-19 21:53                                                         ` Linus Torvalds
2004-02-19 22:21                                                           ` David Lang
2004-02-19 20:48                                                 ` Jamie Lokier
2004-02-19 21:30                                                   ` Linus Torvalds
2004-02-20  0:00                                                     ` Jamie Lokier
2004-02-20  0:17                                                       ` Linus Torvalds
2004-02-20  0:24                                                         ` Linus Torvalds
2004-02-20  0:30                                                           ` Trond Myklebust
2004-02-20  0:54                                                           ` Jamie Lokier
2004-02-20  0:57                                                           ` tridge
2004-02-20  1:07                                                           ` Paul Wagland
2004-02-20 13:31                                                           ` Chris Wedgwood
2004-02-20  0:46                                                         ` Jamie Lokier
2004-02-23 10:13                                                           ` Tim Connors
2004-02-20  1:39                                                     ` Junio C Hamano
2004-02-20 12:54                                                       ` Jamie Lokier
2004-02-19 23:37                                           ` tridge
2004-02-20  0:02                                             ` Linus Torvalds
2004-02-20  0:16                                               ` tridge
2004-02-20  0:37                                                 ` Linus Torvalds
2004-02-20  1:26                                                   ` tridge
2004-02-20  1:07                                               ` H. Peter Anvin
2004-02-20  2:30                                           ` Theodore Ts'o
2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
2004-02-20 13:19                                             ` Jamie Lokier
2004-02-20 13:37                                               ` Ingo Molnar
2004-02-20 14:00                                                 ` Ingo Molnar
2004-02-20 16:31                                                 ` Jamie Lokier
2004-02-20 13:23                                             ` [patch] " Ingo Molnar
2004-02-20 18:00                                               ` viro
2004-02-20 15:41                                             ` Linus Torvalds
2004-02-20 17:04                                               ` Ingo Molnar
2004-02-20 17:19                                                 ` Linus Torvalds
2004-02-20 18:48                                                   ` Ingo Molnar
2004-02-21  1:44                                                     ` Jamie Lokier [this message]
2004-02-21  7:58                                                     ` Ingo Molnar
2004-02-21  8:04                                                       ` viro
2004-02-21 17:46                                                         ` Ingo Molnar
2004-02-21 18:15                                                         ` Linus Torvalds
2004-02-21  8:26                                                       ` Keith Owens
2004-02-23 10:59                                                       ` Pavel Machek
2004-02-23 13:55                                                         ` Jamie Lokier
2004-02-23 16:45                                                           ` Ingo Molnar
2004-02-23 17:32                                                             ` Jamie Lokier
2004-02-20 23:00                                                   ` tridge
2004-02-20 17:33                                               ` Jamie Lokier
2004-02-20 18:22                                                 ` Linus Torvalds
2004-02-21  0:38                                                   ` Jamie Lokier
2004-02-21  1:10                                                     ` Linus Torvalds
2004-02-21  3:01                                                       ` Jamie Lokier
2004-02-20 17:47                                               ` Jamie Lokier
2004-02-20 20:38                                             ` Christer Weinigel
2004-02-22 15:07                                               ` Jamie Lokier
2004-02-22 16:55                                                 ` Miquel van Smoorenburg
2004-02-19 19:08                                       ` UTF-8 and case-insensitivity Helge Hafting
2004-02-18  4:08           ` tridge
2004-02-18 10:05             ` Robin Rosenberg
2004-02-18 11:43               ` tridge
2004-02-18 12:31                 ` Robin Rosenberg
2004-02-18 16:48                   ` H. Peter Anvin
2004-02-18 20:00                     ` H. Peter Anvin
2004-02-19  2:53   ` Daniel Newby
2004-02-17  5:25 ` Tim Connors
2004-02-17  7:43 ` H. Peter Anvin
2004-02-17  8:05   ` H. Peter Anvin
2004-02-17 14:25 ` Dave Kleikamp
2004-02-18  0:16 ` Robert White
2004-02-18  0:20   ` Linus Torvalds
2004-02-18  1:03     ` Robert White
2004-02-18 21:48     ` Ville Herva
2004-02-18  2:48   ` tridge
2004-02-18 20:56     ` Robert White

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040221014401.GD10928@mail.shareable.org \
    --to=jamie@shareable.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=torvalds@osdl.org \
    --cc=tridge@samba.org \
    --cc=viro@parcelfarce.linux.theplanet.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox