public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Tridge <tridge@samba.org>,
	Al Viro <viro@parcelfarce.linux.theplanet.co.uk>,
	Jamie Lokier <jamie@shareable.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
Date: Fri, 20 Feb 2004 13:04:17 +0100	[thread overview]
Message-ID: <20040220120417.GA4010@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.58.0402191124080.1270@ppc970.osdl.org>


* Linus Torvalds <torvalds@osdl.org> wrote:

> Basic approach: add two bits to the VFS dentry flags. That's all that
> is needed. Then you have two new system calls:
> 
>  - set_bit_one(dirfd)
>  - set_bit_two_if_one_is_set(dirfd);
>  - check_or_create_name(dirfd, name, case_table_pointer, newfd);

i believe Samba's problems can be solved in an even simpler way, by 
using only a single bit associated with the directory dentry, and by not 
putting any case-insensitivity code into the kernel. (not even as a 
separate module.)

One 'user-space cache is valid/clean' bit should be enough - where all
non-Samba accesses clear the 'valid bit', and Samba sets the bit
manually.

What Samba needs is a way to tell between two points in time whether the
directory contents have changed in any way - nothing more. Only one new
syscall is used to maintain the Samba dcache:

	long sys_mark_dir_clean(dirfd);

the syscall returns whether the directory was valid/clean already.

this is how Samba name lookup would work:

repeat:
	if (sys_mark_dir_clean(dirfd)) {
		... pure user-space fast path, use Samba dcache ...
		return;
	}
	... fill Samba dcache ...
	readdir() loop

	goto repeat;

i.e. there will be two calls to sys_mark_dir_clean() in the slowpath
(the first one to set it, the second one to make sure it's still set). 
Races are handled automatically by the loop.

this is how Samba could create a file atomically:

	sys_create(name, mode | O_CLEAN);

ie. the create only succeeds if the directory has not been touched since
the Samba dcache has processed it last time. O_CLEAN would be a very
simple check in the open_namei() code, it returns -ENOTCLEAN if the
parent directory has not been marked clean.

i dont think there's any need to have a case-insensitive lookup module
in the kernel - Samba has all the information through the readdir() loop
already - all it needs to know is whether that info is valid or not via
the mark_dir_clean() syscall!

the impact of sys_mark_dir_clean() and O_CLEAN is quite minimal on the
generic VFS i believe. Also, it can be used as a caching method for just
about everything that wants to have a coherent user-space cache of the
VFS namespace. Note that there's nothing about case sensitivity or
insensitivity in this approach, it still gets rid of all of the
excessive readdir()s done in the Samba fastpath.

[ To get rid of all Samba overhead in this area we might need other
  syscall variants too, like rename_if_clean() and unlink_if_clean().
  Under this scheme Samba would never have to do a stat() call of the
  target file, because it always has a coherent copy of the kernel
  dcache, for directories it choses to cache. ]

this approach differs from dnotify in a couple of key areas:

 - it's a synchronous solution that avoids signals, and is thus
   usable/robust in libraries too.

 - dnotify _forces_ action. mark_dir_clean() you can use if there's use 
   and there's no overhead if the Samba workload is completely silent
   and there are only POSIX users. I.e. it should scale better than 
   dnotify.

 - cache teardown can be done in userspace purely: the 'clean bit' has
   no state associated with it (unlike dnotify), so no kernel call is
   necessary to tear down state. User-space just forgets that it cached
   anything about that directory and it's done. No leaking state, and
   good scalability again.

 - but most importantly, it's fundamentally atomic for local filesystems
   and thus meets the needs of Samba in mixed POSIX/Samba workloads.

just in case anyone has followed me down to this point :-), there's yet
another, more advanced way to do the Samba-dcache fastpath 100% in
user-space:

We can export the 'directory clean bit' to userspace, via the same page
pinning and mapping techniques used by futexes. User-space could
register a 'clean bit' address via a new syscall, which the dcache then
uses from that point on. Thus there would be only a single syscall when
Samba sets up a directory cache in user-space [which needs those
readdir() calls so performance is down the drain anyway], which syscall
lets userspace register a machine-word address to serve as the
'directory is clean' flag. Userspace and kernelspace will set this flag
possibly in parallel which is not a problem as long as userspace uses
atomic ops. This approach introduces some page pinning allocation
overhead but that's easy to solve.  User-space would of course condense
the pinned range. Kernel-space would see very minimal overhead from
having the bit in an indirect pointer - at least on 64-bit systems where
all kernel RAM is mapped.

	Ingo

  parent reply	other threads:[~2004-02-20 12:03 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-02-17  4:12 UTF-8 and case-insensitivity tridge
2004-02-17  5:11 ` Linus Torvalds
2004-02-17  6:54   ` tridge
2004-02-17  8:33     ` Neil Brown
2004-02-17 22:48       ` tridge
2004-02-18  0:06         ` Neil Brown
2004-02-18  9:47           ` Helge Hafting
2004-02-17 15:13     ` Linus Torvalds
2004-02-17 16:57       ` Linus Torvalds
2004-02-17 19:44         ` viro
2004-02-17 20:10           ` Linus Torvalds
2004-02-17 20:17             ` viro
2004-02-17 20:23               ` Linus Torvalds
2004-02-17 21:08         ` Robin Rosenberg
2004-02-17 21:17           ` Linus Torvalds
2004-02-17 22:27             ` Robin Rosenberg
2004-02-18  3:02               ` tridge
2004-02-17 23:57         ` tridge
2004-02-17 23:20       ` tridge
2004-02-17 23:43         ` Linus Torvalds
2004-02-18  3:26           ` tridge
2004-02-18  5:33             ` H. Peter Anvin
2004-02-18  7:54             ` Marc Lehmann
2004-02-18  2:37         ` H. Peter Anvin
2004-02-18  3:03           ` Linus Torvalds
2004-02-18  3:14             ` H. Peter Anvin
2004-02-18  3:27               ` Linus Torvalds
2004-02-18 21:31                 ` tridge
2004-02-18 22:23                   ` Linus Torvalds
2004-02-18 22:28                     ` Linus Torvalds
2004-02-18 22:50                       ` tridge
2004-02-18 22:59                         ` Linus Torvalds
2004-02-18 23:09                           ` tridge
2004-02-18 23:16                             ` Linus Torvalds
2004-02-19  8:10                               ` Jamie Lokier
2004-02-19 16:09                                 ` Linus Torvalds
2004-02-19 16:38                                   ` Jamie Lokier
2004-02-19 16:54                                     ` Linus Torvalds
2004-02-19 18:29                                       ` Jamie Lokier
2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
2004-02-19 19:51                                           ` Linus Torvalds
2004-02-19 19:48                                             ` H. Peter Anvin
2004-02-19 20:04                                               ` Linus Torvalds
2004-02-19 20:05                                           ` viro
2004-02-19 20:23                                             ` Linus Torvalds
2004-02-19 20:32                                               ` Linus Torvalds
2004-02-19 20:45                                                 ` viro
2004-02-19 21:26                                                   ` Linus Torvalds
2004-02-19 21:38                                                     ` Linus Torvalds
2004-02-19 21:45                                                     ` Linus Torvalds
2004-02-19 21:43                                                       ` viro
2004-02-19 21:53                                                         ` Linus Torvalds
2004-02-19 22:21                                                           ` David Lang
2004-02-19 20:48                                                 ` Jamie Lokier
2004-02-19 21:30                                                   ` Linus Torvalds
2004-02-20  0:00                                                     ` Jamie Lokier
2004-02-20  0:17                                                       ` Linus Torvalds
2004-02-20  0:24                                                         ` Linus Torvalds
2004-02-20  0:30                                                           ` Trond Myklebust
2004-02-20  0:54                                                           ` Jamie Lokier
2004-02-20  0:57                                                           ` tridge
2004-02-20  1:07                                                           ` Paul Wagland
2004-02-20 13:31                                                           ` Chris Wedgwood
2004-02-20  0:46                                                         ` Jamie Lokier
2004-02-23 10:13                                                           ` Tim Connors
2004-02-20  1:39                                                     ` Junio C Hamano
2004-02-20 12:54                                                       ` Jamie Lokier
2004-02-19 23:37                                           ` tridge
2004-02-20  0:02                                             ` Linus Torvalds
2004-02-20  0:16                                               ` tridge
2004-02-20  0:37                                                 ` Linus Torvalds
2004-02-20  1:26                                                   ` tridge
2004-02-20  1:07                                               ` H. Peter Anvin
2004-02-20  2:30                                           ` Theodore Ts'o
2004-02-20 12:04                                           ` Ingo Molnar [this message]
2004-02-20 13:19                                             ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Jamie Lokier
2004-02-20 13:37                                               ` Ingo Molnar
2004-02-20 14:00                                                 ` Ingo Molnar
2004-02-20 16:31                                                 ` Jamie Lokier
2004-02-20 13:23                                             ` [patch] " Ingo Molnar
2004-02-20 18:00                                               ` viro
2004-02-20 15:41                                             ` Linus Torvalds
2004-02-20 17:04                                               ` Ingo Molnar
2004-02-20 17:19                                                 ` Linus Torvalds
2004-02-20 18:48                                                   ` Ingo Molnar
2004-02-21  1:44                                                     ` Jamie Lokier
2004-02-21  7:58                                                     ` Ingo Molnar
2004-02-21  8:04                                                       ` viro
2004-02-21 17:46                                                         ` Ingo Molnar
2004-02-21 18:15                                                         ` Linus Torvalds
2004-02-21  8:26                                                       ` Keith Owens
2004-02-23 10:59                                                       ` Pavel Machek
2004-02-23 13:55                                                         ` Jamie Lokier
2004-02-23 16:45                                                           ` Ingo Molnar
2004-02-23 17:32                                                             ` Jamie Lokier
2004-02-20 23:00                                                   ` tridge
2004-02-20 17:33                                               ` Jamie Lokier
2004-02-20 18:22                                                 ` Linus Torvalds
2004-02-21  0:38                                                   ` Jamie Lokier
2004-02-21  1:10                                                     ` Linus Torvalds
2004-02-21  3:01                                                       ` Jamie Lokier
2004-02-20 17:47                                               ` Jamie Lokier
2004-02-20 20:38                                             ` Christer Weinigel
2004-02-22 15:07                                               ` Jamie Lokier
2004-02-22 16:55                                                 ` Miquel van Smoorenburg
2004-02-19 19:08                                       ` UTF-8 and case-insensitivity Helge Hafting
2004-02-18  4:08           ` tridge
2004-02-18 10:05             ` Robin Rosenberg
2004-02-18 11:43               ` tridge
2004-02-18 12:31                 ` Robin Rosenberg
2004-02-18 16:48                   ` H. Peter Anvin
2004-02-18 20:00                     ` H. Peter Anvin
2004-02-19  2:53   ` Daniel Newby
2004-02-17  5:25 ` Tim Connors
2004-02-17  7:43 ` H. Peter Anvin
2004-02-17  8:05   ` H. Peter Anvin
2004-02-17 14:25 ` Dave Kleikamp
2004-02-18  0:16 ` Robert White
2004-02-18  0:20   ` Linus Torvalds
2004-02-18  1:03     ` Robert White
2004-02-18 21:48     ` Ville Herva
2004-02-18  2:48   ` tridge
2004-02-18 20:56     ` Robert White

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040220120417.GA4010@elte.hu \
    --to=mingo@elte.hu \
    --cc=hpa@zytor.com \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@osdl.org \
    --cc=tridge@samba.org \
    --cc=viro@parcelfarce.linux.theplanet.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox