From: Jamie Lokier <jamie@shareable.org>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Ingo Molnar <mingo@elte.hu>, Tridge <tridge@samba.org>,
Al Viro <viro@parcelfarce.linux.theplanet.co.uk>,
"H. Peter Anvin" <hpa@zytor.com>,
Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
Date: Sat, 21 Feb 2004 00:38:31 +0000 [thread overview]
Message-ID: <20040221003831.GB10928@mail.shareable.org> (raw)
In-Reply-To: <Pine.LNX.4.58.0402201017370.2533@ppc970.osdl.org>
Linus Torvalds wrote:
> > How about this: we clean up dnotify, so it can be used for
> > user<->kernel dcache coherency
>
> No can do.
>
> There is no _way_ dnotify can do a race-free update, exactly because any
> user-level state is fundamentally irrelevant because it isn't tested under
> the directory semaphore.
>
> See? You can have a user-level cache, but the flag and the notification
> absolutely has to be under the inode semaphore (and thus in kernel space)
> if you want to avoid all races with unrelated processes.
Eh? The flag and notification operations are set and tested
under the inode semaphore, when fcntl() is called.
The userspace cache is a slave to the kernel's cache, and I think it
is _fully_ coherent with the kernel.
Every read of the userspace cache is guaranteed to reflect the
contents of the kernel cache, atomically with respect to other
operations by unrelated processes. Also, operations on the directory
that depend on case-insensitive matching (create, link, rename etc.)
are also atomic with respect to unrelated processes.
The atomic read nature of the userspace cache comes from a loop.
It's very similar to the sequence lock in <linux/seqlock.h>:
cache_lookup_names(names...) {
while (fcntl(dirfd, F_NOTIFY, flags) != 0) {
userspace_name_list = read_directory(dirfd);
}
return case_insensitive_lookup(userspace_name_list, names...);
}
Atomic operations on the directory come from a higher level loop,
using Ingo's O_CLEAN idea:
atomic_create(name, flags, mode) {
do {
ci_name = cache_lookup_names(name);
if (ci_name && (flags & O_EXCL)) { return -EEXIST; }
if (!ci_name && !(flags & O_CREAT)) { return -ENOENT; }
fd = clean_open (name, flags, mode);
} while (fd == -ENOENT || fd == -ENOTCLEAN);
return fd;
}
atomic_stat(name, st) {
do {
ci_name = cache_lookup_names(name);
if (!ci_name) { return -ENOENT; }
result = stat (ci_name, st);
} while (result == -ENOENT);
return result;
}
/* This unlinks just one entry if there are multiple case-equivalent
ones. If you want to remove _all_ case-equivalent entries, you'll
need clean_unlink. */
atomic_unlink(name) {
do {
ci_name = cache_lookup_names(name);
if (!ci_name) { return -ENOENT; }
result = unlink (ci_name);
} while (result == -ENOENT);
return result;
}
atomic_rename(old, new) {
do {
(ci_old, ci_new) = cache_lookup_names(old, new);
if (!ci_old) { return -ENOENT; }
result = clean_rename(ci_old, ci_new ? ci_new : new);
} while (result == -ENOTCLEAN || result == -ENOENT);
}
atomic_link(from, to) {
do {
(ci_from, ci_to) = cache_lookup_names(from, to);
if (!ci_from) { return -ENOENT; }
if (ci_to) { return -EEXIST; }
result = clean_link(ci_from, to);
} while (result == -ENOTCLEAN || result == -ENOENT);
}
(symlink, mkdir and rmdir are similar to link, create and unlink).
The operations clean_open, clean_mkdir, clean_rename, clean_link,
clean_symlink and clean_mknod are either new system calls, or use the
standard system calls with the fchdir() method I described.
Even path walking is atomic: Samba will do a path walk using a
case-insensitive lookup on each path component. That means every
directory that is involved will be cached in Samba and have a "clean bit".
It doesn't matter whether Samba prefers to fchdir() each step (in
which case it'll get the atomicity that it would get doing that with
normal kernel case-sensitive lookups), or not and pass the whole path
to the clean_*() operation. In the latter case, the clean_*()
operation will test all the clean bits involved in the target path
lookup, and return -ENOTCLEAN if any aren't set, thus providing the
normal atomicity guarantees.
Ingo's concern that a directory opened by Samba's cache might be moved
is not a problem: if that happens, it'll clear the clean bit of at
least one directory in the target path.
You gave an example before:
> On the other hand, even with a nice dnotify infrastructure, you
> simply _cannot_ get absolute atomicity guarantees. Because by the
> time you actually execute the "mv" operation, another process may
> create a new file with the "same" name (ie different name, but
> comparing the same ignoring case) on another CPU. By the time you
> get the dnotify, it's too late, and the move will have happened, and
> undoing the operation (and hiding it from the client) may well be
> impossible - possibly because another process creating a file with
> the old name.
The example is flawed: the attempted rename _is_ atomic. Either
another process succeeds on another CPU, in which case _our_ attempt
to "mv" returns -ENOTCLEAN and we will start again by refreshing our
cache, or we beat the other process to it.
This works because the clean bit checking is done by the kernel, under
the directory/inode semaphores.
It's atomic w.r.t. both other POSIX processes _and_ other processes
with their own userspace caches.
> But then it should be documented as such. It's not coherent, it's only
> "almost coherent".
It's entirely possible I'm being dense, but I think both Ingo's
proposal, and mine which is based on it but using dnotify both provide
_fully_ coherent userspace cache, and _atomic_ operations.
They do it by looping (like a spinlock or seqlock) rather than
sleeping until ready (like a semaphore), but that is ok as long as
there isn't excessive competition between Samba and other processes
modifying the same directory.
(If the excessive competition proves to be a performance problem, then
we can adapt F_SETLEASE to resolve that too. But I don't think it is
necessary).
-- Jamie
next prev parent reply other threads:[~2004-02-21 0:41 UTC|newest]
Thread overview: 123+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-02-17 4:12 UTF-8 and case-insensitivity tridge
2004-02-17 5:11 ` Linus Torvalds
2004-02-17 6:54 ` tridge
2004-02-17 8:33 ` Neil Brown
2004-02-17 22:48 ` tridge
2004-02-18 0:06 ` Neil Brown
2004-02-18 9:47 ` Helge Hafting
2004-02-17 15:13 ` Linus Torvalds
2004-02-17 16:57 ` Linus Torvalds
2004-02-17 19:44 ` viro
2004-02-17 20:10 ` Linus Torvalds
2004-02-17 20:17 ` viro
2004-02-17 20:23 ` Linus Torvalds
2004-02-17 21:08 ` Robin Rosenberg
2004-02-17 21:17 ` Linus Torvalds
2004-02-17 22:27 ` Robin Rosenberg
2004-02-18 3:02 ` tridge
2004-02-17 23:57 ` tridge
2004-02-17 23:20 ` tridge
2004-02-17 23:43 ` Linus Torvalds
2004-02-18 3:26 ` tridge
2004-02-18 5:33 ` H. Peter Anvin
2004-02-18 7:54 ` Marc Lehmann
2004-02-18 2:37 ` H. Peter Anvin
2004-02-18 3:03 ` Linus Torvalds
2004-02-18 3:14 ` H. Peter Anvin
2004-02-18 3:27 ` Linus Torvalds
2004-02-18 21:31 ` tridge
2004-02-18 22:23 ` Linus Torvalds
2004-02-18 22:28 ` Linus Torvalds
2004-02-18 22:50 ` tridge
2004-02-18 22:59 ` Linus Torvalds
2004-02-18 23:09 ` tridge
2004-02-18 23:16 ` Linus Torvalds
2004-02-19 8:10 ` Jamie Lokier
2004-02-19 16:09 ` Linus Torvalds
2004-02-19 16:38 ` Jamie Lokier
2004-02-19 16:54 ` Linus Torvalds
2004-02-19 18:29 ` Jamie Lokier
2004-02-19 19:48 ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
2004-02-19 19:51 ` Linus Torvalds
2004-02-19 19:48 ` H. Peter Anvin
2004-02-19 20:04 ` Linus Torvalds
2004-02-19 20:05 ` viro
2004-02-19 20:23 ` Linus Torvalds
2004-02-19 20:32 ` Linus Torvalds
2004-02-19 20:45 ` viro
2004-02-19 21:26 ` Linus Torvalds
2004-02-19 21:38 ` Linus Torvalds
2004-02-19 21:45 ` Linus Torvalds
2004-02-19 21:43 ` viro
2004-02-19 21:53 ` Linus Torvalds
2004-02-19 22:21 ` David Lang
2004-02-19 20:48 ` Jamie Lokier
2004-02-19 21:30 ` Linus Torvalds
2004-02-20 0:00 ` Jamie Lokier
2004-02-20 0:17 ` Linus Torvalds
2004-02-20 0:24 ` Linus Torvalds
2004-02-20 0:30 ` Trond Myklebust
2004-02-20 0:54 ` Jamie Lokier
2004-02-20 0:57 ` tridge
2004-02-20 1:07 ` Paul Wagland
2004-02-20 13:31 ` Chris Wedgwood
2004-02-20 0:46 ` Jamie Lokier
2004-02-23 10:13 ` Tim Connors
2004-02-20 1:39 ` Junio C Hamano
2004-02-20 12:54 ` Jamie Lokier
2004-02-19 23:37 ` tridge
2004-02-20 0:02 ` Linus Torvalds
2004-02-20 0:16 ` tridge
2004-02-20 0:37 ` Linus Torvalds
2004-02-20 1:26 ` tridge
2004-02-20 1:07 ` H. Peter Anvin
2004-02-20 2:30 ` Theodore Ts'o
2004-02-20 12:04 ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
2004-02-20 13:19 ` Jamie Lokier
2004-02-20 13:37 ` Ingo Molnar
2004-02-20 14:00 ` Ingo Molnar
2004-02-20 16:31 ` Jamie Lokier
2004-02-20 13:23 ` [patch] " Ingo Molnar
2004-02-20 18:00 ` viro
2004-02-20 15:41 ` Linus Torvalds
2004-02-20 17:04 ` Ingo Molnar
2004-02-20 17:19 ` Linus Torvalds
2004-02-20 18:48 ` Ingo Molnar
2004-02-21 1:44 ` Jamie Lokier
2004-02-21 7:58 ` Ingo Molnar
2004-02-21 8:04 ` viro
2004-02-21 17:46 ` Ingo Molnar
2004-02-21 18:15 ` Linus Torvalds
2004-02-21 8:26 ` Keith Owens
2004-02-23 10:59 ` Pavel Machek
2004-02-23 13:55 ` Jamie Lokier
2004-02-23 16:45 ` Ingo Molnar
2004-02-23 17:32 ` Jamie Lokier
2004-02-20 23:00 ` tridge
2004-02-20 17:33 ` Jamie Lokier
2004-02-20 18:22 ` Linus Torvalds
2004-02-21 0:38 ` Jamie Lokier [this message]
2004-02-21 1:10 ` Linus Torvalds
2004-02-21 3:01 ` Jamie Lokier
2004-02-20 17:47 ` Jamie Lokier
2004-02-20 20:38 ` Christer Weinigel
2004-02-22 15:07 ` Jamie Lokier
2004-02-22 16:55 ` Miquel van Smoorenburg
2004-02-19 19:08 ` UTF-8 and case-insensitivity Helge Hafting
2004-02-18 4:08 ` tridge
2004-02-18 10:05 ` Robin Rosenberg
2004-02-18 11:43 ` tridge
2004-02-18 12:31 ` Robin Rosenberg
2004-02-18 16:48 ` H. Peter Anvin
2004-02-18 20:00 ` H. Peter Anvin
2004-02-19 2:53 ` Daniel Newby
2004-02-17 5:25 ` Tim Connors
2004-02-17 7:43 ` H. Peter Anvin
2004-02-17 8:05 ` H. Peter Anvin
2004-02-17 14:25 ` Dave Kleikamp
2004-02-18 0:16 ` Robert White
2004-02-18 0:20 ` Linus Torvalds
2004-02-18 1:03 ` Robert White
2004-02-18 21:48 ` Ville Herva
2004-02-18 2:48 ` tridge
2004-02-18 20:56 ` Robert White
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040221003831.GB10928@mail.shareable.org \
--to=jamie@shareable.org \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=torvalds@osdl.org \
--cc=tridge@samba.org \
--cc=viro@parcelfarce.linux.theplanet.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox