UTF-8 and case-insensitivity

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* UTF-8 and case-insensitivity
@ 2004-02-17  4:12 tridge
  2004-02-17  5:11 ` Linus Torvalds
                   ` (4 more replies)
  0 siblings, 5 replies; 123+ messages in thread
From: tridge @ 2004-02-17  4:12 UTC (permalink / raw)
  To: linux-kernel

Given how much pain the "kernel is agnostic to charset encoding"
attitude has cost me in terms of programming pain, I thought I should
de-cloak from lurk mode and put my 2c into the UTF-8 issue.

Personally I think that eventually the Linux kernel will have to
embrace the interpretation of the byte streams that applications have
given it, despite the fact that this will be very painful and
potentially quite complex. The reason is that I think that eventually
the Linux kernel will need to efficiently support a userspace policy
of case-insensitivity and the only way to do case-insensitive filename
operations is to interpret those byte streams as a particular
encoding.

Personally I much prefer the systems I use to be case-sensitive, but
there are important applications that require case-insensitivity for
interoperability. Right now it is not possible to write a case
insensitive application on Linux in an efficient manner. With the
current "encoding agnostic" APIs a simple open() or stat() call
becomes a horrendously expensive operation and one that is fraught
with race conditions. Providing the same functionality in the kernel
is dirt cheap by comparison (not cheap in terms of code complexity,
but cheap in terms of runtime efficiency).

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  4:12 UTF-8 and case-insensitivity tridge
@ 2004-02-17  5:11 ` Linus Torvalds
  2004-02-17  6:54   ` tridge
  2004-02-19  2:53   ` Daniel Newby
  2004-02-17  5:25 ` Tim Connors
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-17  5:11 UTC (permalink / raw)
  To: Andrew Tridgell; +Cc: Kernel Mailing List, Al Viro

[ Al cc'd, because while I'm pretty certain that he agrees with me 100% on 
  the craziness of case-insensitive name lookups, he may have some input
  on the "samba helper" function approach. That input may well boil down 
  to "Linus is crazy", of course. Wouldn't be the first time ;)

  Andrew - you really should assume that case insensitivity is a hell of a 
  lot more costly than you think it is, and forget that particular idea. 
  Let's see if there are acceptable half-measures. ]

On Tue, 17 Feb 2004 tridge@samba.org wrote:
>
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
> 
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it, despite the fact that this will be very painful and
> potentially quite complex.

I seriously doubt it. There just isn't any point.

>		 The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.

The thing is, if you want to do efficient user-space case-insensitive 
lookups, that is a _completely_ different matter from having the kernel do 
case-insensitivity.

Kernel-level case insensitivity is a total disaster, and your "very
painful and potentially quite complex" assertion is the understatement of
the year. The thing is, you can't sanely do dentry caching, since the case
insensitivity has to be per-open or at least per-process (you MUST NOT be
case-insensitive in a POSIX process).

So the only way to do case-insensitive names is to do all lookups very 
slowly. I'm willing to bet that WNT opens files a hell of a lot slower 
than Linux does, and one big portion of that is exactly the fact that 
Linux can do a really good job with the dentry cache.

And that _depends_ on a well-defined and unique filename setup (by
changing the hashing function and compare function, a filesystem can do a
limited kind of case-insensitivity right now in Linux, but then it will
have to be not only fairly slow, but also case-insensitive for _everybody_
which is unacceptable in a mixed POSIX/samba environment).

In other words, just forget the whole notion. The only set people who have
any reason at _all_ to want it is the samba team, and we can solve the 
samba-specific problems other ways.

Just take that as a simple fact - case insensitivity in the kernel is such 
a horribly bad idea, that you really shouldn't go there.

With that destructive criticism out of the way, let's look at somewhat 
more constructive approaches, ie some way to allow certain processes that 
need it better help in their quest for case insensitivity.

Let's start with some assumptions:

 - MOST name lookups are likely results of some kind of "readdir()" 
   lookup, and tend to have the case right in the first place. So that 
   should go fast. Maybe Tridge has some statistics on this one?

 - samba probably has certain pretty well-defined special patterns for 
   what it wants to do with a filename, do you probably don't need a 
   generic "everything that takes a filename should be case-insensitive", 
   and it would be acceptable to have a few _very_ specific system calls.

With those assumptions out of the way, we could think of an interface that
exports some partial functionality of the "lookup_path()" code the kernel
as a special system call. In particular, something that takes an input
pathname, and is able to stop at any point of the name when a lookup
fails.

So some variation of the interface

	int magic_open(
		/* Input arguments */
		const char *pathname,
		unsigned long flags,
		mode_t mode,

		/* output arguments */
		int *fd,
		struct stat *st,
		int *successful_path_length);

ie the system call would:

 - look up as far into the pathname (using _exact_ lookup) as possible
 - return the error code of the last failure
 - the "flags" could be extended so that you can specify that you mustn't 
   traverse ".." or symlinks (ie those would count as failures)

but also:

 - fill in the "struct stat" information for the last _successful_ 
   pathname component.
 - fill in the "fd" with a fd of the last _successful_ pathname component.
 - tell how much of the pathname it could traverse.

so that the user can do a "readdir" and try to "fix up" the problem
without having to restart the whole thing. For the (hopefully common case)  
where the cases match, this would just boil down to an "open with stat
information" thing.

We'd need something more interesting to guarantee unique filename on file
create, possibly even including letting a trusted process maintain some
locks in the VFS layer. The point being that the kernel can _help_ some 
specific usage, but making case-insensitive names be part of the VFS layer 
proper is not acceptable.

I suspect we can do case-insensitive names faster than WNT even with a 
fairly complex user-mode interface. Just because _not_ having them in the 
kernel allows us to have much faster default behaviour.

			Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  4:12 UTF-8 and case-insensitivity tridge
  2004-02-17  5:11 ` Linus Torvalds
@ 2004-02-17  5:25 ` Tim Connors
  2004-02-17  7:43 ` H. Peter Anvin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 123+ messages in thread
From: Tim Connors @ 2004-02-17  5:25 UTC (permalink / raw)
  To: linux-kernel

tridge@samba.org said on Tue, 17 Feb 2004 15:12:06 +1100:
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
> 
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it,

What applications?

> despite the fact that this will be very painful and
> potentially quite complex. The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.
> 
> Personally I much prefer the systems I use to be case-sensitive, but
> there are important applications that require case-insensitivity for
> interoperability. 

Why? Sounds pretty idiotic to me.

If you don't like it, using some microshit filesystem like vfat. I'll
keep using ext3 etc, thanks.

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Conclusion to my thesis -- "It is trivial to show that it is 
clearly obvious that this is not woofly."

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  5:11 ` Linus Torvalds
@ 2004-02-17  6:54   ` tridge
  2004-02-17  8:33     ` Neil Brown
  2004-02-17 15:13     ` Linus Torvalds
  2004-02-19  2:53   ` Daniel Newby
  1 sibling, 2 replies; 123+ messages in thread
From: tridge @ 2004-02-17  6:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Al Viro

Linus,

 > Kernel-level case insensitivity is a total disaster, and your "very
 > painful and potentially quite complex" assertion is the understatement of
 > the year. The thing is, you can't sanely do dentry caching, since the case
 > insensitivity has to be per-open or at least per-process (you MUST NOT be
 > case-insensitive in a POSIX process).

right, and the patches to add this support to Linux that I have been
involved with in the past have been per-process. You are right that it
is messy, but it is not *horribly* messy. In fact I'd say it is no
worse than many of the other things we already have in the kernel,
although it certainly is much harder than sticking to the "bag of
bytes" interpretation of filenames. I just think that in this case the
simple solution is also wrong.

 > So the only way to do case-insensitive names is to do all lookups very 
 > slowly.

I don't agree with this at all. I agree that the worst-case will get
worse, but I see absolutely no reason why the average case will get
sigificantly worse and I think that the worst case will be rare.

In fact, John Bonesio did a patch to the 2.4 kernel with XFS that
implemened per-process case-insensitivity. It's been a long time since
I played with that patch, but I certainly don't recall any significant
slowdowns. The patch was messy, but it wasn't grossly
inefficient. (that patch was just a proof of concept, and just used
strcasecmp() instead of doing a proper UTF-8 case-insensitive compare,
so there will be some amount of additional cost to adding that).

>From memory, the patch added new classes of dentries to the current
"+ve" and "-ve" dentries. It added concepts like a "-ve
case-insensitive" dentry and a "-ve case-sensitive" dentry. It
certainly adds more code in trying to deal with these variants, but I
see no reason why it should be significantly computationally less
efficient.

 > I'm willing to bet that WNT opens files a hell of a lot slower 
 > than Linux does, and one big portion of that is exactly the fact that 
 > Linux can do a really good job with the dentry cache.

Anyone have any lmbench filesystem numbers for w2k3? The only windows
boxes I use are in vmware sessions, so running performance tests
myself is pretty pointless.

 > And that _depends_ on a well-defined and unique filename setup (by
 > changing the hashing function and compare function, a filesystem can do a
 > limited kind of case-insensitivity right now in Linux, but then it will
 > have to be not only fairly slow, but also case-insensitive for _everybody_
 > which is unacceptable in a mixed POSIX/samba environment).

right, and thats why bones made it per-process in his patch. It was
set using a process personality bit, which really wasn't ideal (that
was one of my contributions to the patch) but it did work.

 > In other words, just forget the whole notion. The only set people who have
 > any reason at _all_ to want it is the samba team, and we can solve the 
 > samba-specific problems other ways.

Nope, its not just Samba, though perhaps Samba is the app that cares
the most about the actual performance. The other obvious people who
care are wine and anyone porting an application from windows. Also,
the problem isn't just one of performance, its also hard to make it
raceless from userspace.

I also think that if the choice were given then some linux distros
(the likes of Lindows comes to mind) would choose to run all processes
case-insensitive. These sorts of distros are aiming at the sorts of
users that would want everything to be case-insensitive.

 > Just take that as a simple fact - case insensitivity in the kernel is such 
 > a horribly bad idea, that you really shouldn't go there.

I'm yet to be convinced :)

 >  - MOST name lookups are likely results of some kind of "readdir()" 
 >    lookup, and tend to have the case right in the first place. So that 
 >    should go fast. Maybe Tridge has some statistics on this one?

ok, the first thing you need to understand about case-insensitivity on
a case-sensitive system is that the hardest thing to do is prove that
a file doesn't exist. File operations on non-existant files are *very*
common. If you can come up with a solution that allows me to prove
that a file doesn't exist in any case combination then we will be most
of the way there.

That immediately throws out most of the "why don't you just use a
cache" arguments that everyone seems to come up with. We *do* use a
cache that primes the "most likely" filename code, its just that a
cache is almost useless when you are trying to prove that a file
definately doesn't exist.

 >  - samba probably has certain pretty well-defined special patterns for 
 >    what it wants to do with a filename, do you probably don't need a 
 >    generic "everything that takes a filename should be case-insensitive", 
 >    and it would be acceptable to have a few _very_ specific system calls.

yes, if we had a single function that took a pathname and gave us
either -1/ENOENT or the pathname of a file that matches
case-insensitively then that would be great. Then again, if we had
such a function then it would be really easy to use that function in
the VFS to make the Linux case-insensitive on a per-process basis.

So lets imagine we have such a function like this:

  int ci_normalize(char *path);

Lets assume it takes a pathname and returns either -1/ENOENT or
modifies the pathname in place (totally ignoring the fact that the
length of the pathname could change, and that the "char *" is really a
"const char *" - pedants go home).

now lets build a ci_unlink() on top of that:

   int ci_unlink(char *path)
   {
	if (task_is_case_sensitive(current)) {
		return unlink(path);
	}
	if (ci_normalize(path) == -1) {
		return -1;
	}
	return unlink(path);
   }

The problem is the negative dentries. If you do the above then
case-sensitive processes will be fast, but case-insensitive processes
will effectively be running without the negative dcache, so unlink()
on paths that don't exist will be slow each and every time. That's why
doing this with any sort of decent efficiency needs dcache changes.

btw, I already know that Al is completely and utterly opposed to
putting any case-insensitivity in the dcache (I think the phrase "over
my dead body" was mentioned), so I know that I'm fighting an uphill
battle here, but I like trying every now and again to see if I can
make any progress.

 > With those assumptions out of the way, we could think of an interface that
 > exports some partial functionality of the "lookup_path()" code the kernel
 > as a special system call. In particular, something that takes an input
 > pathname, and is able to stop at any point of the name when a lookup
 > fails.
 > So some variation of the interface
 > 
 > 	int magic_open(
....

how would this interact with the negative dcache entries? That is the
key.

 > I suspect we can do case-insensitive names faster than WNT even with a 
 > fairly complex user-mode interface. Just because _not_ having them in the 
 > kernel allows us to have much faster default behaviour.

on this I completely disagree. Any solution that doesn't cope with
case insensitive properties of negative dentries is just going to
start filling the dcache with lots of useless entries (case
combinations) or effectively not end up using the dcache at
all. Either way its a big loss compared to making the dcache know
about case insensitivity properly.

Cheers, Tridge

PS: ahh, what timing, someone just posted a request to the rsync list
asking for case-insensitivity in rsync.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  4:12 UTF-8 and case-insensitivity tridge
  2004-02-17  5:11 ` Linus Torvalds
  2004-02-17  5:25 ` Tim Connors
@ 2004-02-17  7:43 ` H. Peter Anvin
  2004-02-17  8:05   ` H. Peter Anvin
  2004-02-17 14:25 ` Dave Kleikamp
  2004-02-18  0:16 ` Robert White
  4 siblings, 1 reply; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-17  7:43 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <16433.38038.881005.468116@samba.org>
By author:    tridge@samba.org
In newsgroup: linux.dev.kernel
>
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
> 
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it, despite the fact that this will be very painful and
> potentially quite complex. The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.
> 

Realistically, the only sane way to do this is to set our foot down
and say: UTF-8 is *the* encoding.  A good step in that direction would
be to set utf-8 to be the default NLS in the kernel, but as long as
people keep the whole sick idea that we can continue to use
locale-dependent encoding we're in for a world of hurt.

That's really the long and short of it.  Until people are willing to
say "we support UTF-8, anything else and it's anyone's guess what
happens" then nothing is going to happen.

	-hpa
-- 
PGP public key available - finger hpa@zytor.com
Key fingerprint: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD  1E DF FE 69 EE 35 BD 74
"The earth is but one country, and mankind its citizens."  --  Bahá'u'lláh
Just Say No to Morden * The Shadows were defeated -- Babylon 5 is renewed!!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  7:43 ` H. Peter Anvin
@ 2004-02-17  8:05   ` H. Peter Anvin
  0 siblings, 0 replies; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-17  8:05 UTC (permalink / raw)
  To: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2181 bytes --]

Followup to:  <c0sgnc$ngo$1@terminus.zytor.com>
By author:    hpa@zytor.com (H. Peter Anvin)
In newsgroup: linux.dev.kernel
> 
> Realistically, the only sane way to do this is to set our foot down
> and say: UTF-8 is *the* encoding.  A good step in that direction would
> be to set utf-8 to be the default NLS in the kernel, but as long as
> people keep the whole sick idea that we can continue to use
> locale-dependent encoding we're in for a world of hurt.
> 
> That's really the long and short of it.  Until people are willing to
> say "we support UTF-8, anything else and it's anyone's guess what
> happens" then nothing is going to happen.
> 

Oh yes, on top of that, if you want case insensitivity, then you also
need to start thinking about a whole lot of other things, including
what normalization form(s) you care about.  Keeping normalization (as
well as case-conversion) data for the entire Unicode space in the
kernel is a boatload of memory.

Then, you have to deal with your filesystem going sour on you when two
files suddenly alias, because there is a new revision of the mapping
tables.

Case seemed simple when we were dealing with the "let's teach them all
English" world, but even when you're dealing with languages like
German (ÃŸ) or Dutch (Ä²) things get fuzzy... what's worse, in
Turkish the uppercase equivalent of "i" (U+0069) isn't "I" (U+0049),
it's "Ä°" (U+0130)!  There is no table which can tell you that, since
it's context-dependent.  Thus, you may now need to consider larger
equivalence classes, but is the other user expecting the same thing?
You can't just use the same base letter being equivalent everywhere,
or a Swedish user would beat the sh*t out of you for confusing the
words "vas" and "vÃ¤s".  On the other hand, the Swedish user would be
perfectly happy having "Ã¤" equivalent with "Ã¦" and "Ã¼" equivalent
with "y"!

Therein lies madness.

	-hpa

-- 
PGP public key available - finger hpa@zytor.com
Key fingerprint: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD  1E DF FE 69 EE 35 BD 74
"The earth is but one country, and mankind its citizens."  --  Bahá'u'lláh
Just Say No to Morden * The Shadows were defeated -- Babylon 5 is renewed!!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  6:54   ` tridge
@ 2004-02-17  8:33     ` Neil Brown
  2004-02-17 22:48       ` tridge
  2004-02-17 15:13     ` Linus Torvalds
  1 sibling, 1 reply; 123+ messages in thread
From: Neil Brown @ 2004-02-17  8:33 UTC (permalink / raw)
  To: tridge; +Cc: Linus Torvalds, Kernel Mailing List, Al Viro

On Tuesday February 17, tridge@samba.org wrote:
> 
> I also think that if the choice were given then some linux distros
> (the likes of Lindows comes to mind) would choose to run all processes
> case-insensitive. These sorts of distros are aiming at the sorts of
> users that would want everything to be case-insensitive.

This is the bit I don't understand.

Surely the value of case-insensitivity is that you can type in a
filename from memory and not worry about what case you used when you
created the file.

Yet with Lindows / MS-Windows style interfaces, you virtually never
type the name of a pre-existing file.  So case-insensitivity doesn't
seem to be a win to the user.

I thought the value of a case-insensitive filenames was for
legacy applications which have been written to the WIN32 API and took
lots of liberties with "pretty-casing" filenames between readdir and
open. 

NeilBrown

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  4:12 UTF-8 and case-insensitivity tridge
                   ` (2 preceding siblings ...)
  2004-02-17  7:43 ` H. Peter Anvin
@ 2004-02-17 14:25 ` Dave Kleikamp
  2004-02-18  0:16 ` Robert White
  4 siblings, 0 replies; 123+ messages in thread
From: Dave Kleikamp @ 2004-02-17 14:25 UTC (permalink / raw)
  To: tridge; +Cc: linux-kernel

On Mon, 2004-02-16 at 22:12, tridge@samba.org wrote:
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
> 
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it, despite the fact that this will be very painful and
> potentially quite complex. The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.
> 
> Personally I much prefer the systems I use to be case-sensitive, but
> there are important applications that require case-insensitivity for
> interoperability. Right now it is not possible to write a case
> insensitive application on Linux in an efficient manner. With the
> current "encoding agnostic" APIs a simple open() or stat() call
> becomes a horrendously expensive operation and one that is fraught
> with race conditions. Providing the same functionality in the kernel
> is dirt cheap by comparison (not cheap in terms of code complexity,
> but cheap in terms of runtime efficiency).

This would be easy to do in JFS due to the baggage we carried over to be
compatible with OS/2-formatted volumes.  In OS/2, the directories were
ordered in a case-insensitive fashion.  This would have to be a mkfs
option, and would not be a per-process option.  The directories must be
created either case-sensitive or not.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  6:54   ` tridge
  2004-02-17  8:33     ` Neil Brown
@ 2004-02-17 15:13     ` Linus Torvalds
  2004-02-17 16:57       ` Linus Torvalds
  2004-02-17 23:20       ` tridge
  1 sibling, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-17 15:13 UTC (permalink / raw)
  To: tridge; +Cc: Kernel Mailing List, Al Viro

On Tue, 17 Feb 2004 tridge@samba.org wrote:
> 
> From memory, the patch added new classes of dentries to the current
> "+ve" and "-ve" dentries. It added concepts like a "-ve
> case-insensitive" dentry and a "-ve case-sensitive" dentry. It
> certainly adds more code in trying to deal with these variants, but I
> see no reason why it should be significantly computationally less
> efficient.

Yes, we could add context sensitivity to the dcache with a context 
bitmask.

However, it's _not_ correct.

It assumes that there is only one way to do lower/upper case, which just 
isn't true. What about different locales that have different case rules? 
Your "one bit per dentry" becomes "one bit per locale per dentry". That's 
just horribly hard to do.

I don't know how Windows does it, so maybe this thing is hardcoded, and 
you don't even want "true" case insensitivity. How "correct" is Windows?

(And don't even bother telling me about the translation table in NTFS 
volumes - I'm not interested. This would have to work on a sane filesystem 
to be useful, even for samba.)

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 15:13     ` Linus Torvalds
@ 2004-02-17 16:57       ` Linus Torvalds
  2004-02-17 19:44         ` viro
                           ` (2 more replies)
  2004-02-17 23:20       ` tridge
  1 sibling, 3 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-17 16:57 UTC (permalink / raw)
  To: tridge; +Cc: Kernel Mailing List, Al Viro

On Tue, 17 Feb 2004, Linus Torvalds wrote:
> 
> It assumes that there is only one way to do lower/upper case, which just 
> isn't true. What about different locales that have different case rules? 
> Your "one bit per dentry" becomes "one bit per locale per dentry". That's 
> just horribly hard to do.

It's also hard to know what to do when there are two filenames that
literally _are_ the same when not comparing cases. Which can obviously
happen under Linux - you'd have a case-sensitive app that creates a both
"makefile" and "Makefile", and now you have a case-insensitive app that
looks it up (or worse, removes it), and what the *heck* is the dcache now
supposed to really do?

This is why I'd hate for the generic Linux dcache to know about case
sensitivity, and I'd be a lot happier having a separate path (which isn't
as speed-critical) that can be used to help implement helper functions for
doing case-insensitive things.

That way the bugs and strange behaviour would be all be limited to the 
case-insensitive special code, and not pollute the "sane" side.

For example, I fundamentally can't easily do an atomic exclusive
case-insensitive "create" or "rename", but we _could_ expose things like
directory generation counts to the special interfaces, and thus allow at
least "local-atomic" operations (but they would _not_ be atomic over a
network, to give you an idea of the kinds of _fundamental_ limitations
there are here).

That's why I'd advocate having a few very special system calls for doing
the operations that samba (and I'll throw wine into the pot too) wants to
do. So you could literally do an atomic create with something like

 - regular atomic create of random case-_sensitive_ name using something 
   tempnam()-like (use a prefix that is invalid on windows or something: 
   make the first character be 0xff or whatever).
 - "read directory local sequence count"
 - readdir to make sure that the new name is still unique even in the
   case-insensitive sense
 - "atomic move conditionally on the local sequence count still being X"

The thing is, we can do hack like the above, and yes, we could do them all 
inside the kernel, and give user space a reasonably nice interface with 
"pseudo-atomic" behaviour (ie it will _not_ be atomic if multiple clients 
do this over NFS, but I doubt you care).

But it wouldn't be "open()" and "rename()". It would be a totally separate
kernel path. It would be in the "case-insensitivity-module". It would be 
_outside_ the regular VFS layer, although it would have some visibility 
into it (ie it could follow dentries on its own, and know about the RCU 
etc locking rules).

We can even allow that case-insensitive module to set some flags in the 
dentries (so that you can create negative dentries that have a flag set 
"this is negative for all cases").

Trust me, this is much less intrusive, and a lot easier to debug too. It 
won't be as fast as the regular path operations, but depending on what the 
common cases are (hopefully "look up name that is exact"), it would likely 
not be horrible either. And it could probably be debugged as a real 
module, without impacting any existing code, which would make it a lot 
easier to create.

See where I'm going? Would this be acceptable to you? Are there any samba 
people who are knowledgeable about the VFS-layer and have the time/energy 
to try something like this?

Al? What do you think?

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 16:57       ` Linus Torvalds
@ 2004-02-17 19:44         ` viro
  2004-02-17 20:10           ` Linus Torvalds
  2004-02-17 21:08         ` Robin Rosenberg
  2004-02-17 23:57         ` tridge
  2 siblings, 1 reply; 123+ messages in thread
From: viro @ 2004-02-17 19:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tridge, Kernel Mailing List

On Tue, Feb 17, 2004 at 08:57:40AM -0800, Linus Torvalds wrote:
> Trust me, this is much less intrusive, and a lot easier to debug too. It 
> won't be as fast as the regular path operations, but depending on what the 
> common cases are (hopefully "look up name that is exact"), it would likely 
> not be horrible either. And it could probably be debugged as a real 
> module, without impacting any existing code, which would make it a lot 
> easier to create.
> 
> See where I'm going? Would this be acceptable to you? Are there any samba 
> people who are knowledgeable about the VFS-layer and have the time/energy 
> to try something like this?
> 
> Al? What do you think?

What will protect your generation counts during the operation itself?
->i_sem?

If anything, I'd suggest doing it as
	cretinous_rename(dir_fd, name1, name2)
with the following semantics:

	* if directory had been changed since open() that gave us dir_fd -
-EFOAD
	* otherwise, rename name1 to name2 (no cross-directory renames here).

No need to expose generation counts to userland - we can just compare the
count at open() time with that at operation time.  The rest can be done
in userland (including creation of files).

We _definitely_ don't want to put "UTF-8 case-insensitive comparison" anywhere
near the kernel - it's insane.  If samba wants it, they get to pay the price,
both in performance and keeping butt-ugly code (after all, the goal of project
is to imitate butt-ugly system for butt-ugly clients).  The same goes for Wine.

And we really don't want to encourage those who port Windows userland in
not fixing the idiotic semantics.  As for Lindows... let's just say that
I can't find any way to describe what I really think of those clowns, their
intellect and their morals that wouldn't lead to a lawsuit from them.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 19:44         ` viro
@ 2004-02-17 20:10           ` Linus Torvalds
  2004-02-17 20:17             ` viro
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-17 20:10 UTC (permalink / raw)
  To: viro; +Cc: tridge, Kernel Mailing List

On Tue, 17 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> 
> What will protect your generation counts during the operation itself?
> ->i_sem?

Yes. You have to take it anyway, so why not?

> If anything, I'd suggest doing it as
> 	cretinous_rename(dir_fd, name1, name2)
> with the following semantics:
> 
> 	* if directory had been changed since open() that gave us dir_fd -
>   -EFOAD
> 	* otherwise, rename name1 to name2 (no cross-directory renames here).

Sure, that works.

> No need to expose generation counts to userland - we can just compare the
> count at open() time with that at operation time.  The rest can be done
> in userland (including creation of files).

Note that I'm not sure we would expose generation counts at all to user 
space: we might keep all of this inside the "crapola windows behaviour" 
module, and user space could actually see some easier highlevel interface. 
Something like yours, but I suspect we'd want to see what the whole 
user-level loop would look like to know what the architecture should be 
like.

I do believe we'd need to have some way to "refresh" the fd in your
example, without restarting the whole lookup. So that when the user gets 
EFOAD, it can do

	refresh(fd);
	readdir(fd);
	/* Check that nothing clashes */
	goto try_again;

or similar. So the generation count _semantics_ would be exposed, even if 
the numbers themselves would be hidden inside the kernel.

> We _definitely_ don't want to put "UTF-8 case-insensitive comparison" anywhere
> near the kernel - it's insane.  If samba wants it, they get to pay the price,
> both in performance and keeping butt-ugly code (after all, the goal of project
> is to imitate butt-ugly system for butt-ugly clients).  The same goes for Wine.

I agree. We'd need to let user space do the equality comparisons, I just 
don't see how to sanely do it in kernel land.

> And we really don't want to encourage those who port Windows userland in
> not fixing the idiotic semantics.  As for Lindows... let's just say that
> I can't find any way to describe what I really think of those clowns, their
> intellect and their morals that wouldn't lead to a lawsuit from them.

Heh.

I suspect most people don't care that much, but I also suspect that 
projects like samba have to have a "anal mode" where they really act like 
Windows, even when it's "wrong". People can then choose to say "screw that 
idiocy", but by just _having_ a very compatible mode you deflect a lot of 
criticism. Regardless of whether people want the anal mode or not in real 
life.

Backwards compatibility is King. It's _hugely_ important. It's one of the
most important things to me in the kernel, and by the same logic I do see
that it is important to others as well - even when the backwards
compatibility ends up being inherited from a broken Windows setup. So
while I hate case-insensitive names, I do understand that people want to
have some way to emulate the braindamage for some _really_ "ass-backwards"
compatibility reasons.

So I think it's worth some pain, as long as we keep that compatibility 
from starting to encrust the _good_ stuff.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 20:10           ` Linus Torvalds
@ 2004-02-17 20:17             ` viro
  2004-02-17 20:23               ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: viro @ 2004-02-17 20:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tridge, Kernel Mailing List

On Tue, Feb 17, 2004 at 12:10:23PM -0800, Linus Torvalds wrote:
> I do believe we'd need to have some way to "refresh" the fd in your
> example, without restarting the whole lookup. So that when the user gets 
> EFOAD, it can do
> 
> 	refresh(fd);

lseek(fd, 0, 0);

> > And we really don't want to encourage those who port Windows userland in
> > not fixing the idiotic semantics.  As for Lindows... let's just say that
> > I can't find any way to describe what I really think of those clowns, their
> > intellect and their morals that wouldn't lead to a lawsuit from them.
> 
> Heh.
> 
> I suspect most people don't care that much, but I also suspect that 
> projects like samba have to have a "anal mode" where they really act like 
> Windows, even when it's "wrong". People can then choose to say "screw that 
> idiocy", but by just _having_ a very compatible mode you deflect a lot of 
> criticism. Regardless of whether people want the anal mode or not in real 
> life.

Umm...  Samba deals with Windows clients.  Windows software allegedly being
ported to Linux is a different story and in that case there's no excuse for
demanding case-insensitive operations.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 20:17             ` viro
@ 2004-02-17 20:23               ` Linus Torvalds
  0 siblings, 0 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-17 20:23 UTC (permalink / raw)
  To: viro; +Cc: tridge, Kernel Mailing List

On Tue, 17 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> > 	refresh(fd);
> 
> lseek(fd, 0, 0);

Yes. We can make that implicitly refresh, I'm certainly ok with that.

> > I suspect most people don't care that much, but I also suspect that 
> > projects like samba have to have a "anal mode" where they really act like 
> > Windows, even when it's "wrong". People can then choose to say "screw that 
> > idiocy", but by just _having_ a very compatible mode you deflect a lot of 
> > criticism. Regardless of whether people want the anal mode or not in real 
> > life.
> 
> Umm...  Samba deals with Windows clients.  Windows software allegedly being
> ported to Linux is a different story and in that case there's no excuse for
> demanding case-insensitive operations.

"wine". It's not porting, it's emulation.

But yes, I agree, I don't see any other cases where we want it. 

We basically want to support broken clients - whether they be on the other 
side of the network, or the other side of an emulation interface. That is 
the only valid reason to do this crap.

It's a fairly sizeable reason, though. On another front ("World
Domination, Fast!") we'll try to fix the problem another way, but there's 
nothing wrong with fighting on multiple fronts if you have the man-power.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 16:57       ` Linus Torvalds
  2004-02-17 19:44         ` viro
@ 2004-02-17 21:08         ` Robin Rosenberg
  2004-02-17 21:17           ` Linus Torvalds
  2004-02-17 23:57         ` tridge
  2 siblings, 1 reply; 123+ messages in thread
From: Robin Rosenberg @ 2004-02-17 21:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tridge, Kernel Mailing List, Al Viro

On Tuesday 17 February 2004 17.57, Linus Torvalds wrote:
[case-insanesititvity proposal ///]
> See where I'm going? Would this be acceptable to you? Are there any samba 
> people who are knowledgeable about the VFS-layer and have the time/energy 
> to try something like this?

So the same guy that strongly insist that a file is a string of bytes and nothing else,
now thinks it is sane to even think of "case" of a byte. That's impossible unless you
actually DO believe its a bunch of characters.  What is it?

-- robin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 21:08         ` Robin Rosenberg
@ 2004-02-17 21:17           ` Linus Torvalds
  2004-02-17 22:27             ` Robin Rosenberg
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-17 21:17 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: tridge, Kernel Mailing List, Al Viro

On Tue, 17 Feb 2004, Robin Rosenberg wrote:
>
> On Tuesday 17 February 2004 17.57, Linus Torvalds wrote:
> [case-insanesititvity proposal ///]
> > See where I'm going? Would this be acceptable to you? Are there any samba 
> > people who are knowledgeable about the VFS-layer and have the time/energy 
> > to try something like this?
> 
> So the same guy that strongly insist that a file is a string of bytes and nothing else,
> now thinks it is sane to even think of "case" of a byte. That's impossible unless you
> actually DO believe its a bunch of characters.  What is it?

Which part of my argumen don't you understand?

The kernel proper thinks it's just a stream of bytes, and all the existing 
interfaces do likewise.

But we'd have a kernel helper module to let samba do what it already does 
now, except help it do so more efficiently?

The fact that _I_ think pathnames are just a nice stream of bytes sadly 
doesn't make Windows clients do the same. Some day when I'm King Of The 
World, and I can outlaw windows clients, we'll finally get rid of the 
braindamage, but until then I'm pragmatic enough to say "let's help out 
the poor samba people who have to deal with the crap day in and day out".

What's your problem with that?

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 21:17           ` Linus Torvalds
@ 2004-02-17 22:27             ` Robin Rosenberg
  2004-02-18  3:02               ` tridge
  0 siblings, 1 reply; 123+ messages in thread
From: Robin Rosenberg @ 2004-02-17 22:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tridge, Kernel Mailing List, Al Viro

On Tuesday 17 February 2004 22.17, Linus Torvalds wrote:
> The fact that _I_ think pathnames are just a nice stream of bytes sadly 
> doesn't make Windows clients do the same. Some day when I'm King Of The 
> World, and I can outlaw windows clients, we'll finally get rid of the 
LPA = Linus' Patriot Act. 

> braindamage, but until then I'm pragmatic enough to say "let's help out 
> the poor samba people who have to deal with the crap day in and day out".
> 
> What's your problem with that?

Nothing wrong with helping people. 

Having to put up with the existence of Windows day in and out is the reason I'm still on
an eight-bit encoding.  Sorry for not explaining the REAL problem, but only a partial
problem. I need to support all kinds of clients on Windows with protocols that convey no
character set info. With samba that's no problem. Having to put up with a Unix world running 
ISO-8859-1 (or ISO-8859-15) is another. Ofcourse that means Linux machines also add
to the disturbance by not storing things as unicode. The real obstable is file names, 
everything else including content of files, I can handle (I think). Maybe I'll find a solution
for the filenames too, but usually some hot discussions are needed for the brain to kick
into the right gear. 

I want to switch to UTF-8 to work better with the outside world, but as things are people will 
start to take notice of what OS is running in the shadows when they see the filename problems, and 
start demanding Windows, and ...  You see; I'm not mean; I don't want to do that to them (or myself),

-- robin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  8:33     ` Neil Brown
@ 2004-02-17 22:48       ` tridge
  2004-02-18  0:06         ` Neil Brown
  0 siblings, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-17 22:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linus Torvalds, Kernel Mailing List, Al Viro

Neil,

 > I thought the value of a case-insensitive filenames was for
 > legacy applications which have been written to the WIN32 API and took
 > lots of liberties with "pretty-casing" filenames between readdir and
 > open. 

No, thats a common misconception. It does happen (the "pretty-casing")
but its relatively rare these days. The real problem is *proving* that
a file doesn't exist. If a file does exist then there are all sorts of
heuristic and cache mechanisms that can be used to get the real
filename quickly on average, but if you have to prove absolutely that
a file does not exist then all of that stuff is pretty much useless.

Samba (and any other system that wants case-insensitive semantics on
Linux) can't make do with "oh, it probably doesn't exist". That way
leads to data loss. You have to know with 100% certainty that the file
doesn't exist in any case combination.

Unfortunately, that is also the hardest thing to do.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 15:13     ` Linus Torvalds
  2004-02-17 16:57       ` Linus Torvalds
@ 2004-02-17 23:20       ` tridge
  2004-02-17 23:43         ` Linus Torvalds
  2004-02-18  2:37         ` H. Peter Anvin
  1 sibling, 2 replies; 123+ messages in thread
From: tridge @ 2004-02-17 23:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Al Viro

Linus,

 > Yes, we could add context sensitivity to the dcache with a context 
 > bitmask.
 > 
 > However, it's _not_ correct.
 > 
 > It assumes that there is only one way to do lower/upper case, which just 
 > isn't true. What about different locales that have different case rules? 
 > Your "one bit per dentry" becomes "one bit per locale per dentry". That's 
 > just horribly hard to do.

I think you're making it sound much harder than it really is.

We just add a VFS hook in the filesystems. The filesystem chooses the
encoding specific comparison function. If the filesystem doesn't
provide one then don't do case insensitivity. If the filesystem does
provide one (for example NTFS, JFS) then use it. Then all I need to do
is convince one of the filesystem maintainers to add a mount time
option to specify the case table (for example by specifying the name
of a file in the filesystem that holds it).

So, all the really ugly stuff is then in the per-filesystem code, and
all the VFS and dcache has to do is know about a single context bit
per dentry. 

 > I don't know how Windows does it, so maybe this thing is hardcoded, and 
 > you don't even want "true" case insensitivity. 

NTFS has a 128k table on disk, created at mkfs time and indexed by the
UCS2 character. The interesting thing about this table is that it
doesn't seem to vary between different locales as one might expect. I
have checked 3 locales so far (Swedish, Japanese and English) and all
have the same 128k table. I should check a few more locales to see if
it really is the same everywhere. Contact me off-list if you have a
NTFS filesystem created in a different locale and would be willing to
run a test program against it to see if the table is different from
the one we have in Samba.

There is stuff in the charset handling of every locale that does vary
in windows, but it isn't the case table, its the "valid characters"
map used to determine what characters are allowed when converting
strings into legacy multi-byte encodings. Even I don't think that the
kernel will ever have to deal with that crap unless someone is foolish
enough to port Samba into the kernel (several people have actually
done that despite the insanity of the idea, but they all did an
absolutely terrible job of it and certainly didn't take care to get
all the charset handling right).

> How "correct" is Windows?

from my rather limited point of view I always have to assume that
windows is "correct", unless I can show that its behaviour leads to
data loss, a security hole or something equally extreme.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 23:20       ` tridge
@ 2004-02-17 23:43         ` Linus Torvalds
  2004-02-18  3:26           ` tridge
  2004-02-18  2:37         ` H. Peter Anvin
  1 sibling, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-17 23:43 UTC (permalink / raw)
  To: tridge; +Cc: Kernel Mailing List, Al Viro

On Wed, 18 Feb 2004 tridge@samba.org wrote:
> 
> I think you're making it sound much harder than it really is.

I think I'm just making the mistake of assuming that anybody would care to 
do it "right", while everybody really only cares to get it be compatible 
with Windows.

For example, if you only want to be compatible with Windows, you don't 
have to worry about UCS-4, you only have the UCS-2 part, which means that 
you can do a silly array-lookup based thing or something.

> We just add a VFS hook in the filesystems. The filesystem chooses the
> encoding specific comparison function. If the filesystem doesn't
> provide one then don't do case insensitivity. If the filesystem does
> provide one (for example NTFS, JFS) then use it. Then all I need to do
> is convince one of the filesystem maintainers to add a mount time
> option to specify the case table (for example by specifying the name
> of a file in the filesystem that holds it).

Ugh. What a horrible kludge, and it won't work without "preparing" the 
filesystem at mount-time. I'd much rather leave the translation table in 
user space, and just give it as an argument to the "look up case 
insensitive" special thing.

That would mean that we can hold the directory semaphore over the whole 
thing, which would simplify _my_ kludge, since there would be no need to 
worry about user space having separate stages.

The hard part would be negative dentries. We'd have to invalidate all
"case-insensitive" negative dentries when creating any new file in a
directory, and that would be something the generic VFS layer would have to 
know about, and that might be unacceptable to Al.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 16:57       ` Linus Torvalds
  2004-02-17 19:44         ` viro
  2004-02-17 21:08         ` Robin Rosenberg
@ 2004-02-17 23:57         ` tridge
  2 siblings, 0 replies; 123+ messages in thread
From: tridge @ 2004-02-17 23:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Al Viro

Linus,

 > It's also hard to know what to do when there are two filenames that
 > literally _are_ the same when not comparing cases. Which can obviously
 > happen under Linux - you'd have a case-sensitive app that creates a both
 > "makefile" and "Makefile", and now you have a case-insensitive app that
 > looks it up (or worse, removes it), and what the *heck* is the dcache now
 > supposed to really do?

This is really not as bad as it first seems. Just think what the
absolutely obvious thing to do is and do that. It's like all those
things in POSIX where it says "if you do XXX then the behaviour is
undefined" and the implementations end up doing whatever the heck they
find easiest to do. It's the same here.

In the example you give then you just give whatever file you come
across first or happen to have in the dcache. You can't do better than
that, as the problem is fundamentally insoluble in a sane fashion, so
just don't try. We've been doing exactly that in Samba for 12 years
(picking the first file we come across) and I can't recall a *single*
complaint about that behaviour. Users *expect* the server to just pick
one, and have no pre-conceived idea of which one it will pick.

Of course, some samba-tuned filesystem could have a mount option to
refuse to allow the creation of filenames that conflict in this way,
but don't even try to enforce this in the kernel core.

 > This is why I'd hate for the generic Linux dcache to know about case
 > sensitivity, and I'd be a lot happier having a separate path (which isn't
 > as speed-critical) that can be used to help implement helper functions for
 > doing case-insensitive things.

The problem is that if that separate path doesn't go via the dcache
then we won't get the invalidation of our negative dentries so we
won't be able to do any better than scanning the whole directory every
time to prove files don't exist. The dcache has to know about this as
its the only place where all the information that is needed comes
together (I'm sure you'll correct me if I'm wrong about this).

 > That way the bugs and strange behaviour would be all be limited to the 
 > case-insensitive special code, and not pollute the "sane" side.

except when something like a file create happens on the "sane" side of
things and we then have no way of knowing that our name space has just
changed. I suppose we could create a completely new dcache in parallel
with the current one and have some sort of notify between the "sane"
and "insane" worlds, but I suspect the glue code between them would be
worse than just adding that context bit to the main dcache.

 > For example, I fundamentally can't easily do an atomic exclusive
 > case-insensitive "create" or "rename", but we _could_ expose things like
 > directory generation counts to the special interfaces, and thus allow at
 > least "local-atomic" operations (but they would _not_ be atomic over a
 > network, to give you an idea of the kinds of _fundamental_ limitations
 > there are here).

yes, doing atomic network file operations sucks, but please don't let
that stop us doing it in a reasonable fashion for local filesystems.

Doing a nice atomic case-insensitive create or rename is really *no*
different from what we do now in Linux, it just means that we need to
have case-insensitive dentries that mean "this is a negative dentry
that covers all possible case combinations of the name it
contains". It is up to the filesystem to provide you with that -ve
dentry (just like the filesystem provides the case-sensitive -ve
dentries now) and the dcache just has to use it in the same way that
it uses the existing ones.

If you really don't want to do this then fine, in which case I'll ask
again in a year or twos time and see if I can convince you then. I
know this would make the code messier, and making code messier for the
sake of interoperability with windows is perhaps reason enough not to
do it. But please don't tell me it *can't* be done or that it is just
too hard. That's just not true.

 >  - regular atomic create of random case-_sensitive_ name using something 
 >    tempnam()-like (use a prefix that is invalid on windows or something: 
 >    make the first character be 0xff or whatever).
 >  - "read directory local sequence count"
 >  - readdir to make sure that the new name is still unique even in the
 >    case-insensitive sense
 >  - "atomic move conditionally on the local sequence count still being X"

that could make things atomic, but it won't make it fast. Think about
the fact that modern filesystems are now using better than linear
lists for directories. So in most cases lookups in large directories
can be done in much better than O(n) time (for reasonable values of
n). The above solution means Samba will never be better than O(n), so
for large directories we will always suck performance wise. It doesn't
have to be that way.

 > We can even allow that case-insensitive module to set some flags in the 
 > dentries (so that you can create negative dentries that have a flag set 
 > "this is negative for all cases").

ahh! yipee!

yes, if we have that dentry bit then we have a hope. Without that I
think it won't help much.

 > See where I'm going? Would this be acceptable to you? Are there any samba 
 > people who are knowledgeable about the VFS-layer and have the time/energy 
 > to try something like this?

I'll discuss this with some of the people here in OzLabs and see if we
can come up with a plan. I suspect most of OzLabs will be avoiding me
for a day or two in an attempt to not be the one to do this :-)

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 22:48       ` tridge
@ 2004-02-18  0:06         ` Neil Brown
  2004-02-18  9:47           ` Helge Hafting
  0 siblings, 1 reply; 123+ messages in thread
From: Neil Brown @ 2004-02-18  0:06 UTC (permalink / raw)
  To: tridge; +Cc: Linus Torvalds, Kernel Mailing List, Al Viro

On Wednesday February 18, tridge@samba.org wrote:
> 
> Samba (and any other system that wants case-insensitive semantics on
> Linux) can't make do with "oh, it probably doesn't exist". That way
> leads to data loss. You have to know with 100% certainty that the file
> doesn't exist in any case combination.
> 
> Unfortunately, that is also the hardest thing to do.

Hi Tridge,

Maybe if it is so hard, we should just define it to be easy.... just
change the universe a bit.....

I'm, sure you've thought about this a lot more that I have or will, so
I must be missing something, but there seems to be a solution that is
efficient, predictable, and should we acceptable.

The first observation is that POSIX applications and WIN32 application
cannot both get exactly the file system, semantics they expect in the
same directory. The example:
    POSIX:
       create "Makefile"
       create "makefile"
    WIN32:
       unlink "MakeFile"
seems to show that.

So decide up front that a WIN32 application will see something
different, and decide what the best thing for it to see would be
(i.e. change the universe).

First cut:
   An application that wants case-insensitive filenames only
   sees those filenames that are in a case-insensitive-canonical-form.
   So the interface maps all file names in requests to a canonical
   form, and the readdir equivalent discards all non-canonical names. 

   Thus in the above example, the WIN32 app would unlink "makefile"
   and never notice that "Makefile" exists.

   This has (to me) two problems.
    1/ case gets lost, so if I save "My File", I will find "my file"
    has been created (unless the application pretty-cases things, in
    which case I can expect case to change anyway).

    2/ Files created by posix apps might be invisible.

    To answer 2/, I'd say "tough".  If you want posix files to be
    visible to WIN32 apps, choose appropriate names.  However I would
    allow there to be a process, either once-off or periodic, which
    creates symlinks from canonical names to non-canocial filenames.
    This would allow you to access pre-existing files where there was
    no ambiguity.

    To answer 1/ I would suggest a second cut at the problem...

Second cut:
    As above, but readdir tries to be clever.  If it sees two (or
    more) names which have the same canonical form, it chooses just
    one of them (predictably), prefering a non-canonical name which is
    a symlink to the canonical name.

    Then when creating an a object, you create it with the canonical
    name and (if that succeeds) subsequently create a symlink from the
    requested name to the canonical name (if that is possible, don't
    worry if it isn't).

Given this approach:

  If only case-insensitive apps use a linux filesystem, they will see
  exactly the semantics they expect, with minimal performance impact.

  If case-sensitive and case-insensitive apps use a linux filesystem,
  they will each see a consistent view and though they may not see the
  same view, there will be well-defined mechanisms which can work at a
  user-space level to resolve or highlight any issues.

The biggest cost I see with this is with large directories.  The
"readdir" equivalent would need to read the whole directory before it
could reliably return any of it.
However  dropping the "guarantee to preserve case" semantic on really
large directories probably isn't an enormous cost (and could be
configurable).

NeilBrown

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: UTF-8 and case-insensitivity
  2004-02-17  4:12 UTF-8 and case-insensitivity tridge
                   ` (3 preceding siblings ...)
  2004-02-17 14:25 ` Dave Kleikamp
@ 2004-02-18  0:16 ` Robert White
  2004-02-18  0:20   ` Linus Torvalds
  2004-02-18  2:48   ` tridge
  4 siblings, 2 replies; 123+ messages in thread
From: Robert White @ 2004-02-18  0:16 UTC (permalink / raw)
  To: tridge, linux-kernel
  Cc: 'Linus Torvalds', 'Kernel Mailing List',
	'Al Viro', 'Neil Brown'

OK, so I wrote the below, but then in the summary I realized that there was
a significant factor that doesn't fit in with the rest of the post.  Case
insensitivity, and more generally locale equivalence rules, is a security
nightmare.  Consider the number of different file names that "su" could map
to if you apply case insensitivity (4) and/or worse yet the various accents
and umlats (?,etc) that sort-equivalent for "u" in some locales.  The user
types "su" and runs "S(u-umlat)" etc. 

====

In point of fact (ok in point of "technically abstract truth"), it is a "bad
thing" that Windows (and seemingly only Windows these days) is case
insensitive.  It is sometimes said that windows is really an application and
not an OS.  If you ignore the occasionally snide *way* it is said you can
find some technical truth to the matter.

In point of fact the entire windows application space has a singular active
locale at any one time and there is a well-defined but horrible layer of
indirection where "long names" like "My Documents" become "real names" like
"MYDOCU~1".  Essentially every windows file name is subject to a
double-indirect file name translation.  The first pass is the strcasecmp()
locale-dependent traversal of the "long name" list.  The second is the
strcasecmp() frozen-locale-spec-dependent traversal of (US Latin?) 8.3 file
naming standard list of media elements (files/directories).

In point of fact, Windows is *not* "properly" case insensitive at the file
system level.  Use "dir /x" more often on your windows box to relive the
experience.  The "real" file names are mangled to good old 8.3 uppercase
internally(1).  You don't usually have to think about this, but if you have
ever lost the long-to-short file name mapping on a drive you know the hell
that ensues.  (see also iso9660.)

So the application file naming interface wedge thingy (in windows) creates
and maintains the mixed case names as an illusion.  It just happens to be an
illusion planted so deeply in the application space that it appears to be
coming up from the "operating system level".

OK, as time has moved on, some later versions of later file systems *may* (I
honestly don't know) have modified the double-indirection model, but if they
have, they must have done so in a guaranteed-to-look-the-same way.  Either
way it ends up being quite costly.

Further, the model only really works because a DOS (and therefore windows)
based program invariably and individually takes responsibility for doing all
sorts of tasks like wildcard expansions (etc) in the application space
(often "free" through comctl32.dll).  [This tends to be foreign to Linux
(UNIX) programmers where shells and such do the expansion.]

The line is then blurred further by the subsequent steady creep of
wildcarding and file selection back into common DLLs.  (more comctl32.dll
and friends.)

The thing is, to match this ersatz "functionality" on a system where more
than one locale may be used at the same time, you end up with a kind of
Cartesian product of user locales and filesystem native locales.  The cost
could get extreme and can only really be amortized if Linux were to declare
our own 8.3 style pronouncement for the character classes used for the
"real" file name storage (etc).

Late stage case insensitivity isn't that hard to put in a linux application,
just crack open your file selection dialog boxes and have them use
strcasecmp() in all their select/sort logic.  Also then replace open() with
CaseOpen() which does a find/search operation before daring to creat().
That is, in every practical way, how Windows handles these problems.  It
just happens in some fairly interesting and hard-to-predict places depending
on context.

It is easier, IMHO, to bring the users into the 20th century (let alone the
21st 8-) by making them mean what they say (if they deign to step out from
behind their GUIs).

So what was I saying... Oh yea...

-- Single Locale storage standard required to prevent multiplicative cost.
-- Not that hard to fake case insensitivity "when necessary".
-- Cheaper in CPU/Space to mix case.
-- Native file names in native locales simplifies administration and
expectations. (not elaborated above, but true.)
-- Case insensitivity and locale equivalence leads to uncertainties about
what/which file may be intended in a given context, which could often lead
to exploitable error.

Rob.

(1) The actual truth is a tad uglier than this, the media can have the 8.3
names stored in interesting ways, but essentially a "toupper()" is done on
every file name as it is retrieved and processed.  This cuts out a lot of
possibilities and leads to a lot of "tildes of shame" in even some of the
more harmless seeming name conflicts.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: UTF-8 and case-insensitivity
  2004-02-18  0:16 ` Robert White
@ 2004-02-18  0:20   ` Linus Torvalds
  2004-02-18  1:03     ` Robert White
  2004-02-18 21:48     ` Ville Herva
  2004-02-18  2:48   ` tridge
  1 sibling, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-18  0:20 UTC (permalink / raw)
  To: Robert White
  Cc: tridge, 'Kernel Mailing List', 'Al Viro',
	'Neil Brown'



On Tue, 17 Feb 2004, Robert White wrote:
>
> OK, so I wrote the below, but then in the summary I realized that there was
> a significant factor that doesn't fit in with the rest of the post.  Case
> insensitivity, and more generally locale equivalence rules, is a security
> nightmare.  Consider the number of different file names that "su" could map
> to if you apply case insensitivity (4) and/or worse yet the various accents
> and umlats (?,etc) that sort-equivalent for "u" in some locales.  The user
> types "su" and runs "S(u-umlat)" etc. 

This is but one reason why I will _refuse_ to make case insensitivity
magically start happening on regular "open()" etc calls.

You'd literally have to use a _different_ system call to do a 
case-insensitive file open. Exactly because anything else would be very 
confusing to existing apps (and thus be potential security holes).

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: UTF-8 and case-insensitivity
  2004-02-18  0:20   ` Linus Torvalds
@ 2004-02-18  1:03     ` Robert White
  2004-02-18 21:48     ` Ville Herva
  1 sibling, 0 replies; 123+ messages in thread
From: Robert White @ 2004-02-18  1:03 UTC (permalink / raw)
  To: tridge
  Cc: 'Kernel Mailing List', 'Al Viro',
	'Neil Brown', 'Linus Torvalds'

P.S. Given that the GUI libraries (almost invariably) already deal with
displaying things in a case insensitive way, the "best place to cut" to add
case insensitivity to the user command-line experience would be adding a
flag to file name completion in bash.  Bash is already doing file name finds
and lookups when you press tab; and the user is actively looking at the
correctness and singularity/duality of the results.

So the proverbial "vi makef{tab}" would, if the flag was set, show you
makefile, Makefile, and MakeFile (etc) as existent or just switch makef to
"Makefile" if the name were unique.

It doesn't make lives easier for the API level project programmer people
(c.f. samba), but it could uber-happy the incoming newbies, and people like
me who have to interoperate within a vast wasteland of directories full of
inconsistently named files created by windows programmers (like SOCKET.C,
Socket.H, constants.h, and ss_switch.c all in one directory tree with
hundreds of their friends. 8-)

I would however, be forced to throttle myself with my own intestine if
kernel started doing this magic mapping "for me", especially "in some
calls/contexts but not in others".  (Not that I want to provide my possible
death as a strong motivation for adding the feature. 8-)

Rob.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 23:20       ` tridge
  2004-02-17 23:43         ` Linus Torvalds
@ 2004-02-18  2:37         ` H. Peter Anvin
  2004-02-18  3:03           ` Linus Torvalds
  2004-02-18  4:08           ` tridge
  1 sibling, 2 replies; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-18  2:37 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <16434.41376.453823.260362@samba.org>
By author:    tridge@samba.org
In newsgroup: linux.dev.kernel
> 
>  > I don't know how Windows does it, so maybe this thing is hardcoded, and 
>  > you don't even want "true" case insensitivity. 
> 
> NTFS has a 128k table on disk, created at mkfs time and indexed by the
> UCS2 character.

So you're hosed if anyone uses characters outside the UCS-2 character
set...

> The interesting thing about this table is that it doesn't seem to
> vary between different locales as one might expect. I have checked 3
> locales so far (Swedish, Japanese and English) and all have the same
> 128k table. I should check a few more locales to see if it really is
> the same everywhere. Contact me off-list if you have a NTFS
> filesystem created in a different locale and would be willing to run
> a test program against it to see if the table is different from the
> one we have in Samba.

There is a "standard" table, which is published by the Unicode
consortium.  However, the "standard" table isn't what you want in
certain locales, e.g. Turkish.

> There is stuff in the charset handling of every locale that does vary
> in windows, but it isn't the case table, its the "valid characters"
> map used to determine what characters are allowed when converting
> strings into legacy multi-byte encodings. Even I don't think that the
> kernel will ever have to deal with that crap unless someone is foolish
> enough to port Samba into the kernel (several people have actually
> done that despite the insanity of the idea, but they all did an
> absolutely terrible job of it and certainly didn't take care to get
> all the charset handling right).
> 
> > How "correct" is Windows?
> 
> from my rather limited point of view I always have to assume that
> windows is "correct", unless I can show that its behaviour leads to
> data loss, a security hole or something equally extreme.

Well, we don't want to support a bunch of hacks to make it behave like
Windows if what Windows does doesn't make sense.  If so you should use
a metalayer where you canonicalize the filenames and don't store
"Makefile" on the disk; store "makefile" and keep the "real" filename
stashed elsewhere, perhaps an EA.

	-hpa


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: UTF-8 and case-insensitivity
  2004-02-18  0:16 ` Robert White
  2004-02-18  0:20   ` Linus Torvalds
@ 2004-02-18  2:48   ` tridge
  2004-02-18 20:56     ` Robert White
  1 sibling, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-18  2:48 UTC (permalink / raw)
  To: Robert White
  Cc: linux-kernel, 'Linus Torvalds', 'Al Viro',
	'Neil Brown'

Robert,

Just about everything in your posting is either years out of date or
just totally wrong. 

 > OK, so I wrote the below, but then in the summary I realized that there was
 > a significant factor that doesn't fit in with the rest of the post.  Case
 > insensitivity, and more generally locale equivalence rules, is a security
 > nightmare.  Consider the number of different file names that "su" could map
 > to if you apply case insensitivity (4) and/or worse yet the various accents
 > and umlats (?,etc) that sort-equivalent for "u" in some locales.  The user
 > types "su" and runs "S(u-umlat)" etc. 

This is no different from the "stupid admin puts . in $PATH"
problem. Simple solutions:

 1) don't mount your root filesystem with case insensitive naming
 2) use a sane $PATH
 3) don't allow untrusted users to create files in your $PATH
 4) don't run bash in case insensitive mode if you can't for some
    you can't do (1) or (2) or (3)

any of (1), (2) or (3) solves this. 

 > In point of fact the entire windows application space has a
 > singular active locale at any one time and there is a well-defined
 > but horrible layer of indirection where "long names" like "My
 > Documents" become "real names" like "MYDOCU~1".  Essentially every
 > windows file name is subject to a double-indirect file name
 > translation.  The first pass is the strcasecmp() locale-dependent
 > traversal of the "long name" list.  The second is the strcasecmp()
 > frozen-locale-spec-dependent traversal of (US Latin?) 8.3 file
 > naming standard list of media elements (files/directories).

this is just total crap. That might have been true for msdos and even
possibly win9x, but its totally untrue for NTFS. There are enough
stupidities in windows without having to invent more.

NTFS is case insensitive at the filesystem level. In fact, its
selectable whether its case sensitive or case insensitive per-process
(a process can switch between the two models). The case mapping table
is built into the filesystem itself. That mapping has absolutely
*zero* to do with US Latin or any other legacy multi-byte encoding.

What you have done is the equivalent of stating that Linux can only do
14 character filenames, because once upon a time Linux had a
filesystem called minix. We've moved beyond that and so has windows. 

 > In point of fact, Windows is *not* "properly" case insensitive at the file
 > system level.  Use "dir /x" more often on your windows box to relive the
 > experience.  The "real" file names are mangled to good old 8.3 uppercase
 > internally(1).  You don't usually have to think about this, but if you have
 > ever lost the long-to-short file name mapping on a drive you know the hell
 > that ensues.  (see also iso9660.)

again, this is just complete crap. NTFS has had the ability to
completely disable 8.3 "alternative name" support for ages. Microsoft
is even starting to use this switch in their published benchmark
results, and I suspect it will become the default in a couple of
years. 

We've been through the same transition in Samba:

  - Samba 0.x only supported 8.3
  - Samba 1.x was oriented towards 8.3, but also supported long names
  - Samba 2.x and 3.x is oriented towards long names, and can disable 8.3
    names to some extent

by the time Samba 4.x comes out (I am working on it now) we may see a
significant number of sites disabling 8.3 completely. 

 > The thing is, to match this ersatz "functionality" on a system where more
 > than one locale may be used at the same time, you end up with a kind of
 > Cartesian product of user locales and filesystem native locales.  The cost
 > could get extreme and can only really be amortized if Linux were to declare
 > our own 8.3 style pronouncement for the character classes used for the
 > "real" file name storage (etc).

you are *way* out of date here. All recent windows apps use the UCS-2
interfaces which provides a single charset encoding across all
locales. I've heard that they may be redefining this as UCS-16 to
allow for an even larger range of characters, although I haven't seen
this popping up on the wire yet (then again, I just might not have
noticed). I wish they had chosen UTF-8 instead of UCS-2, but at least
they chose something and got it into every part of the OS years ago.

 > Late stage case insensitivity isn't that hard to put in a linux application,
 > just crack open your file selection dialog boxes and have them use
 > strcasecmp() in all their select/sort logic.  Also then replace open() with
 > CaseOpen() which does a find/search operation before daring to
 > creat().

Have you read *any* of what I've been saying about how expensive this is??

 > That is, in every practical way, how Windows handles these problems.  It
 > just happens in some fairly interesting and hard-to-predict places depending
 > on context.

No, that is *not* how current versions of windows do things. 

 > So what was I saying... Oh yea...
 > 
 > -- Single Locale storage standard required to prevent multiplicative cost.

windows has this. Linux doesn't.

 > -- Not that hard to fake case insensitivity "when necessary".

ditto

 > -- Cheaper in CPU/Space to mix case.

ditto

 > -- Native file names in native locales simplifies administration and
 > expectations. (not elaborated above, but true.)

?? single locale storage makes this just a no-op

 > -- Case insensitivity and locale equivalence leads to uncertainties about
 > what/which file may be intended in a given context, which could often lead
 > to exploitable error.

and that is just a complete load of crap. Windows has had exploitable
bugs due to case insensitivity, but the cause was things like leaving
directories in the search path writeable by unprivileged users. It was
*not* due to anything fundamentally insecure about case-insensitive
names in filesystems. 

 > (1) The actual truth is a tad uglier than this, the media can have the 8.3
 > names stored in interesting ways, but essentially a "toupper()" is done on
 > every file name as it is retrieved and processed.  This cuts out a lot of
 > possibilities and leads to a lot of "tildes of shame" in even some of the
 > more harmless seeming name conflicts.

oh i get it, you're just a troll ....

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 22:27             ` Robin Rosenberg
@ 2004-02-18  3:02               ` tridge
  0 siblings, 0 replies; 123+ messages in thread
From: tridge @ 2004-02-18  3:02 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linus Torvalds, Kernel Mailing List, Al Viro

Robin,

 > Having to put up with the existence of Windows day in and out is
 > the reason I'm still on an eight-bit encoding.  Sorry for not
 > explaining the REAL problem, but only a partial problem. I need to
 > support all kinds of clients on Windows with protocols that convey
 > no character set info. With samba that's no problem. Having to put
 > up with a Unix world running ISO-8859-1 (or ISO-8859-15) is
 > another. Ofcourse that means Linux machines also add to the
 > disturbance by not storing things as unicode. The real obstable is
 > file names, everything else including content of files, I can
 > handle (I think). Maybe I'll find a solution for the filenames too,
 > but usually some hot discussions are needed for the brain to kick
 > into the right gear.

I suspect you are running Samba 2.x, which negotiated all that
multi-byte stuff on the wire. Samba 3.x does the same as windows
servers have done for years and negotiates UCS-2, which means that
every windows box that connects to it no matter what locale it is in
uses the same charset encoding as every other windows box.

There are still some legacy interfaces on the wire that use the old
encodings, but they are rare and getting rarer. To support these,
Samba3 juggles 4 character set encodings internally:

  * the unix-charset, which it uses to talk to the OS, and defaults to
    UTF-8

  * the windows wire charset, which is always UCS-2

  * the dos-charset for legacy parts of the protocol, which you have
    to configure in the samba config if you care about these legacy
    parts of the protocol (for example if you have older apps). It
    defaults to either CP850 or ASCII depending on what autoconf
    discovers. 

  * the display-charset which is used to put stuff on an admins
    terminal for utilities like smbclient. The default depends on your
    LOCALE setting, or if nothing is set it uses ASCII.

Internally Samba3 only ever stores stuff in the "unix-charset"
encoding, which is usually UTF-8. It converts to the others as needed
when talking on the wire or to terminals.

 > I want to switch to UTF-8 to work better with the outside world,
 > but as things are people will start to take notice of what OS is
 > running in the shadows when they see the filename problems, and
 > start demanding Windows, and ...  You see; I'm not mean; I don't
 > want to do that to them (or myself),

If you use Samba3 then they will not notice what charset you are using
on your Linux filesystems. The windows clients will just see UCS-2.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  2:37         ` H. Peter Anvin
@ 2004-02-18  3:03           ` Linus Torvalds
  2004-02-18  3:14             ` H. Peter Anvin
  2004-02-18  4:08           ` tridge
  1 sibling, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:03 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
> Well, we don't want to support a bunch of hacks to make it behave like
> Windows if what Windows does doesn't make sense.

I'd disagree, for a very simple reason: case-insensitivity itself simply 
does not make sense, so the _only_ reason for having a bunch of hacks is 
literally to support windows file exports and nothing else.

I obviously agree with the fact that we should _not_ put those hacks into 
the VFS layer proper - we should keep them as a separate thing, and we 
should make it clear that it makes no sense _except_ for Windows 
compatibility.

Think of it as nothing more than a binary compatibility layer, the same 
way we have hooks to support "lcall 7,0" for binary compatibility with 
some silly (and much less interesting) x86 OSes through external modules.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  3:03           ` Linus Torvalds
@ 2004-02-18  3:14             ` H. Peter Anvin
  2004-02-18  3:27               ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-18  3:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
>>Well, we don't want to support a bunch of hacks to make it behave like
>>Windows if what Windows does doesn't make sense.
> 
> 
> I'd disagree, for a very simple reason: case-insensitivity itself simply 
> does not make sense, so the _only_ reason for having a bunch of hacks is 
> literally to support windows file exports and nothing else.
> 
> I obviously agree with the fact that we should _not_ put those hacks into 
> the VFS layer proper - we should keep them as a separate thing, and we 
> should make it clear that it makes no sense _except_ for Windows 
> compatibility.
> 
> Think of it as nothing more than a binary compatibility layer, the same 
> way we have hooks to support "lcall 7,0" for binary compatibility with 
> some silly (and much less interesting) x86 OSes through external modules.
> 

Well, this is also true :)  I still say it belongs in userspace.

For 100% bug-compatibility with Windows, though, it is probably
worthwhile to have the filename in the native filesystem be not what a
Windows user would see, but rather the normalized filename.  That makes
a userspace implementation much easier.

	-hpa


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17 23:43         ` Linus Torvalds
@ 2004-02-18  3:26           ` tridge
  2004-02-18  5:33             ` H. Peter Anvin
  2004-02-18  7:54             ` Marc Lehmann
  0 siblings, 2 replies; 123+ messages in thread
From: tridge @ 2004-02-18  3:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Al Viro

Linus,

 > For example, if you only want to be compatible with Windows, you don't 
 > have to worry about UCS-4, you only have the UCS-2 part, which means that 
 > you can do a silly array-lookup based thing or something.

Even within UCS-2 land the case-mapping table is sparse as only some
characters have a upper/lower mapping. In fact, there are just 636
characters out of 64k that have an upper/lower case mapping that isn't
the identity. That is across *all* languages that windows uses for
UCS-2.

In Samba that's not sparse enough that its worth saving the single
mmap of 128k to encode it sparsely in memory, but in UCS-4 land you
would obviously use a sparse mapping, and that mapping table would
probably be just a few k in size. If you allow for extents then I
expect you could encode it in a couple of hundred bytes.

(I experimented with using a sparse mapping in Samba, and it was a
slight loss on the machine I was testing on compared to just doing the
mmap, so I went with the mmap. Maybe someone else can do a better
sparse encoding than I did and actually get a win due to better cache
behaviour.)

 > Ugh. What a horrible kludge, and it won't work without "preparing" the 
 > filesystem at mount-time. I'd much rather leave the translation table in 
 > user space, and just give it as an argument to the "look up case 
 > insensitive" special thing.

The case mapping table must remain the same for the lifetime of the
mounted filesyste, otherwise you'd get chaos.  That's why tying it to
the filesystem (ie. hanging it off the superblock) makes sense.

 > The hard part would be negative dentries. We'd have to invalidate all
 > "case-insensitive" negative dentries when creating any new file in a
 > directory, and that would be something the generic VFS layer would have to 
 > know about

Right, the handling of negative dentries is the key. I don't think its
quite as bad as you say though, as you can do this:

1) use a filesystem provided case-insensitive hash in the dcache. If
   the filesystem provided hash isn't case-insensitive then don't try
   to do case-insensitive lookups on this filesystem.

2) you only need to potentially invalidate entries in the same hash
   bucket as the name you are creating. 

3) Even better, you don't need to invalidate entries that don't have
   the same hash value (presuming your hash values are larger than
   your truncated hash keys).

> and that might be unacceptable to Al.

yes, and I'm quite sympathentic to that point of view. I just want to
make sure that if we don't do this then we use honest reasons for not
doing it, not "that's impossible" reasons which are bogus when you
examine them.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  3:14             ` H. Peter Anvin
@ 2004-02-18  3:27               ` Linus Torvalds
  2004-02-18 21:31                 ` tridge
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:27 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
> Well, this is also true :)  I still say it belongs in userspace.

The thing is, I do agree with Tridge on one simple fact: it's very hard 
indeed to do atomic file operations from user space.

That's not necessarily a problem if samba is the only process accessing
the directories in question, since then samba could do all locking
internally and make sure that it never does anything inconsistent.

However, clearly people who run samba on a machine want to potentially 
_also_ export that same filesystem as a NFS volume, as a way to have both 
Windows and UNIX clients access the same data. And that pretty much means 
that other people _will_ access the directories, and that samba can't do 
its internal locking in that kind of environment.

This is why I am symphathetic to the need to add _some_ kind of support 
for this. And the only common place ends up being the kernel.

> For 100% bug-compatibility with Windows, though, it is probably
> worthwhile to have the filename in the native filesystem be not what a
> Windows user would see, but rather the normalized filename.  That makes
> a userspace implementation much easier.

Oh, absolutely. But that's something that samba can easily do internally: 
it can choose to just entirely ignore filenames that aren't normalized, or 
it can export it on the wire (obviously in the normalized UCS-2 format), 
and just consider non-normalized names to be another "case". In fact, 
that's what the naive implementation would do anyway, so that's not any 
added complexity.

(And samba clearly _cannot_ show the client a non-normalized name per se, 
since the smb protocol ends up using UCS-2).

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  2:37         ` H. Peter Anvin
  2004-02-18  3:03           ` Linus Torvalds
@ 2004-02-18  4:08           ` tridge
  2004-02-18 10:05             ` Robin Rosenberg
  1 sibling, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-18  4:08 UTC (permalink / raw)
  To: hpa; +Cc: Kernel Mailing List

Hpa,

> So you're hosed if anyone uses characters outside the UCS-2 character
> set...

I've heard they are re-defining all those 16 bit numbers to be UCS-16
instead of UCS-2 for exactly that reason. This is rather similar to
the move in the Unix community to start using UTF-8.

Note that I am not at all proposing that we use UCS-2 in the Linux
kernel (except in places where you have to, like the NTFS
filesystem). I am proposing that the filesystems be able to offer a
case-insenstive hash function to the dcache, and I would expect that
this function would be based on UTF-8. 

The function might operate internally by converting UTF-8 to UCS-2, or
it might use a sparse mapping table. It would almost certainly have a
fast-path that looked first to see if there are any bytes with the top
bit set, and if there are none then it can do a really easy 7 bit
table based hash which would make this really fast for most users.

The point is that the kernel proper (the VFS and dcache in particular)
won't have to care how this hash works. They're just consumers of it. 

> There is a "standard" table, which is published by the Unicode
> consortium. 

The table used in windows is not exactly the same as the one on
unicode.org. Which is "correct" I will leave up to the pedants to
discuss, as all that Samba cares about is that it uses the same table
as w2k.

> However, the "standard" table isn't what you want in certain
> locales, e.g. Turkish.

I'd really like someone to confirm this for me by volunteering to run
a tool I provide on a Turkish NTFS filesystem or sending me a
compressed empty Turkish NTFS volume (please ask first by email - I
only need one of these). Up to now I have only ever seen the one 128k
table used across all windows locales. If this table really *is*
different in some locales then I need to know.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  3:26           ` tridge
@ 2004-02-18  5:33             ` H. Peter Anvin
  2004-02-18  7:54             ` Marc Lehmann
  1 sibling, 0 replies; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-18  5:33 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <16434.56190.639555.554525@samba.org>
By author:    tridge@samba.org
In newsgroup: linux.dev.kernel
> 
> In Samba that's not sparse enough that its worth saving the single
> mmap of 128k to encode it sparsely in memory, but in UCS-4 land you
> would obviously use a sparse mapping, and that mapping table would
> probably be just a few k in size. If you allow for extents then I
> expect you could encode it in a couple of hundred bytes.
> 

If all you care about is the UTF-16-compatible range, you only need
1088K entries in your table; small enough that it can be reasonably
had in userspace.

> (I experimented with using a sparse mapping in Samba, and it was a
> slight loss on the machine I was testing on compared to just doing the
> mmap, so I went with the mmap. Maybe someone else can do a better
> sparse encoding than I did and actually get a win due to better cache
> behaviour.)

The thing is, you're probably only touching small parts of your table,
so the kernel and the CPU cache works quite well on the large table as
it is.

Wouldn't work in kernel space, though.

	-hpa

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  3:26           ` tridge
  2004-02-18  5:33             ` H. Peter Anvin
@ 2004-02-18  7:54             ` Marc Lehmann
  1 sibling, 0 replies; 123+ messages in thread
From: Marc Lehmann @ 2004-02-18  7:54 UTC (permalink / raw)
  To: linux-kernel

On Wed, Feb 18, 2004 at 02:26:54PM +1100, tridge@samba.org wrote:
> Even within UCS-2 land the case-mapping table is sparse as only some
> characters have a upper/lower mapping. In fact, there are just 636
> characters out of 64k that have an upper/lower case mapping that isn't
> the identity. That is across *all* languages that windows uses for
> UCS-2.

This is because scripts differentiating between upper and lower case are
rare exceptions in the world.

Unfortunately, commonly used exceptions, and still locale dependent.

Having a samba-helper kernel module that would contain this table (I am
confident that it's only a single table in existing versions of windows,
but maybe they improve that in future versions) could solve this problem.

I still wonder wether it ever can be made efficient, though.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  0:06         ` Neil Brown
@ 2004-02-18  9:47           ` Helge Hafting
  0 siblings, 0 replies; 123+ messages in thread
From: Helge Hafting @ 2004-02-18  9:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel

Neil Brown wrote:

>     1/ case gets lost, so if I save "My File", I will find "my file"
>     has been created (unless the application pretty-cases things, in
>     which case I can expect case to change anyway).
> 
>     2/ Files created by posix apps might be invisible.
> 
> 
>     To answer 2/, I'd say "tough".  If you want posix files to be

This is a bit worse than just "though".  
win32: rmdir foo
       directory not empty!
win32: there are _no_ files there?

Helge Hafting


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  4:08           ` tridge
@ 2004-02-18 10:05             ` Robin Rosenberg
  2004-02-18 11:43               ` tridge
  0 siblings, 1 reply; 123+ messages in thread
From: Robin Rosenberg @ 2004-02-18 10:05 UTC (permalink / raw)
  To: tridge; +Cc: hpa, Kernel Mailing List

On Wednesday 18 February 2004 05.08, tridge@samba.org wrote:
> Hpa,
> 
> > So you're hosed if anyone uses characters outside the UCS-2 character
> > set...
> 
> I've heard they are re-defining all those 16 bit numbers to be UCS-16
> instead of UCS-2 for exactly that reason. This is rather similar to
> the move in the Unix community to start using UTF-8.

I've read it also: http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
"The fundamental representation of text in Windows NT-based operating systems is UTF-16"

-- robin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 10:05             ` Robin Rosenberg
@ 2004-02-18 11:43               ` tridge
  2004-02-18 12:31                 ` Robin Rosenberg
  0 siblings, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-18 11:43 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: hpa, Kernel Mailing List

Robin,

 > I've read it also:
 > http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
 > "The fundamental representation of text in Windows NT-based
 > operating systems is UTF-16"

yep, in this thread I've been mistakenly using the term UCS-16 when I
should have said UTF-16 (ie. the variable length, 2 byte encoding).

Samba currently treats the bytes on the wire from windows as UCS-2 (a
2 byte fixed width encoding), whereas perhaps it should be treating
them as UTF-16. I should write a smbtorture test to detect the
difference and see what different versions of windows actually use.

luckily the new charset handling stuff in samba3 and samba4 will make
this easy to fix :-)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 11:43               ` tridge
@ 2004-02-18 12:31                 ` Robin Rosenberg
  2004-02-18 16:48                   ` H. Peter Anvin
  0 siblings, 1 reply; 123+ messages in thread
From: Robin Rosenberg @ 2004-02-18 12:31 UTC (permalink / raw)
  To: tridge; +Cc: hpa, Kernel Mailing List

On Wednesday 18 February 2004 12.43, tridge@samba.org wrote:
> Robin,
>  > I've read it also:
>  > http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
>  > "The fundamental representation of text in Windows NT-based
>  > operating systems is UTF-16"

I believe (please correct me if this is wrong) that Windows never actually
supported any of the UCS-2 code that were in conflict with UTF-16. The cost
of this operation was that some of the "private" code blocks of unicode 2.0, i.e. 
U+D800..U+DFFF were redefined as "surrogates" in Unicode 3.0 making the 
UTF-16 encoding more or less backwards compatible with UCS-2. And it's 
UTF-16LE and UCS-2LE, but I suspect you knew that :-)

> yep, in this thread I've been mistakenly using the term UCS-16 when I
> should have said UTF-16 (ie. the variable length, 2 byte encoding).
> 
> Samba currently treats the bytes on the wire from windows as UCS-2 (a
> 2 byte fixed width encoding), whereas perhaps it should be treating
> them as UTF-16. I should write a smbtorture test to detect the
> difference and see what different versions of windows actually use.
See above, and most importantly the definition in Amendment 1 of the unicode 
3.0 standard.

> luckily the new charset handling stuff in samba3 and samba4 will make
> this easy to fix :-)
Happy man!

-- robin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 12:31                 ` Robin Rosenberg
@ 2004-02-18 16:48                   ` H. Peter Anvin
  2004-02-18 20:00                     ` H. Peter Anvin
  0 siblings, 1 reply; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-18 16:48 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: tridge, Kernel Mailing List

Robin Rosenberg wrote:
> 
> I believe (please correct me if this is wrong) that Windows never actually
> supported any of the UCS-2 code that were in conflict with UTF-16. The cost
> of this operation was that some of the "private" code blocks of unicode 2.0, i.e. 
> U+D800..U+DFFF were redefined as "surrogates" in Unicode 3.0 making the 
> UTF-16 encoding more or less backwards compatible with UCS-2. And it's 
> UTF-16LE and UCS-2LE, but I suspect you knew that :-)
> 

Make that Unicode 1.0 and 1.1, and you're correct.

	-hpa

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 16:48                   ` H. Peter Anvin
@ 2004-02-18 20:00                     ` H. Peter Anvin
  0 siblings, 0 replies; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-18 20:00 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <4033974F.4090706@zytor.com>
By author:    "H. Peter Anvin" <hpa@zytor.com>
In newsgroup: linux.dev.kernel
>
> Robin Rosenberg wrote:
> > 
> > I believe (please correct me if this is wrong) that Windows never actually
> > supported any of the UCS-2 code that were in conflict with UTF-16. The cost
> > of this operation was that some of the "private" code blocks of unicode 2.0, i.e. 
> > U+D800..U+DFFF were redefined as "surrogates" in Unicode 3.0 making the 
> > UTF-16 encoding more or less backwards compatible with UCS-2. And it's 
> > UTF-16LE and UCS-2LE, but I suspect you knew that :-)
> > 
> 
> Make that Unicode 1.0 and 1.1, and you're correct.
> 

Err, that was supposed to be 1.1 and 2.0.

Unicode 1.1 reshuffled the private use range from Unicode 1.0, in
order to make room for surrogates in Unicode 2.0.

UTF-16, what a horrible ugly hack.

	-hpa

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: UTF-8 and case-insensitivity
  2004-02-18  2:48   ` tridge
@ 2004-02-18 20:56     ` Robert White
  0 siblings, 0 replies; 123+ messages in thread
From: Robert White @ 2004-02-18 20:56 UTC (permalink / raw)
  To: tridge
  Cc: linux-kernel, 'Linus Torvalds', 'Al Viro',
	'Neil Brown'

I guess I don't get it...

tridge@samba.org [mailto:tridge@samba.org] said:

> NTFS is case insensitive at the filesystem level. In fact, its
> selectable whether its case sensitive or case insensitive per-process
> (a process can switch between the two models). The case mapping table
> is built into the filesystem itself. That mapping has absolutely
> *zero* to do with US Latin or any other legacy multi-byte encoding.

If the process selects whether it wants to be case insensitive or not how is
NTFS case insensitive "at the file-system level"?  Let me guess, they have
two complete paths through the logic?  Lots of DLLs?  Redundant conflicting
access semantics^Wfeatures?

> you are *way* out of date here. All recent windows apps use the UCS-2
> interfaces which provides a single charset encoding across all locales.

Which kind of directly supports where I said that to amortize the expense
Linux would have to set up its *own* cannon about all file systems using the
same encoding.  The fact that I kept bringing up 8.3 was out of date.  Point
to you.  The point that picking an arbitrary encoding will lead Linux
getting out of date, or at least require a catastrophic realignment of every
program that deigns to open() any file anywhere, remains germane.

> Have you read *any* of what I've been saying about how expensive this is??

Yes, I understand the expense.  I have *paid* that expense in excruciating
detail on several occasions.  You want to have the kernel pay that expense
(in place of the application) as a fixed (amortized) cost or you want to
codify the file names with a standard encoding which would penalize the
entire system uniformly by raising the base cost to localize.

I appreciate the unbounded regex-like expense of iteratively applying
case/encoding insensitivity to a list of files.  I really don't want to pay
that cost in every application when I only need it at the front end.  Sue
me.

I also understand the pain of having to load any/each entire directory into
memory one blasted dirent at a time, and appreciate that since the kernel is
bulk loading them at the filesystem interface it seems (is) wasteful to have
to spoon them across the kernel/user-space interface.  I really do
understand.  (ASIDE: a bulk-fetch-directory-into-buffer call might be nice,
I havn't looked lately, but I presume none such exists.)

Your proposed "single locale storage" would penalize all us embedded systems
types with our space sensitive embedded file systems and low-powered CPUs so
that the larger system that _can_ afford to pay the cost only when necessary
don't have to.  Two-bytes for one in every file name isn't a good trade off
when you are dealing with a 32k file system image.

I kind of tried (and apparently utterly failed) to make the points about how
the Windows model worked and what it would cost by describing the basis for
the model, not the current implementation.  That is kind of why I *started*
the message with "(ok in point of "technically abstract truth")" and
mentioned later that what I was saying may have changed, but if so, it
changed in a way consistent with the model as described.

Windows has been digging themselves steadily out of the deep hole of
case-insensitive file name handling for years; which does nothing to entice
me to jump in and join them.  So bully for windows that they have, iteration
after iteration, managed to reduce the cost of their mistake.

Even *with* a standardized file name character set/encoding case
insensitivity would still be very bad-off in some important areas.  Consider
a simple security log.  "[date] user command xx satisfied with executive
Xx." etc.  I can think of *lots* of times when I would have to open a file
and then have to ask what the real name of the file I opened actually was.
"I asked for 'Bob', what did I get?" isn't a fun question to have to answer
*after* an open.  Yes, all this *can* be addressed by scrubbing paths, but
history suggests that this doesn't happen and the more the system does for
you, the more likely you are to miss something.

At the application level, since I have to sort file names for a picklist
anyway, I'd rather pay the case insensitivity cost while I was sorting.
It's actually cleaner and I am already paying to sort.

I used to write SMB based applications (yes, I'm still way out of date) and
I appreciate the painful tit-for-tat non-streaming ugliness.  I feel your
pain at having to read a whole directory and doing the sort/search.  I
understand the race condition that occurs between the directory read and the
actual open where the file could be renamed or replaced.  I really do.

But "fixing" Linux so that it can share Window's pain doesn't seem wise.

I can imagine a mod/module that would graft a localized and/or
case-insensitive companion hash onto the dirent(s) as the central facility
was doing its work.  I can imagine an alternate open that traversed this
alternate tree.  Creating sort of a giant look-aside into the current file
information tree. But I can't imagine any winning scenario that came from
making that alternate hash the normal access method.  Too many people and
projects would suddenly break.

{And I try not to troll, but I apparently have a knack for getting peoples
dander in a bunch when I write.  I think it is because I write as I speak,
and the loss of tone and inflection in writing makes my turn-of-phrase come
off very priggish.  I'm not sure how to fix that.  /sigh 8-)

Rob.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  3:27               ` Linus Torvalds
@ 2004-02-18 21:31                 ` tridge
  2004-02-18 22:23                   ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-18 21:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, linux-kernel

Linus,

 > The thing is, I do agree with Tridge on one simple fact: it's very hard 
 > indeed to do atomic file operations from user space.

I'm glad I'm making progress :)

The second basic fact that I think is relevant is that its not
possible to do case-insensitive filesystem operations efficiently
without the filesystem having knowledge of the fact that you want a
case-insensitive lookup.

The reason for this is that modern filesystems do much better than an
O(n) linear scan for lookups in directories. They use a hash, or a
tree or whatever you like to take advantage of an ordering function on
the names in the directory. The days of linear scans in directories
are fast dwindling.

The only way you are going to avoid the linear scan for a
case-insensitive lookup is to make that ordering function
case-insensitive. The question really is whether we are willing to pay
the price in terms of complexity for doing that. I've tried to make
the claim in this thread that the code complexity cost of doing this
isn't really all that high, but it is definately non-zero.

So your magic_open() proposal would probably be a help, and would
certainly reduce the amount of code we would need in userspace, but it
doesn't change the fundamental linear scan of directories problem at
all. 

That doesn't mean I won't take you up on the magic_open() proposal,
it's just that I'd need to try it to see if its a sufficient win to
justify using it given the limitations.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18  0:20   ` Linus Torvalds
  2004-02-18  1:03     ` Robert White
@ 2004-02-18 21:48     ` Ville Herva
  1 sibling, 0 replies; 123+ messages in thread
From: Ville Herva @ 2004-02-18 21:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert White, tridge, 'Kernel Mailing List',
	'Al Viro', 'Neil Brown'

On Tue, Feb 17, 2004 at 04:20:26PM -0800, you [Linus Torvalds] wrote:
> 
> This is but one reason why I will _refuse_ to make case insensitivity
> magically start happening on regular "open()" etc calls.
> 
> You'd literally have to use a _different_ system call to do a 
> case-insensitive file open. 

Tongue-in-cheek:

  int Open(const char *pathname, int flags); ?




-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 21:31                 ` tridge
@ 2004-02-18 22:23                   ` Linus Torvalds
  2004-02-18 22:28                     ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-18 22:23 UTC (permalink / raw)
  To: tridge; +Cc: H. Peter Anvin, linux-kernel



On Thu, 19 Feb 2004 tridge@samba.org wrote:
> 
> The second basic fact that I think is relevant is that its not
> possible to do case-insensitive filesystem operations efficiently
> without the filesystem having knowledge of the fact that you want a
> case-insensitive lookup.

That's not my problem. That is _your_ problem, and I don't care. I 
disagree violently with the notion that we would push this down to a 
filesystem level.

Sorry, but there are limits to how much we care about broken operating 
systems.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 22:23                   ` Linus Torvalds
@ 2004-02-18 22:28                     ` Linus Torvalds
  2004-02-18 22:50                       ` tridge
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-18 22:28 UTC (permalink / raw)
  To: tridge; +Cc: H. Peter Anvin, linux-kernel



On Wed, 18 Feb 2004, Linus Torvalds wrote:
> 
> That's not my problem. That is _your_ problem, and I don't care. I 
> disagree violently with the notion that we would push this down to a 
> filesystem level.
> 
> Sorry, but there are limits to how much we care about broken operating 
> systems.

Side note: this only matters for cold cache entries anyway, so I doubt 
you'll see any performance improvement on a file server from passing the 
brain damage down to the lower levels. 

And I bet the performance advantages of _not_ doing native case 
insensitivity are likely to dominate hugely.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 22:28                     ` Linus Torvalds
@ 2004-02-18 22:50                       ` tridge
  2004-02-18 22:59                         ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-18 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, linux-kernel

Linus,

 > And I bet the performance advantages of _not_ doing native case 
 > insensitivity are likely to dominate hugely.

This part I just don't understand at all. The proposed changes would
be extremely cheap performance wise as you are just replacing one hash
with another, and dealing with one extra context bit in the
dcache. There is no way that this could come anywhere near the cost of
doing linear directory scans.

The hash function would be slightly more expensive (when enabled), but
not much, especially when you put in the obvious optimisation for 7
bit characters. The string comparison function in a couple of places
would also become more expensive, but once again it would only be
expensive for case-insensitive processes and benefits from the 7 bit
optimisation so that the average case will only be very slightly more
expensive than the current function.

Fair enough that you don't want to do this for code complexity
reasons, but please don't tell me it would be slower than what we have
to do now. 

Try an strace of Samba trying to unlink() a non-existant file in a
large directory. It's enough to make you want to curl up and die :)

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 22:50                       ` tridge
@ 2004-02-18 22:59                         ` Linus Torvalds
  2004-02-18 23:09                           ` tridge
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-18 22:59 UTC (permalink / raw)
  To: tridge; +Cc: H. Peter Anvin, linux-kernel

On Thu, 19 Feb 2004 tridge@samba.org wrote:
> 
>  > And I bet the performance advantages of _not_ doing native case 
>  > insensitivity are likely to dominate hugely.
> 
> This part I just don't understand at all. The proposed changes would
> be extremely cheap performance wise as you are just replacing one hash
> with another, and dealing with one extra context bit in the
> dcache. There is no way that this could come anywhere near the cost of
> doing linear directory scans.

Why do you focus on linear directory scans?

They simply do not happen under any reasonable IO patterns. You look up 
names under the same name that they are on the disk. So the _only_ thing 
that should matter is the exact match.

The inexact matches should be a case of "make them correct". Screw 
performance. And tell people that they are slower.

Sure, I can imaging that MS would make some benchmark to show that case, 
but at that point I just don't care. 

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 22:59                         ` Linus Torvalds
@ 2004-02-18 23:09                           ` tridge
  2004-02-18 23:16                             ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-18 23:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, linux-kernel

Linus,

 > Why do you focus on linear directory scans?

Because a large number of file operations are on filenames that don't
exist. I have to *prove* they don't exist. That includes:

 * every file create. I have to prove there wasn't an existing file
   under a different case combination.

 * every rename. Again, I have to prove that the destination name
   doesn't exist.

 * every open of a non-existant name (*very* common, its what MS
   office does all the time).

 etc etc.

If I had a single function that could quickly tell me that a file does
not exist in any case combination then I would be much better off.

 > They simply do not happen under any reasonable IO patterns. You look up 
 > names under the same name that they are on the disk. So the _only_ thing 
 > that should matter is the exact match.

nope, see above. The most common pattern of accesses involves doing a
full directory scan on every access.

 > Sure, I can imaging that MS would make some benchmark to show that case, 
 > but at that point I just don't care. 

It's not just "some benchmark". It's the normal use case.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 23:09                           ` tridge
@ 2004-02-18 23:16                             ` Linus Torvalds
  2004-02-19  8:10                               ` Jamie Lokier
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-18 23:16 UTC (permalink / raw)
  To: tridge; +Cc: H. Peter Anvin, linux-kernel

On Thu, 19 Feb 2004 tridge@samba.org wrote:
> 
>  > Why do you focus on linear directory scans?
> 
> Because a large number of file operations are on filenames that don't
> exist. I have to *prove* they don't exist.

And you only need to do that ONCE per name.

There is zero reason to do it over and over again, and there is zero 
reason to push case insensitivity deep into the filesystem.

Have you checked how many filesystems we have? Hint: 

	ls -l fs/ | grep '^d' | wc

The thing is, you have to realize that Windows-compatibility is very very 
much second-class. If you refuse to realize that, you can't argue 
effectively, because you are arguing for things that simply WILL NOT 
happen.

So instead of having this crazy windows-centric idea, I would suggest you 
try to come up with ways to make it easier for you. I can tell you already 
that it won't be everything you want or need, but quite frankly, your 
choice is between _nada_ and something reasonable.

So give it up. We're not making the same STUPID mistakes that Microsoft 
has done. 

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-17  5:11 ` Linus Torvalds
  2004-02-17  6:54   ` tridge
@ 2004-02-19  2:53   ` Daniel Newby
  1 sibling, 0 replies; 123+ messages in thread
From: Daniel Newby @ 2004-02-19  2:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Tridgell, Kernel Mailing List

Linus Torvalds wrote:
> So some variation of the interface
> 
> 	int magic_open(
> 		/* Input arguments */
> 		const char *pathname,
> 		unsigned long flags,
> 		mode_t mode,

What about making the pathname hold the alternative cases for each 
character, not just an exact string?  If Samba wanted to open
"A File.txt", it would do

     magic_open( "[a|A][ ][f|F][i|I][e|E][.][t|T][x|X][t|T]", ... )

The syntax shown is conceptual; the actual code would use binary 
packing.  Characters would be variable length to support UTF-8 and 
the like.

Userland would be responsible for making a useful pathname.  If it 
tried something like "[aL|P|#][m|m]", the kernel would cheerfully 
use it.  The only sanity checking would be that special characters 
like "/" and ":" cannot have alternatives.

Pros:

1.  Filesystem names are looked up in kernel mode, where it might be 
efficient.  (Less grossly slow at least.)

2.  But the kernel doesn't care about encodings and character sets.

3.  No new kernel infrastructure needed.  (I hope?)  The case- 
insensitive system calls don't take a performance hit.

4.  The kernel can detect name collisions and decide what to do 
based on a flag.

5.  Lookup tables are totally in userland and outside locks.  Each 
app can use the table it finds appropriate.

6.  A naughty app can't deadlock the filesystem.

7.  Case-insensitive calls can be atomic, if you're willing to pay 
the performance price.  It's straightforward for magic_creat() to 
refuse to create collisions.

Cons:

1.  Looking up multiple alternatives is hairy.  (Not that the other 
approaches are much prettier.)

2.  Massive filenames would get turned into something *really* 
massive (five times as many bytes for a simple packing).  Does this 
break anything?

     -- Daniel Newby

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-18 23:16                             ` Linus Torvalds
@ 2004-02-19  8:10                               ` Jamie Lokier
  2004-02-19 16:09                                 ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-19  8:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tridge, H. Peter Anvin, linux-kernel

Linus Torvalds wrote:
> >  > Why do you focus on linear directory scans?
> > 
> > Because a large number of file operations are on filenames that don't
> > exist. I have to *prove* they don't exist.
> 
> And you only need to do that ONCE per name.
> 
> There is zero reason to do it over and over again, and there is zero 
> reason to push case insensitivity deep into the filesystem.

Linus, while I agree with you wholeheartedly on everything else in
this thread - how can Samba only do that lookup ONCE per name if a
client is issuing many requests for non-existent opens or stats?

Example: A client has a search path for executables or libraries.

Each time SomeThing.DLL is looked up by the client, it will issue an
open() for each entry in the path, until it finds the file it wants.

For each request, Samba must readdir() every directory in the path
until the file is found.

If a directory doesn't change between requests, Samba can use dnotify
to cache the negative lookups.

However, if any change occurs in a directory, or if the directory is
not dnotify-capable, Samba is not allowed to cache these negative
results: It has to do the readdir() for _every_ request.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-19  8:10                               ` Jamie Lokier
@ 2004-02-19 16:09                                 ` Linus Torvalds
  2004-02-19 16:38                                   ` Jamie Lokier
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 16:09 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: tridge, H. Peter Anvin, linux-kernel

On Thu, 19 Feb 2004, Jamie Lokier wrote:
> 
> Linus, while I agree with you wholeheartedly on everything else in
> this thread - how can Samba only do that lookup ONCE per name if a
> client is issuing many requests for non-existent opens or stats?

While I'm not willing to push case insensitivity deep into the
filesystems, I _am_ willing to entertain the notion of an extra flag to a
dcache entry that the regular VFS operations ignore (apart from clearing
it when they change anything and having to flush them under some
circumstances), which would basically be "this dentry has been judged
unique in a case-insensitive environment".

So assuming nobody else is touching the directory, the case-insensitive 
special module could create these kinds of dentries to its hearts content 
when it does a lookup.

> Example: A client has a search path for executables or libraries.
> 
> Each time SomeThing.DLL is looked up by the client, it will issue an
> open() for each entry in the path, until it finds the file it wants.
> 
> For each request, Samba must readdir() every directory in the path
> until the file is found.
> 
> If a directory doesn't change between requests, Samba can use dnotify
> to cache the negative lookups.
> 
> However, if any change occurs in a directory, or if the directory is
> not dnotify-capable, Samba is not allowed to cache these negative
> results: It has to do the readdir() for _every_ request.

But this is exactly what I _am_ willing to entertain: have some limited 
special logic inside the kernel (but outside the VFS layer proper), that 
allows samba to use special interfaces that avoids this.

For example, the rule can be that _any_ regular dentry create will 
invalidate all the "case-insensitive" dentries. Just to be simple about 
it. But if samba is the only thing that accesses a certain directory (or 
the directory is not written to, like / and /usr etc usually behave), the 
"windows hack" interface will be able to populate it with its fake 
dentries all it wants.

Or something like this. Basically, I'm convinced that the problem _can_ be 
solved without going deep into the VFS layer. Maybe I'm wrong. But I'd 
better not be, because we're definitely not going to screw up the VFS 
layer for Windows.

			Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-19 16:09                                 ` Linus Torvalds
@ 2004-02-19 16:38                                   ` Jamie Lokier
  2004-02-19 16:54                                     ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-19 16:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tridge, H. Peter Anvin, linux-kernel

Linus Torvalds wrote:
> For example, the rule can be that _any_ regular dentry create will 
> invalidate all the "case-insensitive" dentries. Just to be simple about 
> it.

If that's the rule, then with exactly the same algorithmic efficiency,
readdir+dnotify can be used to maintain the cache in userspace
instead.  There is nothing gained by using the helper module in that case.

It follows that a helper module is only useful if readdir+dnotify
isn't fast enough, and the invalidation rule has to be more selective.

(Although, maybe there are atomicity concerns I haven't thought of).

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-19 16:38                                   ` Jamie Lokier
@ 2004-02-19 16:54                                     ` Linus Torvalds
  2004-02-19 18:29                                       ` Jamie Lokier
  2004-02-19 19:08                                       ` UTF-8 and case-insensitivity Helge Hafting
  0 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 16:54 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: tridge, H. Peter Anvin, linux-kernel



On Thu, 19 Feb 2004, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > For example, the rule can be that _any_ regular dentry create will 
> > invalidate all the "case-insensitive" dentries. Just to be simple about 
> > it.
> 
> If that's the rule, then with exactly the same algorithmic efficiency,
> readdir+dnotify can be used to maintain the cache in userspace
> instead.  There is nothing gained by using the helper module in that case.

Wrong.

Because the dnotify would trigger EVEN FOR SAMBA OPERATIONS.

Think about it. Think about samba doing a "rename()" within the directory.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-19 16:54                                     ` Linus Torvalds
@ 2004-02-19 18:29                                       ` Jamie Lokier
  2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
  2004-02-19 19:08                                       ` UTF-8 and case-insensitivity Helge Hafting
  1 sibling, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-19 18:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tridge, H. Peter Anvin, linux-kernel

Linus Torvalds wrote:
> > > For example, the rule can be that _any_ regular dentry create will 
> > > invalidate all the "case-insensitive" dentries. Just to be simple about 
> > > it.
> > 
> > If that's the rule, then with exactly the same algorithmic efficiency,
> > readdir+dnotify can be used to maintain the cache in userspace
> > instead.  There is nothing gained by using the helper module in that case.
> 
> Wrong.
> Because the dnotify would trigger EVEN FOR SAMBA OPERATIONS.

Ah, I didn't know you meant "_any_ regular dentry create (except for
Samba operations)".

To apply that rule, you either need alternate versions of rename() and
other file syscalls, or something akin to a process-specific flag (set
by the helper module) saying that this is a Samba process and dentry
creation _by this process_ shouldn't invalidate case-insensitive
dentries.

And if you have either of those, the bit of code which says "don't
invalidate case-insenitive dentries because this is a Samba process"
can just as easily say "don't send dnotify events to the current
process".

And once you've done that, it's easier just to add a DN_IGNORE_SELF
flag to dnotify meaning to ignore events caused by the current
process, and forget about the helper module.  That'd be useful for
other programs, too.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: UTF-8 and case-insensitivity
  2004-02-19 16:54                                     ` Linus Torvalds
  2004-02-19 18:29                                       ` Jamie Lokier
@ 2004-02-19 19:08                                       ` Helge Hafting
  1 sibling, 0 replies; 123+ messages in thread
From: Helge Hafting @ 2004-02-19 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, tridge, H. Peter Anvin, linux-kernel

On Thu, Feb 19, 2004 at 08:54:51AM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 19 Feb 2004, Jamie Lokier wrote:
> > Linus Torvalds wrote:
> > > For example, the rule can be that _any_ regular dentry create will 
> > > invalidate all the "case-insensitive" dentries. Just to be simple about 
> > > it.
> > 
> > If that's the rule, then with exactly the same algorithmic efficiency,
> > readdir+dnotify can be used to maintain the cache in userspace
> > instead.  There is nothing gained by using the helper module in that case.
> 
> Wrong.
> 
> Because the dnotify would trigger EVEN FOR SAMBA OPERATIONS.
> 
> Think about it. Think about samba doing a "rename()" within the directory.

Avoiding its own operations is a nice one.  Could dnotify pass
some information, such as the inode number involved to samba?
samba could then look up the filename in its cache and take a
closer look at that file only.  That would avoid loosing the cache,
even in case of other processes intruding.

Helge Hafting

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 19:51                                           ` Linus Torvalds
@ 2004-02-19 19:48                                             ` H. Peter Anvin
  2004-02-19 20:04                                               ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-19 19:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tridge, Al Viro, Jamie Lokier, Kernel Mailing List

Linus Torvalds wrote:
> 
> On Thu, 19 Feb 2004, Linus Torvalds wrote:
> 
>>Basic approach: add two bits to the VFS dentry flags. That's all that is 
>>needed. Then you have two new system calls:
> 
>                         ^^^
> 
>> - set_bit_one(dirfd)
>> - set_bit_two_if_one_is_set(dirfd);
>> - check_or_create_name(dirfd, name, case_table_pointer, newfd);
> 
> 
>  [ deletia ]
> 
> 
>>Am I a super-intelligent bastard, or am I a complete nincompoop? You
>>decide.
> 
> 
> I think my lack of counting ability basically answers that question.
> 
> Damn.
> 

How about a compomise - super-intelligent complete nincompoop bastard?

[:^)

	-hpa


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 18:29                                       ` Jamie Lokier
@ 2004-02-19 19:48                                         ` Linus Torvalds
  2004-02-19 19:51                                           ` Linus Torvalds
                                                             ` (4 more replies)
  0 siblings, 5 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 19:48 UTC (permalink / raw)
  To: Tridge, Al Viro; +Cc: Jamie Lokier, H. Peter Anvin, Kernel Mailing List

Ok, 
 I think I've got it. Here's an algorithm that will have "perfect" 
behaviour under normal circumstances as long as you've got enough memory. 

Admittedly the "you've got enough memory" part is a downside, but it's so 
_damn_ clean and simple that it is, I think, a reasonable trade-off. 
Besides, if you want good file serving numbers, you'd better have enough 
memory anyway.

Basic approach: add two bits to the VFS dentry flags. That's all that is 
needed. Then you have two new system calls:

 - set_bit_one(dirfd)
 - set_bit_two_if_one_is_set(dirfd);
 - check_or_create_name(dirfd, name, case_table_pointer, newfd);

The VFS rule is:
 - all new dentries start off with the two magic bits clear
 - whenever we shrink a dentry, we clear the two magic bits in the parent

and that is _all_ the VFS layer ever does. Even Al won't find this 
obnoxious (yeah, we might clear the bits after a timeout on things that 
need re-validation, but that's in the noise).

The "set_bit_one()" system call will set one of the magic bits (with the
dcache lock held) in the dentry that is pointed to by the file descriptor.
Nothing more.

The "set_bit_two_if_one_is_set()" system call will set the _other_ magic 
bit (with the dcache lock held) in the dentry, if the first bit is set. 
Otherwise it will just return.

Let's leave the "check_or_create_name()" thing for now, and see how we can
use this in user space (and realize that we only do this on cache failure,
so this is the "slow case"):

	set_bit_one(dir);
	lseek(dir, 0, SEEK_SET);
	while (readdir(dir, de)) {
		stat(de->d_name);
		.. might also compare the name here with whatever it is 
		   working on right now..
	}
	set_bit_two_if_one_is_set(dirfd);

Notice what the above does? After the above loop, bit two will be set IFF 
the dentry cache now contains every single name in the directory. 
Otherwise it will be clear. Bit two will basically be a "dcache complete" 
bit.

Now, let's go to "check_or_create_name()", which can thus do:

 - for each name in the dcache name list, compare the dang thing 
   without case.
 - return "lookup succeeded" (the file descriptor of the thing it 
   successfully looked up) if a match with a positive dentry occurs.
 - check bit two, return -ENOCACHE if it was clear.
 - create the new dentry with the new name and the new file descriptor 
   inode, and return success.

Notice? Basically _ZERO_ changes to the VFS layer, together with basically 
perfect hot-cache-case behaviour.

Yeah, yeah, the above is probably glossing over a lot of issues (there's a
race if somebody does both the "readdir loop" and the "create" case at the
same time, so that would need a lock around it in user space, but please
realize that the readdir loop only happens if the "check_or_create()" 
thing fails, so the readdir loop should basically never happen in the 
hot-cache case.

And the above allows perfect behaviour even for new filenames that we have 
never seen before (ie a create of a new file with a random name). At least 
as long as the dcache for that directory remains "complete" (which it will 
do, until the kernel needs to throw something out).

Am I a super-intelligent bastard, or am I a complete nincompoop? You
decide.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
@ 2004-02-19 19:51                                           ` Linus Torvalds
  2004-02-19 19:48                                             ` H. Peter Anvin
  2004-02-19 20:05                                           ` viro
                                                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 19:51 UTC (permalink / raw)
  To: Tridge, Al Viro; +Cc: Jamie Lokier, H. Peter Anvin, Kernel Mailing List



On Thu, 19 Feb 2004, Linus Torvalds wrote:
> 
> Basic approach: add two bits to the VFS dentry flags. That's all that is 
> needed. Then you have two new system calls:
                        ^^^
>  - set_bit_one(dirfd)
>  - set_bit_two_if_one_is_set(dirfd);
>  - check_or_create_name(dirfd, name, case_table_pointer, newfd);

 [ deletia ]

> Am I a super-intelligent bastard, or am I a complete nincompoop? You
> decide.

I think my lack of counting ability basically answers that question.

Damn.

		Linus "complete nincompoop" Torvalds

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 19:48                                             ` H. Peter Anvin
@ 2004-02-19 20:04                                               ` Linus Torvalds
  0 siblings, 0 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 20:04 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Tridge, Al Viro, Jamie Lokier, Kernel Mailing List

On Thu, 19 Feb 2004, H. Peter Anvin wrote:
> 
> How about a compomise - super-intelligent complete nincompoop bastard?

Ok, but in the meantime I think I can save face by saying that you only 
need two system calls, by simply making a "lseek(fd, 0, SEEK_SET)" 
implicitly set the first bit. So then the "set second bit if first is set" 
just becomes a "dcache fill complete" notifier.

So I'll take half credit.

		Linus "super-complete bastard" Torvalds

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
  2004-02-19 19:51                                           ` Linus Torvalds
@ 2004-02-19 20:05                                           ` viro
  2004-02-19 20:23                                             ` Linus Torvalds
  2004-02-19 23:37                                           ` tridge
                                                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 123+ messages in thread
From: viro @ 2004-02-19 20:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, Feb 19, 2004 at 11:48:50AM -0800, Linus Torvalds wrote:
> The VFS rule is:
>  - all new dentries start off with the two magic bits clear
>  - whenever we shrink a dentry, we clear the two magic bits in the parent
> 
> and that is _all_ the VFS layer ever does. Even Al won't find this 
> obnoxious (yeah, we might clear the bits after a timeout on things that 
> need re-validation, but that's in the noise).
 
> Notice what the above does? After the above loop, bit two will be set IFF 
> the dentry cache now contains every single name in the directory. 
> Otherwise it will be clear. Bit two will basically be a "dcache complete" 
> bit.

What about dentry getting dropped in the middle of that loop _and_
another task setting the first bit again before the loop ends?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 20:05                                           ` viro
@ 2004-02-19 20:23                                             ` Linus Torvalds
  2004-02-19 20:32                                               ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 20:23 UTC (permalink / raw)
  To: viro; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:

> On Thu, Feb 19, 2004 at 11:48:50AM -0800, Linus Torvalds wrote:
> > The VFS rule is:
> >  - all new dentries start off with the two magic bits clear
> >  - whenever we shrink a dentry, we clear the two magic bits in the parent
> > 
> > and that is _all_ the VFS layer ever does. Even Al won't find this 
> > obnoxious (yeah, we might clear the bits after a timeout on things that 
> > need re-validation, but that's in the noise).
>  
> > Notice what the above does? After the above loop, bit two will be set IFF 
> > the dentry cache now contains every single name in the directory. 
> > Otherwise it will be clear. Bit two will basically be a "dcache complete" 
> > bit.
> 
> What about dentry getting dropped in the middle of that loop _and_
> another task setting the first bit again before the loop ends?

Hey, you snipped the part where I said that the application has to have 
its own locking around the loop and around the lookup to avoid races. 

We can avoid that requirement by using sequence numbers and making it a
bit more complex, but the simple version was for samba only (ie "only one
app that wants this").

Realize that none of this makes the internal kernel (or filesystem) data
structures be wrong, so even if the app has a bug and doesn't do the right
locking, at worst that just results in problems for that application, not
for the rest of the system.

But yes, if we want to make others use this, we'd need to have the kernel 
actually support some kind of locking, probably by just making the whole 
readdir loop be inside the kernel itself (at which point we can use the 
inode semaphore for this).

The "dcache full" bit could be potentially useful regardless of any
case-ignorant operating system emulation crap, although I don't see any
really obvious applications (we could speed up regular "readdir()", but we
don't have the d_offset thing, so..)

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 20:23                                             ` Linus Torvalds
@ 2004-02-19 20:32                                               ` Linus Torvalds
  2004-02-19 20:45                                                 ` viro
  2004-02-19 20:48                                                 ` Jamie Lokier
  0 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 20:32 UTC (permalink / raw)
  To: viro; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
> > 
> > What about dentry getting dropped in the middle of that loop _and_
> > another task setting the first bit again before the loop ends?
> 
> Hey, you snipped the part where I said that the application has to have 
> its own locking around the loop and around the lookup to avoid races. 

[ That, btw, implies that we do need to make the "set bit one" a system 
  call of its own, so that somebody elses "lseek(fd, 0, SEEK_SET)" wouldn't 
  mess up. Mea culpa. ]

Anyway, if we're willing to make some other changes to the VFS layer, we 
could make all of this a bit more efficient by _not_ requiring the actual 
filesystem lookup to take place.

If we had a flag that allowed a dentry to not have a d_inode pointer, but
still _not_ be considered automatically negative, then we could just make
a loop that fills the dcache directly from the readdir() data inside the
kernel, without calling down to the filesystem to look up the inode.

That would save a _lot_ of memory - quite often we'd only need the dentry 
itself.

That would require a third bit in the VFS dentry flags (something like
D_DENTRY_LIKELY_POSITIVE), and would require that "d_lookup()" not just 
assume that a dentry without an inode is always negative (check the new 
flag, and if so, do the filesystem lookup when the lookup actually 
happens).

Doesn't look _too_ bad, and considering the potential memory savings (and 
not having to seek around the disk to look up the inode data), it would 
probably be worth thinking about at least as a "second stage".

So then we could have a dcache that is fully populated, even though the
actual inode data hasn't been loaded yet.

Comments?

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 20:32                                               ` Linus Torvalds
@ 2004-02-19 20:45                                                 ` viro
  2004-02-19 21:26                                                   ` Linus Torvalds
  2004-02-19 20:48                                                 ` Jamie Lokier
  1 sibling, 1 reply; 123+ messages in thread
From: viro @ 2004-02-19 20:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, Feb 19, 2004 at 12:32:55PM -0800, Linus Torvalds wrote:
> Anyway, if we're willing to make some other changes to the VFS layer, we 
> could make all of this a bit more efficient by _not_ requiring the actual 
> filesystem lookup to take place.
> 
> If we had a flag that allowed a dentry to not have a d_inode pointer, but
> still _not_ be considered automatically negative, then we could just make
> a loop that fills the dcache directly from the readdir() data inside the
> kernel, without calling down to the filesystem to look up the inode.
> 
> That would save a _lot_ of memory - quite often we'd only need the dentry 
> itself.

> So then we could have a dcache that is fully populated, even though the
> actual inode data hasn't been loaded yet.
> 
> Comments?

*Ugh*

	That will cause all sorts of nastiness for filesystems that _have_
case-insensitive lookups.  Remember the crap we had to deal with to avoid
multiple dentries for directory?  It will come back, AFAICS.

	Another thing I really don't like is that we now get real lookups
on hashed dentry.  That potentially changes a lot and can lead to very
interesting results for some filesystems.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 20:32                                               ` Linus Torvalds
  2004-02-19 20:45                                                 ` viro
@ 2004-02-19 20:48                                                 ` Jamie Lokier
  2004-02-19 21:30                                                   ` Linus Torvalds
  1 sibling, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-19 20:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> Comments?

Yes: The slow part of my brain thinks dnotify with a new flag
DN_IGNORE_SELF, meaning don't notify for things done by the process
which is watching, would provide equivalent functionality.

That is:

Samba looks up a name:

    1. Look up cache entry in Samba's cache; fails.
    2. Try exact name; fails.
    3. Open directory.
    4. Register dnotify (DN_IGNORE_SELF | DN_CREATE | DN_RENAME | DN_DELETE).
    5. readdir(); no case-insensitive match.
    6. Stores negative cache entry in Samba.

Future lookups just succeed in Samba's cache.

Negative cache entries are simply invalidated whenever a dnotify is
received for that directory.

Samba already maintains a cache for positive entries, so this would be
very little logic to add.

In what way is your two bit proposal better?

- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 20:45                                                 ` viro
@ 2004-02-19 21:26                                                   ` Linus Torvalds
  2004-02-19 21:38                                                     ` Linus Torvalds
  2004-02-19 21:45                                                     ` Linus Torvalds
  0 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 21:26 UTC (permalink / raw)
  To: viro; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> 
> > So then we could have a dcache that is fully populated, even though the
> > actual inode data hasn't been loaded yet.
> > 
> > Comments?
> 
> *Ugh*
> 
> 	That will cause all sorts of nastiness for filesystems that _have_
> case-insensitive lookups.  Remember the crap we had to deal with to avoid
> multiple dentries for directory?  It will come back, AFAICS.

No no. Look at how this works:
 - only one dentry actually exists. It is marked "tentative", which means 
   that nobody will use it as-such without doing a lookup on it. It has 
   zero impact on aliases etc, because it's really just a place-holder: it
   doesn't point to any inodes at all, it only says "there may or may not
   be a file here"

   NOTE! This dentry is in no way case-insensitive. It happens to have 
   _exactly_ the contents (and hash) that the readdir entry had, but it
   has no meaning outside of that.

 - each caller of "__d_lookup()" will have to check if it's a tentative 
   dentry and basically ignore it if so.

   There aren't that many of them, and I think it all comes together in
   "do_lookup()", which may be the _only_ place that actually cares right
   now. Look at how that works right now:

		dentry = __d_lookup(..);
		if (!dentry)
			goto needs_lookup;	/* This case will allocate a whole 
					new dentry and use that for lookup */

		/* NEW CASE! */
		if (dentry->d_flags & D_TENTATIVE)
			goto needs_lookup_with_this_dentry;
	done:
		path->mnt = mnt;
		path->dentry = dentry;
		return 0;

	/*
	 * NEW CASE!!
	 *
	 * Unhash the tentative one, and look up a real one.
	 */
	needs_lookup_with_this_dentry:
		d_drop(dentry);
		dentry = NULL;

	/* OLD REGULAR CASE */
	needs_lookup:
		...

In other words, neither the low-level filesystem NOR anything else really 
ever sees the tentative dentry (the above is the really stupid approach: a 
slightly more clever one will avoid the "real_lookup()" alloc_dentry() 
thing and just use the tentative dentry after having unhashed it and 
verified that it's the only user).

See? Nobody actually ever sees the "raw dentry". They all go through 
__d_lookup(), and the rule would be:

 - if "d_lookup()" sees a tentative dentry, it will just unhash it and 
   drop it (it has the dcache lock, so it can do that)
 - all callers of "__d_lookup()" will have to check for D_TENTATIVE, and 
   decide what to do with it. I think there are exactly _three_ callers, 
   and one of them is d_lookup() itself.

See? Very very minimal impact that I can see (really, the biggest part
would be to do the dentry re-use in the better version of "do_lookup()" -
that would mean some re-organization, but maybe that optimization isn't
even worth it).

Or did I miss anything?

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 20:48                                                 ` Jamie Lokier
@ 2004-02-19 21:30                                                   ` Linus Torvalds
  2004-02-20  0:00                                                     ` Jamie Lokier
  2004-02-20  1:39                                                     ` Junio C Hamano
  0 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 21:30 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: viro, Tridge, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Jamie Lokier wrote:
> 
> Yes: The slow part of my brain thinks dnotify with a new flag
> DN_IGNORE_SELF, meaning don't notify for things done by the process
> which is watching, would provide equivalent functionality.

Basically, yes. However, I can tell you that directory name caching is 
damn hard, and the kernel does it better than anybody else. 

The hardest part of caching is not filling the cache - it's knowing when 
to release it. In other words, forget the filling part, and think about 
the replacement policy (balacing between the page cache, the directory 
cache, and regular pages). The kernel already has that.

Besides, I really think that we can do this with basically just a few 
lines of code in the kernel (apart from the actual case comparison, which 
I'm not even going to worry about - that's totally independent of the 
cache handling itself, and I don't care about how to write a 
"windows_equivalent_strncasecmp()".

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 21:26                                                   ` Linus Torvalds
@ 2004-02-19 21:38                                                     ` Linus Torvalds
  2004-02-19 21:45                                                     ` Linus Torvalds
  1 sibling, 0 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 21:38 UTC (permalink / raw)
  To: viro; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
> 
> No no. Look at how this works:
>  - only one dentry actually exists.

That was really badly phrased. There can be _millions_ of these things,
but they are all "unique" - they have zero impact on each other, and have
no linkages. They never shadow any existing dentries (ie when we create
these, we'd obviously never create a tentative dentry with the same name
as an existing _valid_ dentry), and they are never visible to the 
filesystem. 

So it's not that "only one dentry" exists, but that that this tentative
dentry only exists as a unique marker of "a dentry of this name _may_
exist".

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 21:45                                                     ` Linus Torvalds
@ 2004-02-19 21:43                                                       ` viro
  2004-02-19 21:53                                                         ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: viro @ 2004-02-19 21:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, Feb 19, 2004 at 01:45:32PM -0800, Linus Torvalds wrote:
> So we'd see very quickly if these tentative dentries were to escape 
> outside of __d_lookup().

Ahem...  You'll see them (at least) in dcache pruning codepaths.  And
those will dereference inodes...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 21:26                                                   ` Linus Torvalds
  2004-02-19 21:38                                                     ` Linus Torvalds
@ 2004-02-19 21:45                                                     ` Linus Torvalds
  2004-02-19 21:43                                                       ` viro
  1 sibling, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 21:45 UTC (permalink / raw)
  To: viro; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List



On Thu, 19 Feb 2004, Linus Torvalds wrote:
> 
> See? Nobody actually ever sees the "raw dentry". They all go through 
> __d_lookup(), and the rule would be:
> 
>  - if "d_lookup()" sees a tentative dentry, it will just unhash it and 
>    drop it (it has the dcache lock, so it can do that)
>  - all callers of "__d_lookup()" will have to check for D_TENTATIVE, and 
>    decide what to do with it. I think there are exactly _three_ callers, 
>    and one of them is d_lookup() itself.

Actually, I've got a better setup: instead of having a D_TENTATIVE flag in 
the dentry flags, just do

	#define TENTATIVE_INODE ((struct inode *) 1)

and just have "dentry->d_inode = TENTATIVE_INODE" for the dentries that 
were filled directly from "readdir()" data.

This not only avoids using a bit in the dentry flags, but it pretty much
guarantees that everybody is forced to use them correctly. It would be
very hard to have a buggy user: the dentry will clearly not be a negative
dentry (since d_inode is not NULL), but if anybody ever uses it as a
positive dentry, you'll get a nice and immediate oops.

So we'd see very quickly if these tentative dentries were to escape 
outside of __d_lookup().

			Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 21:43                                                       ` viro
@ 2004-02-19 21:53                                                         ` Linus Torvalds
  2004-02-19 22:21                                                           ` David Lang
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-19 21:53 UTC (permalink / raw)
  To: viro; +Cc: Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> On Thu, Feb 19, 2004 at 01:45:32PM -0800, Linus Torvalds wrote:
> > So we'd see very quickly if these tentative dentries were to escape 
> > outside of __d_lookup().
> 
> Ahem...  You'll see them (at least) in dcache pruning codepaths.  And
> those will dereference inodes...

Yea, you be right. Many of those paths would not need to care about
TENTATIVE at all, so using the d_inode thing would make them uglier, I
agree. Maybe the flag is better after all (and it really should be pretty
well contained by just checking all __d_lookup callers, so it should be 
hard to get it wrong, but maybe I've forgotten some path).

We could do it both ways - do the TENTATIVE_INODE thing as a debugging 
thing at first to make sure none of these dentries escape, and then remove 
it (and the unnecessary tests in the pruning paths) once everybody is 
convinced that it is working correctly.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 21:53                                                         ` Linus Torvalds
@ 2004-02-19 22:21                                                           ` David Lang
  0 siblings, 0 replies; 123+ messages in thread
From: David Lang @ 2004-02-19 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: viro, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

surfacing from normal lurk mode.

it looks to me like this could also end up allowing pruning lots of the
existing negative dcache entries.

if you fully cache the directory (and set the new flag) then any lookup
that isn't found is known to not exist.

is it worth freeing the negative dcache entries when you set the flag to
say that you have things fully cached? if so this could end up being a
significant memory savings.

David Lang

On Thu, 19 Feb 2004, Linus Torvalds wrote:

> Date: Thu, 19 Feb 2004 13:53:43 -0800 (PST)
> From: Linus Torvalds <torvalds@osdl.org>
> To: viro@parcelfarce.linux.theplanet.co.uk
> Cc: Tridge <tridge@samba.org>, Jamie Lokier <jamie@shareable.org>,
>      H. Peter Anvin <hpa@zytor.com>,
>      Kernel Mailing List <linux-kernel@vger.kernel.org>
> Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
>
>
>
> On Thu, 19 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> >
> > On Thu, Feb 19, 2004 at 01:45:32PM -0800, Linus Torvalds wrote:
> > > So we'd see very quickly if these tentative dentries were to escape
> > > outside of __d_lookup().
> >
> > Ahem...  You'll see them (at least) in dcache pruning codepaths.  And
> > those will dereference inodes...
>
> Yea, you be right. Many of those paths would not need to care about
> TENTATIVE at all, so using the d_inode thing would make them uglier, I
> agree. Maybe the flag is better after all (and it really should be pretty
> well contained by just checking all __d_lookup callers, so it should be
> hard to get it wrong, but maybe I've forgotten some path).
>
> We could do it both ways - do the TENTATIVE_INODE thing as a debugging
> thing at first to make sure none of these dentries escape, and then remove
> it (and the unnecessary tests in the pruning paths) once everybody is
> convinced that it is working correctly.
>
> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
  2004-02-19 19:51                                           ` Linus Torvalds
  2004-02-19 20:05                                           ` viro
@ 2004-02-19 23:37                                           ` tridge
  2004-02-20  0:02                                             ` Linus Torvalds
  2004-02-20  2:30                                           ` Theodore Ts'o
  2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
  4 siblings, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-19 23:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

Linus,

I'm probably just thicker than a complete set of superman comics, and
probably haven't had enough coffee this morning, but I'm still trying
to understand exactly how much this is going to gain us.

If I understand it, your suggestion gives us:

 - a way of telling if a directory is fully cached in the dcache
 - a way of scanning that full cache with whatever braindead
   comparison algorithm we want

At first I didn't understand the scanning part at all, because I
didn't realise that you could scan just the dentries associated with a
single directory. Al was kind enough to correct me on that.

What your proposal doesn't give us is case-insensitive indexing into
the dcache. The reason the dcache is such a great thing in Linux is
that it is indexed by name, so you rarely do any scanning at all, and
even the case where you have never seen the name before we avoid
scanning because fast filesystems also use a "indexed by name"
scheme. Now maybe I'm just over-obsessed about this scanning stuff and
I'd need some profiles to see how much it would cost (although the
cost as the directory size gets really large seems obvious).

The really interesting part of your proposal is that it opens up the
possibility of a coherence mechanism between a cache that is indexed
by some windows like scheme and the real dcache. If those two bits
could be used by the windows_braindead module to determine if its own
separately indexed cache was current then we'd really be getting
somewhere. 

If we didn't do the separate cache at all, then your proposal still
should hugely reduce the number of times we ask the filesystem for a
list of files in the directory, although as those calls are already
cached at the block device level what I suppose it does is move the
cache up a level. I don't have a clear idea of how much faster it is
to do this scanning in the dcache versus in the filesystem in the
hot-cache case, so I am not clear on how much this wins us. I'm
prepared to believe it could be quite significant though.

I really need more coffee-and-think time on this, plus maybe some
quick and dirty profiling tests to see what the various costs are
like. 

While I'm here I should point out that I'm thinking of the 2.7/2.8
kernel (or even 3.0) for any change, not 2.6. Maybe thats obvious
anyway, but the corresponding userspace changes in Samba definately
won't be happening in Samba 3.0, so this is a Samba 4.0 thing, which
is a fair way off. This means we've got plenty of time to try some
experiments and see what schemes really help.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 21:30                                                   ` Linus Torvalds
@ 2004-02-20  0:00                                                     ` Jamie Lokier
  2004-02-20  0:17                                                       ` Linus Torvalds
  2004-02-20  1:39                                                     ` Junio C Hamano
  1 sibling, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20  0:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> The hardest part of caching is not filling the cache - it's knowing when 
> to release it. In other words, forget the filling part, and think about 
> the replacement policy (balacing between the page cache, the directory 
> cache, and regular pages). The kernel already has that.

It's worth noting that Samba already has a dcache in userspace: tridge
mentioned that positive cache-insensitive lookups are cached, so the
replacement policy is already skewed by that.

Will your proposal eliminate Samba's positive cache as well?

> Besides, I really think that we can do this with basically just a few 
> lines of code in the kernel (apart from the actual case comparison, which 
> I'm not even going to worry about - that's totally independent of the 
> cache handling itself, and I don't care about how to write a 
> "windows_equivalent_strncasecmp()".

What I like about my idea is that no windows_equivalent_strncasecmp()
needs to go into the kernel.  I.e. no need for a Samba-specific module.

The other thing I like is that DN_IGNORE_SELF would be useful for
other applications too.

What I like about your idea is that it'll be a bit faster, the dcache
replacement policy will be nicer, and if there are atomicity
conditions we haven't thought of, it'll be easier to handle them.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 23:37                                           ` tridge
@ 2004-02-20  0:02                                             ` Linus Torvalds
  2004-02-20  0:16                                               ` tridge
  2004-02-20  1:07                                               ` H. Peter Anvin
  0 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-20  0:02 UTC (permalink / raw)
  To: Andrew Tridgell
  Cc: Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Fri, 20 Feb 2004 tridge@samba.org wrote:
> 
> What your proposal doesn't give us is case-insensitive indexing into
> the dcache.

Correct.

And I've told you OVER AND OVER again that you have a choice: better than 
what you do now, or nothing. Whining about the fact that Windows is 
stupid will only make me convinced that there is no point to even helping 
samba, since what you really want is WNT.

If what you want is WNT, then go away. That's not what I'm offering. And 
it's not going to _be_ that I offer. 

I offer you _sane_ VFS semantics, with some accelerators for your insane 
needs. If that isn't enough, then please just stop bothering me. 

Comprende?

>	 The reason the dcache is such a great thing in Linux is
> that it is indexed by name, so you rarely do any scanning at all

And that is still true of any exact matches.

If you have a fuzzy lookup of a name that does exist, but doesn't match,
or you have a new name that simply doesn't _exist_ in the dcache, then you
will have to scan all dentries. But now you can scan them in-memory by
following the pointers directly without having to index through the
filesystem data structures and worrying about disk reads. And you can
optimize that to do a fast mismatch (ie in most cases you can probably
look at the first one of two characters and determine immediately that
there is no match).

The only way to avoid that is to make the hash weaker. Which I'm not 
willing to do: I'm not willing to make the _proper_ lookups go slower 
because of some insane crap generated by Microsoft.

In other words, put up or shut up. If you are only going to repeat your 
whine about how you want the Linux VFS layer to look like Windows, I'm 
simply NOT INTERESTED.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:02                                             ` Linus Torvalds
@ 2004-02-20  0:16                                               ` tridge
  2004-02-20  0:37                                                 ` Linus Torvalds
  2004-02-20  1:07                                               ` H. Peter Anvin
  1 sibling, 1 reply; 123+ messages in thread
From: tridge @ 2004-02-20  0:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

Linus,

 > And I've told you OVER AND OVER again that you have a choice: better than 
 > what you do now, or nothing. Whining about the fact that Windows is 
 > stupid will only make me convinced that there is no point to even helping 
 > samba, since what you really want is WNT.

yes, I've acknowledged that. I know you aren't going to give me the
ideal solution, I'm just exploring how far this is from the ideal and
trying to get a feel for how much it actually gains us compared to
what we do now. 

If I understand things correctly then I think that your suggestion
probably does gain us a fair bit, but I think that biting my head off
for exploring just how much the gain is versus the current code and
the "ideal" code is a bit much.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:00                                                     ` Jamie Lokier
@ 2004-02-20  0:17                                                       ` Linus Torvalds
  2004-02-20  0:24                                                         ` Linus Torvalds
  2004-02-20  0:46                                                         ` Jamie Lokier
  0 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-20  0:17 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: viro, Tridge, H. Peter Anvin, Kernel Mailing List

On Fri, 20 Feb 2004, Jamie Lokier wrote:
> 
> Will your proposal eliminate Samba's positive cache as well?

Samba has to work on different kernels, so they'll have to have their own 
code anyway. Whether they want to turn it off or not if better 
alternatives are found is up to them. Right now it appears that what 
Tridge wants is a WNT dcache, and since he's not going to get it, I guess 
the whole discussion is moot.

> What I like about my idea is that no windows_equivalent_strncasecmp()
> needs to go into the kernel.  I.e. no need for a Samba-specific module.
> 
> The other thing I like is that DN_IGNORE_SELF would be useful for
> other applications too.

I agree. It might even be acceptable not as a new flag, but as a 
modification to existing behaviour. I can't imagine that a file manager is 
all that interested in seeing the changes it itself does be reported back 
to it. And I don't really know of any other uses of dnotify.

(That said, clearly it's better to just have a new flag, since that way 
there is no possibility of anything breaking).

On the other hand, even with a nice dnotify infrastructure, you simply
_cannot_ get absolute atomicity guarantees. Because by the time you
actually execute the "mv" operation, another process may create a new file
with the "same" name (ie different name, but comparing the same ignoring
case) on another CPU. By the time you get the dnotify, it's too late, and
the move will have happened, and undoing the operation (and hiding it from
the client) may well be impossible - possibly because another process
creating a file with the old name.

NOTE! Even an in-kernel implementation fundamentally cannot fix this race 
on something like NFS. So the in-kernel version would only help for local 
filesystems that the kernel has exclusive write access to.

			Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:17                                                       ` Linus Torvalds
@ 2004-02-20  0:24                                                         ` Linus Torvalds
  2004-02-20  0:30                                                           ` Trond Myklebust
                                                                             ` (4 more replies)
  2004-02-20  0:46                                                         ` Jamie Lokier
  1 sibling, 5 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-20  0:24 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: viro, Tridge, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
> 
> I agree. It might even be acceptable not as a new flag, but as a 
> modification to existing behaviour. I can't imagine that a file manager is 
> all that interested in seeing the changes it itself does be reported back 
> to it. And I don't really know of any other uses of dnotify.

I take that back. Even a file manager may very well be interested in moves 
that it does itself - most of them have some soft of multi-window view 
capability, and if they use dnotify, they might well be using it to keep 
the different views coherent.

So yes, a new flag would likely be required. 

That said, who actually _uses_ dnotify? The only time dnotify seems to 
come up in discussions is when people complain how badly designed it is, 
and I don't think I've ever heard anybody say that they use it and 
that they liked it ;)

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:24                                                         ` Linus Torvalds
@ 2004-02-20  0:30                                                           ` Trond Myklebust
  2004-02-20  0:54                                                           ` Jamie Lokier
                                                                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 123+ messages in thread
From: Trond Myklebust @ 2004-02-20  0:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

På to , 19/02/2004 klokka 16:24, skreiv Linus Torvalds:
> That said, who actually _uses_ dnotify? The only time dnotify seems to 
> come up in discussions is when people complain how badly designed it is, 
> and I don't think I've ever heard anybody say that they use it and 
> that they liked it ;)

We use it in the idmapper and RPCSEC_GSS userland daemons in order to
track which NFS clients are up and running (by peeking inside the
rpc_pipefs). Works fine there...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:16                                               ` tridge
@ 2004-02-20  0:37                                                 ` Linus Torvalds
  2004-02-20  1:26                                                   ` tridge
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-20  0:37 UTC (permalink / raw)
  To: Andrew Tridgell
  Cc: Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Fri, 20 Feb 2004 tridge@samba.org wrote:
> 
> yes, I've acknowledged that. I know you aren't going to give me the
> ideal solution, I'm just exploring how far this is from the ideal and
> trying to get a feel for how much it actually gains us compared to
> what we do now. 

I suspect the only way to know that is to code something up.

The kernel side (with the full "readdir()" loop and a TENTATIVE flag etc)  
is not likely to be that many lines of code, but it's definitely something
where the person who writes those lines needs to really understand the
kernel code to get anywhere at all. And it's in an "interesting" area of
the kernel, so you have to be really careful. And you'd need somebody who
is used to samba too, in order to do the path component walk side in user
space work right with the new interface. So..

I an try to see if I can write something - I'd not do the actual
comparison function, but I have the rough framework in my mind. I won't
get to it for another day or two, at _least_, though.

With that set up, getting numbers and doing a kernel profile to see where
the time goes is probably not hard - again, if you have a samba setup with
benchmarks already set up. I just don't know anybody who knows both pieces
of the puzzle..

(This, btw, was the big problem with pthreads too. The 2.6.x threading
improvements were things that had been discussed for years, but it took
until Ingo, Uli and Roland actually sat down and looked at both the user
side and the kernel side before anything really happened).

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:17                                                       ` Linus Torvalds
  2004-02-20  0:24                                                         ` Linus Torvalds
@ 2004-02-20  0:46                                                         ` Jamie Lokier
  2004-02-23 10:13                                                           ` Tim Connors
  1 sibling, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20  0:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:

> I can't imagine that a file manager is all that interested in seeing
> the changes it itself does be reported back to it.

No, but any file manager that's made of libraries where one thread is
showing the window and another thread is doing operations will care -
unless they explicitly communicate.  Right now they might, or they
might not.

> (That said, clearly it's better to just have a new flag, since that way 
> there is no possibility of anything breaking).

Quite.

> And I don't really know of any other uses of dnotify.

High performance web template cache:
   dnotify is used to invalidate cached info about prerequisite files,
   so that quite a lot of files can be used to create a page, the
   output is cached, and validating the cache for each page request
   as actually zero cost (because dnotify is a signal, so validating is
   just checking that you didn't receive the signal).

Accelerated Make:
   dnotify is used to invalidate cached stat() results between runs.
   A daemon runs in the background to retain the information.
   (Communicating with the daemon is only faster than calling stat()
   if the retained information includes precomputed dependencies,
   pre-parsed Makefiles and such.

Java VM accelerator:
   Let the JVM precompile class files to a machine-specific code and
   keep that in a mmaped file between invocations.  When a new JVM
   process is started, it checks that all the class files for a
   particular program haven't changed; a daemon using dnotify can
   speed up this check, or even provide a stronger guarantee, if you
   don't trust stat() mtimes.

Fontconfig accelerator:
   When a program using fontconfig (e.g. any GNOME program and many
   others) starts, it calls stat() on every font file in ~/.fonts.
   This is lovely to use because you just drop font files in there,
   but the stat() calls are slow when you have a very large number.
   A daemon using dnotify can monitor this and allow a program to
   skip those calls.

Maildir accelerator:
   Similar to fontconfig, but on mail directories for validating
   the cached summary information about all mails in a folder.

Shared cache directory:
   A program stores files in a shared cache, e.g. like a web browser.
   dnotify can be used to monitor the cached files, to invalidate
   in-memory data structures parsed from them if other programs are
   modifying the same cached file data structures.

Shared database in a file (like Berkeley DB et al):
   dnotify is used to notice when another process modifies the file.
   You still need to lock and write updates, but you can avoid reading
   and parsing the database file between queries and use calculated
   in-memory data for the queries, if you know the file hasn't been
   changed by another process.

One thing you can't do is real-time updatedb+locate, because of the
need to have an open file descriptor for every directory that's monitored.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:24                                                         ` Linus Torvalds
  2004-02-20  0:30                                                           ` Trond Myklebust
@ 2004-02-20  0:54                                                           ` Jamie Lokier
  2004-02-20  0:57                                                           ` tridge
                                                                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20  0:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> That said, who actually _uses_ dnotify? The only time dnotify seems to 
> come up in discussions is when people complain how badly designed it is, 
> and I don't think I've ever heard anybody say that they use it and 
> that they liked it ;)

I've not used it, but I have plenty of ideas (see the other email),
and one big project I'm working on that intends to use it, which isn't
a file manager.

I must say it is badly designed and I don't like it :)

Actually the design is ok because it's easy to understand.  It is just
a bit limiting for more adventurous purposes than a file manager.

Something that fitted nicely into the epoll style of event queue, and
also allowed whole directory trees to be monitored, and told exactly
what changed, and let you take out leases on files that caught writing
as well as opens, and worked even across reboots or with no program
running (using generation numbers of some kind).... that I'd like a
tiny bit more :)

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:24                                                         ` Linus Torvalds
  2004-02-20  0:30                                                           ` Trond Myklebust
  2004-02-20  0:54                                                           ` Jamie Lokier
@ 2004-02-20  0:57                                                           ` tridge
  2004-02-20  1:07                                                           ` Paul Wagland
  2004-02-20 13:31                                                           ` Chris Wedgwood
  4 siblings, 0 replies; 123+ messages in thread
From: tridge @ 2004-02-20  0:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, viro, H. Peter Anvin, Kernel Mailing List

Linus,

 > That said, who actually _uses_ dnotify? The only time dnotify seems to 
 > come up in discussions is when people complain how badly designed it is, 
 > and I don't think I've ever heard anybody say that they use it and 
 > that they liked it ;)

This may not be the example you want, but Samba uses it and it is
absolutely vital to good performance.

The common situation is this:

  - 1000 windows drones sitting in an office with their windows
    explorer windows open on their home directory on the server, but
    not doing any real work.

  - all those windows boxes ask the Samba server "let me know when the
    directory changes so I can refresh this window that nobody is
    looking at anyway"

  - before we had dnotify samba had to continuously poll all those
    directories, looking for a change in a checksum of the directory
    contents. We had tunable parameters for how often to poll, whether
    to poll etc, but basically it sucked, because windows users with
    nothing better to do ask "why doesn't it behave just like NT"

  - now samba just watches for dnotify events

The other situation where it really sucked was for windows developers
using visual C. The builtin make-like system in that braindead tool
actually got compilations wrong if the file server didn't tell it that
a file in its directory had changed. It would say "nothing to do" when
you do a build and we hadn't polled recently enough. Cue the angry
windows developers and people screaming to put a real NT box in
instead of Samba.

So dnotify has been a huge bonus for Samba, I just wish a few more
non-Samba tools would use it so it doesn't run the risk of being
removed because only Samba cares. It sucks being the ugly duckling,
and knowing that nobody is ever going to tell you you're really a
swan :)

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:24                                                         ` Linus Torvalds
                                                                             ` (2 preceding siblings ...)
  2004-02-20  0:57                                                           ` tridge
@ 2004-02-20  1:07                                                           ` Paul Wagland
  2004-02-20 13:31                                                           ` Chris Wedgwood
  4 siblings, 0 replies; 123+ messages in thread
From: Paul Wagland @ 2004-02-20  1:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, viro, Tridge, H. Peter Anvin, Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 709 bytes --]

On Fri, 2004-02-20 at 01:24, Linus Torvalds wrote:

> That said, who actually _uses_ dnotify? The only time dnotify seems to 
> come up in discussions is when people complain how badly designed it is, 
> and I don't think I've ever heard anybody say that they use it and 
> that they liked it ;)

Well, in the desktop land both kde and gnome use fam, and fam can use
dnotify as it's backend to watch files. Server side, courier can use fam
as well, so although there are not a lot of programs that use dnotify
directly, there are a lot that can use it indirectly, and will fall back
to polling on a non-dnotify system. I don't know if the famd people like
it or not though ;-)

Cheers,
Paul


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:02                                             ` Linus Torvalds
  2004-02-20  0:16                                               ` tridge
@ 2004-02-20  1:07                                               ` H. Peter Anvin
  1 sibling, 0 replies; 123+ messages in thread
From: H. Peter Anvin @ 2004-02-20  1:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Tridgell, Al Viro, Jamie Lokier, Kernel Mailing List

Linus Torvalds wrote:
> 
> The only way to avoid that is to make the hash weaker. Which I'm not 
> willing to do: I'm not willing to make the _proper_ lookups go slower 
> because of some insane crap generated by Microsoft.
> 

Or, to be fair, have a secondary set of hash entries (effectively a
parallel dcache, which would optimize on normalized names instead of
true names.)

A multi-dcache approach seems scary as hell, though...

	-hpa



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:37                                                 ` Linus Torvalds
@ 2004-02-20  1:26                                                   ` tridge
  0 siblings, 0 replies; 123+ messages in thread
From: tridge @ 2004-02-20  1:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

Linus,

 > I an try to see if I can write something - I'd not do the actual
 > comparison function, but I have the rough framework in my mind. I won't
 > get to it for another day or two, at _least_, though.

ok, that would be excellent. Please don't think there is a huge rush
on this though, whatever we come up with won't be in wide use for a
year at least, and probably longer. The sort of changes in Samba we
need really are most suited for the NTVFS layer in Samba4, and we may
even end up with a ntvfs_linux backend completely separate from the
ntvfs_posix backend that we would use on other unixes. That won't
happen overnight (heck, ntvfs_posix doesn't even exist yet for
Samba4). 

 > With that set up, getting numbers and doing a kernel profile to see where
 > the time goes is probably not hard - again, if you have a samba setup with
 > benchmarks already set up. I just don't know anybody who knows both pieces
 > of the puzzle..

I'm happy to provide the load and profiling tools, probably using
something like dbench but with a different load and a proportion of
case-insensitive lookups (dbench is currently case-sensitive).

One minor thing about your design. You talked about making the new
call actually do the open(). It would be better to just return the
stat information and the real (case sensitive) name. Windows clients
do stat() like calls (Trans2_qpathinfo) roughly 10x as much as they
actually do open() like calls.

We also like to avoid doing open() whenever possible because of the
silly "lose all your locks on close" problem. I know that we've
discussed before fixing that locking stupidity, but even so I think
just returning the stat() info and real name is easiest. Samba needs
to know the name anyway, as there are calls in SMB that ask "what is
the name of the file for this file descriptor I've got open", and we
really should return the case-preserving name.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 21:30                                                   ` Linus Torvalds
  2004-02-20  0:00                                                     ` Jamie Lokier
@ 2004-02-20  1:39                                                     ` Junio C Hamano
  2004-02-20 12:54                                                       ` Jamie Lokier
  1 sibling, 1 reply; 123+ messages in thread
From: Junio C Hamano @ 2004-02-20  1:39 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, viro, Tridge, H. Peter Anvin, Kernel Mailing List

>>>>> "JL" == Jamie Lokier <jamie@shareable.org> writes:

JL> The other thing I like is that DN_IGNORE_SELF would be useful for
JL> other applications too.

While I agree in principle that DN_IGNORE_SELF would be quite an
effective and clean way to solve the Samba problem and also
applicable to other situations, I also imagine that the value of
DN_IGNORE_SELF would be greatly affected by how the "self" is
defined.  A server implementation may be multithreaded, and you
may or may not want to count all your threads in that server
process as self; another may be implemented as one master
process spawning multiple worker bee processes, in which case it
would be more convenient if all the processes in one process
group is counted as self.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
                                                             ` (2 preceding siblings ...)
  2004-02-19 23:37                                           ` tridge
@ 2004-02-20  2:30                                           ` Theodore Ts'o
  2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
  4 siblings, 0 replies; 123+ messages in thread
From: Theodore Ts'o @ 2004-02-20  2:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

On Thu, Feb 19, 2004 at 11:48:50AM -0800, Linus Torvalds wrote:
> Let's leave the "check_or_create_name()" thing for now, and see how we can
> use this in user space (and realize that we only do this on cache failure,
> so this is the "slow case"):
> 
> 	set_bit_one(dir);
> 	lseek(dir, 0, SEEK_SET);
> 	while (readdir(dir, de)) {
> 		stat(de->d_name);
> 		.. might also compare the name here with whatever it is 
> 		   working on right now..
> 	}
> 	set_bit_two_if_one_is_set(dirfd);
> 
> Notice what the above does? After the above loop, bit two will be set IFF 
> the dentry cache now contains every single name in the directory. 
> Otherwise it will be clear. Bit two will basically be a "dcache complete" 
> bit.

Why do this in user space?  The set_bit_one() and
set_bit_two_if_one_is_set() can't really be used for anything else,
really, so why not let check_or_create_name() do the above loop if
necessary to populate all of the dcache entries in the dentry cache?

That way we only expose one system call (check_or_create_name()), and
we let the internal dcache flags be an internal implementation detail.
It will also make it much easier to avoid races.

						- Ted

^ permalink raw reply	[flat|nested] 123+ messages in thread

* explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
                                                             ` (3 preceding siblings ...)
  2004-02-20  2:30                                           ` Theodore Ts'o
@ 2004-02-20 12:04                                           ` Ingo Molnar
  2004-02-20 13:19                                             ` Jamie Lokier
                                                               ` (3 more replies)
  4 siblings, 4 replies; 123+ messages in thread
From: Ingo Molnar @ 2004-02-20 12:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

* Linus Torvalds <torvalds@osdl.org> wrote:

> Basic approach: add two bits to the VFS dentry flags. That's all that
> is needed. Then you have two new system calls:
> 
>  - set_bit_one(dirfd)
>  - set_bit_two_if_one_is_set(dirfd);
>  - check_or_create_name(dirfd, name, case_table_pointer, newfd);

i believe Samba's problems can be solved in an even simpler way, by 
using only a single bit associated with the directory dentry, and by not 
putting any case-insensitivity code into the kernel. (not even as a 
separate module.)

One 'user-space cache is valid/clean' bit should be enough - where all
non-Samba accesses clear the 'valid bit', and Samba sets the bit
manually.

What Samba needs is a way to tell between two points in time whether the
directory contents have changed in any way - nothing more. Only one new
syscall is used to maintain the Samba dcache:

	long sys_mark_dir_clean(dirfd);

the syscall returns whether the directory was valid/clean already.

this is how Samba name lookup would work:

repeat:
	if (sys_mark_dir_clean(dirfd)) {
		... pure user-space fast path, use Samba dcache ...
		return;
	}
	... fill Samba dcache ...
	readdir() loop

	goto repeat;

i.e. there will be two calls to sys_mark_dir_clean() in the slowpath
(the first one to set it, the second one to make sure it's still set). 
Races are handled automatically by the loop.

this is how Samba could create a file atomically:

	sys_create(name, mode | O_CLEAN);

ie. the create only succeeds if the directory has not been touched since
the Samba dcache has processed it last time. O_CLEAN would be a very
simple check in the open_namei() code, it returns -ENOTCLEAN if the
parent directory has not been marked clean.

i dont think there's any need to have a case-insensitive lookup module
in the kernel - Samba has all the information through the readdir() loop
already - all it needs to know is whether that info is valid or not via
the mark_dir_clean() syscall!

the impact of sys_mark_dir_clean() and O_CLEAN is quite minimal on the
generic VFS i believe. Also, it can be used as a caching method for just
about everything that wants to have a coherent user-space cache of the
VFS namespace. Note that there's nothing about case sensitivity or
insensitivity in this approach, it still gets rid of all of the
excessive readdir()s done in the Samba fastpath.

[ To get rid of all Samba overhead in this area we might need other
  syscall variants too, like rename_if_clean() and unlink_if_clean().
  Under this scheme Samba would never have to do a stat() call of the
  target file, because it always has a coherent copy of the kernel
  dcache, for directories it choses to cache. ]

this approach differs from dnotify in a couple of key areas:

 - it's a synchronous solution that avoids signals, and is thus
   usable/robust in libraries too.

 - dnotify _forces_ action. mark_dir_clean() you can use if there's use 
   and there's no overhead if the Samba workload is completely silent
   and there are only POSIX users. I.e. it should scale better than 
   dnotify.

 - cache teardown can be done in userspace purely: the 'clean bit' has
   no state associated with it (unlike dnotify), so no kernel call is
   necessary to tear down state. User-space just forgets that it cached
   anything about that directory and it's done. No leaking state, and
   good scalability again.

 - but most importantly, it's fundamentally atomic for local filesystems
   and thus meets the needs of Samba in mixed POSIX/Samba workloads.

just in case anyone has followed me down to this point :-), there's yet
another, more advanced way to do the Samba-dcache fastpath 100% in
user-space:

We can export the 'directory clean bit' to userspace, via the same page
pinning and mapping techniques used by futexes. User-space could
register a 'clean bit' address via a new syscall, which the dcache then
uses from that point on. Thus there would be only a single syscall when
Samba sets up a directory cache in user-space [which needs those
readdir() calls so performance is down the drain anyway], which syscall
lets userspace register a machine-word address to serve as the
'directory is clean' flag. Userspace and kernelspace will set this flag
possibly in parallel which is not a problem as long as userspace uses
atomic ops. This approach introduces some page pinning allocation
overhead but that's easy to solve.  User-space would of course condense
the pinned range. Kernel-space would see very minimal overhead from
having the bit in an indirect pointer - at least on 64-bit systems where
all kernel RAM is mapped.

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  1:39                                                     ` Junio C Hamano
@ 2004-02-20 12:54                                                       ` Jamie Lokier
  0 siblings, 0 replies; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20 12:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, viro, Tridge, H. Peter Anvin, Kernel Mailing List

Junio C Hamano wrote:
> JL> The other thing I like is that DN_IGNORE_SELF would be useful for
> JL> other applications too.
> 
> While I agree in principle that DN_IGNORE_SELF would be quite an
> effective and clean way to solve the Samba problem and also
> applicable to other situations,

> I also imagine that the value of DN_IGNORE_SELF would be greatly
> affected by how the "self" is defined.  A server implementation may
> be multithreaded, and you may or may not want to count all your
> threads in that server process as self; another may be implemented
> as one master process spawning multiple worker bee processes, in
> which case it would be more convenient if all the processes in one
> process group is counted as self.

Yes, indeed this is an issue.  A multi-threaded program I'm working on
would want each thread to count separately - because the threads don't
know much about each other.  Samba is more likely to want all threads
treated as a single unit.

Even in a program like Samba, you can imagine a plugin architecture or
something where 3rd party add-ons spawn threads, and those 3rd party
threads want to monitor a directory they are using, independent of the
main Samba threads.

-- Jamie


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
@ 2004-02-20 13:19                                             ` Jamie Lokier
  2004-02-20 13:37                                               ` Ingo Molnar
  2004-02-20 13:23                                             ` [patch] " Ingo Molnar
                                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20 13:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

Ingo Molnar wrote:
>  - it's a synchronous solution that avoids signals, and is thus
>    usable/robust in libraries too.

I do like the robustness, due to the way the flag is associated with
dirfd.

>  - dnotify _forces_ action. mark_dir_clean() you can use if there's use 
>    and there's no overhead if the Samba workload is completely silent
>    and there are only POSIX users. I.e. it should scale better than 
>    dnotify.

Firstly, it doesn't scale better than dnotify in this scenario if
you're using signals and one-shot mode.  With one-shot mode, there's
also no overhead if there are only POSIX users.

Secondly, dnotify is synchronous if you _block_ the dnotify signal and
use sigtimedwait() to collect events.

I'd personally like to see the robustness problem solved with epoll
or something similar extended to queue more general events to
userspace, more reliably.

Hey!  That's an idea: Use select/poll on the dirfd to read this bit.
That gives you the flexibility to wait, or collect events from
multiple directories.  (Philosophicaly, every event should be
accessible through select/poll/epoll, right?).

> We can export the 'directory clean bit' to userspace, via the same page
> pinning and mapping techniques used by futexes. User-space could
> register a 'clean bit' address via a new syscall, which the dcache then
> uses from that point on. Thus there would be only a single syscall when
> Samba sets up a directory cache in user-space [which needs those
> readdir() calls so performance is down the drain anyway], which syscall
> lets userspace register a machine-word address to serve as the
> 'directory is clean' flag. Userspace and kernelspace will set this flag
> possibly in parallel which is not a problem as long as userspace uses
> atomic ops. This approach introduces some page pinning allocation
> overhead but that's easy to solve.  User-space would of course condense
> the pinned range. Kernel-space would see very minimal overhead from
> having the bit in an indirect pointer - at least on 64-bit systems where
> all kernel RAM is mapped.

That is _far_ too overcomplicated to do just for this one obscure bit
of information.

If you're going to do kernel-triggers-futex event transmission (like
CLONE_CLEARTID already does), it would be nice to have a framework for
more general event types, like pending-but-blocked signals, dnotify
events (individual ones, not all funnelled through one signal number
as they are now), fd-ready-to-read events and such.  The sort of
things that select/poll/epoll ought to be able to wait for, and in
most cases can.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch] explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
  2004-02-20 13:19                                             ` Jamie Lokier
@ 2004-02-20 13:23                                             ` Ingo Molnar
  2004-02-20 18:00                                               ` viro
  2004-02-20 15:41                                             ` Linus Torvalds
  2004-02-20 20:38                                             ` Christer Weinigel
  3 siblings, 1 reply; 123+ messages in thread
From: Ingo Molnar @ 2004-02-20 13:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List, Jamie Lokier

[-- Attachment #1: Type: text/plain, Size: 1262 bytes --]


> What Samba needs is a way to tell between two points in time whether the
> directory contents have changed in any way - nothing more. Only one new
> syscall is used to maintain the Samba dcache:
> 
> 	long sys_mark_dir_clean(dirfd);

> this is how Samba could create a file atomically:
> 
> 	sys_create(name, mode | O_CLEAN);

i've attached a quick patch (against 2.6.3) that implements the new
sys_mark_dir_clean() syscall and O_CLEAN support in all open() variants,
just to have an idea of how it looks like roughly. (It's incomplete -
e.g. there's no explicit way to do an atomic unlink or rename.)

i've also attached dir-cache.c, a simple testcode for the new
functionality. It marks the current directory clean and tries to open
the "./1" file via O_CLEAN with 1 second delay. Start this in one shell
and do VFS-namespace modifying ops in another window (eg. "rm -f 2;
touch 2") and see the dir-cache code react to it - the 'clean' bit is
lost, and the file open-create does not succeed if the directory is not
clean.

there's a new dentry flag that is maintained under the directory's i_sem
semaphore. (It would be simpler to have the flag on the inode level,
that way the invalidation could be done as a simple filter to the
dnotify function.)

	Ingo

[-- Attachment #2: dir-mark-clean-2.6.3-A3 --]
[-- Type: text/plain, Size: 4310 bytes --]

--- linux/arch/i386/kernel/entry.S.orig	
+++ linux/arch/i386/kernel/entry.S	
@@ -882,5 +882,6 @@ ENTRY(sys_call_table)
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
+ 	.long sys_mark_dir_clean
 
 syscall_table_size=(.-sys_call_table)
--- linux/include/linux/dcache.h.orig	
+++ linux/include/linux/dcache.h	
@@ -153,9 +153,25 @@ d_iput:		no		no		yes
 
 #define DCACHE_REFERENCED	0x0008  /* Recently used, don't discard. */
 #define DCACHE_UNHASHED		0x0010	
+#define DCACHE_USER_CLEAN	0x0020	/* userspace cache coherent */
 
 extern spinlock_t dcache_lock;
 
+static inline void d_user_flush(struct dentry *dentry)
+{
+	dentry->d_vfs_flags &= ~DCACHE_USER_CLEAN;
+}
+
+static inline void d_user_mark_clean(struct dentry *dentry)
+{
+	dentry->d_vfs_flags |= DCACHE_USER_CLEAN;
+}
+
+static inline long d_user_valid(struct dentry *dentry)
+{
+	return (dentry->d_vfs_flags & DCACHE_USER_CLEAN) != 0;
+}
+
 /**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
--- linux/include/asm-generic/errno.h.orig	
+++ linux/include/asm-generic/errno.h	
@@ -96,5 +96,6 @@
 
 #define	ENOMEDIUM	123	/* No medium found */
 #define	EMEDIUMTYPE	124	/* Wrong medium type */
+#define	EFLUSH		125	/* cache not valid */
 
 #endif
--- linux/include/asm-i386/fcntl.h.orig	
+++ linux/include/asm-i386/fcntl.h	
@@ -20,6 +20,7 @@
 #define O_LARGEFILE	0100000
 #define O_DIRECTORY	0200000	/* must be a directory */
 #define O_NOFOLLOW	0400000 /* don't follow links */
+#define O_CLEAN	       01000000 /* parent dir must be clean */
 
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
--- linux/fs/open.c.orig	
+++ linux/fs/open.c	
@@ -747,6 +747,7 @@ struct file *filp_open(const char * file
 		namei_flags++;
 	if (namei_flags & O_TRUNC)
 		namei_flags |= 2;
+	namei_flags |= flags & O_CLEAN;
 
 	error = open_namei(filename, namei_flags, mode, &nd);
 	if (!error)
@@ -1029,6 +1030,26 @@ out_unlock:
 
 EXPORT_SYMBOL(sys_close);
 
+asmlinkage long sys_mark_dir_clean(unsigned int fd)
+{
+	struct file *filp;
+	long ret = -EBADF;
+
+	filp = fget(fd);
+	if (!filp)
+		return ret;
+
+	down(&filp->f_dentry->d_inode->i_sem);
+	ret = d_user_valid(filp->f_dentry);
+	d_user_mark_clean(filp->f_dentry);
+	up(&filp->f_dentry->d_inode->i_sem);
+
+	fput(filp);
+
+	return ret;
+}
+
+
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
--- linux/fs/namei.c.orig	
+++ linux/fs/namei.c	
@@ -1295,11 +1295,23 @@ do_last:
 		goto exit;
 	}
 
+	/*
+	 * Did user-space require the parent directory to be clean
+	 * but it was invalid?:
+	 */
+	error = -EFLUSH;
+	if ((flag & O_CLEAN) && !d_user_valid(dir)) {
+		up(&dir->d_inode->i_sem);
+		goto exit;
+	}
+
 	/* Negative dentry, just create the file */
 	if (!dentry->d_inode) {
 		if (!IS_POSIXACL(dir->d_inode))
 			mode &= ~current->fs->umask;
 		error = vfs_create(dir->d_inode, dentry, mode, nd);
+		if (!error)
+			d_user_flush(dir);
 		up(&dir->d_inode->i_sem);
 		dput(nd->dentry);
 		nd->dentry = dentry;
@@ -1493,6 +1505,8 @@ asmlinkage long sys_mknod(const char __u
 		}
 		dput(dentry);
 	}
+	if (!error)
+		d_user_flush(nd.dentry);
 	up(&nd.dentry->d_inode->i_sem);
 	path_release(&nd);
 out:
@@ -1545,6 +1559,8 @@ asmlinkage long sys_mkdir(const char __u
 			if (!IS_POSIXACL(nd.dentry->d_inode))
 				mode &= ~current->fs->umask;
 			error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+			if (!error)
+				d_user_flush(nd.dentry);
 			dput(dentry);
 		}
 		up(&nd.dentry->d_inode->i_sem);
@@ -1653,6 +1669,8 @@ asmlinkage long sys_rmdir(const char __u
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
 		error = vfs_rmdir(nd.dentry->d_inode, dentry);
+		if (!error)
+			d_user_flush(nd.dentry);
 		dput(dentry);
 	}
 	up(&nd.dentry->d_inode->i_sem);
@@ -1728,6 +1746,8 @@ asmlinkage long sys_unlink(const char __
 		if (inode)
 			atomic_inc(&inode->i_count);
 		error = vfs_unlink(nd.dentry->d_inode, dentry);
+		if (!error)
+			d_user_flush(nd.dentry);
 	exit2:
 		dput(dentry);
 	}
@@ -2099,6 +2119,10 @@ static inline int do_rename(const char *
 
 	error = vfs_rename(old_dir->d_inode, old_dentry,
 				   new_dir->d_inode, new_dentry);
+	if (!error) {
+		d_user_flush(old_dir);
+		d_user_flush(new_dir);
+	}
 exit5:
 	dput(new_dentry);
 exit4:

[-- Attachment #3: dir-cache.c --]
[-- Type: text/plain, Size: 881 bytes --]

/*
 * Copyright (C) Ingo Molnar, 2002
 */
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>
#include <sys/times.h>
#include <sys/wait.h>
#include <sys/ioctl.h>
#include <linux/unistd.h>

#define __NR_sys_mark_dir_clean 274
_syscall1(int, sys_mark_dir_clean, int, fd);

#define O_DIRECTORY        0200000 /* must be a directory */

#define O_CLEAN        01000000 /* parent dir must be clean */

int main(int argc, char **argv)
{
	int fd, fd2, clean;

	fd = open(".", O_RDONLY | O_DIRECTORY);
	if (fd <= 0) {
		perror("fd:");
		exit(-1);
	}

	for (;;) {
		clean = sys_mark_dir_clean(fd);
		printf("clean:%d ", clean); fflush(stdout);
		sleep(1);
		fd2 = open("./1", O_CREAT|O_TRUNC|O_CLEAN, 0777);
		close(fd2);
		printf("fd:%d\n", fd2);
		sleep(1);
	}

	return 0;
}


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:24                                                         ` Linus Torvalds
                                                                             ` (3 preceding siblings ...)
  2004-02-20  1:07                                                           ` Paul Wagland
@ 2004-02-20 13:31                                                           ` Chris Wedgwood
  4 siblings, 0 replies; 123+ messages in thread
From: Chris Wedgwood @ 2004-02-20 13:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, viro, Tridge, H. Peter Anvin, Kernel Mailing List

On Thu, Feb 19, 2004 at 04:24:18PM -0800, Linus Torvalds wrote:

> That said, who actually _uses_ dnotify? The only time dnotify seems
> to come up in discussions is when people complain how badly designed
> it is, and I don't think I've ever heard anybody say that they use
> it and that they liked it ;)

I have code which watches maildir mailboxes using dnotify and it works
great.  I'm not sure I love dnotify but for this purpose it works very
well.


   --cw

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 13:19                                             ` Jamie Lokier
@ 2004-02-20 13:37                                               ` Ingo Molnar
  2004-02-20 14:00                                                 ` Ingo Molnar
  2004-02-20 16:31                                                 ` Jamie Lokier
  0 siblings, 2 replies; 123+ messages in thread
From: Ingo Molnar @ 2004-02-20 13:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

* Jamie Lokier <jamie@shareable.org> wrote:

> >  - it's a synchronous solution that avoids signals, and is thus
> >    usable/robust in libraries too.
> 
> I do like the robustness, due to the way the flag is associated with
> dirfd.
> 
> >  - dnotify _forces_ action. mark_dir_clean() you can use if there's use 
> >    and there's no overhead if the Samba workload is completely silent
> >    and there are only POSIX users. I.e. it should scale better than 
> >    dnotify.
> 
> Firstly, it doesn't scale better than dnotify in this scenario if
> you're using signals and one-shot mode.  With one-shot mode, there's
> also no overhead if there are only POSIX users.

the overriding argument is that the bit needs to be maintained by Samba
expliticly - i.e. no way around sys_mark_dir_clean() and O_CLEAN. And if
we've done that it costs us _nothing_ to return the previous bit value
in sys_mark_dir_clean() - which Samba can use to build a 100% coherent
name cache. At which point i can see no reason at all to use any variant
of dnotify/select/poll/epoll.

Also, one-shot mode will make you lose multiple outstanding events from
multiple directories, unless you associate separate signals with each
directory watched - which will make you run out of the 64 signals very
quick.

> Secondly, dnotify is synchronous if you _block_ the dnotify signal and
> use sigtimedwait() to collect events.

it still queues up kernel structures (dnotify event structures and
signal structures) which can get lost or can overflow, etc. The 'clean
bit' has no such queueing problem.

> I'd personally like to see the robustness problem solved with epoll or
> something similar extended to queue more general events to userspace,
> more reliably.

why? A synchronous lookup is just that - a synchronous lookup. There's
no need at all for Samba to know about all namespace activities. What it
wants to have is coherency between its own cache and the VFS.

dnotify (or epoll) is useful for applications that need to know about
all events: eg. directory visualisation apps. For everything else (like
Samba) it's too much overhead i believe.

> Hey!  That's an idea: Use select/poll on the dirfd to read this bit.
> That gives you the flexibility to wait, or collect events from
> multiple directories.  (Philosophicaly, every event should be
> accessible through select/poll/epoll, right?).

Samba doesnt want to 'wait' for any event. It just wants to keep in sync
with the VFS in a minimalistic way: update the cache only when
absolutely necessary. Think of it as a CPU's cache validation protocol -
you dont want to re-read invalid cachelines all the time, only if they
are really needed.

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 13:37                                               ` Ingo Molnar
@ 2004-02-20 14:00                                                 ` Ingo Molnar
  2004-02-20 16:31                                                 ` Jamie Lokier
  1 sibling, 0 replies; 123+ messages in thread
From: Ingo Molnar @ 2004-02-20 14:00 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

for Samba, the use of sys_mark_dir_clean() & O_CLEAN would be that it
could fully cache all namespace data in user-space (as it already does),
and could assume that the userspace cache is uptodate and ensure that a
particular name does not exist - under whatever namespace rules it wants
to use.

create(O_CLEAN) will return with -EFLUSH if this assumption is not true
anymore - in which case it can re-read that directory.

this way the fastpath would not involve any readdir() calls at all -
even in a mixed environment.

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
  2004-02-20 13:19                                             ` Jamie Lokier
  2004-02-20 13:23                                             ` [patch] " Ingo Molnar
@ 2004-02-20 15:41                                             ` Linus Torvalds
  2004-02-20 17:04                                               ` Ingo Molnar
                                                                 ` (2 more replies)
  2004-02-20 20:38                                             ` Christer Weinigel
  3 siblings, 3 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-20 15:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

On Fri, 20 Feb 2004, Ingo Molnar wrote:
> 
> One 'user-space cache is valid/clean' bit should be enough - where all
> non-Samba accesses clear the 'valid bit', and Samba sets the bit
> manually.

Yes, that, together with O_CLEAN would work.

The problem is that you'd still need other system calls: it's not like 
open(O_CREAT) is the only way to create a file. So you'd have to add 
versions of "link()" etc, which means that O_CLEAN is really pretty 
pointless, and you might as well just do it in a new system call.

Your version is also not multi-threaded: you can never allow more than one 
thread doing the "sys_mark_dir_clean()". That was the reason for having 
two bits: so that anybody can do a lookup in parallell, and only the 
"filldir" part needs to be serialized.

So I do believe you'd want two bits anyway.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 13:37                                               ` Ingo Molnar
  2004-02-20 14:00                                                 ` Ingo Molnar
@ 2004-02-20 16:31                                                 ` Jamie Lokier
  1 sibling, 0 replies; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20 16:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

Ingo Molnar wrote:
> Also, one-shot mode will make you lose multiple outstanding events from
> multiple directories, unless you associate separate signals with each
> directory watched - which will make you run out of the 64 signals very
> quick.

Eh?  Signals are queued, and the file descriptor number is stored in
the siginfo.

> dnotify (or epoll) is useful for applications that need to know about
> all events: eg. directory visualisation apps. For everything else (like
> Samba) it's too much overhead i believe.

dnotify _should_ be quite low overhead, although not as low as your
syscall of course.  I've never measured it, nor looked much at the
kernel code, but in principle delivering a siginfo through
sigtimedwait() should be quite fast.

But I agree it's too much.  Hence:

> > Hey!  That's an idea: Use select/poll on the dirfd to read this bit.
> > That gives you the flexibility to wait, or collect events from
> > multiple directories.  (Philosophicaly, every event should be
> > accessible through select/poll/epoll, right?).
> 
> Samba doesnt want to 'wait' for any event. It just wants to keep in sync
> with the VFS in a minimalistic way: update the cache only when
> absolutely necessary. Think of it as a CPU's cache validation protocol -
> you dont want to re-read invalid cachelines all the time, only if they
> are really needed.

Yes yes, but I'm thinking of the _many_ other applications that aren't
Samba and could benefit from a similar facility.  Also, dnotify is
there already - the signalling mechanism isn't pretty, but the hooks
to catch directory changes are fine.

By the way, what is the scope of O_CLEAN?  Does it fail if _any_
operation was done to the directory, or only if operations were done
by non-Samba tasks?  If the former, it has the same scalability
problem that Linus mentioned: Samba shouldn't have to evict its cache
when it creates files itself.  If the latter, is the test
process-wide, thread-wide, or CLONE_FILES-wide?

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 15:41                                             ` Linus Torvalds
@ 2004-02-20 17:04                                               ` Ingo Molnar
  2004-02-20 17:19                                                 ` Linus Torvalds
  2004-02-20 17:33                                               ` Jamie Lokier
  2004-02-20 17:47                                               ` Jamie Lokier
  2 siblings, 1 reply; 123+ messages in thread
From: Ingo Molnar @ 2004-02-20 17:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List


* Linus Torvalds <torvalds@osdl.org> wrote:

> Your version is also not multi-threaded: you can never allow more than
> one thread doing the "sys_mark_dir_clean()". That was the reason for
> having two bits: so that anybody can do a lookup in parallell, and
> only the "filldir" part needs to be serialized.
> 
> So I do believe you'd want two bits anyway.

hm, right. So for the lookup to be lockless, it would have to be managed
via a syscall variant similar in mechanism to the two-bit approach you 
suggested:

	ret = sys_manage_dir_cache(fd, op);

where the following cache states are defined:

	(invalid, refill_in_progress, valid)

the following type of cache ops are defined:

	(lookup, cache_filled)

the semantics of the sys_manage_dir_cache() syscall are the following:

- op is 'lookup': the syscall returns 'valid' if state is valid. If the 
  state is 'refill_in_progress' then lookup returns refill_in_progress. 
  If the state is 'invalid', then the state goes to 'refill_in_progress' 
  and 'invalid' is returned.

- op is 'cache_filled': the syscall moves the state to 'valid' if state
  is still 'refill_in_progress'. It goes to 'refill_in_progress' if the
  state was 'invalid'.

the kernel does the valid->invalid and refill_in_progress->invalid
transitions automatically, when relevant VFS events occur. All dentries
start out in state invalid.

there's another class of problems: is it an issue that directory renames
that move this directory (higher up in the directory hierarchy of this
directory) do not invalidate the cache? In that case there's no dnotify
event either.

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 17:04                                               ` Ingo Molnar
@ 2004-02-20 17:19                                                 ` Linus Torvalds
  2004-02-20 18:48                                                   ` Ingo Molnar
  2004-02-20 23:00                                                   ` tridge
  0 siblings, 2 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-20 17:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List



On Fri, 20 Feb 2004, Ingo Molnar wrote:
> 
> there's another class of problems: is it an issue that directory renames
> that move this directory (higher up in the directory hierarchy of this
> directory) do not invalidate the cache? In that case there's no dnotify
> event either.

This is one of the reasons why I worry about user-space caching. It's just 
damn hard to get right. 

It's hard in kernel space too, of course, but we've had smart people
working on the dcache for years. So if we can sanely avoid duplication, 
that would be a good thing.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 15:41                                             ` Linus Torvalds
  2004-02-20 17:04                                               ` Ingo Molnar
@ 2004-02-20 17:33                                               ` Jamie Lokier
  2004-02-20 18:22                                                 ` Linus Torvalds
  2004-02-20 17:47                                               ` Jamie Lokier
  2 siblings, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20 17:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tridge, Al Viro, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> > One 'user-space cache is valid/clean' bit should be enough - where all
> > non-Samba accesses clear the 'valid bit', and Samba sets the bit
> > manually.
> 
> Yes, that, together with O_CLEAN would work.
> 
> The problem is that you'd still need other system calls: it's not like 
> open(O_CREAT) is the only way to create a file. So you'd have to add 
> versions of "link()" etc, which means that O_CLEAN is really pretty 
> pointless, and you might as well just do it in a new system call.
> 
> Your version is also not multi-threaded: you can never allow more than one 
> thread doing the "sys_mark_dir_clean()". That was the reason for having 
> two bits: so that anybody can do a lookup in parallell, and only the 
> "filldir" part needs to be serialized.

How about this: we clean up dnotify, so it can be used for
user<->kernel dcache coherency, efficiently, and implement O_CLEAN in
a different way, which works with multiple threads, without extra
system calls for rename/link etc. and where the scope of "this
process/thread/whatever doesn't make a directory unclean" is flexible
enough for Samba, multi-threaded file viewers, maildir mail trackers
and so on.

Ok, marketing aside:

    1. open() a directory
    2. fcntl() for dnotify on the directory, as with the current interface,
       but adding a flag call DN_POLL.

Normally dnotify sends a queued signal with each event.  It can listen
in one-short or multiple event modes.  I'm surprised a signal was ever
used, because we have a perfectly good file descriptor to read from.

So DN_POLL means "register this dnotify but don't send a signal".
Instead, you'll call fcntl() again to read the dnotify status bits.

For Samba, the dnotify is equivalent to sys_mark_dir_clean().
Samba's dcache works like this, following Ingo's logic:

repeat:
	if (fcntl(dirfd, F_NOTIFY, DN_CREATE | DN_RENAME | DN_POLL) == 0) {
		... pure user-space fast path, use Samba dcache ...
		return;
	}
	... fill Samba dcache ...
	readdir() loop
	goto repeat;

(Note that DN_DELETE isn't needed: the negative userspace dcache
doesn't care about deletions).

See, that is obviously equivalent and uses an obvious (and tiny)
improvement to the existing dnotify feature.

The argument that O_CLEAN _requires_ sys_mark_dir_clean() is obviously
bogus: if O_CLEAN will abort when _this_ process/thread/whatever has
the directory marked clean, than it can just as easily abort when
this process/thread/whatever has a dnotify listening.

We might, however, like to add a flag DN_CLEAN so that O_CLEAN only
aborts for dnotifies with that flag set.  Just to stay friendly with
libraries.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 15:41                                             ` Linus Torvalds
  2004-02-20 17:04                                               ` Ingo Molnar
  2004-02-20 17:33                                               ` Jamie Lokier
@ 2004-02-20 17:47                                               ` Jamie Lokier
  2 siblings, 0 replies; 123+ messages in thread
From: Jamie Lokier @ 2004-02-20 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tridge, Al Viro, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote to Ingo Molnar;
> Your version is also not multi-threaded: you can never allow more than one 
> thread doing the "sys_mark_dir_clean()".

It's fine as long as each thread has its own dirfd.  The "clean bit"
applies to an fd.  Or did I miss something obvious?

> The problem is that you'd still need other system calls

Here's a thought.  It's a bit ugly, but it offers O_CLEAN-like
functionality without extra system calls for each operation:

    fchdir(dirfd);

That means change to dirfd in the current process (or thread if
CLONE_FS), and when the "clean bit" is set on dirfd, then any creation
of a name _with no directory component_ will abort.

For example, these operations all create names which will check
dirfd's clean bit, and abort if it's set:

    open("file1", O_CREAT|O_TRUNC|O_RDWR, 0666);
    link("file1", "file2");
    symlink("file1","file3");
    rename("file1", "file4");
    link("subdir/file1", "file2");
    symlink("subdir/file1","file3");
    rename("subdir/file1", "file4");

These operations don't check any clean bits:

    open("/tmp/file1", O_CREAT|O_TRUNC|O_RDWR, 0666);
    open("./file1", O_CREAT|O_TRUNC|O_RDWR, 0666);
    link("file1", "subdir/file2");
    symlink("file1","subdir/file3");
    rename("file1", "subdir/file4");

If dirfd is closed, then of course the current directory stays the
same, but there is no clean bit to check any more.  chdir() also
implies no clean bit to check.

In other words the notion of current directory is extended to be
(inode, fd).  fd can be NULL, or an fd whose clean bit must be checked
before allowing name creation for "/"-less paths.

(As you know I prefer to use dnotify on dirfd to represent the "clean
bit", but that's the subject of another mail).

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch] explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 13:23                                             ` [patch] " Ingo Molnar
@ 2004-02-20 18:00                                               ` viro
  0 siblings, 0 replies; 123+ messages in thread
From: viro @ 2004-02-20 18:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tridge, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

On Fri, Feb 20, 2004 at 02:23:52PM +0100, Ingo Molnar wrote:
 
> i've also attached dir-cache.c, a simple testcode for the new
> functionality. It marks the current directory clean and tries to open
> the "./1" file via O_CLEAN with 1 second delay. Start this in one shell
> and do VFS-namespace modifying ops in another window (eg. "rm -f 2;
> touch 2") and see the dir-cache code react to it - the 'clean' bit is
> lost, and the file open-create does not succeed if the directory is not
> clean.
> 
> there's a new dentry flag that is maintained under the directory's i_sem
> semaphore. (It would be simpler to have the flag on the inode level,
> that way the invalidation could be done as a simple filter to the
> dnotify function.)

IMO putting that in dentry (let alone inode) is fundamentally broken.
Basically, your flag says "somebody in userland knows the contents
of directory".  So your create-if-clean is inherently racy - if we get

task A                       task B                 task C
had learnt the contents
marked clean
                             changed the contents
                                                        had learnt the contents
                                                        marked clean
did create-if-clean, assuming
its knowledge to be accurate

then A will succeed just fine.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 17:33                                               ` Jamie Lokier
@ 2004-02-20 18:22                                                 ` Linus Torvalds
  2004-02-21  0:38                                                   ` Jamie Lokier
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-20 18:22 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ingo Molnar, Tridge, Al Viro, H. Peter Anvin, Kernel Mailing List

On Fri, 20 Feb 2004, Jamie Lokier wrote:
> 
> How about this: we clean up dnotify, so it can be used for
> user<->kernel dcache coherency

No can do.

There is no _way_ dnotify can do a race-free update, exactly because any 
user-level state is fundamentally irrelevant because it isn't tested under 
the directory semaphore.

See? You can have a user-level cache, but the flag and the notification 
absolutely has to be under the inode semaphore (and thus in kernel space) 
if you want to avoid all races with unrelated processes.

Now, for samba this isn't necessarily a huge problem, because you can 
basically say "don't do that, then", and just document that you shouldn't 
mess with a samba export using anything but SMB accesses. So in a sense, 
the samba unix-side coherency is nothing more than politeness, and then 
dnotify or similar works fine (by virtue of not being an absolute 
coherency guarantee, just a "best effort").

But then it should be documented as such. It's not coherent, it's only 
"almost coherent".

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 17:19                                                 ` Linus Torvalds
@ 2004-02-20 18:48                                                   ` Ingo Molnar
  2004-02-21  1:44                                                     ` Jamie Lokier
  2004-02-21  7:58                                                     ` Ingo Molnar
  2004-02-20 23:00                                                   ` tridge
  1 sibling, 2 replies; 123+ messages in thread
From: Ingo Molnar @ 2004-02-20 18:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

* Linus Torvalds <torvalds@osdl.org> wrote:

> > there's another class of problems: is it an issue that directory renames
> > that move this directory (higher up in the directory hierarchy of this
> > directory) do not invalidate the cache? In that case there's no dnotify
> > event either.
> 
> This is one of the reasons why I worry about user-space caching. It's
> just damn hard to get right.

this particular problem could be solved by walking down to the root
dentry for every sys_manage_dir_cache() lookup and check that each
dentry is still cache-valid. This involves some overhead, but it's still
faster than doing the same from userspace. (ie. validating each previous
path component at lookup time.) Since this doesnt change the dcache it
ought to be doable via the rcu-read path and would thus still have
pretty good SMP properties. [except when traversing mountpoints :-( ].

but this scheme also has other problems: who decides who is the 'cache
manager'? What if there are two instances of fileservers both using the
same fileset and also trying to do caching this way?

perhaps using a simple 64-bit generation counter would be better. Samba
would get a new syscall to get the sum of each generation counter down
to the root dentry - a total validation of the pathname. If the counter
matches with that in the userspace cache entry then no need to re-create
the cache. Such generation counters would be usable for multiple file
servers as well. Hm?

> It's hard in kernel space too, of course, but we've had smart people
> working on the dcache for years. So if we can sanely avoid
> duplication, that would be a good thing.

i believe Samba already has what is in essence a duplication of the
dcache. We could enable it to be fairly coherent, for Samba to be able
to have an authorative 'does this file exist' answer without any
excessive readdir()s.

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
                                                               ` (2 preceding siblings ...)
  2004-02-20 15:41                                             ` Linus Torvalds
@ 2004-02-20 20:38                                             ` Christer Weinigel
  2004-02-22 15:07                                               ` Jamie Lokier
  3 siblings, 1 reply; 123+ messages in thread
From: Christer Weinigel @ 2004-02-20 20:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

Ingo Molnar <mingo@elte.hu> writes:

> * Linus Torvalds <torvalds@osdl.org> wrote:
> 
> i believe Samba's problems can be solved in an even simpler way, by 
> using only a single bit associated with the directory dentry, and by not 
> putting any case-insensitivity code into the kernel. (not even as a 
> separate module.)
> 
> One 'user-space cache is valid/clean' bit should be enough - where all
> non-Samba accesses clear the 'valid bit', and Samba sets the bit
> manually.
> 
> What Samba needs is a way to tell between two points in time whether the
> directory contents have changed in any way - nothing more. Only one new
> syscall is used to maintain the Samba dcache:
> 
> 	long sys_mark_dir_clean(dirfd);
> 
> the syscall returns whether the directory was valid/clean already.

Isn't this rather bad, it's only possible to have one process that
does this magic clean bit thing.  Other applications such as Wine or a
DOS emulator might want to get the same speedups.  

Instead of a bit, why don't just use the generation number idea that
have been tossed around?  Let each directory have a generation number
which can be read with a function:

    long sys_get_generation(dirfd);

Then the name lookup would work with multiple processes and with some
code like this:

repeat:
        new_generation = sys_get_generation(dirfd);
        if (new_generation == old_generation) {
		... pure user-space fast path, use Samba dcache ...
		return;
	}
        old_generation = generation;
	... fill Samba dcache ...
	readdir() loop

	goto repeat;

Add a new create syscall with the same idea as your one bit syscall,
which checks that the generation number matches.  If the generation
number doesn't match the create call fails.

    int create_synchronized(name, mode, generation);

  /Christer

-- 
"Just how much can I get away with and still go to heaven?"

Freelance consultant specializing in device driver programming for Linux 
Christer Weinigel <christer@weinigel.se>  http://www.weinigel.se

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 17:19                                                 ` Linus Torvalds
  2004-02-20 18:48                                                   ` Ingo Molnar
@ 2004-02-20 23:00                                                   ` tridge
  1 sibling, 0 replies; 123+ messages in thread
From: tridge @ 2004-02-20 23:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

Linus,

 > This is one of the reasons why I worry about user-space caching. It's just 
 > damn hard to get right. 

yes, I'm concerned about that too. It does have the potential to be
very fast though, as it allows us to index any way we want to (in the
hot-cache paths at least).

One thing that may be important to know is with the normal Samba
process model there may be thousands of processes accessing this cache
as Samba creates a new process for each connection. With futexes we
have some chance of sanely managing access to a shared cache in user
space between such a large pool of processes, so I don't think that is
an insurmountable problem, but its something to consider when thinking
of the normal use case of whatever solution is decided on.

The current user-space positive name cache is per-process, largely
because it was designed to be portable and nice things like futexes
weren't available. At the time we also were trying to avoid too much
OS specific stuff in Samba. We've got much better infrastructure for
OS specific stuff now, so there is no problem with a Linux specific
solution. The other unixes can just continue to be slow.

Cheers, Tridge

PS: Thanks _very_ much for all the effort on this!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 18:22                                                 ` Linus Torvalds
@ 2004-02-21  0:38                                                   ` Jamie Lokier
  2004-02-21  1:10                                                     ` Linus Torvalds
  0 siblings, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-21  0:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tridge, Al Viro, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> > How about this: we clean up dnotify, so it can be used for
> > user<->kernel dcache coherency
> 
> No can do.
> 
> There is no _way_ dnotify can do a race-free update, exactly because any 
> user-level state is fundamentally irrelevant because it isn't tested under 
> the directory semaphore.
> 
> See? You can have a user-level cache, but the flag and the notification 
> absolutely has to be under the inode semaphore (and thus in kernel space) 
> if you want to avoid all races with unrelated processes.

Eh?  The flag and notification operations are set and tested
under the inode semaphore, when fcntl() is called.

The userspace cache is a slave to the kernel's cache, and I think it
is _fully_ coherent with the kernel.

Every read of the userspace cache is guaranteed to reflect the
contents of the kernel cache, atomically with respect to other
operations by unrelated processes.  Also, operations on the directory
that depend on case-insensitive matching (create, link, rename etc.)
are also atomic with respect to unrelated processes.

The atomic read nature of the userspace cache comes from a loop.
It's very similar to the sequence lock in <linux/seqlock.h>:

    cache_lookup_names(names...) {
        while (fcntl(dirfd, F_NOTIFY, flags) != 0) {
            userspace_name_list = read_directory(dirfd);
        }
        return case_insensitive_lookup(userspace_name_list, names...);
    }

Atomic operations on the directory come from a higher level loop,
using Ingo's O_CLEAN idea:

    atomic_create(name, flags, mode) {
        do {
            ci_name = cache_lookup_names(name);
            if (ci_name && (flags & O_EXCL)) { return -EEXIST; }
            if (!ci_name && !(flags & O_CREAT)) { return -ENOENT; }
            fd = clean_open (name, flags, mode);
        } while (fd == -ENOENT || fd == -ENOTCLEAN);
        return fd;
    }

    atomic_stat(name, st) {
        do {
            ci_name = cache_lookup_names(name);
            if (!ci_name) { return -ENOENT; }
            result = stat (ci_name, st);
        } while (result == -ENOENT);
        return result;
    }

    /* This unlinks just one entry if there are multiple case-equivalent
       ones. If you want to remove _all_ case-equivalent entries, you'll
       need clean_unlink. */
    atomic_unlink(name) {
        do {
            ci_name = cache_lookup_names(name);
            if (!ci_name) { return -ENOENT; }
            result = unlink (ci_name);
        } while (result == -ENOENT);
        return result;
    }

    atomic_rename(old, new) {
        do {
            (ci_old, ci_new) = cache_lookup_names(old, new);
            if (!ci_old) { return -ENOENT; }
            result = clean_rename(ci_old, ci_new ? ci_new : new);
        } while (result == -ENOTCLEAN || result == -ENOENT);
    }

    atomic_link(from, to) {
        do {
            (ci_from, ci_to) = cache_lookup_names(from, to);
            if (!ci_from) { return -ENOENT; }
            if (ci_to) { return -EEXIST; }
            result = clean_link(ci_from, to);
        } while (result == -ENOTCLEAN || result == -ENOENT);
    }

(symlink, mkdir and rmdir are similar to link, create and unlink).

The operations clean_open, clean_mkdir, clean_rename, clean_link,
clean_symlink and clean_mknod are either new system calls, or use the
standard system calls with the fchdir() method I described.

Even path walking is atomic: Samba will do a path walk using a
case-insensitive lookup on each path component.  That means every
directory that is involved will be cached in Samba and have a "clean bit".

It doesn't matter whether Samba prefers to fchdir() each step (in
which case it'll get the atomicity that it would get doing that with
normal kernel case-sensitive lookups), or not and pass the whole path
to the clean_*() operation.  In the latter case, the clean_*()
operation will test all the clean bits involved in the target path
lookup, and return -ENOTCLEAN if any aren't set, thus providing the
normal atomicity guarantees.

Ingo's concern that a directory opened by Samba's cache might be moved
is not a problem: if that happens, it'll clear the clean bit of at
least one directory in the target path.

You gave an example before:

> On the other hand, even with a nice dnotify infrastructure, you
> simply _cannot_ get absolute atomicity guarantees. Because by the
> time you actually execute the "mv" operation, another process may
> create a new file with the "same" name (ie different name, but
> comparing the same ignoring case) on another CPU. By the time you
> get the dnotify, it's too late, and the move will have happened, and
> undoing the operation (and hiding it from the client) may well be
> impossible - possibly because another process creating a file with
> the old name.

The example is flawed: the attempted rename _is_ atomic.  Either
another process succeeds on another CPU, in which case _our_ attempt
to "mv" returns -ENOTCLEAN and we will start again by refreshing our
cache, or we beat the other process to it.

This works because the clean bit checking is done by the kernel, under
the directory/inode semaphores.

It's atomic w.r.t. both other POSIX processes _and_ other processes
with their own userspace caches.

> But then it should be documented as such. It's not coherent, it's only 
> "almost coherent".

It's entirely possible I'm being dense, but I think both Ingo's
proposal, and mine which is based on it but using dnotify both provide
_fully_ coherent userspace cache, and _atomic_ operations.

They do it by looping (like a spinlock or seqlock) rather than
sleeping until ready (like a semaphore), but that is ok as long as
there isn't excessive competition between Samba and other processes
modifying the same directory.

(If the excessive competition proves to be a performance problem, then
we can adapt F_SETLEASE to resolve that too.  But I don't think it is
necessary).

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-21  0:38                                                   ` Jamie Lokier
@ 2004-02-21  1:10                                                     ` Linus Torvalds
  2004-02-21  3:01                                                       ` Jamie Lokier
  0 siblings, 1 reply; 123+ messages in thread
From: Linus Torvalds @ 2004-02-21  1:10 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ingo Molnar, Tridge, Al Viro, H. Peter Anvin, Kernel Mailing List

On Sat, 21 Feb 2004, Jamie Lokier wrote:
> 
> Eh?  The flag and notification operations are set and tested
> under the inode semaphore, when fcntl() is called.

Doesn't matter. Because you will drop the inode semaphore before you
actually create a new file. So you'll alway shave a window open for a
race.

That's what Ingo's O_CLEAN thing did. An di fyou do Ingo's O_CLEAN, then 
there's no point to notifiers in the first place - Ingo's algorithm works 
regardless of them (it had other problems, but that's another issue and 
just requires a bit of extending on the basic concept).

So why do you care about dnotify? It doesn't help at all once you have 
O_CLEAN (or equivalent).

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 18:48                                                   ` Ingo Molnar
@ 2004-02-21  1:44                                                     ` Jamie Lokier
  2004-02-21  7:58                                                     ` Ingo Molnar
  1 sibling, 0 replies; 123+ messages in thread
From: Jamie Lokier @ 2004-02-21  1:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

Ingo Molnar wrote:
> > > there's another class of problems: is it an issue that directory renames
> > > that move this directory (higher up in the directory hierarchy of this
> > > directory) do not invalidate the cache? In that case there's no dnotify
> > > event either.
> > 
> > This is one of the reasons why I worry about user-space caching. It's
> > just damn hard to get right.
> 
> this particular problem could be solved by walking down to the root
> dentry for every sys_manage_dir_cache() lookup and check that each
> dentry is still cache-valid. This involves some overhead, but it's still
> faster than doing the same from userspace. (ie. validating each previous
> path component at lookup time.)

All that's required is that Samba has a dcache entry for each path
component.  In Samba's case, every path component from the share root
is case-insensitive, so that'll always be true.  For the path from the
filesystem root to the share root, Samba can either keep a dcache
entry for each of those components (just single entries; readdir isn't
required), or it can do fstat() on each request.  The former is
faster.

When you do an O_CLEAN operation, that'll check the clean bits of
every component during the kernel side path walk, so that validates
Samba's dcache for the whole path.

Samba doesn't need to call sys_mark_dir_clean(), or my preferred
dnotify equivalent, for each step in its userspace dcache walk.  It's
fine to just do the kernel O_CLEAN operation after verifying that
every path component is in Samba's dcache, without validating each
component.

That means Samba does the path walk in userspace, but with no system
calls, and then it calls the O_CLEAN operation.

(In other words, all those atomic_open, atomic_rename etc. operations
in my previous mail are fine, but they can be optimised much better).

Example of Samba code:

    atomic_open(name, flags, mode) {
        ci_name = soft_cache_lookup(name, found);
        if (found) {
            fd = clean_open (ci_name ? ci_name : name, flags, mode);
	    if (fd != -ENOTCLEAN)
                return fd;
        }
        do {
            ci_name = hard_cache_lookup(name);
            if (ci_name && (flags & O_EXCL)) { return -EEXIST; }
            if (!ci_name && !(flags & O_CREAT)) { return -ENOENT; }
            fd = clean_open (ci_name ? ci_name : name, flags, mode);
        } while (fd == -ENOTCLEAN);
        return fd;
    }

Remember, if Samba's dcache has an entry, whether positive or
negative, one of these is true:

    - the dcache entry matches what the clean_*() operation will
      find in the kernel; or
    - the clean_*() operation will return -ENOTCLEAN

(If you use the slightly fancier method of a dcache that doesn't care
about deletions, using dnotify with DN_CREATE|DN_RENAME only (not
DN_DELETE), then -ENOENT can be returned instead).

> Since this doesnt change the dcache it ought to be doable via the
> rcu-read path and would thus still have pretty good SMP
> properties. [except when traversing mountpoints :-( ].

Mount changes need to count as changes anyway.  I'd like DN_MOUNT
added, if DN_MODIFY doesn't already get sent for mount changes.

> What if there are two instances of fileservers both using the
> same fileset and also trying to do caching this way?

I'm fairly sure the scheme described in my long mail (the one with the
atomic_open etc. explanation) works just fine with different
fileservers accessing the same tree.

> perhaps using a simple 64-bit generation counter would be better.

I think that isn't needed.

> Samba would get a new syscall to get the sum of each generation
> counter down to the root dentry - a total validation of the
> pathname.

You can't just sum per-directory counters along the path, because the
path may be rearranged by renaming directories, and different path
components could easily sum to the same generation value.

So that's going to have to be a careful strong hash of (a) the
generation counters of individual directories, (b) _plus_ the path
sequence e.g. as inode numbers, (c) _plus_ something to handle mount
changes.

> If the counter matches with that in the userspace cache entry then
> no need to re-create the cache. Such generation counters would be
> usable for multiple file servers as well. Hm?

I don't think it is worth it, for Samba.  It's quite complicated to
get a number which detects all feasible changes, and I don't think it
offers Samba any efficiency gain over the single "clean bit".

(It's an interesting idea in general though).

> i believe Samba already has what is in essence a duplication of the
> dcache. We could enable it to be fairly coherent, for Samba to be able
> to have an authorative 'does this file exist' answer without any
> excessive readdir()s.

I'm pretty sure it can too - and in a way that's useful for many
applications not just Samba.  (I've hardly touched the surface of
what's possible using the very simple O_CLEAN technique).

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-21  1:10                                                     ` Linus Torvalds
@ 2004-02-21  3:01                                                       ` Jamie Lokier
  0 siblings, 0 replies; 123+ messages in thread
From: Jamie Lokier @ 2004-02-21  3:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tridge, Al Viro, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> That's what Ingo's O_CLEAN thing did. An di fyou do Ingo's O_CLEAN, then 
> there's no point to notifiers in the first place - Ingo's algorithm works 
> regardless of them (it had other problems, but that's another issue and 
> just requires a bit of extending on the basic concept).
> 
> So why do you care about dnotify? It doesn't help at all once you have 
> O_CLEAN (or equivalent).

Please look at my pseudo-code carefully.  It uses dnotify to
test-and-set the bit; there isn't a "notify" event.

In other words, I'm making dnotify simpler by getting rid of the
signal, so it becomes exactly the same as Ingo's syscall:

        while (sys_mark_dir_clean(dirfd) == 0) {
            do_readdir(dirfd);
        }
        /* use results */

becomes:

        while (fcntl(dirfd, F_NOTIFY,
               DN_CREATE|DN_RENAME|DN_DELETE|DN_NOSIGNAL) != 0) {
            do_readdir(dirfd);
        }
        /* use results */

In my scheme, we still have O_CLEAN.  (Have I said that's a great idea
enough times yet?)

The reason I prefer to add DN_NOSIGNAL to dnotify instead of a new
syscall should be obvious: it's a simple change, equally fast, and
dnotify is a _lot_ more versatile.

For Samba, dnotify lets you be more selective for various cache types,
and poll() can do multiple tests in a single syscall - good for path
walk algorithms (although I've shown in another email how the tests
can be elided completely).

The combination of O_CLEAN with dnotify is useful for many other
applications.  I don't want to complicate this explanation by
describing them.  The dnotify change by itself is also good.

In short, it's a good thing, with no bad sides.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 18:48                                                   ` Ingo Molnar
  2004-02-21  1:44                                                     ` Jamie Lokier
@ 2004-02-21  7:58                                                     ` Ingo Molnar
  2004-02-21  8:04                                                       ` viro
                                                                         ` (2 more replies)
  1 sibling, 3 replies; 123+ messages in thread
From: Ingo Molnar @ 2004-02-21  7:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

* Ingo Molnar <mingo@elte.hu> wrote:

> perhaps using a simple 64-bit generation counter would be better.
> Samba would get a new syscall to get the sum of each generation
> counter down to the root dentry - a total validation of the pathname.
> If the counter matches with that in the userspace cache entry then no
> need to re-create the cache. Such generation counters would be usable
> for multiple file servers as well. Hm?

generation counters are problematic if they are not persistent. But
there's a pretty natural persistent 'generation counter' which could be
used for Samba's purpose: the mtime of the directory. The problem right
now is that it doesnt have enough resolution to be a true unique
generation counter. But having high-resolution mtime is a goal anyway.

XFS is one filesystem that has high-resolution mtime:

 typedef struct xfs_timestamp {
         __int32_t       t_sec;          /* timestamp seconds */
         __int32_t       t_nsec;         /* timestamp nanoseconds */
 } xfs_timestamp_t;

monotonity is important: two successive directory operations to not be
possible within the same nanosecond. This is not possible with current
hardware - but how about future hardware? Can we make an assumption like
this?

hardware that has no high-resolution clock can be supported too: by
forcing mtime to be monotonic: if current time <= last_mtime, then
last_mtime++.

so there's only one new syscall necessary for Samba to validate its name
cache:

	sys_get_path_timestamp(char *path, struct timeval *tv);

this returns the _largest_ (youngest) timestamp out of the dentry chain
down to the root dentry. This is in essence the 'age' of the whole path,
with all components taken into account. If any directory along the path
is renamed, the age changes automatically.

filesystems that dont have 64-bit, monotonic timestamps will return
-ENOSYS. This should include even XFS at the moment, because the
timestamp is not guaranteed to be monotonic.

if any path component down the tree doesnt support monotonic timestamps,
then -ENOSYS is returned. (In mixed-type filesystem installations
chroot() can be used to limit Samba's root to a monotonic timestamp
capable filesystem.)

there's at least one problem with this approach:

- the 'age' of a path changes more often than what Samba's caching needs 
  are: e.g. it changes if any directory within the path is written to.

but this is not a big problem i believe - the fastpath is preserved and
a mechanism is presented to validate the cache with a single syscall. 

any other problems with this concept?

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-21  7:58                                                     ` Ingo Molnar
@ 2004-02-21  8:04                                                       ` viro
  2004-02-21 17:46                                                         ` Ingo Molnar
  2004-02-21 18:15                                                         ` Linus Torvalds
  2004-02-21  8:26                                                       ` Keith Owens
  2004-02-23 10:59                                                       ` Pavel Machek
  2 siblings, 2 replies; 123+ messages in thread
From: viro @ 2004-02-21  8:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tridge, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

On Sat, Feb 21, 2004 at 08:58:53AM +0100, Ingo Molnar wrote:
> filesystems that dont have 64-bit, monotonic timestamps will return
> -ENOSYS. This should include even XFS at the moment, because the
> timestamp is not guaranteed to be monotonic.
 
> any other problems with this concept?

If we are demanding specific filesystems, we could simply say "use
JFS in case-insensitive mode" and be done with that.  Which deals
with all problems, since fs code will guarantee uniqueness, etc.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-21  7:58                                                     ` Ingo Molnar
  2004-02-21  8:04                                                       ` viro
@ 2004-02-21  8:26                                                       ` Keith Owens
  2004-02-23 10:59                                                       ` Pavel Machek
  2 siblings, 0 replies; 123+ messages in thread
From: Keith Owens @ 2004-02-21  8:26 UTC (permalink / raw)
  To: Kernel Mailing List

On Sat, 21 Feb 2004 08:58:53 +0100, 
Ingo Molnar <mingo@elte.hu> wrote:
>* Ingo Molnar <mingo@elte.hu> wrote:
>> perhaps using a simple 64-bit generation counter would be better.
>> Samba would get a new syscall to get the sum of each generation
>> counter down to the root dentry - a total validation of the pathname.
>> If the counter matches with that in the userspace cache entry then no
>> need to re-create the cache. Such generation counters would be usable
>> for multiple file servers as well. Hm?
>
>generation counters are problematic if they are not persistent. But
>there's a pretty natural persistent 'generation counter' which could be
>used for Samba's purpose: the mtime of the directory ...
>... monotonity is important: two successive directory operations to not be
>possible within the same nanosecond.

Why do you need monotonity?  Samba only cares if the dcache entry
changes, the indicator from kernel to user space does not have to be
monotonically increasing, just different.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-21  8:04                                                       ` viro
@ 2004-02-21 17:46                                                         ` Ingo Molnar
  2004-02-21 18:15                                                         ` Linus Torvalds
  1 sibling, 0 replies; 123+ messages in thread
From: Ingo Molnar @ 2004-02-21 17:46 UTC (permalink / raw)
  To: viro
  Cc: Linus Torvalds, Tridge, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List


* viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk> wrote:

> On Sat, Feb 21, 2004 at 08:58:53AM +0100, Ingo Molnar wrote:
> > filesystems that dont have 64-bit, monotonic timestamps will return
> > -ENOSYS. This should include even XFS at the moment, because the
> > timestamp is not guaranteed to be monotonic.
>  
> > any other problems with this concept?
> 
> If we are demanding specific filesystems, we could simply say "use JFS
> in case-insensitive mode" and be done with that.  Which deals with all
> problems, since fs code will guarantee uniqueness, etc.

what i propose is a pretty generic feature that we need anyway (current
32-bit, 1-sec granular mtime in most filesystems is already problematic
for things like make dependencies), while "use JFS in case-insensitive
mode" is to degrade a filesystem to a non-POSIX mode. I dont think the
two approaches are equivalent. Having good, monotonic, finegrained
timestamps is a thing of the future - case-insensitive lowlevel
filesystems are a thing of the past.

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-21  8:04                                                       ` viro
  2004-02-21 17:46                                                         ` Ingo Molnar
@ 2004-02-21 18:15                                                         ` Linus Torvalds
  1 sibling, 0 replies; 123+ messages in thread
From: Linus Torvalds @ 2004-02-21 18:15 UTC (permalink / raw)
  To: viro; +Cc: Ingo Molnar, Tridge, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

On Sat, 21 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> 
> If we are demanding specific filesystems, we could simply say "use
> JFS in case-insensitive mode" and be done with that.  Which deals
> with all problems, since fs code will guarantee uniqueness, etc.

Don't be silly. You can't use JFS in case-insensitive mode and do anything 
sane.

That will terminally confuse a lot of UNIX applications, including NFS
serving.  Which makes the whole thing completely useless _except_ as a
pure Windows-compatible partition.

If you are going to limit a partition to _only_ doing Samba serving, then 
you have no problems _anyway_, since then samba can do all locking and all 
name translation totally on its own.

In short, a case-insensitive filesystem is fundamentally uninteresting. It 
buys _nothing_ that samba can't do already, since it only means that you 
can't really do anything else on it.

		Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-20 20:38                                             ` Christer Weinigel
@ 2004-02-22 15:07                                               ` Jamie Lokier
  2004-02-22 16:55                                                 ` Miquel van Smoorenburg
  0 siblings, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-22 15:07 UTC (permalink / raw)
  To: Christer Weinigel
  Cc: Ingo Molnar, Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

Christer Weinigel wrote:
> > 	long sys_mark_dir_clean(dirfd);
> > 
> > the syscall returns whether the directory was valid/clean already.
> 
> Isn't this rather bad, it's only possible to have one process that
> does this magic clean bit thing.  Other applications such as Wine or
> a DOS emulator might want to get the same speedups.

No.  The magic clean bit is associated with dirfd - different file
descriptors have separate magic clean bits.

> Add a new create syscall with the same idea as your one bit syscall,
> which checks that the generation number matches.  If the generation
> number doesn't match the create call fails.
> 
>     int create_synchronized(name, mode, generation);

Hmm.  That's an interesting idea.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-22 15:07                                               ` Jamie Lokier
@ 2004-02-22 16:55                                                 ` Miquel van Smoorenburg
  0 siblings, 0 replies; 123+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-22 16:55 UTC (permalink / raw)
  To: linux-kernel

In article <20040222150753.GB25664@mail.shareable.org>,
Jamie Lokier  <jamie@shareable.org> wrote:
>Christer Weinigel wrote:
>> > 	long sys_mark_dir_clean(dirfd);
>> > 
>> > the syscall returns whether the directory was valid/clean already.
>> 
>> Isn't this rather bad, it's only possible to have one process that
>> does this magic clean bit thing.  Other applications such as Wine or
>> a DOS emulator might want to get the same speedups.
>
>No.  The magic clean bit is associated with dirfd - different file
>descriptors have separate magic clean bits.
>
>> Add a new create syscall with the same idea as your one bit syscall,
>> which checks that the generation number matches.  If the generation
>> number doesn't match the create call fails.
>> 
>>     int create_synchronized(name, mode, generation);
>
>Hmm.  That's an interesting idea.

Generalize it. sys_set_required_generation(generation) - works
with all create/rename/delete/link calls. Setting it to zero
turns it off.

Mike.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Eureka! (was Re: UTF-8 and case-insensitivity)
  2004-02-20  0:46                                                         ` Jamie Lokier
@ 2004-02-23 10:13                                                           ` Tim Connors
  0 siblings, 0 replies; 123+ messages in thread
From: Tim Connors @ 2004-02-23 10:13 UTC (permalink / raw)
  To: linux-kernel

Jamie Lokier <jamie@shareable.org> said on Fri, 20 Feb 2004 00:46:05 +0000:
> One thing you can't do is real-time updatedb+locate, because of the
> need to have an open file descriptor for every directory that's monitored.

That would be so sweet (I hate that 2 hour long slocate run every
morning). It'd also help those of us who like our hd's to spin down,
but get confused by the zillions of lines output by laptop-mode (with
most of the changed "files" really coming from kjournald, etc) :)



-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
You see, wire telegraph is a kind of a very, very long cat. You pull
his tail in New York and his head is meowing in Los Angeles.  Do you
understand this?  And radio operates exactly the same way:  you send
signals here,  they receive them there.  The only difference is that
there is no cat.   -- Albie E. on radios. 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-21  7:58                                                     ` Ingo Molnar
  2004-02-21  8:04                                                       ` viro
  2004-02-21  8:26                                                       ` Keith Owens
@ 2004-02-23 10:59                                                       ` Pavel Machek
  2004-02-23 13:55                                                         ` Jamie Lokier
  2 siblings, 1 reply; 123+ messages in thread
From: Pavel Machek @ 2004-02-23 10:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tridge, Al Viro, Jamie Lokier, H. Peter Anvin,
	Kernel Mailing List

Hi!

> generation counters are problematic if they are not persistent. But
> there's a pretty natural persistent 'generation counter' which could be
> used for Samba's purpose: the mtime of the directory. The problem right
> now is that it doesnt have enough resolution to be a true unique
> generation counter. But having high-resolution mtime is a goal anyway.
> 
> XFS is one filesystem that has high-resolution mtime:
> 
>  typedef struct xfs_timestamp {
>          __int32_t       t_sec;          /* timestamp seconds */
>          __int32_t       t_nsec;         /* timestamp nanoseconds */
>  } xfs_timestamp_t;
> 
> monotonity is important: two successive directory operations to not be
> possible within the same nanosecond. This is not possible with current
> hardware - but how about future hardware? Can we make an assumption like
> this?

Does not ndelay(1) if samba notices mtime is too young in the samba code
fix that?
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-23 10:59                                                       ` Pavel Machek
@ 2004-02-23 13:55                                                         ` Jamie Lokier
  2004-02-23 16:45                                                           ` Ingo Molnar
  0 siblings, 1 reply; 123+ messages in thread
From: Jamie Lokier @ 2004-02-23 13:55 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

Pavel Machek wrote:
> > monotonity is important: two successive directory operations to not be
> > possible within the same nanosecond. This is not possible with current
> > hardware - but how about future hardware? Can we make an assumption like
> > this?
> 
> Does not ndelay(1) if samba notices mtime is too young in the samba code
> fix that?

No, because Samba cannot tell.  To Samba, it looks like the directory
hasn't changed, when it has.

Another issue is that most machines don't have nanosecond resolution
clocks (e.g. m68k is limited to timer interrupt resolution, and some
x86 machines cannot use the cycle counter), and most filesystems don't
store them either.

The right place to put the delay is in the kernel, when mtime is set
or when it is read, or both.

Important: you *don't* have to delay unless someone has _read_ the
mtime since the last time it was set, _and_ if the mtime wouldn't be
changed by the current operation, _and_ if you cannot simply increment
the low-order bits of mtime due to known limits on the system clock
resolution.

So maintain a flag per inode that says "this inode's mtime has been
read".  It is set whenever the mtime is read, or whenever the inode is
written to disk if you are implementing persistence (because you never
know if someone reads it from the disk at another time).  The flag
does not have to be stored - this works with all filesystems.

Also, you don't have to put the delay where the mtime is updated.  You
can also put it where the mtime is _read_, and then only when the flag
is not set.  Or you can balance it equally between both operations, so
that neither operation can be a DOS for the other.

The delaying strategy works very nicely for filesystems that store low
resolution clocks (i.e. nearly all of them), because even though the
delay is longer (e.g. sleep(1) for ext2, sleep(2) for FAT), that flag
is rarely alternated so you hardly ever need the delay - and mtimes
are still observed to be strictly monotonic.

(You can further eliminate the need for delays by assigning some bits
as a sub-generation counter instead of a timestamp.  This is
equivalent to pretending the system clock has lower resolution than
the filesystem can store. In fact, taking bits _away_ from mtime and
using them a sub-generation counter instead provides better
performance with the guarantee of monotonicity.)

It's a generally useful feature, but I'm not sure why we're looking at
this for Samba, which needs the O_CLEAN mechanism more than it needs
change-detection - for this, Samba can already use the existing
dnotify even though the interface is a bit cumbersome, whereas O_CLEAN
or its equivalent isn't available yet.

-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-23 13:55                                                         ` Jamie Lokier
@ 2004-02-23 16:45                                                           ` Ingo Molnar
  2004-02-23 17:32                                                             ` Jamie Lokier
  0 siblings, 1 reply; 123+ messages in thread
From: Ingo Molnar @ 2004-02-23 16:45 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Pavel Machek, Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List


* Jamie Lokier <jamie@shareable.org> wrote:

> Another issue is that most machines don't have nanosecond resolution
> clocks (e.g. m68k is limited to timer interrupt resolution, and some
> x86 machines cannot use the cycle counter), [...]

this is not an issue, with the monotonicity solution i suggested: if
prev_time == curr_time then curr_time.tv_nsec++.

> [...] and most filesystems don't store them either.

their problem. There's at least one filesystem that does it right (XFS),
the rest will be handled via natural selection.

> The right place to put the delay is in the kernel, when mtime is set
> or when it is read, or both.

no need to delay anything - just do the tv_nsec++ thing!

	Ingo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN
  2004-02-23 16:45                                                           ` Ingo Molnar
@ 2004-02-23 17:32                                                             ` Jamie Lokier
  0 siblings, 0 replies; 123+ messages in thread
From: Jamie Lokier @ 2004-02-23 17:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Linus Torvalds, Tridge, Al Viro, H. Peter Anvin,
	Kernel Mailing List

Ingo Molnar wrote:
> > Another issue is that most machines don't have nanosecond resolution
> > clocks (e.g. m68k is limited to timer interrupt resolution, and some
> > x86 machines cannot use the cycle counter), [...]
> 
> this is not an issue, with the monotonicity solution i suggested: if
> prev_time == curr_time then curr_time.tv_nsec++.

So did you read this paragraph?:

   Important: you *don't* have to delay unless someone has _read_ the
   mtime since the last time it was set, _and_ if the mtime wouldn't be
   changed by the current operation, _and_ if you cannot simply increment
   the low-order bits of mtime due to known limits on the system clock
   resolution.

According to that logic, you are correct that the delay isn't
required, and tv_nsec++ works, except on very fast machines which
don't exist yet, _or_ on highly concurrent machines (maybe the 256-way
NUMA boxes?)

When it's called a "generation counter" or "uniqueness stamp", there
is no problem just changing the number.

When it's called a "timestamp", programs will also compare the
timestamps of _different_ files to see which order they were written.
("make" is the simplest example.)  Therefore it's important to get
this order logically correct.

> > The right place to put the delay is in the kernel, when mtime is set
> > or when it is read, or both.
> 
> no need to delay anything - just do the tv_nsec++ thing!

As you yourself pointed out, that doesn't work on machines in 5 years
time:  Think 40GHz machines and dynamic translation which optimises
fast paths to a few highly concurrent cycles.  Then it's easy to
imagine tv_nsec incrementing multiple times within a nanosecond.

Natural selection isn't the problem: programs assuming it is logically
dependable and returning erroneous results is the problem.

There are _lots_ of programs that could use a reliable indicator of
whether a file has changed.  tv_nsec when it isn't stored isn't one of
them, so it would be a shame to have a half-engineered solution that
complex programs like, say, a Java generated-code-precacher must not
use because of rare logical failures.

I've offered a solution that offers correct results for _all_
filesystems and _all_ machines, and it will behave exactly like your
code if you run it with XFS or a similar filesystem, it's efficient
(the delay is minimal and only done when needed), and logically
dependable.  So please consider it.

Thanks :)
-- Jamie

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2004-02-23 17:32 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-17  4:12 UTF-8 and case-insensitivity tridge
2004-02-17  5:11 ` Linus Torvalds
2004-02-17  6:54   ` tridge
2004-02-17  8:33     ` Neil Brown
2004-02-17 22:48       ` tridge
2004-02-18  0:06         ` Neil Brown
2004-02-18  9:47           ` Helge Hafting
2004-02-17 15:13     ` Linus Torvalds
2004-02-17 16:57       ` Linus Torvalds
2004-02-17 19:44         ` viro
2004-02-17 20:10           ` Linus Torvalds
2004-02-17 20:17             ` viro
2004-02-17 20:23               ` Linus Torvalds
2004-02-17 21:08         ` Robin Rosenberg
2004-02-17 21:17           ` Linus Torvalds
2004-02-17 22:27             ` Robin Rosenberg
2004-02-18  3:02               ` tridge
2004-02-17 23:57         ` tridge
2004-02-17 23:20       ` tridge
2004-02-17 23:43         ` Linus Torvalds
2004-02-18  3:26           ` tridge
2004-02-18  5:33             ` H. Peter Anvin
2004-02-18  7:54             ` Marc Lehmann
2004-02-18  2:37         ` H. Peter Anvin
2004-02-18  3:03           ` Linus Torvalds
2004-02-18  3:14             ` H. Peter Anvin
2004-02-18  3:27               ` Linus Torvalds
2004-02-18 21:31                 ` tridge
2004-02-18 22:23                   ` Linus Torvalds
2004-02-18 22:28                     ` Linus Torvalds
2004-02-18 22:50                       ` tridge
2004-02-18 22:59                         ` Linus Torvalds
2004-02-18 23:09                           ` tridge
2004-02-18 23:16                             ` Linus Torvalds
2004-02-19  8:10                               ` Jamie Lokier
2004-02-19 16:09                                 ` Linus Torvalds
2004-02-19 16:38                                   ` Jamie Lokier
2004-02-19 16:54                                     ` Linus Torvalds
2004-02-19 18:29                                       ` Jamie Lokier
2004-02-19 19:48                                         ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
2004-02-19 19:51                                           ` Linus Torvalds
2004-02-19 19:48                                             ` H. Peter Anvin
2004-02-19 20:04                                               ` Linus Torvalds
2004-02-19 20:05                                           ` viro
2004-02-19 20:23                                             ` Linus Torvalds
2004-02-19 20:32                                               ` Linus Torvalds
2004-02-19 20:45                                                 ` viro
2004-02-19 21:26                                                   ` Linus Torvalds
2004-02-19 21:38                                                     ` Linus Torvalds
2004-02-19 21:45                                                     ` Linus Torvalds
2004-02-19 21:43                                                       ` viro
2004-02-19 21:53                                                         ` Linus Torvalds
2004-02-19 22:21                                                           ` David Lang
2004-02-19 20:48                                                 ` Jamie Lokier
2004-02-19 21:30                                                   ` Linus Torvalds
2004-02-20  0:00                                                     ` Jamie Lokier
2004-02-20  0:17                                                       ` Linus Torvalds
2004-02-20  0:24                                                         ` Linus Torvalds
2004-02-20  0:30                                                           ` Trond Myklebust
2004-02-20  0:54                                                           ` Jamie Lokier
2004-02-20  0:57                                                           ` tridge
2004-02-20  1:07                                                           ` Paul Wagland
2004-02-20 13:31                                                           ` Chris Wedgwood
2004-02-20  0:46                                                         ` Jamie Lokier
2004-02-23 10:13                                                           ` Tim Connors
2004-02-20  1:39                                                     ` Junio C Hamano
2004-02-20 12:54                                                       ` Jamie Lokier
2004-02-19 23:37                                           ` tridge
2004-02-20  0:02                                             ` Linus Torvalds
2004-02-20  0:16                                               ` tridge
2004-02-20  0:37                                                 ` Linus Torvalds
2004-02-20  1:26                                                   ` tridge
2004-02-20  1:07                                               ` H. Peter Anvin
2004-02-20  2:30                                           ` Theodore Ts'o
2004-02-20 12:04                                           ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
2004-02-20 13:19                                             ` Jamie Lokier
2004-02-20 13:37                                               ` Ingo Molnar
2004-02-20 14:00                                                 ` Ingo Molnar
2004-02-20 16:31                                                 ` Jamie Lokier
2004-02-20 13:23                                             ` [patch] " Ingo Molnar
2004-02-20 18:00                                               ` viro
2004-02-20 15:41                                             ` Linus Torvalds
2004-02-20 17:04                                               ` Ingo Molnar
2004-02-20 17:19                                                 ` Linus Torvalds
2004-02-20 18:48                                                   ` Ingo Molnar
2004-02-21  1:44                                                     ` Jamie Lokier
2004-02-21  7:58                                                     ` Ingo Molnar
2004-02-21  8:04                                                       ` viro
2004-02-21 17:46                                                         ` Ingo Molnar
2004-02-21 18:15                                                         ` Linus Torvalds
2004-02-21  8:26                                                       ` Keith Owens
2004-02-23 10:59                                                       ` Pavel Machek
2004-02-23 13:55                                                         ` Jamie Lokier
2004-02-23 16:45                                                           ` Ingo Molnar
2004-02-23 17:32                                                             ` Jamie Lokier
2004-02-20 23:00                                                   ` tridge
2004-02-20 17:33                                               ` Jamie Lokier
2004-02-20 18:22                                                 ` Linus Torvalds
2004-02-21  0:38                                                   ` Jamie Lokier
2004-02-21  1:10                                                     ` Linus Torvalds
2004-02-21  3:01                                                       ` Jamie Lokier
2004-02-20 17:47                                               ` Jamie Lokier
2004-02-20 20:38                                             ` Christer Weinigel
2004-02-22 15:07                                               ` Jamie Lokier
2004-02-22 16:55                                                 ` Miquel van Smoorenburg
2004-02-19 19:08                                       ` UTF-8 and case-insensitivity Helge Hafting
2004-02-18  4:08           ` tridge
2004-02-18 10:05             ` Robin Rosenberg
2004-02-18 11:43               ` tridge
2004-02-18 12:31                 ` Robin Rosenberg
2004-02-18 16:48                   ` H. Peter Anvin
2004-02-18 20:00                     ` H. Peter Anvin
2004-02-19  2:53   ` Daniel Newby
2004-02-17  5:25 ` Tim Connors
2004-02-17  7:43 ` H. Peter Anvin
2004-02-17  8:05   ` H. Peter Anvin
2004-02-17 14:25 ` Dave Kleikamp
2004-02-18  0:16 ` Robert White
2004-02-18  0:20   ` Linus Torvalds
2004-02-18  1:03     ` Robert White
2004-02-18 21:48     ` Ville Herva
2004-02-18  2:48   ` tridge
2004-02-18 20:56     ` Robert White

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox