All of lore.kernel.org
 help / color / mirror / Atom feed
* Directory updates in filesystems
@ 2004-10-28  9:23 Simon Waters
  2004-10-29  5:26 ` David Masover
  2004-10-29 21:15 ` Hans Reiser
  0 siblings, 2 replies; 6+ messages in thread
From: Simon Waters @ 2004-10-28  9:23 UTC (permalink / raw)
  To: reiserfs-list

There is discussion on maildir happening in the dovecot mailing list.

My question is mostly curiosity driven as I thought I understood "enough" 
about filesystems to answer such questions, and not reiserfs specific (but I 
know it is dear to your hearts).

The discussion focuses on the use of renaming files to add attributes to the 
end of the filename.

If you have multiple processes simultaneously accessing a directory with 
rename of contents (additions and removals add more fun), a simple 
opendir/readir loop will on occasion fail to find a file that exists (i.e. 
the stem part of the file name is the same) because the file has been 
renamed, on most filesystems on most *nix like systems tested.

This seems to be a result of readdir without locks not being atomic on most 
filesystems, but reading a set amount of directory entries then rereading 
further at a later stage.

AIUI BSD FFS is suppose to try and ensure that new records are added to the 
end of list the pointer points to, so that at worst a file is seen twice, but 
this doesn't seem to completely address the problem when testing the most 
general case.

Are there any file systems that fully address this issue, or POSIX calls that 
guaranteed to make an atomic readdir, without specific locking, or must a 
lock be obtained on the directory to ensure that the read is consistent. I 
think that locking is needed in the application if complete consistency is 
required because the underlying behaviour of the OSes/filesystems is so 
variable in this regard, but I'd be interested in understanding what 
characteristics a filesystem would have to have to avoid this.

I think a lock and full read will have significant performance implications, 
since the problem only manifests itself on busy directories, but in a 
journalled metadata environment all it wants is a consistent read, if we 
later stat the file and it is missing we can look for renamed versions.

Of course in a real filesystem you'd just store the attributes in something 
designed for storing custom attributes..... :)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Directory updates in filesystems
  2004-10-28  9:23 Directory updates in filesystems Simon Waters
@ 2004-10-29  5:26 ` David Masover
  2004-10-29 16:38   ` Valdis.Kletnieks
  2004-10-29 20:04   ` Hans Reiser
  2004-10-29 21:15 ` Hans Reiser
  1 sibling, 2 replies; 6+ messages in thread
From: David Masover @ 2004-10-29  5:26 UTC (permalink / raw)
  To: Simon Waters; +Cc: reiserfs-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Simon Waters wrote:
[...]
| If you have multiple processes simultaneously accessing a directory with
| rename of contents (additions and removals add more fun), a simple
| opendir/readir loop will on occasion fail to find a file that exists
(i.e.
| the stem part of the file name is the same) because the file has been
| renamed, on most filesystems on most *nix like systems tested.

If this is about locking not working well with NFS, why not ensure that
the directory itself is owned by root and read-only before attempting?
Wait -- don't answer that...

| Are there any file systems that fully address this issue, or POSIX
calls that
| guaranteed to make an atomic readdir, without specific locking, or must a
| lock be obtained on the directory to ensure that the read is
consistent. I
| think that locking is needed in the application if complete
consistency is
| required because the underlying behaviour of the OSes/filesystems is so
| variable in this regard, but I'd be interested in understanding what
| characteristics a filesystem would have to have to avoid this.

Maybe an atomic readdir operation?  Does reiser4 do atomic reads?

I know reiser4 (or at least should by 4.1) have a sys_reiser4 api which
does atomic write operations.  That is:  application starts the
transaction, does a bunch of writes, ends the transaction.  If at any
point there is a failure, filesystem tells application to roll back.

<rant>

Actually, though I can see the design decision to roll back, I think
that under certain circumstances, the new transaction should be required
to be complete before being comitted on top of the old state, and that
meanwhile, apps should be allowed to read the consistent, old state.
Like how vim, maildir, and many other things work now -- write to temp
file, rename temp file on top of original, only at the filesystem level
and for entire directories.

This alows read-only access, such as a web server, to operate on
slightly stale "snapshots" as this would create.  When faced with a
decision of:

- - serving a slightly stale page immediately
- - making users wait for a write of a newer version to complete
- - serving a half-written newer version

I am sure most web admins would choose the first option, which is what
they would get if the pages were being updated with vim.  The difference
is that the filesystem solution works on larger units than single files.

This would eliminate the need for application-level rollback support,
though that might be a nice feature.  Also, applications aren't always
available -- maybe mysql or some such was in the middle of a transaction
when my box crashed and my boot sector died, and maybe mysql will never
be on any rescue disk, and maybe I want my rescue disk to be able to
roll back the transactions for me and give me a consistent filesystem.

Obviously this fails if the readdir is part of a larger,
do-something-to-all-files-in-directory operation.  In this case, the
filesystem needs to know the difference between read and read/write
transactions -- read-only access need not lock, read/write must lock,
even if the write may not be necessary (write to file a if and only if
file b exists.)  It also implies that applications need to specify what
to lock.

I haven't really done my homework, but I think that the transactions I
just described are what databases do, and that the ones already in the
reiser4 whitepaper are easier.  Specifically because of that "specify
what to lock" thing.

The nice thing about this is that it gives fully scalable snapshots,
assuming all programs support it.  For example, a traditional snapshot
of a system with lots and lots of random activity in temporary files
will use a lot more resources than necessary, especially if you only
wanted one directory.  This solution gives you a solution as simple as a
patched cp or rsync which can work on anything from a single file to a
whole filesystem.  I'd like to see some benchmarks, but I think
filesystems are getting to a point where they will be more efficient
than humans at organizing data, meaning no real performance reason for a
bunch of tiny little partitions.  It'd suck to have to have a separate /
partition, just so you can atomically back up /etc.

| I think a lock and full read will have significant performance
implications,
| since the problem only manifests itself on busy directories, but in a
| journalled metadata environment all it wants is a consistent read, if we
| later stat the file and it is missing we can look for renamed versions.

~From what I can gather, it seems you'd be happy with the above system,
using read-only transactions in place of locks.

Before I completely run away with my ego, I have absolutely no idea how
difficult this would be to implement on top of reiser4, or anything
else, for that matter.  Also, even if the filesystem in question was
done today, it still might be awhile before application support was
developed enough for it to work -- how can I atomically back up /etc
while a package manager is installing, if said package manager doesn't
know about those locking/transactions?

</rant>

Don't let my ranting discourage you from Reiser or Linux in general.
Most geeks know how to be concise...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iQIVAwUBQYHUeHgHNmZLgCUhAQLuchAAkbo3Ac9ikvfayhlT6M1CDjxAHEU3rTfH
7u50czApG8srLPH6tj6xOV9FT++g1KYhAeeNoGj9nKsFlgNJex3Wx4DqGbOZcUNW
sM3tbGCLrKiPtD8UJLrMc4ZQ7jDTI9AavGCA/Y4U8xGzbQSwOrFOdBeMqV3EXctT
ZycOkTXkfWGYKTQ8FEfxBPAEADsdpCQ4RiOgMuL9wy/V/LY5CdzhVlRJwNLf7UJy
sR+z2eKt97Tlg3opylT2R9AgNGmveMIfpDuN3F4IuMMHRZNDgrzWEYSHT208O8+A
2ZJkOU7yaPShdgt+tTD411hlXMMlSoKoCwhCzr9js9Fj1ice5a/DxXAryZJZPXoi
SxI2vrtm10gEEJ8TfbKDNhh50xTSCC6Jwgd8AuoN6f7GPojM6dpby6kXSMRfp/+A
nb4Oz6kdbAG9PhPaBFBjMC/GGakXelnBO98C8M5YvQ8Uc1X/Gi0zD+UQrN/m0xy+
v/fCJ7ySzBjQVBsKAXZenfrn/RB98zzqNVaMOeoA8hITbCQu9LOgHCbfZaYplUTl
bMJvUU2Pfuk9JCV6sZAwcoQAt1xKku+mzFcnvHXRmvRKlTA6ySMZRqVosXJbgXpa
2fhzJHDRZNUtMsQEy8QGbsUUCcFXHM/4gSRqNzI0r2uljwVxg1WBm/t2vBPudCPF
q0r+lpJTn/s=
=k5Cr
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Directory updates in filesystems
  2004-10-29  5:26 ` David Masover
@ 2004-10-29 16:38   ` Valdis.Kletnieks
  2004-10-29 20:04   ` Hans Reiser
  1 sibling, 0 replies; 6+ messages in thread
From: Valdis.Kletnieks @ 2004-10-29 16:38 UTC (permalink / raw)
  To: David Masover; +Cc: Simon Waters, reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 2875 bytes --]

On Fri, 29 Oct 2004 00:26:18 CDT, David Masover said:

> If this is about locking not working well with NFS, why not ensure that
> the directory itself is owned by root and read-only before attempting?
> Wait -- don't answer that...

No, this is a different problem.

Imagine a directory with 10K files called 0001, 0002, 0003, .... , 9999.
You start a 'readdir()' loop, and get to 5497 or so.  At this point,
another process removes 1260 through 1265, and then another process renames 8534 to
1263, putting it in the slot just vacated - and you reach the end of
the readdir() loop never seeing that file.

> | Are there any file systems that fully address this issue, or POSIX
> calls that
> | guaranteed to make an atomic readdir, without specific locking, or must a
> | lock be obtained on the directory to ensure that the read is
> consistent. I
> | think that locking is needed in the application if complete
> consistency is
> | required because the underlying behaviour of the OSes/filesystems is so
> | variable in this regard, but I'd be interested in understanding what
> | characteristics a filesystem would have to have to avoid this.
> 
> Maybe an atomic readdir operation?  Does reiser4 do atomic reads?

Do you *REALLY* want to lock the *entire* dir (probably in memory, which
can hurt for directories with 10Ks or 100Ks entries, which is where the
problem is most evident)?  Even if it's not locked in memory, the mere
locking against updates can be *painful* performance-wise.

> I know reiser4 (or at least should by 4.1) have a sys_reiser4 api which
> does atomic write operations.  That is:  application starts the
> transaction, does a bunch of writes, ends the transaction.  If at any
> point there is a failure, filesystem tells application to roll back.

Atomic operations don't help you here, unless you're willing to take a
locking performance hit.  Remember that rename() is *already* atomic (at least
from other process's viewpoint), and you have the "rename into a slot
you've passed" problem mentioned above...


> This alows read-only access, such as a web server, to operate on
> slightly stale "snapshots" as this would create.  When faced with a
> decision of:
> 
> - - serving a slightly stale page immediately
> - - making users wait for a write of a newer version to complete
> - - serving a half-written newer version
> 
> I am sure most web admins would choose the first option, which is what
> they would get if the pages were being updated with vim.  The difference
> is that the filesystem solution works on larger units than single files.

The problem is that if you're a mail server, you probably *don't* want to
be sending a slightly stale version of the mail that just got queued.  There,
the only realistic option is your "make users wait" - which may be intolerable
when you're trying to do millions of transactions an hour...

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Directory updates in filesystems
  2004-10-29  5:26 ` David Masover
  2004-10-29 16:38   ` Valdis.Kletnieks
@ 2004-10-29 20:04   ` Hans Reiser
  1 sibling, 0 replies; 6+ messages in thread
From: Hans Reiser @ 2004-10-29 20:04 UTC (permalink / raw)
  To: David Masover; +Cc: Simon Waters, reiserfs-list

David Masover wrote:

> Simon Waters wrote:
> [...]
> | If you have multiple processes simultaneously accessing a directory with
> | rename of contents (additions and removals add more fun), a simple
> | opendir/readir loop will on occasion fail to find a file that exists
> (i.e.
> | the stem part of the file name is the same) because the file has been
> | renamed, on most filesystems on most *nix like systems tested.
>
> If this is about locking not working well with NFS, why not ensure that
> the directory itself is owned by root and read-only before attempting?
> Wait -- don't answer that...
>
> | Are there any file systems that fully address this issue, or POSIX
> calls that
> | guaranteed to make an atomic readdir, without specific locking, or 
> must a
> | lock be obtained on the directory to ensure that the read is
> consistent. I
> | think that locking is needed in the application if complete
> consistency is
> | required because the underlying behaviour of the OSes/filesystems is so
> | variable in this regard, but I'd be interested in understanding what
> | characteristics a filesystem would have to have to avoid this.
>
> Maybe an atomic readdir operation?  Does reiser4 do atomic reads?

Only writes at this time.  Will try to get the government to pay for us 
to do atomic reads someday.....;-)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Directory updates in filesystems
  2004-10-28  9:23 Directory updates in filesystems Simon Waters
  2004-10-29  5:26 ` David Masover
@ 2004-10-29 21:15 ` Hans Reiser
  2004-11-01 10:32   ` Simon Waters
  1 sibling, 1 reply; 6+ messages in thread
From: Hans Reiser @ 2004-10-29 21:15 UTC (permalink / raw)
  To: Simon Waters; +Cc: reiserfs-list

Simon Waters wrote:

>There is discussion on maildir happening in the dovecot mailing list.
>
>My question is mostly curiosity driven as I thought I understood "enough" 
>about filesystems to answer such questions, and not reiserfs specific (but I 
>know it is dear to your hearts).
>
>The discussion focuses on the use of renaming files to add attributes to the 
>end of the filename.
>
>If you have multiple processes simultaneously accessing a directory with 
>rename of contents (additions and removals add more fun), a simple 
>opendir/readir loop will on occasion fail to find a file that exists (i.e. 
>the stem part of the file name is the same) because the file has been 
>renamed, on most filesystems on most *nix like systems tested.
>
>This seems to be a result of readdir without locks not being atomic on most 
>filesystems, but reading a set amount of directory entries then rereading 
>further at a later stage.
>
>AIUI BSD FFS is suppose to try and ensure that new records are added to the 
>end of list the pointer points to, so that at worst a file is seen twice, but 
>this doesn't seem to completely address the problem when testing the most 
>general case.
>
>Are there any file systems that fully address this issue
>
I think no.  It is quite fixable in a variety of ways.  If someone wants 
to fix it or have it fixed, let me know.

> or POSIX calls that 
>guaranteed to make an atomic readdir, without specific locking, or must a 
>lock be obtained on the directory to ensure that the read is consistent. I 
>think that locking is needed in the application if complete consistency is 
>required because the underlying behaviour of the OSes/filesystems is so 
>variable in this regard, but I'd be interested in understanding what 
>characteristics a filesystem would have to have to avoid this.
>
>I think a lock and full read will have significant performance implications, 
>since the problem only manifests itself on busy directories, but in a 
>journalled metadata environment all it wants is a consistent read, if we 
>later stat the file and it is missing we can look for renamed versions.
>
>Of course in a real filesystem you'd just store the attributes in something 
>designed for storing custom attributes..... :)
>
>
>  
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Directory updates in filesystems
  2004-10-29 21:15 ` Hans Reiser
@ 2004-11-01 10:32   ` Simon Waters
  0 siblings, 0 replies; 6+ messages in thread
From: Simon Waters @ 2004-11-01 10:32 UTC (permalink / raw)
  To: Hans Reiser; +Cc: reiserfs-list

On Friday 29 Oct 2004 10:15 pm, Hans Reiser wrote:
> Simon Waters wrote:
> 
> >Are there any file systems that fully address this issue
>
> I think no.  It is quite fixable in a variety of ways.  If someone wants
> to fix it or have it fixed, let me know.

I mostly wanted to make sure I'd understood the problem correctly.

I can see it being a desirable option for a filesystem, indeed perhaps it 
ought to be a default behaviour that can be switched off for performance, as 
the "behaviour of least surprise". 

However I think immediately it isn't obvious that it is needed, as in many 
cases people are using disparate file systems or NFS, and we have lived so 
far without it. But then if it did exist people might see it as a compelling 
reason to use the filesystem supplying it for specific purposes, much as DJB 
recommends BSD and FFS without soft updates for aspects of qmail (if I 
understood it correctly).

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-11-01 10:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-28  9:23 Directory updates in filesystems Simon Waters
2004-10-29  5:26 ` David Masover
2004-10-29 16:38   ` Valdis.Kletnieks
2004-10-29 20:04   ` Hans Reiser
2004-10-29 21:15 ` Hans Reiser
2004-11-01 10:32   ` Simon Waters

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.