From: Jamie Lokier <jamie@shareable.org>
To: viro@parcelfarce.linux.theplanet.co.uk
Cc: Linus Torvalds <torvalds@osdl.org>,
David Woodhouse <dwmw2@infradead.org>,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Andrew Morton <akpm@osdl.org>, Carsten Otte <cotte@de.ibm.com>,
Carsten Otte <cotte@freenet.de>,
linux-fsdevel@vger.kernel.org, sct@redhat.com,
Dave Kleikamp <shaggy@austin.ibm.com>
Subject: Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
Date: Sat, 3 Apr 2004 01:39:06 +0100 [thread overview]
Message-ID: <20040403003906.GG653@mail.shareable.org> (raw)
In-Reply-To: <20040402210813.GI31500@parcelfarce.linux.theplanet.co.uk>
viro@parcelfarce.linux.theplanet.co.uk wrote:
> > The trouble with that is the many programs which assume that if
> > st_nlink == 1, then the file has only one path. This is a critical
> > optimisation for any program which looks at a lot of files checking
> > for equivalent files, and is widely assumed. I've always thought it a
> > reliable basic unix assumption.
>
> Even aside of (in)accuracy of ->st_nlink, that assumption is obviously
> racy.
Yes it is, it's nevertheless a common assumption, and it used to be
accurate when that area of the filesystem was known to be not
changing. It's still accurate when that area is known to have no bind
mounts.
(Fwiw, I'll admit that I have 1 user who has a bind mount in their
home directory and doesn't know it...)
> > For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked.
> > You may argue that using "rsync -H" or "cp -a" on a tree that contains
> > bind mounts is broken by design.
>
> What would you expect from cp -a in that case? To create a link? That
> would be a bug, plain and simple...
In fact, cp -a, rsync -H and tar all do this: if the file has st_nlink
== 1, each mounted occurence of the file is copied. If the file has
st_nlink != 1, each mounted occurence is hard linked together in the
destination.
I'm surprised none of them implement the optimisation where after
seeing st_nlink links, it can forget about that (dev, inode) pair.
Perhaps it's not reliable to assume that on other OSes either.
> > Then there are programs such as optimised Make and cacheing systems
> > and servers which use dnotify. dnotify is reliable for single-linked
> > files, and when there are multiple links, it's still reliable if you
> > discover all the st_nlink paths to a file. However, it isn't reliable
> > if there are any bind mounts, because you don't know whether you have
> > all paths to a file (without grubbing inside /proc/mounts, and that
> > has race conditions anyway).
>
> dnotify is not reliable, period.
Everyone keeps saying how dnotify is unreliable, crap etc. despite the
multitude of good performance enhancing uses for it. The principle is
not _intrinsically_ crap even though the implementation could do with
much improvement.
> We do NOT generate any events on the
> parent of target; only on parent of new link.
I'm stunned. An attribute of the file is changed (nlink), by an
operation which follows the target path, and no DN_ATTRIB event is
generated. I consider that a dnotify implementation disaster.
> So "discover all the st_nlink paths" is both racy and unreliable -
> you won't get notified when links are created.
I agree, but you _should_ be notified. Not least because without this
dnotify is useless, and with it dnotify is reliable (provided you only
use it to watch single-linked files that aren't mount points).
> > The obvious strategy for such programs is to ignore bind mounts and
> > give incorrect results if there are any. However it would be much
> > better if they could detect when a file has multiple paths that aren't
> > mentioned in st_nlink, so they wouldn't depend on dnotify in such cases.
>
> How would they detect when such paths appear in the future? For that matter,
> who said that you can see all paths in question?
First question: I didn't realise DN_ATTRIB events weren't sent when
nlink is incremented by link(). Obviously they should be, it's an
extremely important attribute for dnotify users.
Second question: it doesn't matter. There are lots of situations
where you can't use dnotify, and you'll have to call stat() or
whatever each time you want to check if a file has changed. That's
one of them.
>From my point of view, the useful aspect of dnotify is to avoid large
numbers of stat() calls per complex cached operation which reads from
a filesystem. Hence mutterings of optimised Make etc. It's not about
saving syscalls, it's an algorithmic scalability thing: O(n) becomes
O(1). Sure, I could stop using files and store everything in a
database, but that's a huge loss of convenience. Filesystems are
nice. Have you ever tried to work with source code in databases? :)
> > There's a precedent for different views of a filesystem having
> > different st_dev values. Think about loopback NFS: possibly multiple
> > NFS filesystems, and a real one, all referring to the same set of
> > files, and _all_ of them have different st_dev values. Semantically a
> > bind mount does not seem so different from that.
>
> Except that bind mount is not different from "normal" one - same situation
> as with hard links.
I agree, I misunderstood something: bind mounts are not much different
from multiple mounts of the same filesystem. I was comparing with
traditional unix and old linux, where a filesystem could be mounted once.
Multiple/bind mounts have two significant differences from traditional
single mounts: 1. st_nlink doesn't reflect the number of different
paths visible to a program; 2. st_dev doesn't change when you cross a
mount point.
st_nlink affects programs like "cp -a", and we don't know what sane
semantics of "cp -a" should be anyway, so it's not really a problem.
The Emacs backup-by-copying-when-linked problem is tricker: if a file
has multiple names, we really do want backup by copying, not renaming.
But there is no way for it to know that a file has another name, so
Emacs will rename a file to make a backup, which usually destroys the
intent of the bind mount. (It'll fail to rename a mount point, but
renaming the target of a mount succeeds, and that can be a file).
st_dev is not a problem as such. The problem is that the mount point
isn't detectable. In many cases, an O_NOMOUNT flag would sort that
out: the flag meaning to fail a path walk that tries to cross mounts.
-- Jamie
next prev parent reply other threads:[~2004-04-03 0:39 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-03-30 11:57 [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Martin Schwidefsky
2004-03-30 13:39 ` David Woodhouse
2004-03-30 14:16 ` Matthew Wilcox
2004-03-30 15:51 ` Linus Torvalds
2004-04-02 16:12 ` viro
2004-04-02 18:01 ` viro
2004-04-02 18:52 ` Linus Torvalds
2004-04-02 19:02 ` Linus Torvalds
2004-04-02 19:10 ` viro
2004-04-02 19:07 ` viro
2004-04-02 20:23 ` viro
2004-04-02 22:40 ` Trond Myklebust
2004-04-02 23:06 ` viro
2004-04-02 23:23 ` Trond Myklebust
2004-04-03 0:53 ` Neil Brown
2004-04-02 23:19 ` Trond Myklebust
2004-04-02 19:17 ` Jamie Lokier
2004-04-02 19:25 ` viro
2004-04-02 19:32 ` Linus Torvalds
2004-04-02 19:37 ` viro
2004-04-02 19:45 ` Linus Torvalds
2004-04-02 20:08 ` viro
2004-04-02 20:40 ` Jamie Lokier
2004-04-02 20:59 ` Christoph Hellwig
2004-04-02 21:09 ` viro
2004-04-02 23:42 ` Jamie Lokier
2004-04-02 21:08 ` viro
2004-04-03 0:39 ` Jamie Lokier [this message]
2004-04-05 14:07 ` Stephen C. Tweedie
2004-03-30 15:07 ` Linus Torvalds
2004-04-02 16:14 ` viro
-- strict thread matches above, loose matches on Subject: below --
2004-03-30 15:13 Martin Schwidefsky
2004-03-29 19:07 Martin Schwidefsky
2004-03-29 20:11 ` Linus Torvalds
2004-03-29 20:29 ` Dave Kleikamp
2004-02-19 18:00 Martin Schwidefsky
2004-02-19 12:21 Carsten Otte
2004-02-19 16:53 ` Linus Torvalds
2004-02-19 17:39 ` Stephen C. Tweedie
2004-02-19 18:49 ` Andrew Morton
2004-02-19 20:28 ` Carsten Otte
2004-02-19 20:26 ` viro
2004-02-19 20:35 ` Carsten Otte
2004-02-19 20:14 ` Carsten Otte
2004-02-20 3:41 ` Andrew Morton
2004-02-19 20:19 ` Carsten Otte
[not found] ` <20040220164325.659c4e45.akpm@osdl.org>
[not found] ` <200402241338.57855.cotte@freenet.de>
2004-02-24 22:55 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040403003906.GG653@mail.shareable.org \
--to=jamie@shareable.org \
--cc=akpm@osdl.org \
--cc=cotte@de.ibm.com \
--cc=cotte@freenet.de \
--cc=dwmw2@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=schwidefsky@de.ibm.com \
--cc=sct@redhat.com \
--cc=shaggy@austin.ibm.com \
--cc=torvalds@osdl.org \
--cc=viro@parcelfarce.linux.theplanet.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox