From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Date: Sat, 3 Apr 2004 01:39:06 +0100 Sender: linux-fsdevel-owner@vger.kernel.org Message-ID: <20040403003906.GG653@mail.shareable.org> References: <1080653969.24117.192.camel@hades.cambridge.redhat.com> <20040402161223.GZ31500@parcelfarce.linux.theplanet.co.uk> <20040402180111.GB31500@parcelfarce.linux.theplanet.co.uk> <20040402191752.GB653@mail.shareable.org> <20040402204057.GD653@mail.shareable.org> <20040402210813.GI31500@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linus Torvalds , David Woodhouse , Martin Schwidefsky , Andrew Morton , Carsten Otte , Carsten Otte , linux-fsdevel@vger.kernel.org, sct@redhat.com, Dave Kleikamp Return-path: Received: from mail.shareable.org ([81.29.64.88]:20886 "EHLO mail.shareable.org") by vger.kernel.org with ESMTP id S261440AbUDCAjT (ORCPT ); Fri, 2 Apr 2004 19:39:19 -0500 To: viro@parcelfarce.linux.theplanet.co.uk Content-Disposition: inline In-Reply-To: <20040402210813.GI31500@parcelfarce.linux.theplanet.co.uk> List-Id: linux-fsdevel.vger.kernel.org viro@parcelfarce.linux.theplanet.co.uk wrote: > > The trouble with that is the many programs which assume that if > > st_nlink == 1, then the file has only one path. This is a critical > > optimisation for any program which looks at a lot of files checking > > for equivalent files, and is widely assumed. I've always thought it a > > reliable basic unix assumption. > > Even aside of (in)accuracy of ->st_nlink, that assumption is obviously > racy. Yes it is, it's nevertheless a common assumption, and it used to be accurate when that area of the filesystem was known to be not changing. It's still accurate when that area is known to have no bind mounts. (Fwiw, I'll admit that I have 1 user who has a bind mount in their home directory and doesn't know it...) > > For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked. > > You may argue that using "rsync -H" or "cp -a" on a tree that contains > > bind mounts is broken by design. > > What would you expect from cp -a in that case? To create a link? That > would be a bug, plain and simple... In fact, cp -a, rsync -H and tar all do this: if the file has st_nlink == 1, each mounted occurence of the file is copied. If the file has st_nlink != 1, each mounted occurence is hard linked together in the destination. I'm surprised none of them implement the optimisation where after seeing st_nlink links, it can forget about that (dev, inode) pair. Perhaps it's not reliable to assume that on other OSes either. > > Then there are programs such as optimised Make and cacheing systems > > and servers which use dnotify. dnotify is reliable for single-linked > > files, and when there are multiple links, it's still reliable if you > > discover all the st_nlink paths to a file. However, it isn't reliable > > if there are any bind mounts, because you don't know whether you have > > all paths to a file (without grubbing inside /proc/mounts, and that > > has race conditions anyway). > > dnotify is not reliable, period. Everyone keeps saying how dnotify is unreliable, crap etc. despite the multitude of good performance enhancing uses for it. The principle is not _intrinsically_ crap even though the implementation could do with much improvement. > We do NOT generate any events on the > parent of target; only on parent of new link. I'm stunned. An attribute of the file is changed (nlink), by an operation which follows the target path, and no DN_ATTRIB event is generated. I consider that a dnotify implementation disaster. > So "discover all the st_nlink paths" is both racy and unreliable - > you won't get notified when links are created. I agree, but you _should_ be notified. Not least because without this dnotify is useless, and with it dnotify is reliable (provided you only use it to watch single-linked files that aren't mount points). > > The obvious strategy for such programs is to ignore bind mounts and > > give incorrect results if there are any. However it would be much > > better if they could detect when a file has multiple paths that aren't > > mentioned in st_nlink, so they wouldn't depend on dnotify in such cases. > > How would they detect when such paths appear in the future? For that matter, > who said that you can see all paths in question? First question: I didn't realise DN_ATTRIB events weren't sent when nlink is incremented by link(). Obviously they should be, it's an extremely important attribute for dnotify users. Second question: it doesn't matter. There are lots of situations where you can't use dnotify, and you'll have to call stat() or whatever each time you want to check if a file has changed. That's one of them. >>From my point of view, the useful aspect of dnotify is to avoid large numbers of stat() calls per complex cached operation which reads from a filesystem. Hence mutterings of optimised Make etc. It's not about saving syscalls, it's an algorithmic scalability thing: O(n) becomes O(1). Sure, I could stop using files and store everything in a database, but that's a huge loss of convenience. Filesystems are nice. Have you ever tried to work with source code in databases? :) > > There's a precedent for different views of a filesystem having > > different st_dev values. Think about loopback NFS: possibly multiple > > NFS filesystems, and a real one, all referring to the same set of > > files, and _all_ of them have different st_dev values. Semantically a > > bind mount does not seem so different from that. > > Except that bind mount is not different from "normal" one - same situation > as with hard links. I agree, I misunderstood something: bind mounts are not much different from multiple mounts of the same filesystem. I was comparing with traditional unix and old linux, where a filesystem could be mounted once. Multiple/bind mounts have two significant differences from traditional single mounts: 1. st_nlink doesn't reflect the number of different paths visible to a program; 2. st_dev doesn't change when you cross a mount point. st_nlink affects programs like "cp -a", and we don't know what sane semantics of "cp -a" should be anyway, so it's not really a problem. The Emacs backup-by-copying-when-linked problem is tricker: if a file has multiple names, we really do want backup by copying, not renaming. But there is no way for it to know that a file has another name, so Emacs will rename a file to make a backup, which usually destroys the intent of the bind mount. (It'll fail to rename a mount point, but renaming the target of a mount succeeds, and that can be a file). st_dev is not a problem as such. The problem is that the mount point isn't detectable. In many cases, an O_NOMOUNT flag would sort that out: the flag meaning to fail a path walk that tries to cross mounts. -- Jamie