From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Date: Fri, 2 Apr 2004 21:40:57 +0100 Sender: linux-fsdevel-owner@vger.kernel.org Message-ID: <20040402204057.GD653@mail.shareable.org> References: <1080653969.24117.192.camel@hades.cambridge.redhat.com> <20040402161223.GZ31500@parcelfarce.linux.theplanet.co.uk> <20040402180111.GB31500@parcelfarce.linux.theplanet.co.uk> <20040402191752.GB653@mail.shareable.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: viro@parcelfarce.linux.theplanet.co.uk, David Woodhouse , Martin Schwidefsky , Andrew Morton , Carsten Otte , Carsten Otte , linux-fsdevel@vger.kernel.org, sct@redhat.com, Dave Kleikamp Return-path: Received: from mail.shareable.org ([81.29.64.88]:10902 "EHLO mail.shareable.org") by vger.kernel.org with ESMTP id S264167AbUDBUn3 (ORCPT ); Fri, 2 Apr 2004 15:43:29 -0500 To: Linus Torvalds Content-Disposition: inline In-Reply-To: List-Id: linux-fsdevel.vger.kernel.org Linus Torvalds wrote: > Naah. You _want_ user space to see that they are the same file, and then > the algorithm should be: "open+open+fstat+fstat+cmp st_dev/st_ino". The trouble with that is the many programs which assume that if st_nlink == 1, then the file has only one path. This is a critical optimisation for any program which looks at a lot of files checking for equivalent files, and is widely assumed. I've always thought it a reliable basic unix assumption. For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked. You may argue that using "rsync -H" or "cp -a" on a tree that contains bind mounts is broken by design. Then there are occasions when you want to traverse a tree, but not cross mounts. Programs do that by checking whether the st_dev field returned by stat() or fstat() on a directory is different from its parent. For example: find -xdev, tar --one-file-system, cp -x. You may argue that bind mounts shouldn't count for the purpose of --one-file-system, but there should be some reasonable a way for programs to recognise the bind mount topology, to offer the behaviour of --one-file-system-not-crossing-bind-mounts. Regular files can be bind-mounted. So even if bind mounts did change st_dev, programs which check st_dev when entering a directory wouldn't recognise bind mounted files. Then there are programs such as optimised Make and cacheing systems and servers which use dnotify. dnotify is reliable for single-linked files, and when there are multiple links, it's still reliable if you discover all the st_nlink paths to a file. However, it isn't reliable if there are any bind mounts, because you don't know whether you have all paths to a file (without grubbing inside /proc/mounts, and that has race conditions anyway). The obvious strategy for such programs is to ignore bind mounts and give incorrect results if there are any. However it would be much better if they could detect when a file has multiple paths that aren't mentioned in st_nlink, so they wouldn't depend on dnotify in such cases. > But I agree that it might be good to also have a way to enquire about > mount information. One logical place for that might be "fstatfs()". > We've got a few spare bytes there, so it wouldn't be impossible to do. Sounds like a good idea. I appreciate Al's point about st_dev having an actual real meaning. I was thinking that with 64-bit dev_t, it might be ok to reserve some bits of that to distinguish bind mounts, so that a program can still get the underlying device id if it wants that. There's a precedent for different views of a filesystem having different st_dev values. Think about loopback NFS: possibly multiple NFS filesystems, and a real one, all referring to the same set of files, and _all_ of them have different st_dev values. Semantically a bind mount does not seem so different from that. -- Jamie