From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
Date: Fri, 2 Apr 2004 21:40:57 +0100
Sender: linux-fsdevel-owner@vger.kernel.org
Message-ID: <20040402204057.GD653@mail.shareable.org>
References: <OFB7BF1E31.8824C0B7-ONC1256E67.004015BB-C1256E67.0041B813@de.ibm.com> <1080653969.24117.192.camel@hades.cambridge.redhat.com> <Pine.LNX.4.58.0403300745040.1096@ppc970.osdl.org> <20040402161223.GZ31500@parcelfarce.linux.theplanet.co.uk> <20040402180111.GB31500@parcelfarce.linux.theplanet.co.uk> <Pine.LNX.4.58.0404021050220.1122@ppc970.osdl.org> <20040402191752.GB653@mail.shareable.org> <Pine.LNX.4.58.0404021127200.1122@ppc970.osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: viro@parcelfarce.linux.theplanet.co.uk,
	David Woodhouse <dwmw2@infradead.org>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Andrew Morton <akpm@osdl.org>, Carsten Otte <cotte@de.ibm.com>,
	Carsten Otte <cotte@freenet.de>, linux-fsdevel@vger.kernel.org,
	sct@redhat.com, Dave Kleikamp <shaggy@austin.ibm.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail.shareable.org ([81.29.64.88]:10902 "EHLO
	mail.shareable.org") by vger.kernel.org with ESMTP id S264167AbUDBUn3
	(ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 2 Apr 2004 15:43:29 -0500
To: Linus Torvalds <torvalds@osdl.org>
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.58.0404021127200.1122@ppc970.osdl.org>
List-Id: linux-fsdevel.vger.kernel.org

Linus Torvalds wrote:
> Naah. You _want_ user space to see that they are the same file, and then 
> the algorithm should be: "open+open+fstat+fstat+cmp st_dev/st_ino".

The trouble with that is the many programs which assume that if
st_nlink == 1, then the file has only one path.  This is a critical
optimisation for any program which looks at a lot of files checking
for equivalent files, and is widely assumed.  I've always thought it a
reliable basic unix assumption.

For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked.
You may argue that using "rsync -H" or "cp -a" on a tree that contains
bind mounts is broken by design.

Then there are occasions when you want to traverse a tree, but not
cross mounts.  Programs do that by checking whether the st_dev field
returned by stat() or fstat() on a directory is different from its parent.

For example: find -xdev, tar --one-file-system, cp -x.

You may argue that bind mounts shouldn't count for the purpose of
--one-file-system, but there should be some reasonable a way for
programs to recognise the bind mount topology, to offer the behaviour
of --one-file-system-not-crossing-bind-mounts.

Regular files can be bind-mounted.  So even if bind mounts did change
st_dev, programs which check st_dev when entering a directory wouldn't
recognise bind mounted files.

Then there are programs such as optimised Make and cacheing systems
and servers which use dnotify.  dnotify is reliable for single-linked
files, and when there are multiple links, it's still reliable if you
discover all the st_nlink paths to a file.  However, it isn't reliable
if there are any bind mounts, because you don't know whether you have
all paths to a file (without grubbing inside /proc/mounts, and that
has race conditions anyway).

The obvious strategy for such programs is to ignore bind mounts and
give incorrect results if there are any.  However it would be much
better if they could detect when a file has multiple paths that aren't
mentioned in st_nlink, so they wouldn't depend on dnotify in such cases.

> But I agree that it might be good to also have a way to enquire about
> mount information. One logical place for that might be "fstatfs()".  
> We've got a few spare bytes there, so it wouldn't be impossible to do.

Sounds like a good idea.  I appreciate Al's point about st_dev having
an actual real meaning.  I was thinking that with 64-bit dev_t, it
might be ok to reserve some bits of that to distinguish bind mounts,
so that a program can still get the underlying device id if it wants that.

There's a precedent for different views of a filesystem having
different st_dev values.  Think about loopback NFS: possibly multiple
NFS filesystems, and a real one, all referring to the same set of
files, and _all_ of them have different st_dev values.  Semantically a
bind mount does not seem so different from that.

-- Jamie