From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Hansen <hansendc@us.ibm.com>
Subject: Re: [PATCH 00/26] Mount writer count and read-only bind mounts
Date: Mon, 25 Jun 2007 08:45:06 -0700
Message-ID: <1182786306.26162.102.camel@localhost>
References: <20070622200303.82D9CC3A@kernel>
	 <20070623095246.a9061585.akpm@linux-foundation.org>
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Cc: Dave Hansen <haveblue@us.ibm.com>, linux-fsdevel@vger.kernel.org,
	hch@infradead.org, viro@ftp.linux.org.uk
To: Andrew Morton <akpm@linux-foundation.org>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e34.co.us.ibm.com ([32.97.110.152]:41597 "EHLO
	e34.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752584AbXFYPpL (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 25 Jun 2007 11:45:11 -0400
Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106])
	by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l5PFj9VD020322
	for <linux-fsdevel@vger.kernel.org>; Mon, 25 Jun 2007 11:45:09 -0400
Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168])
	by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l5PFj8mf137694
	for <linux-fsdevel@vger.kernel.org>; Mon, 25 Jun 2007 09:45:08 -0600
Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1])
	by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l5PFj8d6016454
	for <linux-fsdevel@vger.kernel.org>; Mon, 25 Jun 2007 09:45:08 -0600
In-Reply-To: <20070623095246.a9061585.akpm@linux-foundation.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Sat, 2007-06-23 at 09:52 -0700, Andrew Morton wrote:
> > On Fri, 22 Jun 2007 13:03:03 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> > Why do we need r/o bind mounts?
> > 
> > This feature allows a read-only view into a read-write filesystem.
> > In the process of doing that, it also provides infrastructure for
> > keeping track of the number of writers to any given mount.
> > 
> > This has a number of uses.  It allows chroots to have parts of
> > filesystems writable.  It will be useful for containers in the future
> > because users may have root inside a container, but should not
> > be allowed to write to somefilesystems.  This also replaces 
> > patches that vserver has had out of the tree for several years.
> > 
> > It allows security enhancement by making sure that parts of
> > your filesystem read-only (such as when you don't trust your
> > FTP server), when you don't want to have entire new filesystems
> > mounted, or when you want atime selectively updated.
> > I've been using the following script to test that the feature is
> > working as desired.  It takes a directory and makes a regular
> > bind and a r/o bind mount of it.  It then performs some normal
> > filesystem operations on the three directories, including ones
> > that are expected to fail, like creating a file on the r/o
> > mount.
> 
> Doesn't selinux do some of this?
> 
> My overall reaction: owch.  There's a ton of tricksy code here and great
> potential for us to accidentally break it in the future by forgetting a
> mnt_may_write() as the kernel evolves.

This is definitely a tricky thing.  It takes a static, single check and
replaces it with a matched set of operations.  But, it's not much
different that adding a mutex to something.  People can always miss one
side of the lock pair.

People won't miss the mnt_may_write() because it will become the only
way that it is valid to check a mounted fs for the ability to write to
it.  IS_RDONLY() will not be available for these kinds of checks. 

> And then there's the added complexity and the added runtime overhead.
>
> Balance that against some pretty obscure-looking benefits and I'm
> struggling to see how a merge is justifiable?

One reason Al had me go through using these paired operations instead of
just passing the mount all over the vfs is that this fixes some
existing, fundamental problems: we do not properly track when writers
are _finished_ to our filesystems, and may allow a remount-r/o operation
to success when writes are still occurring.  We needed to separate out
the logical "users can write to this fs" from the physical "this fs is
on r/o media" or "this fs is dying and writes will only kill it more".
That's what these patches do in the end.  

One set of things that I'm going to tack on here once these go in is the
ability to increment the writer count upon a decrement of i_nlink to
zero.  We'll drop the write count when the file is actually truncated.
As it stands right now, since there is never an open filp on those
files, you might unlink a file, do a r/o mount of the fs, then still
write to it when the truncate occurs.  I think fixing that was one of
Al's long-term goals with this strategy.

-- Dave