public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Valerie Aurora <vaurora@redhat.com>
To: Michal Suchanek <hramrach@centrum.cz>
Cc: linux-fsdevel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Nick Piggin <npiggin@suse.de>,
	Dmitry Monakhov <dmonakhov@openvz.org>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: UnionMount status?
Date: Tue, 23 Mar 2010 19:02:58 -0400	[thread overview]
Message-ID: <20100323230258.GA8432@shell> (raw)
In-Reply-To: <a5d587fb1003191447r7d37106ax3f62452893b7e9b3@mail.gmail.com>

On Fri, Mar 19, 2010 at 10:47:15PM +0100, Michal Suchanek wrote:
> Hello
> 
> On 19 March 2010 19:03, Valerie Aurora <vaurora@redhat.com> wrote:
> 
> >
> > Where union mounts is right now is in need of more review from VFS
> > experts (and thanks to those who have already reviewed it). ??I'm
> 
> I don't count myself among VFS experts so I'm sorry if I am restating
> or missing something obvious.

Thanks for taking a look!

> > rewriting the in-file copyup code right now, which is dependent on a
> > lot of ongoing VFS work by Al Viro, Nick Piggin, Dmitriy Monakhov, and
> > others. ??Here's my description of the problem I'm currently working,
> > which is where I could use review the most:
> >
> > http://groups.google.com/group/linux.kernel/msg/217ca5aedbd7bfd0
> >
> 
> On Mar 16, 7:20 pm, Valerie Aurora <vaur...@redhat.com> wrote:
> > This patch shows the basic data flow necessary to implement efficient
> > union mount file copyup (without the actual file copyup code).  I need
> > input from other VFS people on design, especially since namei.c is
> > getting some much needed reorganization.  Some background:
> >
> > In union mounts, a file on the lower, read-only file system layer is
> > copied up to the top layer when an application attempts to write to
> > it.  This is in-kernel file copyup.  Some constraints make this
> > difficult:
> >
> > 1. Don't copyup if the operation would fail (e.g., open(O_WRONLY) on a
> > file with mode 444).  It's inefficient and a possible security hole to
> > copy up a file if no write (or maybe even read) can occur anyway.
> 
> The open fails in that case anyway so I see no reason to copy
> anything. Why would you copy before you open?

Basically, it's easiest to copy up the file at pathname lookup time,
before you do all the permission checks.  This constraint really says
that you have to wait to do the copy up until you have finished any
check that might fail.

> On the other hand, when the open succeeds there is nothing stopping
> the writes from happening save things like hardware failure or lack of
> disk space. It's appropriate to create an empty inode in this case.
> Did you consider creating the files as sparse and handling holes by
> looking into the lower layer before /dev/zero? But then you would
> perhaps need a flag that differentiates them from real sparse files.

No, I haven't considered that.  Currently, the design is to copy up
the file at open() time to simplify the code.

> Actually the file has to be copied even when it is open for reading
> because if somebody writes it later the readonly bottom handle would
> never receive the top updates.

Copying up on every single open costs too much.  Copy up on
open-for-write does have this odd effect, but I consider it the moral
equivalent of a process updating a file by copying it to a temporary
file and then renaming it over the original.

> > 2. Near-zero added cost for operations on non-union paths in a
> > CONFIG_UNION_MOUNT kernel.  One or two flag tests is okay, but
> > grabbing a lock or doing a lookup operation to find out if a file
> > should be copied up is too expensive.
> 
> As I see it any file can be a part of union mount in two ways:
> 
> 1) You are just looking it up so you do pay the lookup costs and you
> will know where it is and if that is part of a union.

Yes, we just have to record and track that information after the
lookup.  This will require a new flag or structure or structure
member.  user_path_at() is an example where we need to retain this
information but throw it away and just return the path.

> 2) it's already open so you have  a handle to the file and this handle
> can be tagged in some way. It is probably associated with the
> originating filesystem anyway so  you just need to look that way.

In general, by the time you have an open file and you want to alter
it, you've already copied it up.  But I found two exceptions to this
rule: fchown() and fchmod() are legal on an fd opened O_RDONLY.  So
we'll have to close, copyup, and re-open the file in that case.

> > 3. A file should be copied up if the path (mnt + dentry) of its parent
> > is from a union top layer (MNT_UNION set in mnt->mnt_flags).  We can't
> > tell if a file can be copied up by looking at its inode, dentry, or
> > even path, we have to know what its parent path is.
> 
> Let's see what the lookup may be like:
> 
> 0) No lookup: The file is open. It's always top, no issue here.
> 
> 1) Perform lookup:
>   - if the path goes through non-union directory structure nothing
> interesting is going on
>   - if the path is currently being looked up inside an union mount it
> can happen that a bottom directory exists without a corresponding top
> directory. In this case the top directory has to be created. Otherwise
> users who have the bottom directory open would not see new files added
> (and removed) in the top layer.

The current implementation always copies up directories on lookup.

>   - the fact that lookup is (or is not) going on inside an union mount
> is a single boolean flag which may change only on mountpoint
> boundaries.
> 
> During the lookup it's easy to determine if sufficient permissions
> exist to open a file. Directory permissions are recorded on the top
> directories or are copied from the bottom directories to the top
> directories created during lookup. Additionally files which are only
> in the bottom layer can have permissions in there. If the file can be
> opened then the file can be copied and must be copied on open.
> 
> I guess that the union directories are a bit special in that they are
> top directories but have continuation in the bottom directory. Still
> this should not concern the user as they should not get the bottom
> directory handle nor the top directory handle but a union directory
> handle.
> 
> On Wed, Mar 17, 2010 at 07:51:31AM +0900, J. R. Okajima wrote:
> 
> > Valerie Aurora:
> > > 1. Don't copyup if the operation would fail (e.g., open(O_WRONLY) on a
> > > file with mode 444).  It's inefficient and a possible security hole to
> > > copy up a file if no write (or maybe even read) can occur anyway.
> 
> > Just a question.
> > How about this case?
> > When the file is writable (0644 or something) but its parent directory
> > is readonly (0555), do you think the file should be copied-up?
> 
> It must appear to be copied up in one way or another.
> If the mode can be changed and the FS code does not have some giant
> lock that allows only one open handle at a time it cannot rely on the
> directory not changing just because it does not appear to have any
> write permissions at the time.
> 
> Actually in a very primitive filesystem that locks the whole directory
> it might hold that the directory cannot be changed while its mode says
> it is readonly but it would still have to be quite non-reentrant for
> that as the directory mode and contents should be quite separate.
> 
> J. R. Okajima:
> > Valerie Aurora:
> > > I think what people will expect is that we copy up in that case.  I
> > > can think of ways this can go wrong, but perhaps that should be an
> > > explicit requirement on the top-layer file system, that it can handle
> > > create/unlink() in a directory without write permission.
> 
> > I am not sure such requirement is the right way.
> > How about delegating open() to keventd or some other workqueue which
> > will succeed to create files under a directory without write permission?
> > Of course, we should handle some error cases after creating a file.
> 
> This will fail anyway when the filesystem is actually readonly. I
> guess you need the equivalent of ramfs/tmpfs to store the directory
> structure. Although if you plan on supporting more than two layers in
> the future you may require the top layer to be writable and still
> achieve reasonable flexibility.

The current implementation stores the directory structure in the
topmost directory itself using fallthrus as placeholders for lower
level directory entries.  Several people have written in-memory
solutions, look for Miklos Szeredi, Bharata Rao, and Jan Blunck's
patches.

Thanks,

-VAL

  parent reply	other threads:[~2010-03-23 23:03 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-19  0:21 UnionMount status? Michal Suchanek
2010-03-19  6:10 ` J. R. Okajima
2010-03-19 20:28   ` Michal Suchanek
2010-03-19 20:59     ` J. R. Okajima
2010-03-19 18:03 ` Valerie Aurora
2010-03-19 21:47   ` Michal Suchanek
2010-03-20 16:41     ` Michal Suchanek
2010-03-23 23:02     ` Valerie Aurora [this message]
2010-04-01 15:36       ` Michal Suchanek
2010-07-14 13:12     ` Michal Suchanek
2010-07-14 19:30       ` Valerie Aurora
2010-07-19 16:06         ` Michal Suchanek
2010-08-10 16:52           ` Valerie Aurora
     [not found]             ` <AANLkTik+S2femtfOQvAYMayfN2N=PGvVL8J9=AD3_VCa@mail.gmail.com>
2010-08-16 18:52               ` Valerie Aurora
2010-08-17 12:03                 ` Michal Suchanek
2010-08-17 14:31                   ` Michal Suchanek
2010-08-17 19:51                     ` Valerie Aurora
2010-08-18 11:30                       ` Michal Suchanek
2010-08-18 19:06                         ` Valerie Aurora
2010-08-18 19:25                           ` Michal Suchanek
2010-07-12 13:01   ` Michal Suchanek
  -- strict thread matches above, loose matches on Subject: below --
2011-04-12 15:00 Unionmount status? Michal Suchanek
2011-04-12 20:31 ` Ric Wheeler
2011-04-12 21:36   ` Michal Suchanek
2011-04-13 14:18     ` Jiri Kosina
2011-04-13 15:13       ` Michal Suchanek
2011-04-14  8:38         ` Miklos Szeredi
2011-04-14  9:48           ` Sedat Dilek
2011-04-14  9:58             ` Miklos Szeredi
2011-04-15 11:22           ` Michal Suchanek
2011-04-15 11:31             ` Miklos Szeredi
2011-04-15 11:51               ` Michal Suchanek
2011-04-15 12:29                 ` Miklos Szeredi
2011-04-15 12:34                   ` Michal Suchanek
2011-04-15 12:48                     ` Miklos Szeredi
2011-04-15 21:48                   ` Hugh Dickins
2011-04-15 22:18                   ` Andreas Dilger
2011-04-18 13:31                     ` Michal Suchanek
2011-04-18 13:34                     ` Michal Suchanek
2011-04-18 13:37                     ` Michal Suchanek
2011-04-13 17:26     ` Ric Wheeler
2011-04-13 18:58       ` Michal Suchanek
2011-04-13 19:11         ` Ric Wheeler
2011-04-13 19:47           ` Michal Suchanek
2011-04-14  4:50             ` Ian Kent
2011-04-14  9:32               ` Michal Suchanek
2011-04-14  9:40                 ` Miklos Szeredi
2011-04-14 13:21                   ` Ric Wheeler
2011-04-14 14:54                     ` Michal Suchanek
2011-04-15 16:31                       ` Ric Wheeler
2011-04-14 19:14               ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100323230258.GA8432@shell \
    --to=vaurora@redhat.com \
    --cc=dmonakhov@openvz.org \
    --cc=hch@infradead.org \
    --cc=hramrach@centrum.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox