linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/7] [RFC] cramfs: fake write support
@ 2008-05-31 15:37 arnd
  2008-05-31 18:56 ` David Newall
  2008-06-01  3:19 ` Phillip Lougher
  0 siblings, 2 replies; 32+ messages in thread
From: arnd @ 2008-05-31 15:37 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch

Inspired by a discussion with Christoph Hellwig, I tried to
recreate a patch that he did a few years ago to add support
for writing to a mounted cramfs file system. It still has
known problems (and likely unknown ones), but should be
good enough for practical use. I've been able to boot
a full Ubuntu installation from a cramfs image and work with
it normally.

The intention is to use it for instance on read-only root
file systems like CD-ROM, or on compressed initrd images.
In either case, no data is written back to the medium, but
remains in the page/inode/dentry cache, like ramfs does.

Many existing systems currently use unionfs or aufs for this
purpose, by overlaying a tmpfs over a read-only file
system like cramfs, squashfs or iso9660. IMHO, it would
be a much nicer solution to not require unionfs for a simple
case like this, but rather have support for it in the file
system. If people find this useful, we can do the same in
other read-only file system.

Writing to existing files is broken in at least two corner
cases, and I'm still looking for a solution here:

When you truncate an on-disk to make it larger, reading
beyond the old end of the file will make cramfs try to
read from disk instead of filling with zeroes. I'm not sure
if this can be solved without adding additional members to
the inode structure (using a private inode cache) to remember
the end of the on-disk file.

Deleting a preexisting file currently does not free the inode
and page cache for that file, which I assume is easy to fix.

Also, the i_nlink field of directories is always 1, and
has always been on cramfs. Getting the count right should
simplify the code a bit and make it more correct according
to posix, but will cost a bit of performance on 'stat'.

The patch series also lives on
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground.git cramfs

Comments?

	Arnd <><

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-05-31 15:37 [RFC 0/7] [RFC] cramfs: fake write support arnd
@ 2008-05-31 18:56 ` David Newall
  2008-05-31 20:40   ` Arnd Bergmann
  2008-06-01  3:19 ` Phillip Lougher
  1 sibling, 1 reply; 32+ messages in thread
From: David Newall @ 2008-05-31 18:56 UTC (permalink / raw)
  To: arnd; +Cc: linux-fsdevel, linux-kernel, hch

arnd@arndb.de wrote:
> Many existing systems currently use unionfs or aufs for this
> purpose, by overlaying a tmpfs over a read-only file
> system like cramfs, squashfs or iso9660. IMHO, it would
> be a much nicer solution to not require unionfs for a simple
> case like this, but rather have support for it in the file
> system. If people find this useful, we can do the same in
> other read-only file system.
>   

I don't agree that it is nicer to do this in cramfs.  I prefer the
technique of union of a tmpfs over some other fs because a single
solution that works with all filesystems is better than re-implementing
the same idea in multiple filesystems.  Multiple implementations is a
recipe for bugs and feature mismatch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-05-31 18:56 ` David Newall
@ 2008-05-31 20:40   ` Arnd Bergmann
  2008-06-01  3:54     ` Phillip Lougher
  2008-06-01  6:02     ` David Newall
  0 siblings, 2 replies; 32+ messages in thread
From: Arnd Bergmann @ 2008-05-31 20:40 UTC (permalink / raw)
  To: David Newall; +Cc: linux-fsdevel, linux-kernel, hch

On Saturday 31 May 2008, David Newall wrote:
> I don't agree that it is nicer to do this in cramfs.  I prefer the
> technique of union of a tmpfs over some other fs because a single
> solution that works with all filesystems is better than re-implementing
> the same idea in multiple filesystems.  Multiple implementations is a
> recipe for bugs and feature mismatch.

You're right in principle, but unfortunately there is to date no working
implementation of union mounts. Giving users the option of using an
existing file system with a few tweaks can only be better than than
forcing them to use hacks like unionfs.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-05-31 15:37 [RFC 0/7] [RFC] cramfs: fake write support arnd
  2008-05-31 18:56 ` David Newall
@ 2008-06-01  3:19 ` Phillip Lougher
  1 sibling, 0 replies; 32+ messages in thread
From: Phillip Lougher @ 2008-06-01  3:19 UTC (permalink / raw)
  To: arnd; +Cc: linux-fsdevel, linux-kernel, hch

arnd@arndb.de wrote:
> Many existing systems currently use unionfs or aufs for this
> purpose, by overlaying a tmpfs over a read-only file
> system like cramfs, squashfs or iso9660. IMHO, it would
> be a much nicer solution to not require unionfs for a simple
> case like this, but rather have support for it in the file
> system. If people find this useful, we can do the same in
> other read-only file system.

I think it's a good idea, and I have been thinking about adding 
something similar to Squashfs for a quite a while (when I get time).

> Comments?

Patch 2 ([RFC 2/7] cramfs: create unique inode numbers) changes the 
inode number to be based on the dentry location rather than the file 
location.  This is a user-visible change, not only do empty directories, 
char, block, pipe, and sockets get real inode numbers rather than 1 (a 
good thing IMHO), but files that were hard-linked (in the original 
source directory) now get different inode numbers.  Obviously cramfs has 
never properly supported hard links, but the duplicate file check in 
cramfs did ensure hard linked files got the same inode number.

This change in behaviour may break some existing users of cramfs 
filesystems.  It may be worth sending the RFC and patches etc. to the 
new linux-embedded mailing list to get some feedback from the embedded 
folks who use cramfs.

Phillip



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-05-31 20:40   ` Arnd Bergmann
@ 2008-06-01  3:54     ` Phillip Lougher
  2008-06-01  8:52       ` Arnd Bergmann
  2008-06-01 12:28       ` Jamie Lokier
  2008-06-01  6:02     ` David Newall
  1 sibling, 2 replies; 32+ messages in thread
From: Phillip Lougher @ 2008-06-01  3:54 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: David Newall, linux-fsdevel, linux-kernel, hch

Arnd Bergmann wrote:
> On Saturday 31 May 2008, David Newall wrote:
>> I don't agree that it is nicer to do this in cramfs.  I prefer the
>> technique of union of a tmpfs over some other fs because a single
>> solution that works with all filesystems is better than re-implementing
>> the same idea in multiple filesystems.  Multiple implementations is a
>> recipe for bugs and feature mismatch.
> 
> You're right in principle, but unfortunately there is to date no working
> implementation of union mounts. Giving users the option of using an
> existing file system with a few tweaks can only be better than than
> forcing them to use hacks like unionfs.
> 

I tend to agree with Arnd Bergmann.  While I prefer the aesthetic 
cleanliness of stackable filesystems, the lack of proper stacking 
support in the Linux VFS makes other techniques necessary.  Unionfs is 
complex and for many embedded systems with constrained resources Unionfs 
adds a lot of extra overhead.

If I read the patches correctly, when a file page is written to, only 
that page gets copied into the page cache and locked, the other pages 
continue to be read off disk from cramfs?  With Unionfs a page write 
causes the entire file to be copied up to the r/w tmpfs and locked into 
the page cache causing unnecessary RAM overhead.

Phillip


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-05-31 20:40   ` Arnd Bergmann
  2008-06-01  3:54     ` Phillip Lougher
@ 2008-06-01  6:02     ` David Newall
  2008-06-01  9:11       ` Jan Engelhardt
  2008-06-01 16:25       ` Jörn Engel
  1 sibling, 2 replies; 32+ messages in thread
From: David Newall @ 2008-06-01  6:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-fsdevel, linux-kernel, hch

Arnd Bergmann wrote:
> On Saturday 31 May 2008, David Newall wrote:
>   
>> I prefer the technique of union of a tmpfs over some other fs
>>     
>
> You're right in principle, but unfortunately there is to date no working
> implementation of union mounts. Giving users the option of using an
> existing file system with a few tweaks can only be better than than
> forcing them to use hacks like unionfs.

I've not used unionfs (nor aufs) so I'm not aware of its foibles, but I
can say that it's the right kind of solution.  Rather than spend effort
implementing write support for read-only filesystems, why not put your
time into fixing whatever you see wrong with one or both of those?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-01  3:54     ` Phillip Lougher
@ 2008-06-01  8:52       ` Arnd Bergmann
  2008-06-01 12:28       ` Jamie Lokier
  1 sibling, 0 replies; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-01  8:52 UTC (permalink / raw)
  To: Phillip Lougher; +Cc: David Newall, linux-fsdevel, linux-kernel, hch

On Sunday 01 June 2008, Phillip Lougher wrote:
> If I read the patches correctly, when a file page is written to, only 
> that page gets copied into the page cache and locked, the other pages 
> continue to be read off disk from cramfs?  With Unionfs a page write 
> causes the entire file to be copied up to the r/w tmpfs and locked into 
> the page cache causing unnecessary RAM overhead.

Yes, that's right.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-01  6:02     ` David Newall
@ 2008-06-01  9:11       ` Jan Engelhardt
  2008-06-01 16:25       ` Jörn Engel
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Engelhardt @ 2008-06-01  9:11 UTC (permalink / raw)
  To: David Newall; +Cc: Arnd Bergmann, linux-fsdevel, linux-kernel, hch


On Sunday 2008-06-01 08:02, David Newall wrote:
>>   
>>> I prefer the technique of union of a tmpfs over some other fs
>>
>> You're right in principle, but unfortunately there is to date no working
>> implementation of union mounts. Giving users the option of using an
>> existing file system with a few tweaks can only be better than than
>> forcing them to use hacks like unionfs.
>
>I've not used unionfs (nor aufs) so I'm not aware of its foibles, but I
>can say that it's the right kind of solution.  Rather than spend effort
>implementing write support for read-only filesystems, why not put your
>time into fixing whatever you see wrong with one or both of those?

I have to join in. Unionfs and AUFS may be bigger in bytes than the
embedded developer wants to sacrifice, but that is what it takes for
a solid implementation that has to deal with things like NFS and
mmap. Even so, there is a fs called mini_fo you can try using if
you disagree with the size of unionfs/aufs, at the cost of not having
support for all corner cases.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-01  3:54     ` Phillip Lougher
  2008-06-01  8:52       ` Arnd Bergmann
@ 2008-06-01 12:28       ` Jamie Lokier
  2008-06-01 21:49         ` Arnd Bergmann
  1 sibling, 1 reply; 32+ messages in thread
From: Jamie Lokier @ 2008-06-01 12:28 UTC (permalink / raw)
  To: Phillip Lougher
  Cc: Arnd Bergmann, David Newall, linux-fsdevel, linux-kernel, hch

Phillip Lougher wrote:
> If I read the patches correctly, when a file page is written to, only 
> that page gets copied into the page cache and locked, the other pages 
> continue to be read off disk from cramfs?  With Unionfs a page write 
> causes the entire file to be copied up to the r/w tmpfs and locked into 
> the page cache causing unnecessary RAM overhead.

Ok, so why not fix that in unionfs?  An option so that holes in the
overlay file let through data from the underlying file sounds like it
would be generally useful, and quite easy to implement.

If not unionfs, a "union-tmpfs" combination would be good.  Many
filesystems aren't well suited to being the overlay filesystem -
adding to the implementation's complexity - but a modified tmpfs could
be very well suited.

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-01  6:02     ` David Newall
  2008-06-01  9:11       ` Jan Engelhardt
@ 2008-06-01 16:25       ` Jörn Engel
  1 sibling, 0 replies; 32+ messages in thread
From: Jörn Engel @ 2008-06-01 16:25 UTC (permalink / raw)
  To: David Newall; +Cc: Arnd Bergmann, linux-fsdevel, linux-kernel, hch

On Sun, 1 June 2008 15:32:50 +0930, David Newall wrote:
> 
> I've not used unionfs (nor aufs) so I'm not aware of its foibles, but I
> can say that it's the right kind of solution.  Rather than spend effort
> implementing write support for read-only filesystems, why not put your
> time into fixing whatever you see wrong with one or both of those?

There is a strong argument to be made for fixing some problem once
instead of N times.  But when that solution is M times more complicated,
with M being significantly larger than N, said argument becomes rather
weak.

And having looked at unionfs, I claim that your argument is paper-thin.

Jörn

-- 
/* Keep these two variables together */
int bar;
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-01 12:28       ` Jamie Lokier
@ 2008-06-01 21:49         ` Arnd Bergmann
  2008-06-02  2:48           ` hooanon05
  0 siblings, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-01 21:49 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Phillip Lougher, David Newall, linux-fsdevel, linux-kernel, hch

On Sunday 01 June 2008, Jamie Lokier wrote:
> Ok, so why not fix that in unionfs?  An option so that holes in the
> overlay file let through data from the underlying file sounds like it
> would be generally useful, and quite easy to implement.

I can imagine a lot of unexpected effects with that. Think of e.g.
someone replacing the underlying file with a new one. Then enlarge
the file using truncate() and read from it -- suddenly you see
the old contents instead of zeroes. Probably fixable as well, but
certainly not in a nice way.

Besides, there are a many more problems with unionfs, which have
all been mentioned in the previous review cycles. Aufs doesn't
address those either AFAIK, with the exception of at least
not making additional copies in the page cache when writing to
a file.

The real solution of course are VFS based union mounts (think
'mount --union -t tmpfs none /'), but the patches for that
are not stable enough for inclusion in mainline yet.

> If not unionfs, a "union-tmpfs" combination would be good.  Many
> filesystems aren't well suited to being the overlay filesystem -
> adding to the implementation's complexity - but a modified tmpfs could
> be very well suited.

Yes, that is similar to one of my earlier ideas as well. Christoph
managed to convince me that it's not as easy as I thought, though
I can't remember the exact arguments any more. I'll try to think
about that some more.

One of the problems is certainly the complexity involved in tmpfs
to start with, which is the reason I based the code on ramfs instead.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-01 21:49         ` Arnd Bergmann
@ 2008-06-02  2:48           ` hooanon05
  2008-06-02  3:25             ` Erez Zadok
                               ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: hooanon05 @ 2008-06-02  2:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch


Arnd Bergmann:
> Besides, there are a many more problems with unionfs, which have
> all been mentioned in the previous review cycles. Aufs doesn't
> address those either AFAIK, with the exception of at least
> not making additional copies in the page cache when writing to
> a file.

Hello Arnd,

While I don't have particular objection to your idea and approach to
cramfs, I'd point out that modern LiveCDs tend to save their
modifications to disk.
And AUFS did address all known problems. If there left something, please
let me know.


Junjiro Okajima

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  2:48           ` hooanon05
@ 2008-06-02  3:25             ` Erez Zadok
  2008-06-02  7:51               ` Arnd Bergmann
  2008-06-02  3:51             ` Erez Zadok
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Erez Zadok @ 2008-06-02  3:25 UTC (permalink / raw)
  To: Arnd Bergmann, Jamie Lokier, Phillip Lougher, David Newall,
	linux-fsdevel

Arnd Bergmann:
> Besides, there are a many more problems with unionfs, which have
> all been mentioned in the previous review cycles. Aufs doesn't
> address those either AFAIK, with the exception of at least
> not making additional copies in the page cache when writing to
> a file.

Correction: Unionfs doesn't make additional copies in the page cache.

Arnd, I favor a more generic approach, one that will work with the vast
majority of file systems that people use w/ unioning, preferably all of
them.  Supporting copy-on-write in cramfs will only help a small subset of
users.  Yes, it might be simple, but I fear it won't be useful enough to
convince existing users of unioning to switch over.  And I don't think we
should add CoW support in every file system -- the complexity will be much
more than using unionfs or some other VFS-based solution.

I can see some advantages (re: cache coherency) by hacking CoW support
directly into a f/s.  If you want to use a filesystem-specific solution,
then I suggest you don't modify a file system used as a source in a union,
but one used as a destination.  You'll have better overage that way.  The
vast majority of times, unionfs users will either write to tmpfs or ext2;
but the source readonly f/s can be a lot of different ones (most popular are
ext*, nfs*, isofs, and cramfs/squashfs).

I find it somewhat ironic to hear the argument that "union mounts isn't
stable yet, so lets come up with a new solution inside cramfs."  Why should
your solution become stable much faster than union mounts (which also had
patches floating around for a long time already).

If you have cycles to spare, why not help Bharata and Jan?

Cheers,
Erez.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  2:48           ` hooanon05
  2008-06-02  3:25             ` Erez Zadok
@ 2008-06-02  3:51             ` Erez Zadok
  2008-06-02 11:07               ` Jamie Lokier
  2008-06-02  4:37             ` Erez Zadok
  2008-06-02  7:12             ` Arnd Bergmann
  3 siblings, 1 reply; 32+ messages in thread
From: Erez Zadok @ 2008-06-02  3:51 UTC (permalink / raw)
  To: Arnd Bergmann, Jamie Lokier, Phillip Lougher, David Newall,
	linux-fsdevel


> Jamie Lokier wrote:
> > Phillip Lougher wrote:
> > If I read the patches correctly, when a file page is written to, only 
> > that page gets copied into the page cache and locked, the other pages 
> > continue to be read off disk from cramfs?  With Unionfs a page write 
> > causes the entire file to be copied up to the r/w tmpfs and locked into 
> > the page cache causing unnecessary RAM overhead.

Yes, unionfs does copyup whole files, but it doesn't lock the entire file
into the page cache.  But I agree, that copying up large files to a tmpfs
partition adds more memory pressure, at least temporarily (until pdflush
kicks in).

> Ok, so why not fix that in unionfs?  An option so that holes in the
> overlay file let through data from the underlying file sounds like it
> would be generally useful, and quite easy to implement.

If I understand you right, you want to copyup one page at a time, right?
That's not nearly as easy as one might imagine.  First, you can't do it on
file systems which don't support holes.  Second, holes is a file-systems
specific implementation issue, and the knowledge of holes AFAIC, is hidden
from the VFS (IIRC, FreeBSD has a specific "zfod" page flag, which is turned
on when the VM has a page that came out of a f/s hole).

You'll need a way to tell if a given page was copied up or not, and
distinguish b/t pages which are naturally filled with zeros vs. those which
came from f/s holes.

Copyup is also providing persistency: you can copyup to a persistent f/s
such as ext2.  So you'll need a bitmap or some sort of record that will
survive file system remount and system reboot; such a bitmap will have to
tell which pages of a file have been copied up or not.

I'm not saying it's not possible, but it's to do this page-wise caching at a
stackable layer than inside a native f/s such as ext2.  Now, if there was a
generic VFS op that allowed me to query a file system whether a page it a
given file is a hole or not, then unionfs would be able to do page-wise
copyup easily.

Frankly, I think something like support for a copied-up file, page-by-page,
should probably be supported by a block layer virtual driver (this might be
easier in a BSD-like geom layer.)

BTW, I believe FSCache has page-wise caching, right?  Caching is a
copy-on-read operation, and it doesn't take much to make it cache (read:
copy) on writes.  So FScache might be a good starting point for such an
effort.

> If not unionfs, a "union-tmpfs" combination would be good.  Many
> filesystems aren't well suited to being the overlay filesystem -
> adding to the implementation's complexity - but a modified tmpfs could
> be very well suited.

I think a union-tmpfs is a better solution than a cramfs-specific one, b/c
at least with union-tmpfs, many more users could use it.  Even if you
restrict yourself to using tmpfs as the r-w layer, and read-only from just
one other source f/s, that still will cover a large portion of unioning
users.

> -- Jamie

Cheers,
Erez.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  2:48           ` hooanon05
  2008-06-02  3:25             ` Erez Zadok
  2008-06-02  3:51             ` Erez Zadok
@ 2008-06-02  4:37             ` Erez Zadok
  2008-06-02  6:07               ` Bharata B Rao
  2008-06-02  7:17               ` Jan Engelhardt
  2008-06-02  7:12             ` Arnd Bergmann
  3 siblings, 2 replies; 32+ messages in thread
From: Erez Zadok @ 2008-06-02  4:37 UTC (permalink / raw)
  To: Arnd Bergmann, Jamie Lokier, Phillip Lougher, Jan Engelhardt,
	David Newall

> Jan Engelhardt wrote:
> > On Sunday 2008-06-01 08:02, David Newall wrote:
> >>   
> >>> I prefer the technique of union of a tmpfs over some other fs
> >>
> >> You're right in principle, but unfortunately there is to date no working
> >> implementation of union mounts. Giving users the option of using an
> >> existing file system with a few tweaks can only be better than than
> >> forcing them to use hacks like unionfs.
> >
> >I've not used unionfs (nor aufs) so I'm not aware of its foibles, but I
> >can say that it's the right kind of solution.  Rather than spend effort
> >implementing write support for read-only filesystems, why not put your
> >time into fixing whatever you see wrong with one or both of those?
> 
> I have to join in. Unionfs and AUFS may be bigger in bytes than the
> embedded developer wants to sacrifice, but that is what it takes for
> a solid implementation that has to deal with things like NFS and
> mmap. Even so, there is a fs called mini_fo you can try using if
> you disagree with the size of unionfs/aufs, at the cost of not having
> support for all corner cases.

I agree w/ Jan E.

Folks, I've said it before: unioning is a deceptively simple idea in
principle, and &^@%*$&^@ hard in practice.  And anyone who thinks otherwise
is welcome to write a *versatile* unioning implementation on their own.  Once
you get through all corner cases and satisfy all the features which users
want, you have a complex large file system.

I believe that implementing unioning inside actual filesystems is totally the
wrong direction: going to lower layers is wrong, instead of going up to a
VFS-based solution.  Unioning is a namespace operation that should not be
done deep inside a lower f/s.

People often wonder why FScache is (reportedly) so complex and big.  It's
b/c in some part it has to deal with similar issues: unioning is
copy-on-write, whereas caching is copy-on-read.

Nevertheless, I can understand if the embedded community wants lightweight
unioning.  Union Mounts initially may not support everything that unionfs
does, but it should be smaller, and it should be enough I believe for the
basic unioning uses --- perhaps even for the embedded community.  If so,
then I suggest people offer to help Bharata and Jan Blunk's efforts, rather
than [sic] cramming unioning into a single file system.

Erez.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  4:37             ` Erez Zadok
@ 2008-06-02  6:07               ` Bharata B Rao
  2008-06-02  7:17               ` Jan Engelhardt
  1 sibling, 0 replies; 32+ messages in thread
From: Bharata B Rao @ 2008-06-02  6:07 UTC (permalink / raw)
  To: Erez Zadok
  Cc: Arnd Bergmann, Jamie Lokier, Phillip Lougher, Jan Engelhardt,
	David Newall, linux-fsdevel, linux-kernel, hch

On Mon, Jun 2, 2008 at 10:07 AM, Erez Zadok <ezk@cs.sunysb.edu> wrote:
>
> Nevertheless, I can understand if the embedded community wants lightweight
> unioning.  Union Mounts initially may not support everything that unionfs
> does, but it should be smaller, and it should be enough I believe for the
> basic unioning uses --- perhaps even for the embedded community.  If so,
> then I suggest people offer to help Bharata and Jan Blunk's efforts, rather
> than [sic] cramming unioning into a single file system.
>

Though Union Mount effort has become slow and silent lately, some of
us are still working on it. While I worked on readdir support lately,
Jan Blunck and David Woodhouse are working on having a generic
whiteout support for linux.

Talking about help, Union Mount effort could take a generous help in
getting directory listing implementation right. We first tried to
handle duplicate elimination (during readdir) inside the kernel
entirely. The outcome was neither clean nor efficient.
(http://lkml.org/lkml/2007/12/5/147). Then there was a suggestion to
push the duplicate elimination to userspace. When that was tried out
(http://lkml.org/lkml/2008/4/29/248), we were told that NFS support is
going to be an issue. (BTW NFS support is going to be an issue
irrespective of where directory listing is implemented: kernel or
userspace). Some insights into  feasibility of supporting NFS with
Union Mount from people who understand NFS better would be very
helpful.

Regards,
Bharata.
-- 
http://bharata.sulekha.com/blog/posts.htm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  2:48           ` hooanon05
                               ` (2 preceding siblings ...)
  2008-06-02  4:37             ` Erez Zadok
@ 2008-06-02  7:12             ` Arnd Bergmann
  2008-06-02 10:36               ` hooanon05
  2008-06-02 15:35               ` Erez Zadok
  3 siblings, 2 replies; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-02  7:12 UTC (permalink / raw)
  To: hooanon05
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch

On Monday 02 June 2008, hooanon05@yahoo.co.jp wrote:
> While I don't have particular objection to your idea and approach to
> cramfs, I'd point out that modern LiveCDs tend to save their
> modifications to disk.

Sure, and I wasn't trying to address those of course. I have a rather
specific setup in mind myself, and I figured the same would be useful
for others as well, while we are waiting for a generic union mount
implementation in the mainline kernel.

> And AUFS did address all known problems. If there left something, please
> let me know.

Ok, I'm sorry for not having looked at it myself. I saw an older version
and assumed it was not going to improve much. I'll have another look
when I find the time. Unionfs was suffering from severe feature creep
(multiple writable branches, runtime branch modification), and aufs
seemed to add even more features instead of removing them.

Without reading either again, the top problems in unionfs at the time were:
* data inconsistency problems when simultaneously accessing the underlying
  fs and the union.
* duplication of dentry and inode data structures in the union wastes
  memory and cpu cycles.
* whiteouts are in the same namespace as regular files, so conflicts are
  possible.
* mounting a large number of aufs on top of each other eventually
  overflows the kernel stack, e.g. in readdir.
* allowing multiple writable branches (instead of just stacking
  one rw copy on a number of ro file systems) is confusing to the user
  and complicates the implementation a lot.

With the exception of the last two, I assumed that these were all
unfixable with a file system based approach (including the hypothetical
union-tmpfs). If you have addressed them, how?

	Arnd <><

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  4:37             ` Erez Zadok
  2008-06-02  6:07               ` Bharata B Rao
@ 2008-06-02  7:17               ` Jan Engelhardt
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Engelhardt @ 2008-06-02  7:17 UTC (permalink / raw)
  To: Erez Zadok
  Cc: Arnd Bergmann, Jamie Lokier, Phillip Lougher, David Newall,
	linux-fsdevel, linux-kernel, hch


On Monday 2008-06-02 06:37, Erez Zadok wrote:
>> Jan Engelhardt wrote:
>> > On Sunday 2008-06-01 08:02, David Newall wrote:
>> >>   
>> >>> I prefer the technique of union of a tmpfs over some other fs
>> >>
>> >> You're right in principle, but unfortunately there is to date no working
>> >> implementation of union mounts. Giving users the option of using an
>> >> existing file system with a few tweaks can only be better than than
>> >> forcing them to use hacks like unionfs.

>Folks, I've said it before: unioning is a deceptively simple idea in
>principle, and &^@%*$&^@ hard in practice.  And anyone who thinks otherwise
>is welcome to write a *versatile* unioning implementation on their own. Once
>you get through all corner cases and satisfy all the features which users
>want, you have a complex large file system.
>[...]

To the original posters:

I urge those who do believe {au,union}fs is too fat to go and build
their unioning into their on-disk filesystems, then let users run it
(remark: iff you can convince (or force) them why they should not be
using existing fs), let users report issues and iron it out for
perhaps 2-3 years, and then see how much your implementation has
grown. That is, if you actually added code (see remark 1).

About last year (June 2007), SLAX sought a solution that enhances
VFAT with UNIX permissions -- much like the old umsdosfs. A kernel
solution was initially preferred by Tomas (SLAX developer), yet I
(who got to write posixovl then) went for FUSE. It was about 20 KB
when it was moderately usable. The end result? Posixovl is a 46 KB C
file today. For userspace code. I bet it would be much more if it was
in-kernel.

Take that as a hint when developing your fs-specific unioning.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  3:25             ` Erez Zadok
@ 2008-06-02  7:51               ` Arnd Bergmann
  2008-06-02 18:13                 ` Erez Zadok
  0 siblings, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-02  7:51 UTC (permalink / raw)
  To: Erez Zadok
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch

On Monday 02 June 2008, Erez Zadok wrote:
> Correction: Unionfs doesn't make additional copies in the page cache.

Ok, I must have misunderstood something there. Sorry about that.

> Arnd, I favor a more generic approach, one that will work with the vast
> majority of file systems that people use w/ unioning, preferably all of
> them.  Supporting copy-on-write in cramfs will only help a small subset of
> users.  Yes, it might be simple, but I fear it won't be useful enough to
> convince existing users of unioning to switch over.  And I don't think we
> should add CoW support in every file system -- the complexity will be much
> more than using unionfs or some other VFS-based solution.

My idea was to have it in cramfs, squashfs and iso9660 at most, I agree
that doing it in even a single writable file system would add far too
much complexity. I did not mean to start a fundamental discussion about
how to do it the right way, just noticed that there are half a dozen
implementations that have been around for years without getting close to
inclusion in the mainline kernel, while a much simpler approach gives
you sane semantics for a subset of users.

> I can see some advantages (re: cache coherency) by hacking CoW support
> directly into a f/s.  If you want to use a filesystem-specific solution,
> then I suggest you don't modify a file system used as a source in a union,
> but one used as a destination.  You'll have better overage that way.  The
> vast majority of times, unionfs users will either write to tmpfs or ext2;
> but the source readonly f/s can be a lot of different ones (most popular are
> ext*, nfs*, isofs, and cramfs/squashfs).

Yes, that absolutely makes sense. I don't care much about a persistant
storage for the overlay, so tmpfs (if not ramfs) should be the only place
to do it in. It does introduce some of the same old problems though,
because you could still write to a bind mounted copy of the underlying
file system (unlike cramfs, which is guaranteed to be read-only), which
forces you to either to a full copy-up, or can result in inconsistent
file contents. Also, stacking multiple union-tmpfs copies on top of each
other would be hard to do without the potential to overflow the kernel
stack.

I'll probably try implementing a '-o union' option tmpfs anyway, just
to see how hard it is and what the problems are.

> I find it somewhat ironic to hear the argument that "union mounts isn't
> stable yet, so lets come up with a new solution inside cramfs."  Why should
> your solution become stable much faster than union mounts (which also had
> patches floating around for a long time already).

Because the patches are not trying to solve any of the hard problems at all:
Persistent storage of overlays, readdir traversal through more than two
layers, stable inode numbers, opening a file through two different overlays,
copyup, and so on. I'm sure you know more about these problems that I do,
but as long as I don't have to care about them, I don't see a problem
with my patches (other than the bugs I already described).

> If you have cycles to spare, why not help Bharata and Jan?

I spent a lot of time on discussing the initial implementation with Jan
years ago, and will keep reviewing their patches, but I have neither the
time nor the brains to really contribute much to them. As you mentioned
in your reply to Jan E., it's on an entirely different scale than doing
a small hack to cramfs or tmpfs.

	Arnd <><

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  7:12             ` Arnd Bergmann
@ 2008-06-02 10:36               ` hooanon05
  2008-06-02 11:15                 ` Arnd Bergmann
  2008-06-02 15:35               ` Erez Zadok
  1 sibling, 1 reply; 32+ messages in thread
From: hooanon05 @ 2008-06-02 10:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch


Arnd Bergmann:
> Without reading either again, the top problems in unionfs at the time were:
> * data inconsistency problems when simultaneously accessing the underlying
>   fs and the union.
> * duplication of dentry and inode data structures in the union wastes
>   memory and cpu cycles.
> * whiteouts are in the same namespace as regular files, so conflicts are
>   possible.
> * mounting a large number of aufs on top of each other eventually
>   overflows the kernel stack, e.g. in readdir.
> * allowing multiple writable branches (instead of just stacking
>   one rw copy on a number of ro file systems) is confusing to the user
>   and complicates the implementation a lot.
> 
> With the exception of the last two, I assumed that these were all
> unfixable with a file system based approach (including the hypothetical
> union-tmpfs). If you have addressed them, how?

I will try explain individually.
Here are what I implemented in AUFS.
Any comments are welcome.

> * data inconsistency problems when simultaneously accessing the underlying
>   fs and the union.
Aufs has three levels of detecting the direct-access to the lower
(branch) filesystems (ie. bypassing aufs). I guess the most strict level
is a good answer for your question. It is based on the inotify
feature. Aufs sets inotify-watch to every accessed directories on lower
fs. During those inodes are cached, aufs receives the inotify event for
thier children/files and marks the aufs data for the file is
obsoleted. When the file is accessed later, aufs retrives the latest
inode (or dentry) again.
The inotify-watch will be removed when the aufs dir inode is discarded
from cache.


> * duplication of dentry and inode data structures in the union wastes
>   memory and cpu cycles.

Aufs has its own dentry and inode object as normal fs has. And they have
pointers to the corresponding ones on the lower fs. If you make a union
from two real filesystems, then aufs inode will have (at most) two
pointers as its private data.
Do you mean having pointers is a duplicataion?


> * whiteouts are in the same namespace as regular files, so conflicts are
>   possible.

Yes, that's right.
Aufs reserves ".wh." as a whiteout prefix, and prohibits users to handle
such filename inside aufs. It might be a problem as you wrote, but users
can create/remove them directly on the lower fs and I have never
received request about this reserved prefix.


> * mounting a large number of aufs on top of each other eventually
>   overflows the kernel stack, e.g. in readdir.

Aufs readdir operation consumes memory, but it is not stack. If it was
implemented as a recursive function, it might cause the stack
overflow. But actually it is a loop.
The memory is used for stroing entry names and eliminating whiteout-ed
ones, and the result will be cached for a specified time. So the memory
(other than stack) will be consumed.


> * allowing multiple writable branches (instead of just stacking
>   one rw copy on a number of ro file systems) is confusing to the user
>   and complicates the implementation a lot.

Probably you are right. Initially aufs had only one policy to select the
writable branch. But several users requested another policy such as
round-robin or most-free-spece, and aufs has implemented them.
I don't guess uers will be confused by these policies. While I tried it
should be simple, I guess some people will say it is complex.


Junjiro Okajima



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  3:51             ` Erez Zadok
@ 2008-06-02 11:07               ` Jamie Lokier
  0 siblings, 0 replies; 32+ messages in thread
From: Jamie Lokier @ 2008-06-02 11:07 UTC (permalink / raw)
  To: Erez Zadok
  Cc: Arnd Bergmann, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch

Erez Zadok wrote:
> 
> > Jamie Lokier wrote:
> > > Phillip Lougher wrote:
> > > If I read the patches correctly, when a file page is written to, only 
> > > that page gets copied into the page cache and locked, the other pages 
> > > continue to be read off disk from cramfs?  With Unionfs a page write 
> > > causes the entire file to be copied up to the r/w tmpfs and locked into 
> > > the page cache causing unnecessary RAM overhead.
> 
> Yes, unionfs does copyup whole files, but it doesn't lock the entire file
> into the page cache.  But I agree, that copying up large files to a tmpfs
> partition adds more memory pressure, at least temporarily (until pdflush
> kicks in).

1: I'm thinking systems which have union-over-cramfs probably don't have
swap at all...

2: It's a problem when you modify a very large file, even on a fast PC
with plenty of RAM.  LVM snapshots might be better for this sort of
thing.

> > Ok, so why not fix that in unionfs?  An option so that holes in the
> > overlay file let through data from the underlying file sounds like it
> > would be generally useful, and quite easy to implement.
> 
> If I understand you right, you want to copyup one page at a time, right?
> That's not nearly as easy as one might imagine.  First, you can't do it on
> file systems which don't support holes.  Second, holes is a file-systems
> specific implementation issue, and the knowledge of holes AFAIC, is hidden
> from the VFS (IIRC, FreeBSD has a specific "zfod" page flag, which is turned
> on when the VM has a page that came out of a f/s hole).

True, although the new FIEMAP ioctl is supposed to make holes more
filesystem independent, when they are supported.

> You'll need a way to tell if a given page was copied up or not, and
> distinguish b/t pages which are naturally filled with zeros vs. those which
> came from f/s holes.

Metadata.  Don't you have other metadata anyway, like whiteouts? :-)

> Copyup is also providing persistency: you can copyup to a persistent f/s
> such as ext2.  So you'll need a bitmap or some sort of record that will
> survive file system remount and system reboot; such a bitmap will have to
> tell which pages of a file have been copied up or not.

Yes.

> I'm not saying it's not possible, but it's to do this page-wise caching at a
> stackable layer than inside a native f/s such as ext2.  Now, if there was a
> generic VFS op that allowed me to query a file system whether a page it a
> given file is a hole or not, then unionfs would be able to do page-wise
> copyup easily.

See FIEMAP.  Is it any use?

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 10:36               ` hooanon05
@ 2008-06-02 11:15                 ` Arnd Bergmann
  2008-06-02 12:56                   ` hooanon05
  2008-06-02 14:54                   ` Evgeniy Polyakov
  0 siblings, 2 replies; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-02 11:15 UTC (permalink / raw)
  To: hooanon05
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch

On Monday 02 June 2008, hooanon05@yahoo.co.jp wrote:
> > * data inconsistency problems when simultaneously accessing the underlying
> >   fs and the union.
> Aufs has three levels of detecting the direct-access to the lower
> (branch) filesystems (ie. bypassing aufs). I guess the most strict level
> is a good answer for your question. It is based on the inotify
> feature. Aufs sets inotify-watch to every accessed directories on lower
> fs. During those inodes are cached, aufs receives the inotify event for
> thier children/files and marks the aufs data for the file is
> obsoleted. When the file is accessed later, aufs retrives the latest
> inode (or dentry) again.
> The inotify-watch will be removed when the aufs dir inode is discarded
> from cache.

This is a very complicated approach, and I'm not sure if it even addresses
the case where you have a shared mmap on both files. With VFS based union
mounts, they share one inode, so you don't need to use idiotify in the first
place, and it automatically works on shared mmaps.

> > * duplication of dentry and inode data structures in the union wastes
> >   memory and cpu cycles.
> 
> Aufs has its own dentry and inode object as normal fs has. And they have
> pointers to the corresponding ones on the lower fs. If you make a union
> from two real filesystems, then aufs inode will have (at most) two
> pointers as its private data.
> Do you mean having pointers is a duplicataion?

I mean having your own dentry and inode object is duplication. The
underlying file system already has them, so if you have your own,
you need to keep them synchronized. I guess that in order to do
a lookup on a file, you need the steps of

1. lookup in aufs dentry cache -> fail
2. lookup in underlying dentry cache -> fail
3. try to read dentry from disk -> fail
4. repeat 2-3 until found, or arrive at lowest level 
5. create an inode in memory for the lower file system
6. create dentry in memory on lower file system, pointing
   to that
7. create an aufs specific inode pointing to the underlying
   inode
8. create an aufs specific dentry object to point to that
9. create a struct inode representing the aufs inode
10. create another VFS dentry to point to that

when you really should just return the dentry found by the
lower file system.

> > * whiteouts are in the same namespace as regular files, so conflicts are
> >   possible.
> 
> Yes, that's right.
> Aufs reserves ".wh." as a whiteout prefix, and prohibits users to handle
> such filename inside aufs. It might be a problem as you wrote, but users
> can create/remove them directly on the lower fs and I have never
> received request about this reserved prefix.

It's not so much a practical limitation as an exploitable feature.
E.g. an unpriviledged user may use this to get an application into
an error condition by asking for an invalid file name.

Posix reserves a well-defined set of invalid file names, and
deviation from this means that you are not compliant, and that
in a potentially unexpected way.

> > * mounting a large number of aufs on top of each other eventually
> >   overflows the kernel stack, e.g. in readdir.
> 
> Aufs readdir operation consumes memory, but it is not stack. If it was
> implemented as a recursive function, it might cause the stack
> overflow. But actually it is a loop.
> The memory is used for stroing entry names and eliminating whiteout-ed
> ones, and the result will be cached for a specified time. So the memory
> (other than stack) will be consumed.

How does aufs know that one of its branches is an aufs itself?
If you detect this, do you fold it into a single aufs instance with
more branches?
In case you don't do it, I don't see how you get around the stack
overflow, but if you do it, you have again added a whole lot of
complexity for something that should be trivial when done right.

> > * allowing multiple writable branches (instead of just stacking
> >   one rw copy on a number of ro file systems) is confusing to the user
> >   and complicates the implementation a lot.
> 
> Probably you are right. Initially aufs had only one policy to select the
> writable branch. But several users requested another policy such as
> round-robin or most-free-spece, and aufs has implemented them.
> I don't guess uers will be confused by these policies. While I tried it
> should be simple, I guess some people will say it is complex.

I personally think that a policy other than writing to the top is crazy
enough, but randomly writing to multiple places is much worse, as it
becomes unpredictable what the file system does, not just unexpected.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 11:15                 ` Arnd Bergmann
@ 2008-06-02 12:56                   ` hooanon05
  2008-06-02 14:13                     ` Arnd Bergmann
  2008-06-02 14:54                   ` Evgeniy Polyakov
  1 sibling, 1 reply; 32+ messages in thread
From: hooanon05 @ 2008-06-02 12:56 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch


Arnd Bergmann:
> This is a very complicated approach, and I'm not sure if it even addresses
> the case where you have a shared mmap on both files. With VFS based union
> mounts, they share one inode, so you don't need to use idiotify in the first
> place, and it automatically works on shared mmaps.

As you might know, aufs doesn't have its own file mapped pages. Aufs
overrides vm_operations and redirects the page fault to the lower file's
vm_operation. So the shared mmap has no problem.
I am afraid that I should write "marks the attributes in aufs is
obsoleted" instead of "marks the aufs data for the file is obsoleted" in
my previous mail.


> I mean having your own dentry and inode object is duplication. The

I see.
Then the solution must be union-mount.
Your 10 steps seem to be rather verbose. Generally, 'lookup' means to
create (or get) inode and dentry, and the fs inode and VFS inode are
allocated in the same time.
Aufs does 'lookup' for the lower dentry (yes, it must be repeated if
necessary), and sets it to the aufs dentry/inode private data.


> It's not so much a practical limitation as an exploitable feature.
> E.g. an unpriviledged user may use this to get an application into
> an error condition by asking for an invalid file name.

If a user specifies the prohibitted filename, the he will get an error.


> Posix reserves a well-defined set of invalid file names, and
> deviation from this means that you are not compliant, and that
> in a potentially unexpected way.

Yes, the whiteout prefix is a limitation (or a feature).


> How does aufs know that one of its branches is an aufs itself?
> If you detect this, do you fold it into a single aufs instance with
> more branches?
> In case you don't do it, I don't see how you get around the stack
> overflow, but if you do it, you have again added a whole lot of
> complexity for something that should be trivial when done right.

- To detect the filesystem type is easy. Aufs can know whether the
  branch is aufs or not by checking s_magic or s_type->name.
- aufs doesn't fold? expand? the nested aufs branch.

You might be pointng out a general matter of stacking filesystem.
When one of branches is a stacking fs, and it is nested deeper and
deeper,
- /aufs1 = /rw1 + /aufs2
- /aufs2 = /rw2 + /aufs3
- /aufs3 = /rw3 + /aufs4
	:::
then the stack-overflow may happen. It is not limited to readdir, it can
happen in every operation. Basically aufs rejects 'aufs/unionfs branch',
in other word "aufs branch of another aufs mount."
But aufs has a configuration to enable this. When a user enables it and
sets deeply nested aufs branch, it could happen. But this is same thing
even if you use union-mount (and if UnionMount supports such branch).


> I personally think that a policy other than writing to the top is crazy
> enough, but randomly writing to multiple places is much worse, as it
> becomes unpredictable what the file system does, not just unexpected.

I don't want you to call aufs users crazy who are using such policies.
By the way, how do you think link(2) or rename(2)? When the source file
exists on the lower writable branch, do you think copy-up is the best
way? Or do you think all lower branches should be readonly?
There is an exception in aufs's branch-select policy. That is
link/rename case. When the source file exists on a writable branch, aufs
tries link/rename it on that branch in every policy. Do you think it
best to do it on the top branch only?


Junjiro Okajima

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 12:56                   ` hooanon05
@ 2008-06-02 14:13                     ` Arnd Bergmann
  2008-06-02 14:33                       ` hooanon05
  0 siblings, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-02 14:13 UTC (permalink / raw)
  To: hooanon05
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch

On Monday 02 June 2008, hooanon05@yahoo.co.jp wrote:
> I don't want you to call aufs users crazy who are using such policies.
> By the way, how do you think link(2) or rename(2)? When the source file
> exists on the lower writable branch, do you think copy-up is the best
> way? Or do you think all lower branches should be readonly?
> There is an exception in aufs's branch-select policy. That is
> link/rename case. When the source file exists on a writable branch, aufs
> tries link/rename it on that branch in every policy. Do you think it
> best to do it on the top branch only?

Yes, I tend to consider the union case identical to the cross-mount
move or link, so I'd expect the kernel to return errno=EXDEV and user
space to handle this by doing the appropriate copy/unlink as it does
for other cases already.

	Arnd <><

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 14:13                     ` Arnd Bergmann
@ 2008-06-02 14:33                       ` hooanon05
  2008-06-02 15:01                         ` Arnd Bergmann
  0 siblings, 1 reply; 32+ messages in thread
From: hooanon05 @ 2008-06-02 14:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch


Arnd Bergmann:
> > By the way, how do you think link(2) or rename(2)? When the source file
> > exists on the lower writable branch, do you think copy-up is the best
> > way? Or do you think all lower branches should be readonly?
> > There is an exception in aufs's branch-select policy. That is
> > link/rename case. When the source file exists on a writable branch, aufs
> > tries link/rename it on that branch in every policy. Do you think it
> > best to do it on the top branch only?
> 
> Yes, I tend to consider the union case identical to the cross-mount
> move or link, so I'd expect the kernel to return errno=EXDEV and user
> space to handle this by doing the appropriate copy/unlink as it does
> for other cases already.

Aure rename returns EXDEV when the target is a dir and it has child
entr(y|ies) on lower branhc(es). And mv(1) handles this case.
My Engilsh might be miunderstood. Do you think link(2) should return an
error when the target exists on lower writable branch?


Junjiro Okajima

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 11:15                 ` Arnd Bergmann
  2008-06-02 12:56                   ` hooanon05
@ 2008-06-02 14:54                   ` Evgeniy Polyakov
  2008-06-02 17:42                     ` Arnd Bergmann
  1 sibling, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2008-06-02 14:54 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: hooanon05, Jamie Lokier, Phillip Lougher, David Newall,
	linux-fsdevel, linux-kernel, hch

Hi Arnd.

On Mon, Jun 02, 2008 at 01:15:40PM +0200, Arnd Bergmann (arnd@arndb.de) wrote:
> This is a very complicated approach, and I'm not sure if it even addresses
> the case where you have a shared mmap on both files. With VFS based union
> mounts, they share one inode, so you don't need to use idiotify in the first
> place, and it automatically works on shared mmaps.

Inotify has nothing common with that - it notifies about inode update,
which is only thing needed for unionfs. VM and aufs vmops will take care of
reads and writes, since there is no duplication of the data here.

> I mean having your own dentry and inode object is duplication. The
> underlying file system already has them, so if you have your own,
> you need to keep them synchronized. I guess that in order to do
> a lookup on a file, you need the steps of
> 
> 1. lookup in aufs dentry cache -> fail
> 2. lookup in underlying dentry cache -> fail
> 3. try to read dentry from disk -> fail
> 4. repeat 2-3 until found, or arrive at lowest level 
> 5. create an inode in memory for the lower file system
> 6. create dentry in memory on lower file system, pointing
>    to that
> 7. create an aufs specific inode pointing to the underlying
>    inode
> 8. create an aufs specific dentry object to point to that
> 9. create a struct inode representing the aufs inode
> 10. create another VFS dentry to point to that
> 
> when you really should just return the dentry found by the
> lower file system.

Or it is a feature, and you should not return dentry for lower file
system, when you can have different objects pointing to the
same object.

> It's not so much a practical limitation as an exploitable feature.
> E.g. an unpriviledged user may use this to get an application into
> an error condition by asking for an invalid file name.

Hmm... I believe if exploit wants to do bad things and system prevents
it, it is actually a right decision? But since you asked, I'm not sure
anymore...

> Posix reserves a well-defined set of invalid file names, and
> deviation from this means that you are not compliant, and that
> in a potentially unexpected way.

Everything has own limitation. 256 bytes per name is much stronger
problem, but everyone works with that.
It is a limitation, buts rather nonsignificant IMO.

> I personally think that a policy other than writing to the top is crazy
> enough, but randomly writing to multiple places is much worse, as it
> becomes unpredictable what the file system does, not just unexpected.

Is this a double rot13 encoded "people will never use computers with
more than 640 kb of ram" phrase? :)

While working VFS union mounting does not exist, AUFS does work.
It is just another filesystem, which works and has big userbase. Any VFS
approach (when implemented) will work on its own and its implementation
does not depend on this particular fs.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 14:33                       ` hooanon05
@ 2008-06-02 15:01                         ` Arnd Bergmann
  2008-06-03 11:04                           ` hooanon05
  0 siblings, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-02 15:01 UTC (permalink / raw)
  To: hooanon05
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch

On Monday 02 June 2008, hooanon05@yahoo.co.jp wrote:
> Aure rename returns EXDEV when the target is a dir and it has child
> entr(y|ies) on lower branhc(es). And mv(1) handles this case.
> My Engilsh might be miunderstood. Do you think link(2) should return an
> error when the target exists on lower writable branch?

Any writes should always just go to the top level. If the source file
for link() exists on the top level, link should succeed even if a target
exists on a lower level (given that the user has permissions to
unlink that file), but should return EXDEV if the source comes from
a lower level.

	Arnd <><

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  7:12             ` Arnd Bergmann
  2008-06-02 10:36               ` hooanon05
@ 2008-06-02 15:35               ` Erez Zadok
  1 sibling, 0 replies; 32+ messages in thread
From: Erez Zadok @ 2008-06-02 15:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: hooanon05, Jamie Lokier, Phillip Lougher, David Newall,
	linux-fsdevel, linux-kernel, hch

In message <200806020912.49721.arnd@arndb.de>, Arnd Bergmann writes:
> On Monday 02 June 2008, hooanon05@yahoo.co.jp wrote:
[...]
> Ok, I'm sorry for not having looked at it myself. I saw an older version
> and assumed it was not going to improve much. I'll have another look
> when I find the time. Unionfs was suffering from severe feature creep
> (multiple writable branches, runtime branch modification), and aufs
> seemed to add even more features instead of removing them.

Re: feature creep.  Unionfs had more features initially, but we removed
those that users didn't seem to want/use.  The bottom line, we've been
maintaining unionfs publicly for 5+ years now, so the set of features we
have is based exactly on what users want.  If anyone can give the users what
they want/need in a different, more elegant way, that's great; if not, users
just won't switch to another solution.

> Without reading either again, the top problems in unionfs at the time were:
> * data inconsistency problems when simultaneously accessing the underlying
>   fs and the union.

That's not an issue when using vm_ops->fault for data.

There is still an issue wrt dentries and topology changes, as Al mentioned
here before.  Al suggested to me (at OLS 08) that the superblock struct
might need the same writers-count as has been done for vfsmounts recently;
then you can prevent topology changes during union'ed operations
(esp. copyup).

> * duplication of dentry and inode data structures in the union wastes
>   memory and cpu cycles.

Yes, but I don't think it's much more than any other layered solution will
have (including ecryptfs and union mounts).  This is a general problem in
stackable file systems.  Union Mounts, being in the VFS, has the chance to
use less memory indeed, but at a possible cost of increased VFS complexity.

> * whiteouts are in the same namespace as regular files, so conflicts are
>   possible.

Agreed.  We have a different version of unionfs, called unionfs-odf, which
moves the whiteouts and all unioning-related meta-data to a separate, small
persistent partition.

But the better long-term solution is to get WH support in every native f/s.
These patches had been floating around for a while now, and they seem simple
enough that I don't see why it had taken so long to get basic WH support
into mainline (or at least -mm).  (Bharata, can you ask akpm to add just the
WH support into -mm perhaps?)

> * mounting a large number of aufs on top of each other eventually
>   overflows the kernel stack, e.g. in readdir.

Yes.  That's a general problem with stackable file systems.  Each layer you
add increases the depth of the stack.  There are all already known paths
(involving xfs/nfs/dm, etc.) which overrun the stack and the solution I've
heard was "don't do it."  That seems silly to me.  Instead, the kernel stack
should be growable dynamically, at the cost of performance.

However, the vast majority of unioning users use just one layer, so even for
us, blowing up the stack has been a rather rare user complaint.  And we've
been very mindful of stack usage (i.e., checking and optimizing based on
checkstack.pl).

> * allowing multiple writable branches (instead of just stacking
>   one rw copy on a number of ro file systems) is confusing to the user
>   and complicates the implementation a lot.

I agree that it does complicate the implementation, but again, this is
something that _some_ users really want: they want to merge multiple
"packages" together, and ensure that modifications to files/dirs of a given
package stay in their logical location.

I disagree with you that it's confusing to the user.  I've never had
complaints that people didn't how to change the branch configurations
dynamically.  Heck, people come up with creative ways of using dynamic
branch configurations in all sorts of funky environments that make even my
head spin :-) -- chroot, pivot_root, nfs exports, etc.

> With the exception of the last two, I assumed that these were all
> unfixable with a file system based approach (including the hypothetical
> union-tmpfs). If you have addressed them, how?
> 
> 	Arnd <><

Erez.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 14:54                   ` Evgeniy Polyakov
@ 2008-06-02 17:42                     ` Arnd Bergmann
  0 siblings, 0 replies; 32+ messages in thread
From: Arnd Bergmann @ 2008-06-02 17:42 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: hooanon05, Jamie Lokier, Phillip Lougher, David Newall,
	linux-fsdevel, linux-kernel, hch

On Monday 02 June 2008, Evgeniy Polyakov wrote:
> > I personally think that a policy other than writing to the top is crazy
> > enough, but randomly writing to multiple places is much worse, as it
> > becomes unpredictable what the file system does, not just unexpected.
> 
> Is this a double rot13 encoded "people will never use computers with
> more than 640 kb of ram" phrase? :)

No, it's more the "people don't need variable block size drives" argument.
They've been working fine for decades on mainframes, are incredibly
complicated to build and entirely pointless in practice ;-)

	Arnd <><

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02  7:51               ` Arnd Bergmann
@ 2008-06-02 18:13                 ` Erez Zadok
  2008-06-03  2:02                   ` Phillip Lougher
  0 siblings, 1 reply; 32+ messages in thread
From: Erez Zadok @ 2008-06-02 18:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Erez Zadok, Jamie Lokier, Phillip Lougher, David Newall,
	linux-fsdevel, linux-kernel, hch

In message <200806020951.26868.arnd@arndb.de>, Arnd Bergmann writes:
> On Monday 02 June 2008, Erez Zadok wrote:

> > Arnd, I favor a more generic approach, one that will work with the vast
> > majority of file systems that people use w/ unioning, preferably all of
> > them.  Supporting copy-on-write in cramfs will only help a small subset of
> > users.  Yes, it might be simple, but I fear it won't be useful enough to
> > convince existing users of unioning to switch over.  And I don't think we
> > should add CoW support in every file system -- the complexity will be much
> > more than using unionfs or some other VFS-based solution.
> 
> My idea was to have it in cramfs, squashfs and iso9660 at most, I agree
[...]

Ah, ok.  Doing those 3 will get better coverage for existing users.  The
question may come to how much code complexity does it add to each, and
whether some common code can be excised into generic helpers?

Arnd, my concern is that it might take a long time to see those in mainline.
Look at the status of whiteouts support in native file systems (just
whiteouts, not duplicate elimination): after months trials and several
posts, those patches aren't even in -mm.  And those are relatively simple
patches.  I can search for Viro's posting when he said he could hack it all
in one weekend; ok so maybe *he* can :-), but the point is that even with
Viro's tentative support of whiteouts, we're still not closer to having WH
support in mainline.

Who knows, maybe if you managed to get _something_ into mainline, it'll help
the overall effort move along; right now I fear there are too many strong
opinions on all sides that the effort is stuck.

[...]
> I'll probably try implementing a '-o union' option tmpfs anyway, just
> to see how hard it is and what the problems are.

And I'll be happy to test it for you (read: find bugs :-).  I've built a
large set of unioning-related regression tests over the years.

> 	Arnd <><

Erez.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 18:13                 ` Erez Zadok
@ 2008-06-03  2:02                   ` Phillip Lougher
  0 siblings, 0 replies; 32+ messages in thread
From: Phillip Lougher @ 2008-06-03  2:02 UTC (permalink / raw)
  To: Erez Zadok
  Cc: Arnd Bergmann, Jamie Lokier, David Newall, linux-fsdevel,
	linux-kernel, hch

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=windows-1252; format=flowed, Size: 1623 bytes --]

Erez Zadok wrote:
> In message <200806020951.26868.arnd@arndb.de>, Arnd Bergmann writes:
>> On Monday 02 June 2008, Erez Zadok wrote:
> 
>>> Arnd, I favor a more generic approach, one that will work with the vast
>>> majority of file systems that people use w/ unioning, preferably all of
>>> them.  Supporting copy-on-write in cramfs will only help a small subset of
>>> users.  Yes, it might be simple, but I fear it won't be useful enough to
>>> convince existing users of unioning to switch over.  And I don't think we
>>> should add CoW support in every file system -- the complexity will be much
>>> more than using unionfs or some other VFS-based solution.
>> My idea was to have it in cramfs, squashfs and iso9660 at most, I agree
> [...]
> 
> Ah, ok.  Doing those 3 will get better coverage for existing users.  The
> question may come to how much code complexity does it add to each, and
> whether some common code can be excised into generic helpers?
> 

Yes, that's what I'm interested in.  From my reading of the patches, the 
general approach and a lot of the code should be directly useable in a 
fake-writable Squashfs.  The first step (a very big first step) is to 
get readonly Squashfs mainlined, which is what I'm working on at the 
moment.  After that I'll be very interested in looking at fake-write 
support and factoring any common code into generic helpers.

Phillip
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] [RFC] cramfs: fake write support
  2008-06-02 15:01                         ` Arnd Bergmann
@ 2008-06-03 11:04                           ` hooanon05
  0 siblings, 0 replies; 32+ messages in thread
From: hooanon05 @ 2008-06-03 11:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jamie Lokier, Phillip Lougher, David Newall, linux-fsdevel,
	linux-kernel, hch


Arnd Bergmann:
> Any writes should always just go to the top level. If the source file
> for link() exists on the top level, link should succeed even if a target
> exists on a lower level (given that the user has permissions to
> unlink that file), but should return EXDEV if the source comes from
> a lower level.

Then what will happen when a user builds a union by "empty tmpfs" +
"cramfs"? Following your design, link(2) becomes useless in stacking fs.

You may be considering to implement a new dynamic link library for
stacking.
Hmm, that is intersting. It may be worth to think.


Junjiro Okajima

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2008-06-03 11:05 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-31 15:37 [RFC 0/7] [RFC] cramfs: fake write support arnd
2008-05-31 18:56 ` David Newall
2008-05-31 20:40   ` Arnd Bergmann
2008-06-01  3:54     ` Phillip Lougher
2008-06-01  8:52       ` Arnd Bergmann
2008-06-01 12:28       ` Jamie Lokier
2008-06-01 21:49         ` Arnd Bergmann
2008-06-02  2:48           ` hooanon05
2008-06-02  3:25             ` Erez Zadok
2008-06-02  7:51               ` Arnd Bergmann
2008-06-02 18:13                 ` Erez Zadok
2008-06-03  2:02                   ` Phillip Lougher
2008-06-02  3:51             ` Erez Zadok
2008-06-02 11:07               ` Jamie Lokier
2008-06-02  4:37             ` Erez Zadok
2008-06-02  6:07               ` Bharata B Rao
2008-06-02  7:17               ` Jan Engelhardt
2008-06-02  7:12             ` Arnd Bergmann
2008-06-02 10:36               ` hooanon05
2008-06-02 11:15                 ` Arnd Bergmann
2008-06-02 12:56                   ` hooanon05
2008-06-02 14:13                     ` Arnd Bergmann
2008-06-02 14:33                       ` hooanon05
2008-06-02 15:01                         ` Arnd Bergmann
2008-06-03 11:04                           ` hooanon05
2008-06-02 14:54                   ` Evgeniy Polyakov
2008-06-02 17:42                     ` Arnd Bergmann
2008-06-02 15:35               ` Erez Zadok
2008-06-01  6:02     ` David Newall
2008-06-01  9:11       ` Jan Engelhardt
2008-06-01 16:25       ` Jörn Engel
2008-06-01  3:19 ` Phillip Lougher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).