Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
       [not found] <20180508180436.716-1-mfasheh@suse.de>
@ 2018-05-08 23:38 ` Dave Chinner
  2018-05-09  2:06   ` Jeff Mahoney
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2018-05-08 23:38 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-fsdevel, linux-kernel, linux-btrfs

On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
> Hi,
> 
> The VFS's super_block covers a variety of filesystem functionality. In
> particular we have a single structure representing both I/O and
> namespace domains.
> 
> There are requirements to de-couple this functionality. For example,
> filesystems with more than one root (such as btrfs subvolumes) can
> have multiple inode namespaces. This starts to confuse userspace when
> it notices multiple inodes with the same inode/device tuple on a
> filesystem.

Devil's Advocate - I'm not looking at the code, I'm commenting on
architectural issues I see here.

The XFS subvolume work I've been doing explicitly uses a superblock
per subvolume. That's because subvolumes are designed to be
completely independent of the backing storage - they know nothing
about the underlying storage except to share a BDI for writeback
purposes and write to whatever block device the remapping layer
gives them at IO time.  Hence XFS subvolumes have (at this point)
their own unique s_dev, on-disk format configuration, journal, space
accounting, etc. i.e. They are fully independent filesystems in
their own right, and as such we do not have multiple inode
namespaces per superblock.

So this doesn't sound like a "subvolume problem" - it's a "how do we
sanely support multiple independent namespaces per superblock"
problem. AFAICT, this same problem exists with bind mounts and mount
namespaces - they are effectively multiple roots on a single
superblock, but it's done at the vfsmount level and so the
superblock knows nothing about them.

So this kinda feel like there's still a impedence mismatch between
btrfs subvolumes being mounted as subtrees on the underlying root
vfsmount rather than being created as truly independent vfs
namespaces that share a superblock. To put that as a question: why
aren't btrfs subvolumes vfsmounts in their own right, and the unique
information subvolume information get stored in (or obtained from)
the vfsmount?

> In addition, it's currently impossible for a filesystem subvolume to
> have a different security context from it's parent. If we could allow
> for subvolumes to optionally specify their own security context, we
> could use them as containers directly instead of having to go through
> an overlay.

Again, XFS subvolumes don't have this problem. So really we need to
frame this discussion in terms of supporting multiple namespaces
within a superblock sanely, not subvolumes.

> I ran into this particular problem with respect to Btrfs some years
> ago and sent out a very naive set of patches which were (rightfully)
> not incorporated:
> 
> https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
> https://marc.info/?l=linux-btrfs&m=130532890824992&w=2
> 
> During the discussion, one question did come up - why can't
> filesystems like Btrfs use a superblock per subvolume? There's a
> couple of problems with that:
> 
> - It's common for a single Btrfs filesystem to have thousands of
>   subvolumes. So keeping a superblock for each subvol in memory would
>   get prohibively expensive - imagine having 8000 copies of struct
>   super_block for a file system just because we wanted some separation
>   of say, s_dev.

That's no different to using individual overlay mounts for the
thousands of containers that are on the system. This doesn't seem to
be a major problem...

> - Writeback would also have to walk all of these superblocks -
>   again not very good for system performance.

Background writeback is backing device focussed, not superblock
focussed. It will only iterate the superblocks that have dirty
inodes on the bdi writeback lists, not all the superblocks on the
bdi. IOWs, this isn't a major problem except for sync() operations
that iterate superblocks.....

> - Anyone wanting to lock down I/O on a filesystem would have to
> freeze all the superblocks. This goes for most things related to
> I/O really - we simply can't afford to have the kernel walking
> thousands of superblocks to sync a single fs.

Not with XFS subvolumes. Freezing the underlying parent filesystem
will effectively stop all IO from the mounted subvolumes by freezing
remapping calls before IO. Sure, those subvolumes aren't in a
consistent state, but we don't freeze userspace so none of the
application data is ever in a consistent state when filesystems are
frozen.

So, again, I'm not sure there's /subvolume/ problem here. There's
definitely a "freeze heirarchy" problem, but that already exists and
it's something we talked about at LSFMM because we need to solve it
for reliable hibernation.

> It's far more efficient then to pull those fields we need for a
> subvolume namespace into their own structure.

I'm not convinced yet - it still feels like it's the wrong layer to
be solving the multiple namespace per superblock problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-05-08 23:38 ` [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root Dave Chinner
@ 2018-05-09  2:06   ` Jeff Mahoney
  2018-05-09  6:41     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Mahoney @ 2018-05-09  2:06 UTC (permalink / raw)
  To: Dave Chinner, Mark Fasheh; +Cc: linux-fsdevel, linux-kernel, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 8661 bytes --]

On 5/8/18 7:38 PM, Dave Chinner wrote:
> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
>> Hi,
>>
>> The VFS's super_block covers a variety of filesystem functionality. In
>> particular we have a single structure representing both I/O and
>> namespace domains.
>>
>> There are requirements to de-couple this functionality. For example,
>> filesystems with more than one root (such as btrfs subvolumes) can
>> have multiple inode namespaces. This starts to confuse userspace when
>> it notices multiple inodes with the same inode/device tuple on a
>> filesystem.
> 
> Devil's Advocate - I'm not looking at the code, I'm commenting on
> architectural issues I see here.
> 
> The XFS subvolume work I've been doing explicitly uses a superblock
> per subvolume. That's because subvolumes are designed to be
> completely independent of the backing storage - they know nothing
> about the underlying storage except to share a BDI for writeback
> purposes and write to whatever block device the remapping layer
> gives them at IO time.  Hence XFS subvolumes have (at this point)
> their own unique s_dev, on-disk format configuration, journal, space
> accounting, etc. i.e. They are fully independent filesystems in
> their own right, and as such we do not have multiple inode
> namespaces per superblock.

That's a fundamental difference between how your XFS subvolumes work and
how btrfs subvolumes do.  There is no independence among btrfs
subvolumes.  When a snapshot is created, it has a few new blocks but
otherwise shares the metadata of the source subvolume.  The metadata
trees are shared across all of the subvolumes and there are several
internal trees used to manage all of it.  It's a single storage pool and
a single transaction engine.  There are housekeeping and maintenance
tasks that operate across the entire file system internally.  I
understand that there are several problems you need to solve at the VFS
layer to get your version of subvolumes up and running, but trying to
shoehorn one into the other is bound to fail.

> So this doesn't sound like a "subvolume problem" - it's a "how do we
> sanely support multiple independent namespaces per superblock"
> problem. AFAICT, this same problem exists with bind mounts and mount
> namespaces - they are effectively multiple roots on a single
> superblock, but it's done at the vfsmount level and so the
> superblock knows nothing about them.

In this case, you're talking about the user-visible file system
hierarchy namespace that has no bearing on the underlying file system
outside of per-mount flags.  It makes sense for that to be above the
superblock because the file system doesn't care about them.  We're
interested in the inode namespace, which for every other file system can
be described using an inode and a superblock pair, but btrfs has another
layer in the middle: inode -> btrfs_root -> superblock.  The lifetime
rules for e.g. the s_dev follow that middle layer and a vfsmount can
disappear well before the inode does.

> So this kinda feel like there's still a impedence mismatch between
> btrfs subvolumes being mounted as subtrees on the underlying root
> vfsmount rather than being created as truly independent vfs
> namespaces that share a superblock. To put that as a question: why
> aren't btrfs subvolumes vfsmounts in their own right, and the unique
> information subvolume information get stored in (or obtained from)
> the vfsmount?

Those are two separate problems.   Using a vfsmount to export the
btrfs_root is on my roadmap.  I have a WIP patch set that automounts the
subvolumes when stepping into a new one, but it's to fix a longstanding
UX wart.  Ultimately, vfsmounts are at the wrong level to solve the
inode namespace problem.  Again, there's the lifetime issue.  There are
also many places where we only have an inode and need the s_dev
associated with it.  Most of these sites are well removed from having
access to a vfsmount and pinning one and passing it around carries no
other benefit.

>> In addition, it's currently impossible for a filesystem subvolume to
>> have a different security context from it's parent. If we could allow
>> for subvolumes to optionally specify their own security context, we
>> could use them as containers directly instead of having to go through
>> an overlay.
> 
> Again, XFS subvolumes don't have this problem. So really we need to
> frame this discussion in terms of supporting multiple namespaces
> within a superblock sanely, not subvolumes.
> 
>> I ran into this particular problem with respect to Btrfs some years
>> ago and sent out a very naive set of patches which were (rightfully)
>> not incorporated:
>>
>> https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
>> https://marc.info/?l=linux-btrfs&m=130532890824992&w=2
>>
>> During the discussion, one question did come up - why can't
>> filesystems like Btrfs use a superblock per subvolume? There's a
>> couple of problems with that:
>>
>> - It's common for a single Btrfs filesystem to have thousands of
>>   subvolumes. So keeping a superblock for each subvol in memory would
>>   get prohibively expensive - imagine having 8000 copies of struct
>>   super_block for a file system just because we wanted some separation
>>   of say, s_dev.
> 
> That's no different to using individual overlay mounts for the
> thousands of containers that are on the system. This doesn't seem to
> be a major problem...

Overlay mounts are indepedent of one another and don't need coordination
among them.  The memory usage is relatively unimportant.  The important
part is having a bunch of superblocks that all correspond to the same
resources and coordinating them at the VFS level.  Your assumptions
below follow how your XFS subvolumes work, where there's a clear
hierarchy between the subvolumes and the master filesystem with a
mapping layer between them.  Btrfs subvolumes have no such hierarchy.
Everything is shared.  So while we could use a writeback hierarchy to
merge all of the inode lists before doing writeback on the 'master'
superblock, we'd gain nothing by it.  Handling anything involving
s_umount with a superblock per subvolume would be a nightmare.
Ultimately, it would be a ton of effort that amounts to working around
the VFS instead of with it.

>> - Writeback would also have to walk all of these superblocks -
>>   again not very good for system performance.
> 
> Background writeback is backing device focussed, not superblock
> focussed. It will only iterate the superblocks that have dirty
> inodes on the bdi writeback lists, not all the superblocks on the
> bdi. IOWs, this isn't a major problem except for sync() operations
> that iterate superblocks.....
> 
>> - Anyone wanting to lock down I/O on a filesystem would have to
>> freeze all the superblocks. This goes for most things related to
>> I/O really - we simply can't afford to have the kernel walking
>> thousands of superblocks to sync a single fs.
> 
> Not with XFS subvolumes. Freezing the underlying parent filesystem
> will effectively stop all IO from the mounted subvolumes by freezing
> remapping calls before IO. Sure, those subvolumes aren't in a
> consistent state, but we don't freeze userspace so none of the
> application data is ever in a consistent state when filesystems are
> frozen.
> 
> So, again, I'm not sure there's /subvolume/ problem here. There's
> definitely a "freeze heirarchy" problem, but that already exists and
> it's something we talked about at LSFMM because we need to solve it
> for reliable hibernation.

There's only a freeze hierarchy problem if we have to use multiple
superblocks.  Otherwise, we freeze the whole thing or we don't.  Trying
to freeze a single subvolume would be an illusion for the user since the
underlying file system would still be active underneath it.  Under the
hood, things like relocation don't even look at what subvolume owns a
particular extent until it must.  So it could be coordinating thousands
of superblocks to do what a single lock does now and for what benefit?

>> It's far more efficient then to pull those fields we need for a
>> subvolume namespace into their own structure.
>
> I'm not convinced yet - it still feels like it's the wrong layer to
> be solving the multiple namespace per superblock problem....

It needs to be between the inode and the superblock.  If there are
multiple user-visible namespaces, each will still get the same
underlying file system namespace.

-Jeff

-- 
Jeff Mahoney
SUSE Labs

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-05-09  2:06   ` Jeff Mahoney
@ 2018-05-09  6:41     ` Dave Chinner
  2018-06-05 20:17       ` Jeff Mahoney
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2018-05-09  6:41 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: Mark Fasheh, linux-fsdevel, linux-kernel, linux-btrfs

On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote:
> On 5/8/18 7:38 PM, Dave Chinner wrote:
> > On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
> >> Hi,
> >>
> >> The VFS's super_block covers a variety of filesystem functionality. In
> >> particular we have a single structure representing both I/O and
> >> namespace domains.
> >>
> >> There are requirements to de-couple this functionality. For example,
> >> filesystems with more than one root (such as btrfs subvolumes) can
> >> have multiple inode namespaces. This starts to confuse userspace when
> >> it notices multiple inodes with the same inode/device tuple on a
> >> filesystem.
> > 
> > Devil's Advocate - I'm not looking at the code, I'm commenting on
> > architectural issues I see here.
> > 
> > The XFS subvolume work I've been doing explicitly uses a superblock
> > per subvolume. That's because subvolumes are designed to be
> > completely independent of the backing storage - they know nothing
> > about the underlying storage except to share a BDI for writeback
> > purposes and write to whatever block device the remapping layer
> > gives them at IO time.  Hence XFS subvolumes have (at this point)
> > their own unique s_dev, on-disk format configuration, journal, space
> > accounting, etc. i.e. They are fully independent filesystems in
> > their own right, and as such we do not have multiple inode
> > namespaces per superblock.
> 
> That's a fundamental difference between how your XFS subvolumes work and
> how btrfs subvolumes do.

Yup, you've just proved my point: this is not a "subvolume problem";
but rather a "multiple namespace per root" problem.

> There is no independence among btrfs
> subvolumes.  When a snapshot is created, it has a few new blocks but
> otherwise shares the metadata of the source subvolume.  The metadata
> trees are shared across all of the subvolumes and there are several
> internal trees used to manage all of it.

I don't need btrfs 101 stuff explained to me. :/

> a single transaction engine.  There are housekeeping and maintenance
> tasks that operate across the entire file system internally.  I
> understand that there are several problems you need to solve at the VFS
> layer to get your version of subvolumes up and running, but trying to
> shoehorn one into the other is bound to fail.

Actually, the VFS has provided everything I need for XFS subvolumes
so far without requiring any sort of modifications. That's the
perspective I'm approaching this from - if the VFS can do what we
need for XFS subvolumes, as well as overlay (which are effectively
VFS-based COW subvolumes), then lets see if we can make that work
for btrfs too.

> > So this doesn't sound like a "subvolume problem" - it's a "how do we
> > sanely support multiple independent namespaces per superblock"
> > problem. AFAICT, this same problem exists with bind mounts and mount
> > namespaces - they are effectively multiple roots on a single
> > superblock, but it's done at the vfsmount level and so the
> > superblock knows nothing about them.
> 
> In this case, you're talking about the user-visible file system
> hierarchy namespace that has no bearing on the underlying file system
> outside of per-mount flags.

Except that it tracks and provides infrastructure that allows user
visible  "multiple namespace per root" constructs. Subvolumes - as a
user visible namespace construct - are little different to bind
mounts in behaviour and functionality. 

How the underlying filesystem implements subvolumes is really up to
the filesystem, but we should be trying to implement a clean model
for "multiple namespaces on a single root" at the VFS so we have
consistent behaviour across all filesystems that implement similar
functionality.

FWIW, bind mounts and overlay also have similar inode number
namespace problems to what Mark describes for btrfs subvolumes.
e.g. overlay recently introduce the "xino" mount option to separate
the user presented inode number namespace for overlay inode from the
underlying parent filesystem inodes. How is that different to btrfs
subvolumes needing to present different inode number namespaces from
the underlying parent?

This sort of "namespace shifting" is needed for several different
pieces of information the kernel reports to userspace. The VFS
replacement for shiftfs is an example of this. So is inode number
remapping. I'm sure there's more.

My point is that if we are talking about infrastructure to remap
what userspace sees from different mountpoint views into a
filesystem, then it should be done above the filesystem layers in
the VFS so all filesystems behave the same way. And in this case,
the vfsmount maps exactly to the "fs_view" that Mark has proposed we
add to the superblock.....

> It makes sense for that to be above the
> superblock because the file system doesn't care about them.  We're
> interested in the inode namespace, which for every other file system can
> be described using an inode and a superblock pair, but btrfs has another
> layer in the middle: inode -> btrfs_root -> superblock. 

Which seems to me to be irrelevant if there's a vfsmount per
subvolume that can hold per-subvolume information.

> > So this kinda feel like there's still a impedence mismatch between
> > btrfs subvolumes being mounted as subtrees on the underlying root
> > vfsmount rather than being created as truly independent vfs
> > namespaces that share a superblock. To put that as a question: why
> > aren't btrfs subvolumes vfsmounts in their own right, and the unique
> > information subvolume information get stored in (or obtained from)
> > the vfsmount?
> 
> Those are two separate problems.   Using a vfsmount to export the
> btrfs_root is on my roadmap.  I have a WIP patch set that automounts the
> subvolumes when stepping into a new one, but it's to fix a longstanding
> UX wart.

IMO that's more than a UX wart - th elack of vfsmounts for internal
subvolume mount point traversals could be considered the root cause
of the issues we are discussing here. Extending the mounted
namespace should trigger vfs mounts, not be hidden deep inside the
filesystem. Hence I'd suggest this needs changing before anything
else....

> >> During the discussion, one question did come up - why can't
> >> filesystems like Btrfs use a superblock per subvolume? There's a
> >> couple of problems with that:
> >>
> >> - It's common for a single Btrfs filesystem to have thousands of
> >>   subvolumes. So keeping a superblock for each subvol in memory would
> >>   get prohibively expensive - imagine having 8000 copies of struct
> >>   super_block for a file system just because we wanted some separation
> >>   of say, s_dev.
> > 
> > That's no different to using individual overlay mounts for the
> > thousands of containers that are on the system. This doesn't seem to
> > be a major problem...
> 
> Overlay mounts are indepedent of one another and don't need coordination
> among them.  The memory usage is relatively unimportant.  The important
> part is having a bunch of superblocks that all correspond to the same
> resources and coordinating them at the VFS level.  Your assumptions
> below follow how your XFS subvolumes work, where there's a clear
> hierarchy between the subvolumes and the master filesystem with a
> mapping layer between them.  Btrfs subvolumes have no such hierarchy.
> Everything is shared. 

Yup, that's the impedence mismatch between the VFS infrastructure
and btrfs that I was talking about. What I'm trying to communicate
is that I think the proposal is attacking the impedence mismatch
from the wrong direction.

i.e. The proposal is to modify btrfs code to propagate stuff that
btrfs needs to know to deal with it's internal "everything is
shared" problems up into the VFS where it's probably not useful to
anything other than btrfs. We already have the necessary construct
in the VFS - I think we should be trying to use the information held
by the generic VFS infrastructure to solve the solve the specific
btrfs issue at hand....

> So while we could use a writeback hierarchy to
> merge all of the inode lists before doing writeback on the 'master'
> superblock, we'd gain nothing by it.  Handling anything involving
> s_umount with a superblock per subvolume would be a nightmare.
> Ultimately, it would be a ton of effort that amounts to working around
> the VFS instead of with it.

I'm not suggesting that btrfs needs to use a superblock per
subvolume. Please don't confuse my statements along the lines of
"this doesn't seem to be a problem for others" with "you must change
btrfs to do this". I'm just saying that the problems arising from
using a superblock per subvolume are not as dire as is being
implied.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-05-09  6:41     ` Dave Chinner
@ 2018-06-05 20:17       ` Jeff Mahoney
  2018-06-06  9:49         ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Mahoney @ 2018-06-05 20:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Fasheh, linux-fsdevel, linux-kernel, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 14291 bytes --]

Sorry, just getting back to this.

On 5/9/18 2:41 AM, Dave Chinner wrote:
> On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote:
>> On 5/8/18 7:38 PM, Dave Chinner wrote:
>>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
>>>> Hi,
>>>>
>>>> The VFS's super_block covers a variety of filesystem functionality. In
>>>> particular we have a single structure representing both I/O and
>>>> namespace domains.
>>>>
>>>> There are requirements to de-couple this functionality. For example,
>>>> filesystems with more than one root (such as btrfs subvolumes) can
>>>> have multiple inode namespaces. This starts to confuse userspace when
>>>> it notices multiple inodes with the same inode/device tuple on a
>>>> filesystem.
>>>
>>> Devil's Advocate - I'm not looking at the code, I'm commenting on
>>> architectural issues I see here.
>>>
>>> The XFS subvolume work I've been doing explicitly uses a superblock
>>> per subvolume. That's because subvolumes are designed to be
>>> completely independent of the backing storage - they know nothing
>>> about the underlying storage except to share a BDI for writeback
>>> purposes and write to whatever block device the remapping layer
>>> gives them at IO time.  Hence XFS subvolumes have (at this point)
>>> their own unique s_dev, on-disk format configuration, journal, space
>>> accounting, etc. i.e. They are fully independent filesystems in
>>> their own right, and as such we do not have multiple inode
>>> namespaces per superblock.
>>
>> That's a fundamental difference between how your XFS subvolumes work and
>> how btrfs subvolumes do.
> 
> Yup, you've just proved my point: this is not a "subvolume problem";
> but rather a "multiple namespace per root" problem.

Mark framed it as a namespace issue initially.  The btrfs subvolume
problem is the biggest one we're concerned with.  More below.

>> There is no independence among btrfs
>> subvolumes.  When a snapshot is created, it has a few new blocks but
>> otherwise shares the metadata of the source subvolume.  The metadata
>> trees are shared across all of the subvolumes and there are several
>> internal trees used to manage all of it.
> 
> I don't need btrfs 101 stuff explained to me. :/

No offense meant.  This was crossposted to a few lists.

>> a single transaction engine.  There are housekeeping and maintenance
>> tasks that operate across the entire file system internally.  I
>> understand that there are several problems you need to solve at the VFS
>> layer to get your version of subvolumes up and running, but trying to
>> shoehorn one into the other is bound to fail.
> 
> Actually, the VFS has provided everything I need for XFS subvolumes
> so far without requiring any sort of modifications. That's the
> perspective I'm approaching this from - if the VFS can do what we
> need for XFS subvolumes, as well as overlay (which are effectively
> VFS-based COW subvolumes), then lets see if we can make that work
> for btrfs too.

I'm glad that the VFS provides what you need for XFS subvolumes, but
you're effectively creating independent file systems using a backend
storage remapping technique.  That's pretty much exactly what the VFS is
already designed to do.  Overlayfs is definitely not just VFS-based CoW
subvolumes.  I ended up having to wade more deeply into overlayfs than I
would've liked to respond to your email.  I'd say it's better described
as Overlayfs can abuse what the VFS offers just enough to work well in
common cases.  It has a similar disconnect between the superblock and
inode to btrfs except that it's not as easily resolved since the dentry
belongs to overlayfs and the inode belongs to the underlying file
system. There are still places getting this wrong in the kernel.  In
order to reliably present the right inode/dev pair for overlayfs, we
need the dentry to go with it.  We don't always have it and, even if we
did, we'd need both the dentry and inode.

At any rate, anything with a ->getattr that modifies dev or ino so that
it's different from inode->i_sb->s_dev or inode->i_ino will misbehave
somehow.  The question is just how badly, which depends on the context.
It could be something simple like /proc/pid/maps not showing the right
device/ino pair.  Many file systems that use 64-bit inode numbers
already remap them on 32-bit systems, which can already cause collisions
(even inside the kernel) when comparisons are made.  There's no excuse
to not use 64-bit inode numbers everywhere inside the kernel now,
especially since we have the capability to export them as 64-bit inode
numbers on 32-bit systems now via statx.

One thing is clear: If we want to solve the btrfs and overlayfs problems
in the same way, the view approach with a simple static mapping doesn't
work.  Sticking something between the inode and superblock doesn't get
the job done when the belongs to a different file system.  Overlayfs
needs a per-object remapper, which means a callback that takes a path.
Suddenly the way we do things in the SUSE kernel doesn't seem so hacky
anymore.

I'm not sure we need the same solution for btrfs and overlayfs.  It's
not the same problem.  Every object in overlayfs as a unique mapping
already.  If we report s_dev and i_ino from the inode, it still maps to
a unique user-visible object.  It may not map back to the overlayfs
name, but that's a separate issue that's more difficult to solve.  The
btrfs issue isn't one of presenting an alternative namespace to the
user.  Btrfs has multiple internal namespaces and no way to publish them
to the rest of the kernel.

>>> So this doesn't sound like a "subvolume problem" - it's a "how do we
>>> sanely support multiple independent namespaces per superblock"
>>> problem. AFAICT, this same problem exists with bind mounts and mount
>>> namespaces - they are effectively multiple roots on a single
>>> superblock, but it's done at the vfsmount level and so the
>>> superblock knows nothing about them.
>>
>> In this case, you're talking about the user-visible file system
>> hierarchy namespace that has no bearing on the underlying file system
>> outside of per-mount flags.
> 
> Except that it tracks and provides infrastructure that allows user
> visible  "multiple namespace per root" constructs. Subvolumes - as a
> user visible namespace construct - are little different to bind
> mounts in behaviour and functionality. 
> 
> How the underlying filesystem implements subvolumes is really up to
> the filesystem, but we should be trying to implement a clean model
> for "multiple namespaces on a single root" at the VFS so we have
> consistent behaviour across all filesystems that implement similar
> functionality.
> 
> FWIW, bind mounts and overlay also have similar inode number
> namespace problems to what Mark describes for btrfs subvolumes.
> e.g. overlay recently introduce the "xino" mount option to separate
> the user presented inode number namespace for overlay inode from the
> underlying parent filesystem inodes. How is that different to btrfs
> subvolumes needing to present different inode number namespaces from
> the underlying parent?
> 
> This sort of "namespace shifting" is needed for several different
> pieces of information the kernel reports to userspace. The VFS
> replacement for shiftfs is an example of this. So is inode number
> remapping. I'm sure there's more.

Djalal's work?  That rightfully belongs above the file system layer.
The only bit that's needed within the file system is the part that sets
the super flag to enable it.  The different namespaces have nothing to
do with the file system below it.

> My point is that if we are talking about infrastructure to remap
> what userspace sees from different mountpoint views into a
> filesystem, then it should be done above the filesystem layers in
> the VFS so all filesystems behave the same way. And in this case,
> the vfsmount maps exactly to the "fs_view" that Mark has proposed we
> add to the superblock.....

It's proposed to be above the superblock with a default view in the
superblock.  It would sit between the inode and the superblock so we
have access to it anywhere we already have an inode.  That's the main
difference.  We already have the inode everywhere it's needed.  Plumbing
a vfsmount everywhere needed means changing code that only requires an
inode and doesn't need a vfsmount.

The two biggest problem areas:
- Writeback tracepoints print a dev/inode pair.  Do we want to plumb a
vfsmount into __mark_inode_dirty, super_operations->write_inode,
__writeback_single_inode, writeback_sb_inodes, etc?
- Audit.  As it happens, most of audit has a path or file that can be
used.  We do run into problems with fsnotify.  fsnotify_move is called
from vfs_rename which turns into a can of worms pretty quickly.

>> It makes sense for that to be above the
>> superblock because the file system doesn't care about them.  We're
>> interested in the inode namespace, which for every other file system can
>> be described using an inode and a superblock pair, but btrfs has another
>> layer in the middle: inode -> btrfs_root -> superblock. 
> 
> Which seems to me to be irrelevant if there's a vfsmount per
> subvolume that can hold per-subvolume information.

I disagree.  There are a ton of places where we only have access to an
inode and only need access to an inode.  It also doesn't solve the
overlayfs issue.

>>> So this kinda feel like there's still a impedence mismatch between
>>> btrfs subvolumes being mounted as subtrees on the underlying root
>>> vfsmount rather than being created as truly independent vfs
>>> namespaces that share a superblock. To put that as a question: why
>>> aren't btrfs subvolumes vfsmounts in their own right, and the unique
>>> information subvolume information get stored in (or obtained from)
>>> the vfsmount?
>>
>> Those are two separate problems.   Using a vfsmount to export the
>> btrfs_root is on my roadmap.  I have a WIP patch set that automounts the
>> subvolumes when stepping into a new one, but it's to fix a longstanding
>> UX wart.
> 
> IMO that's more than a UX wart - th elack of vfsmounts for internal
> subvolume mount point traversals could be considered the root cause
> of the issues we are discussing here. Extending the mounted
> namespace should trigger vfs mounts, not be hidden deep inside the
> filesystem. Hence I'd suggest this needs changing before anything
> else....

I disagree.  Yes, it would be nice if btrfs did this, but it doesn't get
us the dev/ino pair to places where we need one, like the depths of audit.

>>>> During the discussion, one question did come up - why can't
>>>> filesystems like Btrfs use a superblock per subvolume? There's a
>>>> couple of problems with that:
>>>>
>>>> - It's common for a single Btrfs filesystem to have thousands of
>>>>   subvolumes. So keeping a superblock for each subvol in memory would
>>>>   get prohibively expensive - imagine having 8000 copies of struct
>>>>   super_block for a file system just because we wanted some separation
>>>>   of say, s_dev.
>>>
>>> That's no different to using individual overlay mounts for the
>>> thousands of containers that are on the system. This doesn't seem to
>>> be a major problem...
>>
>> Overlay mounts are indepedent of one another and don't need coordination
>> among them.  The memory usage is relatively unimportant.  The important
>> part is having a bunch of superblocks that all correspond to the same
>> resources and coordinating them at the VFS level.  Your assumptions
>> below follow how your XFS subvolumes work, where there's a clear
>> hierarchy between the subvolumes and the master filesystem with a
>> mapping layer between them.  Btrfs subvolumes have no such hierarchy.
>> Everything is shared. 
> 
> Yup, that's the impedence mismatch between the VFS infrastructure
> and btrfs that I was talking about. What I'm trying to communicate
> is that I think the proposal is attacking the impedence mismatch
> from the wrong direction.
> 
> i.e. The proposal is to modify btrfs code to propagate stuff that
> btrfs needs to know to deal with it's internal "everything is
> shared" problems up into the VFS where it's probably not useful to
> anything other than btrfs. We already have the necessary construct
> in the VFS - I think we should be trying to use the information held
> by the generic VFS infrastructure to solve the solve the specific
> btrfs issue at hand....

The many namespaces to one I/O domain relationship isn't a
btrfs-internal problem.  It's visible to the user and it's visible
inside the kernel.  The view implementation gives the VFS a generic
mechanism to allow it to represent that.  The plan isn't to only use the
view for that but to also allow separate security contexts for things
like containers.  Those have a similar issue to the inode availability
discussed above.  We have superblocks where we need them.  We could have
a view just as a easily.  Plumbing a vfsmount is the wrong approach.

>> So while we could use a writeback hierarchy to
>> merge all of the inode lists before doing writeback on the 'master'
>> superblock, we'd gain nothing by it.  Handling anything involving
>> s_umount with a superblock per subvolume would be a nightmare.
>> Ultimately, it would be a ton of effort that amounts to working around
>> the VFS instead of with it.
> 
> I'm not suggesting that btrfs needs to use a superblock per
> subvolume. Please don't confuse my statements along the lines of
> "this doesn't seem to be a problem for others" with "you must change
> btrfs to do this". I'm just saying that the problems arising from
> using a superblock per subvolume are not as dire as is being
> implied.

It's possible to solve the btrfs problem this way but it adds a bunch of
complexity for... what benefit, exactly? As I said, it's working around
the VFS instead of working with it.  As Mark initially stated, we handle
I/O and namespace domains.  Btrfs has many namespaces and only one I/O
domain.

-Jeff

-- 
Jeff Mahoney
SUSE Labs

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-06-05 20:17       ` Jeff Mahoney
@ 2018-06-06  9:49         ` Amir Goldstein
  2018-06-06 20:42           ` Mark Fasheh
  2018-06-06 21:19           ` Jeff Mahoney
  0 siblings, 2 replies; 10+ messages in thread
From: Amir Goldstein @ 2018-06-06  9:49 UTC (permalink / raw)
  To: Jeff Mahoney
  Cc: Dave Chinner, Mark Fasheh, linux-fsdevel, linux-kernel,
	Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara

On Tue, Jun 5, 2018 at 11:17 PM, Jeff Mahoney <jeffm@suse.com> wrote:
> Sorry, just getting back to this.
>
> On 5/9/18 2:41 AM, Dave Chinner wrote:
>> On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote:
>>> On 5/8/18 7:38 PM, Dave Chinner wrote:
>>>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
>>>>> Hi,
>>>>>
>>>>> The VFS's super_block covers a variety of filesystem functionality. In
>>>>> particular we have a single structure representing both I/O and
>>>>> namespace domains.
>>>>>
>>>>> There are requirements to de-couple this functionality. For example,
>>>>> filesystems with more than one root (such as btrfs subvolumes) can
>>>>> have multiple inode namespaces. This starts to confuse userspace when
>>>>> it notices multiple inodes with the same inode/device tuple on a
>>>>> filesystem.
>>>>

Speaking as someone who joined this discussion late, maybe years after
it started, it would help to get an overview of existing problems and how
fs_view aims to solve them.

I do believe that both Overlayfs and Btrfs can benefit from a layer of
abstraction
in the VFS, but I think it is best if we start with laying all the
common problems
and then see how a solution would look like.

Even the name of the abstraction (fs_view) doesn't make it clear to me what
it is we are abstracting (security context? st_dev? what else?). probably best
to try to describe the abstraction from user POV rather then give sporadic
examples of what MAY go into fs_view.

While at it, need to see if this discussion has any intersections with David
Howell's fs_context work, because if we consider adding sub volume support
to VFS, we may want to leave some reserved bits in the API for it.

[...]

> One thing is clear: If we want to solve the btrfs and overlayfs problems
> in the same way, the view approach with a simple static mapping doesn't
> work.  Sticking something between the inode and superblock doesn't get
> the job done when the belongs to a different file system.  Overlayfs
> needs a per-object remapper, which means a callback that takes a path.
> Suddenly the way we do things in the SUSE kernel doesn't seem so hacky
> anymore.
>

And what is the SUSE way?

> I'm not sure we need the same solution for btrfs and overlayfs.  It's
> not the same problem.  Every object in overlayfs as a unique mapping
> already.  If we report s_dev and i_ino from the inode, it still maps to
> a unique user-visible object.  It may not map back to the overlayfs
> name, but that's a separate issue that's more difficult to solve.  The
> btrfs issue isn't one of presenting an alternative namespace to the
> user.  Btrfs has multiple internal namespaces and no way to publish them
> to the rest of the kernel.
>

FYI, the Overlayfs file/inode mapping is about to change with many
VFS hacks queued for removal, so stay tuned.

[...]

>> My point is that if we are talking about infrastructure to remap
>> what userspace sees from different mountpoint views into a
>> filesystem, then it should be done above the filesystem layers in
>> the VFS so all filesystems behave the same way. And in this case,
>> the vfsmount maps exactly to the "fs_view" that Mark has proposed we
>> add to the superblock.....
>
> It's proposed to be above the superblock with a default view in the
> superblock.  It would sit between the inode and the superblock so we
> have access to it anywhere we already have an inode.  That's the main
> difference.  We already have the inode everywhere it's needed.  Plumbing
> a vfsmount everywhere needed means changing code that only requires an
> inode and doesn't need a vfsmount.
>
> The two biggest problem areas:
> - Writeback tracepoints print a dev/inode pair.  Do we want to plumb a
> vfsmount into __mark_inode_dirty, super_operations->write_inode,
> __writeback_single_inode, writeback_sb_inodes, etc?
> - Audit.  As it happens, most of audit has a path or file that can be
> used.  We do run into problems with fsnotify.  fsnotify_move is called
> from vfs_rename which turns into a can of worms pretty quickly.
>

Can you please elaborate on that problem.
Do you mean when watching a directory for changes, you need to
be able to tell in which fs_view the directory inode that is being watched?

>>> It makes sense for that to be above the
>>> superblock because the file system doesn't care about them.  We're
>>> interested in the inode namespace, which for every other file system can
>>> be described using an inode and a superblock pair, but btrfs has another
>>> layer in the middle: inode -> btrfs_root -> superblock.
>>
>> Which seems to me to be irrelevant if there's a vfsmount per
>> subvolume that can hold per-subvolume information.
>
> I disagree.  There are a ton of places where we only have access to an
> inode and only need access to an inode.  It also doesn't solve the
> overlayfs issue.
>

I have an interest of solving another problem.
In VFS operations where only inode is available, I would like to be able to
report fsnotify events (e.g. fsnotify_move()) only in directories under a
certain subtree root. That could be achieved either by bind mount the subtree
root and passing vfsmount into vfs_rename() or by defining an fs_view on the
subtree and mounting that fs_view.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-06-06  9:49         ` Amir Goldstein
@ 2018-06-06 20:42           ` Mark Fasheh
  2018-06-07  6:06             ` Amir Goldstein
  2018-06-06 21:19           ` Jeff Mahoney
  1 sibling, 1 reply; 10+ messages in thread
From: Mark Fasheh @ 2018-06-06 20:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jeff Mahoney, Dave Chinner, linux-fsdevel, linux-kernel,
	Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara

Hi Amir, thanks for the comments!

On Wed, Jun 06, 2018 at 12:49:48PM +0300, Amir Goldstein wrote:
> On Tue, Jun 5, 2018 at 11:17 PM, Jeff Mahoney <jeffm@suse.com> wrote:
> > Sorry, just getting back to this.
> >
> > On 5/9/18 2:41 AM, Dave Chinner wrote:
> >> On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote:
> >>> On 5/8/18 7:38 PM, Dave Chinner wrote:
> >>>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
> >>>>> Hi,
> >>>>>
> >>>>> The VFS's super_block covers a variety of filesystem functionality. In
> >>>>> particular we have a single structure representing both I/O and
> >>>>> namespace domains.
> >>>>>
> >>>>> There are requirements to de-couple this functionality. For example,
> >>>>> filesystems with more than one root (such as btrfs subvolumes) can
> >>>>> have multiple inode namespaces. This starts to confuse userspace when
> >>>>> it notices multiple inodes with the same inode/device tuple on a
> >>>>> filesystem.
> >>>>
> 
> Speaking as someone who joined this discussion late, maybe years after
> it started, it would help to get an overview of existing problems and how
> fs_view aims to solve them.
> 
> I do believe that both Overlayfs and Btrfs can benefit from a layer of
> abstraction
> in the VFS, but I think it is best if we start with laying all the
> common problems
> and then see how a solution would look like.
> 
> Even the name of the abstraction (fs_view) doesn't make it clear to me what
> it is we are abstracting (security context? st_dev? what else?). probably best
> to try to describe the abstraction from user POV rather then give sporadic
> examples of what MAY go into fs_view.

The first problem fs_view intended to fix was the s_dev issue, where some
filesystems return a different device from stat(2) than what the rest of the
kernel exports. If you look at the e-mail record I referenced:

https://marc.info/?l=linux-btrfs&m=130074451403261&w=2

you'll see how that can confuse some userspace software when they try to
take the ino/dev pair they get from one portion of the kernel (for example,
/proc/pid/maps) then use it to resolve to an inode on the system.

As it turns out, there's other places where we might want to decouple some
superblock fields, hence why I mention security contexts. I am admittedly
less familiar with that particular problem but as I understand it,
containers on a btrfs subvolume have to go through an overlay to route
around the default security context. With fs_view we could point subvolumes
to their own security contexts.

Btw, sorry that the name is confusing. I've never been good at picking them.
That said, if you have a minute to check out the first patch or two you'll
see that the patches are basically putting a struct in between the super
block and the inode.

One thing I'd like to politely suggest is that anyone now following this
conversation to please read the at least the first patch. It's an easy read
and I feel like the conversation overall would be much more clear if
everyone understood what we're going for. I worry that I didn't do a
particularly good job explaining the actual code changes.

https://www.spinics.net/lists/linux-fsdevel/msg125492.html

Regarding a layout of the problems, I have a more complete e-mail coming
soon which should describe in detail the issues I've seen with respect to
how the kernel is exporting ino/dev pairs (outside of statx). fs_view alone
is not enough to solve that problem. I'd be happy to CC you on that one if
you'd like.

> While at it, need to see if this discussion has any intersections with David
> Howell's fs_context work, because if we consider adding sub volume support
> to VFS, we may want to leave some reserved bits in the API for it.
> 
> [...]
> 
> > One thing is clear: If we want to solve the btrfs and overlayfs problems
> > in the same way, the view approach with a simple static mapping doesn't
> > work.  Sticking something between the inode and superblock doesn't get
> > the job done when the belongs to a different file system.  Overlayfs
> > needs a per-object remapper, which means a callback that takes a path.
> > Suddenly the way we do things in the SUSE kernel doesn't seem so hacky
> > anymore.
> >
> 
> And what is the SUSE way?

At SUSE, we carry a version of this patch:

https://marc.info/?l=linux-btrfs&m=130532890824992&w=2

Essentially a callback which was rejected for various reasons.

The fs_view work was intended to replace that patch with an upstream
solution.

> > I'm not sure we need the same solution for btrfs and overlayfs.  It's
> > not the same problem.  Every object in overlayfs as a unique mapping
> > already.  If we report s_dev and i_ino from the inode, it still maps to
> > a unique user-visible object.  It may not map back to the overlayfs
> > name, but that's a separate issue that's more difficult to solve.  The
> > btrfs issue isn't one of presenting an alternative namespace to the
> > user.  Btrfs has multiple internal namespaces and no way to publish them
> > to the rest of the kernel.
> >
> 
> FYI, the Overlayfs file/inode mapping is about to change with many
> VFS hacks queued for removal, so stay tuned.
> 
> [...]

Actually, would you mind giving me a pointer to this work? I'd be very
interested to see what exactly might be changing.

> >> My point is that if we are talking about infrastructure to remap
> >> what userspace sees from different mountpoint views into a
> >> filesystem, then it should be done above the filesystem layers in
> >> the VFS so all filesystems behave the same way. And in this case,
> >> the vfsmount maps exactly to the "fs_view" that Mark has proposed we
> >> add to the superblock.....
> >
> > It's proposed to be above the superblock with a default view in the
> > superblock.  It would sit between the inode and the superblock so we
> > have access to it anywhere we already have an inode.  That's the main
> > difference.  We already have the inode everywhere it's needed.  Plumbing
> > a vfsmount everywhere needed means changing code that only requires an
> > inode and doesn't need a vfsmount.
> >
> > The two biggest problem areas:
> > - Writeback tracepoints print a dev/inode pair.  Do we want to plumb a
> > vfsmount into __mark_inode_dirty, super_operations->write_inode,
> > __writeback_single_inode, writeback_sb_inodes, etc?
> > - Audit.  As it happens, most of audit has a path or file that can be
> > used.  We do run into problems with fsnotify.  fsnotify_move is called
> > from vfs_rename which turns into a can of worms pretty quickly.
> >
> 
> Can you please elaborate on that problem.
> Do you mean when watching a directory for changes, you need to
> be able to tell in which fs_view the directory inode that is being watched?

Here Jeff is responding to Dave's suggestion that we use a vfsmount
instead of the fs_view. You cannot get to a vfsmount from an inode or even
dentry alone, so this implies that we'd be passing a vfsmount down a ton of
code that otherwise doesn't care about it.

fs_view has the advantage that it is accesible from the inode so those
places which ONLY have an inode can easily get to the correct device (the
code changes make this pretty clear).

> >>> It makes sense for that to be above the
> >>> superblock because the file system doesn't care about them.  We're
> >>> interested in the inode namespace, which for every other file system can
> >>> be described using an inode and a superblock pair, but btrfs has another
> >>> layer in the middle: inode -> btrfs_root -> superblock.
> >>
> >> Which seems to me to be irrelevant if there's a vfsmount per
> >> subvolume that can hold per-subvolume information.
> >
> > I disagree.  There are a ton of places where we only have access to an
> > inode and only need access to an inode.  It also doesn't solve the
> > overlayfs issue.
> >
> 
> I have an interest of solving another problem.
> In VFS operations where only inode is available, I would like to be able to
> report fsnotify events (e.g. fsnotify_move()) only in directories under a
> certain subtree root. That could be achieved either by bind mount the subtree
> root and passing vfsmount into vfs_rename() or by defining an fs_view on the
> subtree and mounting that fs_view.

I'm not sure whether fs_view works for this. Taking a quick look at
fsnotify, the state is already on the inode? If there's a globabl fsnotify
state that needs to be per subtree than yes we could move that to the
fs_view and you'd simply deref it from the inode struct.

I hope this helps,
	--Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-06-06 20:42           ` Mark Fasheh
@ 2018-06-07  6:06             ` Amir Goldstein
  2018-06-07 20:44               ` Mark Fasheh
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2018-06-07  6:06 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Jeff Mahoney, Dave Chinner, linux-fsdevel, linux-kernel,
	Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara

On Wed, Jun 6, 2018 at 11:42 PM, Mark Fasheh <mfasheh@suse.de> wrote:
> Hi Amir, thanks for the comments!
>

Hi Mark,

[...]
>
> Btw, sorry that the name is confusing. I've never been good at picking them.

Didn't say that it was confusing. It might very well be the perfect name...
if I only knew what sort of thing we are trying to name...

> That said, if you have a minute to check out the first patch or two you'll
> see that the patches are basically putting a struct in between the super
> block and the inode.
>
>
> One thing I'd like to politely suggest is that anyone now following this
> conversation to please read the at least the first patch. It's an easy read
> and I feel like the conversation overall would be much more clear if
> everyone understood what we're going for. I worry that I didn't do a
> particularly good job explaining the actual code changes.
>
> https://www.spinics.net/lists/linux-fsdevel/msg125492.html
>


I did look at the patches. They look simple and clean and they solve A problem.
All I'm saying is that we should look at the set of problems that we know of
before we design an abstraction layer.

>
> Regarding a layout of the problems, I have a more complete e-mail coming
> soon which should describe in detail the issues I've seen with respect to
> how the kernel is exporting ino/dev pairs (outside of statx). fs_view alone
> is not enough to solve that problem. I'd be happy to CC you on that one if
> you'd like.
>

Sure.

[...]

>>
>> And what is the SUSE way?
>
> At SUSE, we carry a version of this patch:
>
> https://marc.info/?l=linux-btrfs&m=130532890824992&w=2
>
> Essentially a callback which was rejected for various reasons.
>

Don't see a patch here. Wrong link?

> The fs_view work was intended to replace that patch with an upstream
> solution.
>
>

[...]

>>
>> FYI, the Overlayfs file/inode mapping is about to change with many
>> VFS hacks queued for removal, so stay tuned.
>>
>> [...]
>
> Actually, would you mind giving me a pointer to this work? I'd be very
> interested to see what exactly might be changing.
>

Mostly, less VFS code is going to be exposed to underlying inode:

https://marc.info/?l=linux-fsdevel&m=152760014530531&w=2

[...]

>> I have an interest of solving another problem.
>> In VFS operations where only inode is available, I would like to be able to
>> report fsnotify events (e.g. fsnotify_move()) only in directories under a
>> certain subtree root. That could be achieved either by bind mount the subtree
>> root and passing vfsmount into vfs_rename() or by defining an fs_view on the
>> subtree and mounting that fs_view.
>
> I'm not sure whether fs_view works for this. Taking a quick look at
> fsnotify, the state is already on the inode? If there's a globabl fsnotify
> state that needs to be per subtree than yes we could move that to the
> fs_view and you'd simply deref it from the inode struct.
>

That was my thinking. I have patches to attach an fsnotify mark
to super block. If fs_view could have a root that is different than
super block's root and if file system can guaranty that objects
cannot be moved outside of fs_view root, then fsnotify mark
could be attached to fs_view.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-06-07  6:06             ` Amir Goldstein
@ 2018-06-07 20:44               ` Mark Fasheh
  0 siblings, 0 replies; 10+ messages in thread
From: Mark Fasheh @ 2018-06-07 20:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jeff Mahoney, Dave Chinner, linux-fsdevel, linux-kernel,
	Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara

On Thu, Jun 07, 2018 at 09:06:04AM +0300, Amir Goldstein wrote:
> On Wed, Jun 6, 2018 at 11:42 PM, Mark Fasheh <mfasheh@suse.de> wrote:
> > Hi Amir, thanks for the comments!
> >
> 
> Hi Mark,
> 
> [...]
> >
> > Btw, sorry that the name is confusing. I've never been good at picking them.
> 
> Didn't say that it was confusing. It might very well be the perfect name...
> if I only knew what sort of thing we are trying to name...

Fair enough :)


> > That said, if you have a minute to check out the first patch or two you'll
> > see that the patches are basically putting a struct in between the super
> > block and the inode.
> >
> >
> > One thing I'd like to politely suggest is that anyone now following this
> > conversation to please read the at least the first patch. It's an easy read
> > and I feel like the conversation overall would be much more clear if
> > everyone understood what we're going for. I worry that I didn't do a
> > particularly good job explaining the actual code changes.
> >
> > https://www.spinics.net/lists/linux-fsdevel/msg125492.html
> >
> 
> 
> I did look at the patches. They look simple and clean and they solve A problem.
> All I'm saying is that we should look at the set of problems that we know of
> before we design an abstraction layer.

Agreed, I think my more recent e-mail does a much better job of laying out
the issues we're seeing here.


> >> And what is the SUSE way?
> >
> > At SUSE, we carry a version of this patch:
> >
> > https://marc.info/?l=linux-btrfs&m=130532890824992&w=2
> >
> > Essentially a callback which was rejected for various reasons.
> >
> 
> Don't see a patch here. Wrong link?

That was the 0/2 mail in the patch series. Here are the next two, which
contain the actual patches:

https://marc.info/?l=linux-btrfs&m=130532892525016&w=2
https://marc.info/?l=linux-btrfs&m=130532899325091&w=2

Keep in mind that the patch has evolved since then though it's essentially
the same idea - use a callback where we need to get the correct device.


> > The fs_view work was intended to replace that patch with an upstream
> > solution.
> >
> >
> 
> [...]
> 
> >>
> >> FYI, the Overlayfs file/inode mapping is about to change with many
> >> VFS hacks queued for removal, so stay tuned.
> >>
> >> [...]
> >
> > Actually, would you mind giving me a pointer to this work? I'd be very
> > interested to see what exactly might be changing.
> >
> 
> Mostly, less VFS code is going to be exposed to underlying inode:
> 
> https://marc.info/?l=linux-fsdevel&m=152760014530531&w=2
> 
> [...]

Awesome, thanks for the link Amir.


> > I'm not sure whether fs_view works for this. Taking a quick look at
> > fsnotify, the state is already on the inode? If there's a globabl fsnotify
> > state that needs to be per subtree than yes we could move that to the
> > fs_view and you'd simply deref it from the inode struct.
> >
> 
> That was my thinking. I have patches to attach an fsnotify mark
> to super block. If fs_view could have a root that is different than
> super block's root and if file system can guaranty that objects
> cannot be moved outside of fs_view root, then fsnotify mark
> could be attached to fs_view.

Ahh yeah so in that case we could have the mark on the fs_view instead of
the superblock and you'd deref it from the inode to get to that inodes root.

Thanks,
	--Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-06-06  9:49         ` Amir Goldstein
  2018-06-06 20:42           ` Mark Fasheh
@ 2018-06-06 21:19           ` Jeff Mahoney
  2018-06-07  6:17             ` Amir Goldstein
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff Mahoney @ 2018-06-06 21:19 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Mark Fasheh, linux-fsdevel, linux-kernel,
	Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara


[-- Attachment #1.1: Type: text/plain, Size: 7255 bytes --]

Hi Amir -

Mark's answered some of your questions already, so I'll skip those.

On 6/6/18 5:49 AM, Amir Goldstein wrote:
> On Tue, Jun 5, 2018 at 11:17 PM, Jeff Mahoney <jeffm@suse.com> wrote:
>> Sorry, just getting back to this.
>>
>> On 5/9/18 2:41 AM, Dave Chinner wrote:
>>> On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote:
>>>> On 5/8/18 7:38 PM, Dave Chinner wrote:
>>>>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
>>>>>> Hi,
>>>>>>
>>>>>> The VFS's super_block covers a variety of filesystem functionality. In
>>>>>> particular we have a single structure representing both I/O and
>>>>>> namespace domains.
>>>>>>
>>>>>> There are requirements to de-couple this functionality. For example,
>>>>>> filesystems with more than one root (such as btrfs subvolumes) can
>>>>>> have multiple inode namespaces. This starts to confuse userspace when
>>>>>> it notices multiple inodes with the same inode/device tuple on a
>>>>>> filesystem.
>>>>>
> 
> Speaking as someone who joined this discussion late, maybe years after
> it started, it would help to get an overview of existing problems and how
> fs_view aims to solve them.
> 
> I do believe that both Overlayfs and Btrfs can benefit from a layer of
> abstraction
> in the VFS, but I think it is best if we start with laying all the
> common problems
> and then see how a solution would look like.
> 
> Even the name of the abstraction (fs_view) doesn't make it clear to me what
> it is we are abstracting (security context? st_dev? what else?). probably best
> to try to describe the abstraction from user POV rather then give sporadic
> examples of what MAY go into fs_view.
> 
> While at it, need to see if this discussion has any intersections with David
> Howell's fs_context work, because if we consider adding sub volume support
> to VFS, we may want to leave some reserved bits in the API for it.
> 
> [...]
> 
>> One thing is clear: If we want to solve the btrfs and overlayfs problems
>> in the same way, the view approach with a simple static mapping doesn't
>> work.  Sticking something between the inode and superblock doesn't get
>> the job done when the belongs to a different file system.  Overlayfs
>> needs a per-object remapper, which means a callback that takes a path.
>> Suddenly the way we do things in the SUSE kernel doesn't seem so hacky
>> anymore.
>>
> 
> And what is the SUSE way?
> 
>> I'm not sure we need the same solution for btrfs and overlayfs.  It's
>> not the same problem.  Every object in overlayfs as a unique mapping
>> already.  If we report s_dev and i_ino from the inode, it still maps to
>> a unique user-visible object.  It may not map back to the overlayfs
>> name, but that's a separate issue that's more difficult to solve.  The
>> btrfs issue isn't one of presenting an alternative namespace to the
>> user.  Btrfs has multiple internal namespaces and no way to publish them
>> to the rest of the kernel.
>>
> 
> FYI, the Overlayfs file/inode mapping is about to change with many
> VFS hacks queued for removal, so stay tuned.
> 
> [...]

I have to admit I'm curious how this will work.  I've heard rumor of
using overlayfs inodes and calling the underlying file system's
inode_operations.  If part of that work removes the danger of overlayfs
inode numbers colliding with xino mode, I'm definitely interested.

>>> My point is that if we are talking about infrastructure to remap
>>> what userspace sees from different mountpoint views into a
>>> filesystem, then it should be done above the filesystem layers in
>>> the VFS so all filesystems behave the same way. And in this case,
>>> the vfsmount maps exactly to the "fs_view" that Mark has proposed we
>>> add to the superblock.....
>>
>> It's proposed to be above the superblock with a default view in the
>> superblock.  It would sit between the inode and the superblock so we
>> have access to it anywhere we already have an inode.  That's the main
>> difference.  We already have the inode everywhere it's needed.  Plumbing
>> a vfsmount everywhere needed means changing code that only requires an
>> inode and doesn't need a vfsmount.
>>
>> The two biggest problem areas:
>> - Writeback tracepoints print a dev/inode pair.  Do we want to plumb a
>> vfsmount into __mark_inode_dirty, super_operations->write_inode,
>> __writeback_single_inode, writeback_sb_inodes, etc?
>> - Audit.  As it happens, most of audit has a path or file that can be
>> used.  We do run into problems with fsnotify.  fsnotify_move is called
>> from vfs_rename which turns into a can of worms pretty quickly.
>>
> 
> Can you please elaborate on that problem.
> Do you mean when watching a directory for changes, you need to
> be able to tell in which fs_view the directory inode that is being watched?

I was investigating whether Dave's suggestion of using a vfsmount was
feasible.  When following the audit call graph up, I found
fsnotify_move, called by vfs_rename.  Piping a vfsmount into the vfs_*
operations has historically been rejected by Al (see Apparmor
discussions, among others), and with good reason.  The file system
implementation shouldn't care about where it's mounted.  While piping it
into vfs_rename doesn't seem too invasive, it's called by various file
systems' ->rename callback, so then we're stuck piping vfsmounts into
inode_operations, which is what Al's been wanting to avoid for years.

>>>> It makes sense for that to be above the
>>>> superblock because the file system doesn't care about them.  We're
>>>> interested in the inode namespace, which for every other file system can
>>>> be described using an inode and a superblock pair, but btrfs has another
>>>> layer in the middle: inode -> btrfs_root -> superblock.
>>>
>>> Which seems to me to be irrelevant if there's a vfsmount per
>>> subvolume that can hold per-subvolume information.
>>
>> I disagree.  There are a ton of places where we only have access to an
>> inode and only need access to an inode.  It also doesn't solve the
>> overlayfs issue.
>>
> 
> I have an interest of solving another problem.
> In VFS operations where only inode is available, I would like to be able to
> report fsnotify events (e.g. fsnotify_move()) only in directories under a
> certain subtree root. That could be achieved either by bind mount the subtree
> root and passing vfsmount into vfs_rename() or by defining an fs_view on the
> subtree and mounting that fs_view.

I'm not sure there's a lot of overlap, but I expect that this will end
up running into the same review feedback that Al gave during the
AppArmor merge: vfsmounts have no business at the lower level and you
can get the same behavior by hooking in at a higher level.  See
security_path_* vs security_inode_* for how that was resolved.

What you're talking about isn't really what we had in mind for the
fs_view.  In our case, it sits between the inode and superblock, which
would be at too low a level for determining whether an inode is under a
certain subtree.  In any event, wouldn't you need a path instead of an
inode to do what you're proposing?

-Jeff

-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
  2018-06-06 21:19           ` Jeff Mahoney
@ 2018-06-07  6:17             ` Amir Goldstein
  0 siblings, 0 replies; 10+ messages in thread
From: Amir Goldstein @ 2018-06-07  6:17 UTC (permalink / raw)
  To: Jeff Mahoney
  Cc: Dave Chinner, Mark Fasheh, linux-fsdevel, linux-kernel,
	Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara

On Thu, Jun 7, 2018 at 12:19 AM, Jeff Mahoney <jeffm@suse.com> wrote:
[...]
>>
>> FYI, the Overlayfs file/inode mapping is about to change with many
>> VFS hacks queued for removal, so stay tuned.
>>
>> [...]
>
> I have to admit I'm curious how this will work.  I've heard rumor of
> using overlayfs inodes and calling the underlying file system's
> inode_operations.  If part of that work removes the danger of overlayfs
> inode numbers colliding with xino mode, I'm definitely interested.

See https://marc.info/?l=linux-fsdevel&m=152760014530531&w=2

It doesn't remove the need to maintain a unique and persistent inode
number namespace in overlayfs. It just reduces exposure of the underlying
inode to VFS.

[...]

>>> - Audit.  As it happens, most of audit has a path or file that can be
>>> used.  We do run into problems with fsnotify.  fsnotify_move is called
>>> from vfs_rename which turns into a can of worms pretty quickly.
>>>
>>
>> Can you please elaborate on that problem.
>> Do you mean when watching a directory for changes, you need to
>> be able to tell in which fs_view the directory inode that is being watched?
>
> I was investigating whether Dave's suggestion of using a vfsmount was
> feasible.  When following the audit call graph up, I found
> fsnotify_move, called by vfs_rename.  Piping a vfsmount into the vfs_*
> operations has historically been rejected by Al (see Apparmor
> discussions, among others), and with good reason.  The file system
> implementation shouldn't care about where it's mounted.  While piping it
> into vfs_rename doesn't seem too invasive, it's called by various file
> systems' ->rename callback, so then we're stuck piping vfsmounts into
> inode_operations, which is what Al's been wanting to avoid for years.
>

Yes, I've heard about this objection, thought didn't have a reference.

[...]
>>
>> I have an interest of solving another problem.
>> In VFS operations where only inode is available, I would like to be able to
>> report fsnotify events (e.g. fsnotify_move()) only in directories under a
>> certain subtree root. That could be achieved either by bind mount the subtree
>> root and passing vfsmount into vfs_rename() or by defining an fs_view on the
>> subtree and mounting that fs_view.
>
> I'm not sure there's a lot of overlap, but I expect that this will end
> up running into the same review feedback that Al gave during the
> AppArmor merge: vfsmounts have no business at the lower level and you
> can get the same behavior by hooking in at a higher level.  See
> security_path_* vs security_inode_* for how that was resolved.
>
> What you're talking about isn't really what we had in mind for the
> fs_view.  In our case, it sits between the inode and superblock, which
> would be at too low a level for determining whether an inode is under a
> certain subtree.  In any event, wouldn't you need a path instead of an
> inode to do what you're proposing?
>

Not if the fs_view could have a root that is different than sb root
then I could attach fsnotify mark to fs_view instead of say vfsmount
and I have the information I need inside fsnotify_move().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-06-07 20:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20180508180436.716-1-mfasheh@suse.de>
2018-05-08 23:38 ` [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root Dave Chinner
2018-05-09  2:06   ` Jeff Mahoney
2018-05-09  6:41     ` Dave Chinner
2018-06-05 20:17       ` Jeff Mahoney
2018-06-06  9:49         ` Amir Goldstein
2018-06-06 20:42           ` Mark Fasheh
2018-06-07  6:06             ` Amir Goldstein
2018-06-07 20:44               ` Mark Fasheh
2018-06-06 21:19           ` Jeff Mahoney
2018-06-07  6:17             ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).