* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root [not found] <20180508180436.716-1-mfasheh@suse.de> @ 2018-05-08 23:38 ` Dave Chinner 2018-05-09 2:06 ` Jeff Mahoney 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2018-05-08 23:38 UTC (permalink / raw) To: Mark Fasheh; +Cc: linux-fsdevel, linux-kernel, linux-btrfs On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > Hi, > > The VFS's super_block covers a variety of filesystem functionality. In > particular we have a single structure representing both I/O and > namespace domains. > > There are requirements to de-couple this functionality. For example, > filesystems with more than one root (such as btrfs subvolumes) can > have multiple inode namespaces. This starts to confuse userspace when > it notices multiple inodes with the same inode/device tuple on a > filesystem. Devil's Advocate - I'm not looking at the code, I'm commenting on architectural issues I see here. The XFS subvolume work I've been doing explicitly uses a superblock per subvolume. That's because subvolumes are designed to be completely independent of the backing storage - they know nothing about the underlying storage except to share a BDI for writeback purposes and write to whatever block device the remapping layer gives them at IO time. Hence XFS subvolumes have (at this point) their own unique s_dev, on-disk format configuration, journal, space accounting, etc. i.e. They are fully independent filesystems in their own right, and as such we do not have multiple inode namespaces per superblock. So this doesn't sound like a "subvolume problem" - it's a "how do we sanely support multiple independent namespaces per superblock" problem. AFAICT, this same problem exists with bind mounts and mount namespaces - they are effectively multiple roots on a single superblock, but it's done at the vfsmount level and so the superblock knows nothing about them. So this kinda feel like there's still a impedence mismatch between btrfs subvolumes being mounted as subtrees on the underlying root vfsmount rather than being created as truly independent vfs namespaces that share a superblock. To put that as a question: why aren't btrfs subvolumes vfsmounts in their own right, and the unique information subvolume information get stored in (or obtained from) the vfsmount? > In addition, it's currently impossible for a filesystem subvolume to > have a different security context from it's parent. If we could allow > for subvolumes to optionally specify their own security context, we > could use them as containers directly instead of having to go through > an overlay. Again, XFS subvolumes don't have this problem. So really we need to frame this discussion in terms of supporting multiple namespaces within a superblock sanely, not subvolumes. > I ran into this particular problem with respect to Btrfs some years > ago and sent out a very naive set of patches which were (rightfully) > not incorporated: > > https://marc.info/?l=linux-btrfs&m=130074451403261&w=2 > https://marc.info/?l=linux-btrfs&m=130532890824992&w=2 > > During the discussion, one question did come up - why can't > filesystems like Btrfs use a superblock per subvolume? There's a > couple of problems with that: > > - It's common for a single Btrfs filesystem to have thousands of > subvolumes. So keeping a superblock for each subvol in memory would > get prohibively expensive - imagine having 8000 copies of struct > super_block for a file system just because we wanted some separation > of say, s_dev. That's no different to using individual overlay mounts for the thousands of containers that are on the system. This doesn't seem to be a major problem... > - Writeback would also have to walk all of these superblocks - > again not very good for system performance. Background writeback is backing device focussed, not superblock focussed. It will only iterate the superblocks that have dirty inodes on the bdi writeback lists, not all the superblocks on the bdi. IOWs, this isn't a major problem except for sync() operations that iterate superblocks..... > - Anyone wanting to lock down I/O on a filesystem would have to > freeze all the superblocks. This goes for most things related to > I/O really - we simply can't afford to have the kernel walking > thousands of superblocks to sync a single fs. Not with XFS subvolumes. Freezing the underlying parent filesystem will effectively stop all IO from the mounted subvolumes by freezing remapping calls before IO. Sure, those subvolumes aren't in a consistent state, but we don't freeze userspace so none of the application data is ever in a consistent state when filesystems are frozen. So, again, I'm not sure there's /subvolume/ problem here. There's definitely a "freeze heirarchy" problem, but that already exists and it's something we talked about at LSFMM because we need to solve it for reliable hibernation. > It's far more efficient then to pull those fields we need for a > subvolume namespace into their own structure. I'm not convinced yet - it still feels like it's the wrong layer to be solving the multiple namespace per superblock problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-05-08 23:38 ` [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root Dave Chinner @ 2018-05-09 2:06 ` Jeff Mahoney 2018-05-09 6:41 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Jeff Mahoney @ 2018-05-09 2:06 UTC (permalink / raw) To: Dave Chinner, Mark Fasheh; +Cc: linux-fsdevel, linux-kernel, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 8661 bytes --] On 5/8/18 7:38 PM, Dave Chinner wrote: > On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: >> Hi, >> >> The VFS's super_block covers a variety of filesystem functionality. In >> particular we have a single structure representing both I/O and >> namespace domains. >> >> There are requirements to de-couple this functionality. For example, >> filesystems with more than one root (such as btrfs subvolumes) can >> have multiple inode namespaces. This starts to confuse userspace when >> it notices multiple inodes with the same inode/device tuple on a >> filesystem. > > Devil's Advocate - I'm not looking at the code, I'm commenting on > architectural issues I see here. > > The XFS subvolume work I've been doing explicitly uses a superblock > per subvolume. That's because subvolumes are designed to be > completely independent of the backing storage - they know nothing > about the underlying storage except to share a BDI for writeback > purposes and write to whatever block device the remapping layer > gives them at IO time. Hence XFS subvolumes have (at this point) > their own unique s_dev, on-disk format configuration, journal, space > accounting, etc. i.e. They are fully independent filesystems in > their own right, and as such we do not have multiple inode > namespaces per superblock. That's a fundamental difference between how your XFS subvolumes work and how btrfs subvolumes do. There is no independence among btrfs subvolumes. When a snapshot is created, it has a few new blocks but otherwise shares the metadata of the source subvolume. The metadata trees are shared across all of the subvolumes and there are several internal trees used to manage all of it. It's a single storage pool and a single transaction engine. There are housekeeping and maintenance tasks that operate across the entire file system internally. I understand that there are several problems you need to solve at the VFS layer to get your version of subvolumes up and running, but trying to shoehorn one into the other is bound to fail. > So this doesn't sound like a "subvolume problem" - it's a "how do we > sanely support multiple independent namespaces per superblock" > problem. AFAICT, this same problem exists with bind mounts and mount > namespaces - they are effectively multiple roots on a single > superblock, but it's done at the vfsmount level and so the > superblock knows nothing about them. In this case, you're talking about the user-visible file system hierarchy namespace that has no bearing on the underlying file system outside of per-mount flags. It makes sense for that to be above the superblock because the file system doesn't care about them. We're interested in the inode namespace, which for every other file system can be described using an inode and a superblock pair, but btrfs has another layer in the middle: inode -> btrfs_root -> superblock. The lifetime rules for e.g. the s_dev follow that middle layer and a vfsmount can disappear well before the inode does. > So this kinda feel like there's still a impedence mismatch between > btrfs subvolumes being mounted as subtrees on the underlying root > vfsmount rather than being created as truly independent vfs > namespaces that share a superblock. To put that as a question: why > aren't btrfs subvolumes vfsmounts in their own right, and the unique > information subvolume information get stored in (or obtained from) > the vfsmount? Those are two separate problems. Using a vfsmount to export the btrfs_root is on my roadmap. I have a WIP patch set that automounts the subvolumes when stepping into a new one, but it's to fix a longstanding UX wart. Ultimately, vfsmounts are at the wrong level to solve the inode namespace problem. Again, there's the lifetime issue. There are also many places where we only have an inode and need the s_dev associated with it. Most of these sites are well removed from having access to a vfsmount and pinning one and passing it around carries no other benefit. >> In addition, it's currently impossible for a filesystem subvolume to >> have a different security context from it's parent. If we could allow >> for subvolumes to optionally specify their own security context, we >> could use them as containers directly instead of having to go through >> an overlay. > > Again, XFS subvolumes don't have this problem. So really we need to > frame this discussion in terms of supporting multiple namespaces > within a superblock sanely, not subvolumes. > >> I ran into this particular problem with respect to Btrfs some years >> ago and sent out a very naive set of patches which were (rightfully) >> not incorporated: >> >> https://marc.info/?l=linux-btrfs&m=130074451403261&w=2 >> https://marc.info/?l=linux-btrfs&m=130532890824992&w=2 >> >> During the discussion, one question did come up - why can't >> filesystems like Btrfs use a superblock per subvolume? There's a >> couple of problems with that: >> >> - It's common for a single Btrfs filesystem to have thousands of >> subvolumes. So keeping a superblock for each subvol in memory would >> get prohibively expensive - imagine having 8000 copies of struct >> super_block for a file system just because we wanted some separation >> of say, s_dev. > > That's no different to using individual overlay mounts for the > thousands of containers that are on the system. This doesn't seem to > be a major problem... Overlay mounts are indepedent of one another and don't need coordination among them. The memory usage is relatively unimportant. The important part is having a bunch of superblocks that all correspond to the same resources and coordinating them at the VFS level. Your assumptions below follow how your XFS subvolumes work, where there's a clear hierarchy between the subvolumes and the master filesystem with a mapping layer between them. Btrfs subvolumes have no such hierarchy. Everything is shared. So while we could use a writeback hierarchy to merge all of the inode lists before doing writeback on the 'master' superblock, we'd gain nothing by it. Handling anything involving s_umount with a superblock per subvolume would be a nightmare. Ultimately, it would be a ton of effort that amounts to working around the VFS instead of with it. >> - Writeback would also have to walk all of these superblocks - >> again not very good for system performance. > > Background writeback is backing device focussed, not superblock > focussed. It will only iterate the superblocks that have dirty > inodes on the bdi writeback lists, not all the superblocks on the > bdi. IOWs, this isn't a major problem except for sync() operations > that iterate superblocks..... > >> - Anyone wanting to lock down I/O on a filesystem would have to >> freeze all the superblocks. This goes for most things related to >> I/O really - we simply can't afford to have the kernel walking >> thousands of superblocks to sync a single fs. > > Not with XFS subvolumes. Freezing the underlying parent filesystem > will effectively stop all IO from the mounted subvolumes by freezing > remapping calls before IO. Sure, those subvolumes aren't in a > consistent state, but we don't freeze userspace so none of the > application data is ever in a consistent state when filesystems are > frozen. > > So, again, I'm not sure there's /subvolume/ problem here. There's > definitely a "freeze heirarchy" problem, but that already exists and > it's something we talked about at LSFMM because we need to solve it > for reliable hibernation. There's only a freeze hierarchy problem if we have to use multiple superblocks. Otherwise, we freeze the whole thing or we don't. Trying to freeze a single subvolume would be an illusion for the user since the underlying file system would still be active underneath it. Under the hood, things like relocation don't even look at what subvolume owns a particular extent until it must. So it could be coordinating thousands of superblocks to do what a single lock does now and for what benefit? >> It's far more efficient then to pull those fields we need for a >> subvolume namespace into their own structure. > > I'm not convinced yet - it still feels like it's the wrong layer to > be solving the multiple namespace per superblock problem.... It needs to be between the inode and the superblock. If there are multiple user-visible namespaces, each will still get the same underlying file system namespace. -Jeff -- Jeff Mahoney SUSE Labs [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-05-09 2:06 ` Jeff Mahoney @ 2018-05-09 6:41 ` Dave Chinner 2018-06-05 20:17 ` Jeff Mahoney 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2018-05-09 6:41 UTC (permalink / raw) To: Jeff Mahoney; +Cc: Mark Fasheh, linux-fsdevel, linux-kernel, linux-btrfs On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote: > On 5/8/18 7:38 PM, Dave Chinner wrote: > > On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > >> Hi, > >> > >> The VFS's super_block covers a variety of filesystem functionality. In > >> particular we have a single structure representing both I/O and > >> namespace domains. > >> > >> There are requirements to de-couple this functionality. For example, > >> filesystems with more than one root (such as btrfs subvolumes) can > >> have multiple inode namespaces. This starts to confuse userspace when > >> it notices multiple inodes with the same inode/device tuple on a > >> filesystem. > > > > Devil's Advocate - I'm not looking at the code, I'm commenting on > > architectural issues I see here. > > > > The XFS subvolume work I've been doing explicitly uses a superblock > > per subvolume. That's because subvolumes are designed to be > > completely independent of the backing storage - they know nothing > > about the underlying storage except to share a BDI for writeback > > purposes and write to whatever block device the remapping layer > > gives them at IO time. Hence XFS subvolumes have (at this point) > > their own unique s_dev, on-disk format configuration, journal, space > > accounting, etc. i.e. They are fully independent filesystems in > > their own right, and as such we do not have multiple inode > > namespaces per superblock. > > That's a fundamental difference between how your XFS subvolumes work and > how btrfs subvolumes do. Yup, you've just proved my point: this is not a "subvolume problem"; but rather a "multiple namespace per root" problem. > There is no independence among btrfs > subvolumes. When a snapshot is created, it has a few new blocks but > otherwise shares the metadata of the source subvolume. The metadata > trees are shared across all of the subvolumes and there are several > internal trees used to manage all of it. I don't need btrfs 101 stuff explained to me. :/ > a single transaction engine. There are housekeeping and maintenance > tasks that operate across the entire file system internally. I > understand that there are several problems you need to solve at the VFS > layer to get your version of subvolumes up and running, but trying to > shoehorn one into the other is bound to fail. Actually, the VFS has provided everything I need for XFS subvolumes so far without requiring any sort of modifications. That's the perspective I'm approaching this from - if the VFS can do what we need for XFS subvolumes, as well as overlay (which are effectively VFS-based COW subvolumes), then lets see if we can make that work for btrfs too. > > So this doesn't sound like a "subvolume problem" - it's a "how do we > > sanely support multiple independent namespaces per superblock" > > problem. AFAICT, this same problem exists with bind mounts and mount > > namespaces - they are effectively multiple roots on a single > > superblock, but it's done at the vfsmount level and so the > > superblock knows nothing about them. > > In this case, you're talking about the user-visible file system > hierarchy namespace that has no bearing on the underlying file system > outside of per-mount flags. Except that it tracks and provides infrastructure that allows user visible "multiple namespace per root" constructs. Subvolumes - as a user visible namespace construct - are little different to bind mounts in behaviour and functionality. How the underlying filesystem implements subvolumes is really up to the filesystem, but we should be trying to implement a clean model for "multiple namespaces on a single root" at the VFS so we have consistent behaviour across all filesystems that implement similar functionality. FWIW, bind mounts and overlay also have similar inode number namespace problems to what Mark describes for btrfs subvolumes. e.g. overlay recently introduce the "xino" mount option to separate the user presented inode number namespace for overlay inode from the underlying parent filesystem inodes. How is that different to btrfs subvolumes needing to present different inode number namespaces from the underlying parent? This sort of "namespace shifting" is needed for several different pieces of information the kernel reports to userspace. The VFS replacement for shiftfs is an example of this. So is inode number remapping. I'm sure there's more. My point is that if we are talking about infrastructure to remap what userspace sees from different mountpoint views into a filesystem, then it should be done above the filesystem layers in the VFS so all filesystems behave the same way. And in this case, the vfsmount maps exactly to the "fs_view" that Mark has proposed we add to the superblock..... > It makes sense for that to be above the > superblock because the file system doesn't care about them. We're > interested in the inode namespace, which for every other file system can > be described using an inode and a superblock pair, but btrfs has another > layer in the middle: inode -> btrfs_root -> superblock. Which seems to me to be irrelevant if there's a vfsmount per subvolume that can hold per-subvolume information. > > So this kinda feel like there's still a impedence mismatch between > > btrfs subvolumes being mounted as subtrees on the underlying root > > vfsmount rather than being created as truly independent vfs > > namespaces that share a superblock. To put that as a question: why > > aren't btrfs subvolumes vfsmounts in their own right, and the unique > > information subvolume information get stored in (or obtained from) > > the vfsmount? > > Those are two separate problems. Using a vfsmount to export the > btrfs_root is on my roadmap. I have a WIP patch set that automounts the > subvolumes when stepping into a new one, but it's to fix a longstanding > UX wart. IMO that's more than a UX wart - th elack of vfsmounts for internal subvolume mount point traversals could be considered the root cause of the issues we are discussing here. Extending the mounted namespace should trigger vfs mounts, not be hidden deep inside the filesystem. Hence I'd suggest this needs changing before anything else.... > >> During the discussion, one question did come up - why can't > >> filesystems like Btrfs use a superblock per subvolume? There's a > >> couple of problems with that: > >> > >> - It's common for a single Btrfs filesystem to have thousands of > >> subvolumes. So keeping a superblock for each subvol in memory would > >> get prohibively expensive - imagine having 8000 copies of struct > >> super_block for a file system just because we wanted some separation > >> of say, s_dev. > > > > That's no different to using individual overlay mounts for the > > thousands of containers that are on the system. This doesn't seem to > > be a major problem... > > Overlay mounts are indepedent of one another and don't need coordination > among them. The memory usage is relatively unimportant. The important > part is having a bunch of superblocks that all correspond to the same > resources and coordinating them at the VFS level. Your assumptions > below follow how your XFS subvolumes work, where there's a clear > hierarchy between the subvolumes and the master filesystem with a > mapping layer between them. Btrfs subvolumes have no such hierarchy. > Everything is shared. Yup, that's the impedence mismatch between the VFS infrastructure and btrfs that I was talking about. What I'm trying to communicate is that I think the proposal is attacking the impedence mismatch from the wrong direction. i.e. The proposal is to modify btrfs code to propagate stuff that btrfs needs to know to deal with it's internal "everything is shared" problems up into the VFS where it's probably not useful to anything other than btrfs. We already have the necessary construct in the VFS - I think we should be trying to use the information held by the generic VFS infrastructure to solve the solve the specific btrfs issue at hand.... > So while we could use a writeback hierarchy to > merge all of the inode lists before doing writeback on the 'master' > superblock, we'd gain nothing by it. Handling anything involving > s_umount with a superblock per subvolume would be a nightmare. > Ultimately, it would be a ton of effort that amounts to working around > the VFS instead of with it. I'm not suggesting that btrfs needs to use a superblock per subvolume. Please don't confuse my statements along the lines of "this doesn't seem to be a problem for others" with "you must change btrfs to do this". I'm just saying that the problems arising from using a superblock per subvolume are not as dire as is being implied. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-05-09 6:41 ` Dave Chinner @ 2018-06-05 20:17 ` Jeff Mahoney 2018-06-06 9:49 ` Amir Goldstein 0 siblings, 1 reply; 10+ messages in thread From: Jeff Mahoney @ 2018-06-05 20:17 UTC (permalink / raw) To: Dave Chinner; +Cc: Mark Fasheh, linux-fsdevel, linux-kernel, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 14291 bytes --] Sorry, just getting back to this. On 5/9/18 2:41 AM, Dave Chinner wrote: > On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote: >> On 5/8/18 7:38 PM, Dave Chinner wrote: >>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: >>>> Hi, >>>> >>>> The VFS's super_block covers a variety of filesystem functionality. In >>>> particular we have a single structure representing both I/O and >>>> namespace domains. >>>> >>>> There are requirements to de-couple this functionality. For example, >>>> filesystems with more than one root (such as btrfs subvolumes) can >>>> have multiple inode namespaces. This starts to confuse userspace when >>>> it notices multiple inodes with the same inode/device tuple on a >>>> filesystem. >>> >>> Devil's Advocate - I'm not looking at the code, I'm commenting on >>> architectural issues I see here. >>> >>> The XFS subvolume work I've been doing explicitly uses a superblock >>> per subvolume. That's because subvolumes are designed to be >>> completely independent of the backing storage - they know nothing >>> about the underlying storage except to share a BDI for writeback >>> purposes and write to whatever block device the remapping layer >>> gives them at IO time. Hence XFS subvolumes have (at this point) >>> their own unique s_dev, on-disk format configuration, journal, space >>> accounting, etc. i.e. They are fully independent filesystems in >>> their own right, and as such we do not have multiple inode >>> namespaces per superblock. >> >> That's a fundamental difference between how your XFS subvolumes work and >> how btrfs subvolumes do. > > Yup, you've just proved my point: this is not a "subvolume problem"; > but rather a "multiple namespace per root" problem. Mark framed it as a namespace issue initially. The btrfs subvolume problem is the biggest one we're concerned with. More below. >> There is no independence among btrfs >> subvolumes. When a snapshot is created, it has a few new blocks but >> otherwise shares the metadata of the source subvolume. The metadata >> trees are shared across all of the subvolumes and there are several >> internal trees used to manage all of it. > > I don't need btrfs 101 stuff explained to me. :/ No offense meant. This was crossposted to a few lists. >> a single transaction engine. There are housekeeping and maintenance >> tasks that operate across the entire file system internally. I >> understand that there are several problems you need to solve at the VFS >> layer to get your version of subvolumes up and running, but trying to >> shoehorn one into the other is bound to fail. > > Actually, the VFS has provided everything I need for XFS subvolumes > so far without requiring any sort of modifications. That's the > perspective I'm approaching this from - if the VFS can do what we > need for XFS subvolumes, as well as overlay (which are effectively > VFS-based COW subvolumes), then lets see if we can make that work > for btrfs too. I'm glad that the VFS provides what you need for XFS subvolumes, but you're effectively creating independent file systems using a backend storage remapping technique. That's pretty much exactly what the VFS is already designed to do. Overlayfs is definitely not just VFS-based CoW subvolumes. I ended up having to wade more deeply into overlayfs than I would've liked to respond to your email. I'd say it's better described as Overlayfs can abuse what the VFS offers just enough to work well in common cases. It has a similar disconnect between the superblock and inode to btrfs except that it's not as easily resolved since the dentry belongs to overlayfs and the inode belongs to the underlying file system. There are still places getting this wrong in the kernel. In order to reliably present the right inode/dev pair for overlayfs, we need the dentry to go with it. We don't always have it and, even if we did, we'd need both the dentry and inode. At any rate, anything with a ->getattr that modifies dev or ino so that it's different from inode->i_sb->s_dev or inode->i_ino will misbehave somehow. The question is just how badly, which depends on the context. It could be something simple like /proc/pid/maps not showing the right device/ino pair. Many file systems that use 64-bit inode numbers already remap them on 32-bit systems, which can already cause collisions (even inside the kernel) when comparisons are made. There's no excuse to not use 64-bit inode numbers everywhere inside the kernel now, especially since we have the capability to export them as 64-bit inode numbers on 32-bit systems now via statx. One thing is clear: If we want to solve the btrfs and overlayfs problems in the same way, the view approach with a simple static mapping doesn't work. Sticking something between the inode and superblock doesn't get the job done when the belongs to a different file system. Overlayfs needs a per-object remapper, which means a callback that takes a path. Suddenly the way we do things in the SUSE kernel doesn't seem so hacky anymore. I'm not sure we need the same solution for btrfs and overlayfs. It's not the same problem. Every object in overlayfs as a unique mapping already. If we report s_dev and i_ino from the inode, it still maps to a unique user-visible object. It may not map back to the overlayfs name, but that's a separate issue that's more difficult to solve. The btrfs issue isn't one of presenting an alternative namespace to the user. Btrfs has multiple internal namespaces and no way to publish them to the rest of the kernel. >>> So this doesn't sound like a "subvolume problem" - it's a "how do we >>> sanely support multiple independent namespaces per superblock" >>> problem. AFAICT, this same problem exists with bind mounts and mount >>> namespaces - they are effectively multiple roots on a single >>> superblock, but it's done at the vfsmount level and so the >>> superblock knows nothing about them. >> >> In this case, you're talking about the user-visible file system >> hierarchy namespace that has no bearing on the underlying file system >> outside of per-mount flags. > > Except that it tracks and provides infrastructure that allows user > visible "multiple namespace per root" constructs. Subvolumes - as a > user visible namespace construct - are little different to bind > mounts in behaviour and functionality. > > How the underlying filesystem implements subvolumes is really up to > the filesystem, but we should be trying to implement a clean model > for "multiple namespaces on a single root" at the VFS so we have > consistent behaviour across all filesystems that implement similar > functionality. > > FWIW, bind mounts and overlay also have similar inode number > namespace problems to what Mark describes for btrfs subvolumes. > e.g. overlay recently introduce the "xino" mount option to separate > the user presented inode number namespace for overlay inode from the > underlying parent filesystem inodes. How is that different to btrfs > subvolumes needing to present different inode number namespaces from > the underlying parent? > > This sort of "namespace shifting" is needed for several different > pieces of information the kernel reports to userspace. The VFS > replacement for shiftfs is an example of this. So is inode number > remapping. I'm sure there's more. Djalal's work? That rightfully belongs above the file system layer. The only bit that's needed within the file system is the part that sets the super flag to enable it. The different namespaces have nothing to do with the file system below it. > My point is that if we are talking about infrastructure to remap > what userspace sees from different mountpoint views into a > filesystem, then it should be done above the filesystem layers in > the VFS so all filesystems behave the same way. And in this case, > the vfsmount maps exactly to the "fs_view" that Mark has proposed we > add to the superblock..... It's proposed to be above the superblock with a default view in the superblock. It would sit between the inode and the superblock so we have access to it anywhere we already have an inode. That's the main difference. We already have the inode everywhere it's needed. Plumbing a vfsmount everywhere needed means changing code that only requires an inode and doesn't need a vfsmount. The two biggest problem areas: - Writeback tracepoints print a dev/inode pair. Do we want to plumb a vfsmount into __mark_inode_dirty, super_operations->write_inode, __writeback_single_inode, writeback_sb_inodes, etc? - Audit. As it happens, most of audit has a path or file that can be used. We do run into problems with fsnotify. fsnotify_move is called from vfs_rename which turns into a can of worms pretty quickly. >> It makes sense for that to be above the >> superblock because the file system doesn't care about them. We're >> interested in the inode namespace, which for every other file system can >> be described using an inode and a superblock pair, but btrfs has another >> layer in the middle: inode -> btrfs_root -> superblock. > > Which seems to me to be irrelevant if there's a vfsmount per > subvolume that can hold per-subvolume information. I disagree. There are a ton of places where we only have access to an inode and only need access to an inode. It also doesn't solve the overlayfs issue. >>> So this kinda feel like there's still a impedence mismatch between >>> btrfs subvolumes being mounted as subtrees on the underlying root >>> vfsmount rather than being created as truly independent vfs >>> namespaces that share a superblock. To put that as a question: why >>> aren't btrfs subvolumes vfsmounts in their own right, and the unique >>> information subvolume information get stored in (or obtained from) >>> the vfsmount? >> >> Those are two separate problems. Using a vfsmount to export the >> btrfs_root is on my roadmap. I have a WIP patch set that automounts the >> subvolumes when stepping into a new one, but it's to fix a longstanding >> UX wart. > > IMO that's more than a UX wart - th elack of vfsmounts for internal > subvolume mount point traversals could be considered the root cause > of the issues we are discussing here. Extending the mounted > namespace should trigger vfs mounts, not be hidden deep inside the > filesystem. Hence I'd suggest this needs changing before anything > else.... I disagree. Yes, it would be nice if btrfs did this, but it doesn't get us the dev/ino pair to places where we need one, like the depths of audit. >>>> During the discussion, one question did come up - why can't >>>> filesystems like Btrfs use a superblock per subvolume? There's a >>>> couple of problems with that: >>>> >>>> - It's common for a single Btrfs filesystem to have thousands of >>>> subvolumes. So keeping a superblock for each subvol in memory would >>>> get prohibively expensive - imagine having 8000 copies of struct >>>> super_block for a file system just because we wanted some separation >>>> of say, s_dev. >>> >>> That's no different to using individual overlay mounts for the >>> thousands of containers that are on the system. This doesn't seem to >>> be a major problem... >> >> Overlay mounts are indepedent of one another and don't need coordination >> among them. The memory usage is relatively unimportant. The important >> part is having a bunch of superblocks that all correspond to the same >> resources and coordinating them at the VFS level. Your assumptions >> below follow how your XFS subvolumes work, where there's a clear >> hierarchy between the subvolumes and the master filesystem with a >> mapping layer between them. Btrfs subvolumes have no such hierarchy. >> Everything is shared. > > Yup, that's the impedence mismatch between the VFS infrastructure > and btrfs that I was talking about. What I'm trying to communicate > is that I think the proposal is attacking the impedence mismatch > from the wrong direction. > > i.e. The proposal is to modify btrfs code to propagate stuff that > btrfs needs to know to deal with it's internal "everything is > shared" problems up into the VFS where it's probably not useful to > anything other than btrfs. We already have the necessary construct > in the VFS - I think we should be trying to use the information held > by the generic VFS infrastructure to solve the solve the specific > btrfs issue at hand.... The many namespaces to one I/O domain relationship isn't a btrfs-internal problem. It's visible to the user and it's visible inside the kernel. The view implementation gives the VFS a generic mechanism to allow it to represent that. The plan isn't to only use the view for that but to also allow separate security contexts for things like containers. Those have a similar issue to the inode availability discussed above. We have superblocks where we need them. We could have a view just as a easily. Plumbing a vfsmount is the wrong approach. >> So while we could use a writeback hierarchy to >> merge all of the inode lists before doing writeback on the 'master' >> superblock, we'd gain nothing by it. Handling anything involving >> s_umount with a superblock per subvolume would be a nightmare. >> Ultimately, it would be a ton of effort that amounts to working around >> the VFS instead of with it. > > I'm not suggesting that btrfs needs to use a superblock per > subvolume. Please don't confuse my statements along the lines of > "this doesn't seem to be a problem for others" with "you must change > btrfs to do this". I'm just saying that the problems arising from > using a superblock per subvolume are not as dire as is being > implied. It's possible to solve the btrfs problem this way but it adds a bunch of complexity for... what benefit, exactly? As I said, it's working around the VFS instead of working with it. As Mark initially stated, we handle I/O and namespace domains. Btrfs has many namespaces and only one I/O domain. -Jeff -- Jeff Mahoney SUSE Labs [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-06-05 20:17 ` Jeff Mahoney @ 2018-06-06 9:49 ` Amir Goldstein 2018-06-06 20:42 ` Mark Fasheh 2018-06-06 21:19 ` Jeff Mahoney 0 siblings, 2 replies; 10+ messages in thread From: Amir Goldstein @ 2018-06-06 9:49 UTC (permalink / raw) To: Jeff Mahoney Cc: Dave Chinner, Mark Fasheh, linux-fsdevel, linux-kernel, Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara On Tue, Jun 5, 2018 at 11:17 PM, Jeff Mahoney <jeffm@suse.com> wrote: > Sorry, just getting back to this. > > On 5/9/18 2:41 AM, Dave Chinner wrote: >> On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote: >>> On 5/8/18 7:38 PM, Dave Chinner wrote: >>>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: >>>>> Hi, >>>>> >>>>> The VFS's super_block covers a variety of filesystem functionality. In >>>>> particular we have a single structure representing both I/O and >>>>> namespace domains. >>>>> >>>>> There are requirements to de-couple this functionality. For example, >>>>> filesystems with more than one root (such as btrfs subvolumes) can >>>>> have multiple inode namespaces. This starts to confuse userspace when >>>>> it notices multiple inodes with the same inode/device tuple on a >>>>> filesystem. >>>> Speaking as someone who joined this discussion late, maybe years after it started, it would help to get an overview of existing problems and how fs_view aims to solve them. I do believe that both Overlayfs and Btrfs can benefit from a layer of abstraction in the VFS, but I think it is best if we start with laying all the common problems and then see how a solution would look like. Even the name of the abstraction (fs_view) doesn't make it clear to me what it is we are abstracting (security context? st_dev? what else?). probably best to try to describe the abstraction from user POV rather then give sporadic examples of what MAY go into fs_view. While at it, need to see if this discussion has any intersections with David Howell's fs_context work, because if we consider adding sub volume support to VFS, we may want to leave some reserved bits in the API for it. [...] > One thing is clear: If we want to solve the btrfs and overlayfs problems > in the same way, the view approach with a simple static mapping doesn't > work. Sticking something between the inode and superblock doesn't get > the job done when the belongs to a different file system. Overlayfs > needs a per-object remapper, which means a callback that takes a path. > Suddenly the way we do things in the SUSE kernel doesn't seem so hacky > anymore. > And what is the SUSE way? > I'm not sure we need the same solution for btrfs and overlayfs. It's > not the same problem. Every object in overlayfs as a unique mapping > already. If we report s_dev and i_ino from the inode, it still maps to > a unique user-visible object. It may not map back to the overlayfs > name, but that's a separate issue that's more difficult to solve. The > btrfs issue isn't one of presenting an alternative namespace to the > user. Btrfs has multiple internal namespaces and no way to publish them > to the rest of the kernel. > FYI, the Overlayfs file/inode mapping is about to change with many VFS hacks queued for removal, so stay tuned. [...] >> My point is that if we are talking about infrastructure to remap >> what userspace sees from different mountpoint views into a >> filesystem, then it should be done above the filesystem layers in >> the VFS so all filesystems behave the same way. And in this case, >> the vfsmount maps exactly to the "fs_view" that Mark has proposed we >> add to the superblock..... > > It's proposed to be above the superblock with a default view in the > superblock. It would sit between the inode and the superblock so we > have access to it anywhere we already have an inode. That's the main > difference. We already have the inode everywhere it's needed. Plumbing > a vfsmount everywhere needed means changing code that only requires an > inode and doesn't need a vfsmount. > > The two biggest problem areas: > - Writeback tracepoints print a dev/inode pair. Do we want to plumb a > vfsmount into __mark_inode_dirty, super_operations->write_inode, > __writeback_single_inode, writeback_sb_inodes, etc? > - Audit. As it happens, most of audit has a path or file that can be > used. We do run into problems with fsnotify. fsnotify_move is called > from vfs_rename which turns into a can of worms pretty quickly. > Can you please elaborate on that problem. Do you mean when watching a directory for changes, you need to be able to tell in which fs_view the directory inode that is being watched? >>> It makes sense for that to be above the >>> superblock because the file system doesn't care about them. We're >>> interested in the inode namespace, which for every other file system can >>> be described using an inode and a superblock pair, but btrfs has another >>> layer in the middle: inode -> btrfs_root -> superblock. >> >> Which seems to me to be irrelevant if there's a vfsmount per >> subvolume that can hold per-subvolume information. > > I disagree. There are a ton of places where we only have access to an > inode and only need access to an inode. It also doesn't solve the > overlayfs issue. > I have an interest of solving another problem. In VFS operations where only inode is available, I would like to be able to report fsnotify events (e.g. fsnotify_move()) only in directories under a certain subtree root. That could be achieved either by bind mount the subtree root and passing vfsmount into vfs_rename() or by defining an fs_view on the subtree and mounting that fs_view. Thanks, Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-06-06 9:49 ` Amir Goldstein @ 2018-06-06 20:42 ` Mark Fasheh 2018-06-07 6:06 ` Amir Goldstein 2018-06-06 21:19 ` Jeff Mahoney 1 sibling, 1 reply; 10+ messages in thread From: Mark Fasheh @ 2018-06-06 20:42 UTC (permalink / raw) To: Amir Goldstein Cc: Jeff Mahoney, Dave Chinner, linux-fsdevel, linux-kernel, Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara Hi Amir, thanks for the comments! On Wed, Jun 06, 2018 at 12:49:48PM +0300, Amir Goldstein wrote: > On Tue, Jun 5, 2018 at 11:17 PM, Jeff Mahoney <jeffm@suse.com> wrote: > > Sorry, just getting back to this. > > > > On 5/9/18 2:41 AM, Dave Chinner wrote: > >> On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote: > >>> On 5/8/18 7:38 PM, Dave Chinner wrote: > >>>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > >>>>> Hi, > >>>>> > >>>>> The VFS's super_block covers a variety of filesystem functionality. In > >>>>> particular we have a single structure representing both I/O and > >>>>> namespace domains. > >>>>> > >>>>> There are requirements to de-couple this functionality. For example, > >>>>> filesystems with more than one root (such as btrfs subvolumes) can > >>>>> have multiple inode namespaces. This starts to confuse userspace when > >>>>> it notices multiple inodes with the same inode/device tuple on a > >>>>> filesystem. > >>>> > > Speaking as someone who joined this discussion late, maybe years after > it started, it would help to get an overview of existing problems and how > fs_view aims to solve them. > > I do believe that both Overlayfs and Btrfs can benefit from a layer of > abstraction > in the VFS, but I think it is best if we start with laying all the > common problems > and then see how a solution would look like. > > Even the name of the abstraction (fs_view) doesn't make it clear to me what > it is we are abstracting (security context? st_dev? what else?). probably best > to try to describe the abstraction from user POV rather then give sporadic > examples of what MAY go into fs_view. The first problem fs_view intended to fix was the s_dev issue, where some filesystems return a different device from stat(2) than what the rest of the kernel exports. If you look at the e-mail record I referenced: https://marc.info/?l=linux-btrfs&m=130074451403261&w=2 you'll see how that can confuse some userspace software when they try to take the ino/dev pair they get from one portion of the kernel (for example, /proc/pid/maps) then use it to resolve to an inode on the system. As it turns out, there's other places where we might want to decouple some superblock fields, hence why I mention security contexts. I am admittedly less familiar with that particular problem but as I understand it, containers on a btrfs subvolume have to go through an overlay to route around the default security context. With fs_view we could point subvolumes to their own security contexts. Btw, sorry that the name is confusing. I've never been good at picking them. That said, if you have a minute to check out the first patch or two you'll see that the patches are basically putting a struct in between the super block and the inode. One thing I'd like to politely suggest is that anyone now following this conversation to please read the at least the first patch. It's an easy read and I feel like the conversation overall would be much more clear if everyone understood what we're going for. I worry that I didn't do a particularly good job explaining the actual code changes. https://www.spinics.net/lists/linux-fsdevel/msg125492.html Regarding a layout of the problems, I have a more complete e-mail coming soon which should describe in detail the issues I've seen with respect to how the kernel is exporting ino/dev pairs (outside of statx). fs_view alone is not enough to solve that problem. I'd be happy to CC you on that one if you'd like. > While at it, need to see if this discussion has any intersections with David > Howell's fs_context work, because if we consider adding sub volume support > to VFS, we may want to leave some reserved bits in the API for it. > > [...] > > > One thing is clear: If we want to solve the btrfs and overlayfs problems > > in the same way, the view approach with a simple static mapping doesn't > > work. Sticking something between the inode and superblock doesn't get > > the job done when the belongs to a different file system. Overlayfs > > needs a per-object remapper, which means a callback that takes a path. > > Suddenly the way we do things in the SUSE kernel doesn't seem so hacky > > anymore. > > > > And what is the SUSE way? At SUSE, we carry a version of this patch: https://marc.info/?l=linux-btrfs&m=130532890824992&w=2 Essentially a callback which was rejected for various reasons. The fs_view work was intended to replace that patch with an upstream solution. > > I'm not sure we need the same solution for btrfs and overlayfs. It's > > not the same problem. Every object in overlayfs as a unique mapping > > already. If we report s_dev and i_ino from the inode, it still maps to > > a unique user-visible object. It may not map back to the overlayfs > > name, but that's a separate issue that's more difficult to solve. The > > btrfs issue isn't one of presenting an alternative namespace to the > > user. Btrfs has multiple internal namespaces and no way to publish them > > to the rest of the kernel. > > > > FYI, the Overlayfs file/inode mapping is about to change with many > VFS hacks queued for removal, so stay tuned. > > [...] Actually, would you mind giving me a pointer to this work? I'd be very interested to see what exactly might be changing. > >> My point is that if we are talking about infrastructure to remap > >> what userspace sees from different mountpoint views into a > >> filesystem, then it should be done above the filesystem layers in > >> the VFS so all filesystems behave the same way. And in this case, > >> the vfsmount maps exactly to the "fs_view" that Mark has proposed we > >> add to the superblock..... > > > > It's proposed to be above the superblock with a default view in the > > superblock. It would sit between the inode and the superblock so we > > have access to it anywhere we already have an inode. That's the main > > difference. We already have the inode everywhere it's needed. Plumbing > > a vfsmount everywhere needed means changing code that only requires an > > inode and doesn't need a vfsmount. > > > > The two biggest problem areas: > > - Writeback tracepoints print a dev/inode pair. Do we want to plumb a > > vfsmount into __mark_inode_dirty, super_operations->write_inode, > > __writeback_single_inode, writeback_sb_inodes, etc? > > - Audit. As it happens, most of audit has a path or file that can be > > used. We do run into problems with fsnotify. fsnotify_move is called > > from vfs_rename which turns into a can of worms pretty quickly. > > > > Can you please elaborate on that problem. > Do you mean when watching a directory for changes, you need to > be able to tell in which fs_view the directory inode that is being watched? Here Jeff is responding to Dave's suggestion that we use a vfsmount instead of the fs_view. You cannot get to a vfsmount from an inode or even dentry alone, so this implies that we'd be passing a vfsmount down a ton of code that otherwise doesn't care about it. fs_view has the advantage that it is accesible from the inode so those places which ONLY have an inode can easily get to the correct device (the code changes make this pretty clear). > >>> It makes sense for that to be above the > >>> superblock because the file system doesn't care about them. We're > >>> interested in the inode namespace, which for every other file system can > >>> be described using an inode and a superblock pair, but btrfs has another > >>> layer in the middle: inode -> btrfs_root -> superblock. > >> > >> Which seems to me to be irrelevant if there's a vfsmount per > >> subvolume that can hold per-subvolume information. > > > > I disagree. There are a ton of places where we only have access to an > > inode and only need access to an inode. It also doesn't solve the > > overlayfs issue. > > > > I have an interest of solving another problem. > In VFS operations where only inode is available, I would like to be able to > report fsnotify events (e.g. fsnotify_move()) only in directories under a > certain subtree root. That could be achieved either by bind mount the subtree > root and passing vfsmount into vfs_rename() or by defining an fs_view on the > subtree and mounting that fs_view. I'm not sure whether fs_view works for this. Taking a quick look at fsnotify, the state is already on the inode? If there's a globabl fsnotify state that needs to be per subtree than yes we could move that to the fs_view and you'd simply deref it from the inode struct. I hope this helps, --Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-06-06 20:42 ` Mark Fasheh @ 2018-06-07 6:06 ` Amir Goldstein 2018-06-07 20:44 ` Mark Fasheh 0 siblings, 1 reply; 10+ messages in thread From: Amir Goldstein @ 2018-06-07 6:06 UTC (permalink / raw) To: Mark Fasheh Cc: Jeff Mahoney, Dave Chinner, linux-fsdevel, linux-kernel, Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara On Wed, Jun 6, 2018 at 11:42 PM, Mark Fasheh <mfasheh@suse.de> wrote: > Hi Amir, thanks for the comments! > Hi Mark, [...] > > Btw, sorry that the name is confusing. I've never been good at picking them. Didn't say that it was confusing. It might very well be the perfect name... if I only knew what sort of thing we are trying to name... > That said, if you have a minute to check out the first patch or two you'll > see that the patches are basically putting a struct in between the super > block and the inode. > > > One thing I'd like to politely suggest is that anyone now following this > conversation to please read the at least the first patch. It's an easy read > and I feel like the conversation overall would be much more clear if > everyone understood what we're going for. I worry that I didn't do a > particularly good job explaining the actual code changes. > > https://www.spinics.net/lists/linux-fsdevel/msg125492.html > I did look at the patches. They look simple and clean and they solve A problem. All I'm saying is that we should look at the set of problems that we know of before we design an abstraction layer. > > Regarding a layout of the problems, I have a more complete e-mail coming > soon which should describe in detail the issues I've seen with respect to > how the kernel is exporting ino/dev pairs (outside of statx). fs_view alone > is not enough to solve that problem. I'd be happy to CC you on that one if > you'd like. > Sure. [...] >> >> And what is the SUSE way? > > At SUSE, we carry a version of this patch: > > https://marc.info/?l=linux-btrfs&m=130532890824992&w=2 > > Essentially a callback which was rejected for various reasons. > Don't see a patch here. Wrong link? > The fs_view work was intended to replace that patch with an upstream > solution. > > [...] >> >> FYI, the Overlayfs file/inode mapping is about to change with many >> VFS hacks queued for removal, so stay tuned. >> >> [...] > > Actually, would you mind giving me a pointer to this work? I'd be very > interested to see what exactly might be changing. > Mostly, less VFS code is going to be exposed to underlying inode: https://marc.info/?l=linux-fsdevel&m=152760014530531&w=2 [...] >> I have an interest of solving another problem. >> In VFS operations where only inode is available, I would like to be able to >> report fsnotify events (e.g. fsnotify_move()) only in directories under a >> certain subtree root. That could be achieved either by bind mount the subtree >> root and passing vfsmount into vfs_rename() or by defining an fs_view on the >> subtree and mounting that fs_view. > > I'm not sure whether fs_view works for this. Taking a quick look at > fsnotify, the state is already on the inode? If there's a globabl fsnotify > state that needs to be per subtree than yes we could move that to the > fs_view and you'd simply deref it from the inode struct. > That was my thinking. I have patches to attach an fsnotify mark to super block. If fs_view could have a root that is different than super block's root and if file system can guaranty that objects cannot be moved outside of fs_view root, then fsnotify mark could be attached to fs_view. Thanks, Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-06-07 6:06 ` Amir Goldstein @ 2018-06-07 20:44 ` Mark Fasheh 0 siblings, 0 replies; 10+ messages in thread From: Mark Fasheh @ 2018-06-07 20:44 UTC (permalink / raw) To: Amir Goldstein Cc: Jeff Mahoney, Dave Chinner, linux-fsdevel, linux-kernel, Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara On Thu, Jun 07, 2018 at 09:06:04AM +0300, Amir Goldstein wrote: > On Wed, Jun 6, 2018 at 11:42 PM, Mark Fasheh <mfasheh@suse.de> wrote: > > Hi Amir, thanks for the comments! > > > > Hi Mark, > > [...] > > > > Btw, sorry that the name is confusing. I've never been good at picking them. > > Didn't say that it was confusing. It might very well be the perfect name... > if I only knew what sort of thing we are trying to name... Fair enough :) > > That said, if you have a minute to check out the first patch or two you'll > > see that the patches are basically putting a struct in between the super > > block and the inode. > > > > > > One thing I'd like to politely suggest is that anyone now following this > > conversation to please read the at least the first patch. It's an easy read > > and I feel like the conversation overall would be much more clear if > > everyone understood what we're going for. I worry that I didn't do a > > particularly good job explaining the actual code changes. > > > > https://www.spinics.net/lists/linux-fsdevel/msg125492.html > > > > > I did look at the patches. They look simple and clean and they solve A problem. > All I'm saying is that we should look at the set of problems that we know of > before we design an abstraction layer. Agreed, I think my more recent e-mail does a much better job of laying out the issues we're seeing here. > >> And what is the SUSE way? > > > > At SUSE, we carry a version of this patch: > > > > https://marc.info/?l=linux-btrfs&m=130532890824992&w=2 > > > > Essentially a callback which was rejected for various reasons. > > > > Don't see a patch here. Wrong link? That was the 0/2 mail in the patch series. Here are the next two, which contain the actual patches: https://marc.info/?l=linux-btrfs&m=130532892525016&w=2 https://marc.info/?l=linux-btrfs&m=130532899325091&w=2 Keep in mind that the patch has evolved since then though it's essentially the same idea - use a callback where we need to get the correct device. > > The fs_view work was intended to replace that patch with an upstream > > solution. > > > > > > [...] > > >> > >> FYI, the Overlayfs file/inode mapping is about to change with many > >> VFS hacks queued for removal, so stay tuned. > >> > >> [...] > > > > Actually, would you mind giving me a pointer to this work? I'd be very > > interested to see what exactly might be changing. > > > > Mostly, less VFS code is going to be exposed to underlying inode: > > https://marc.info/?l=linux-fsdevel&m=152760014530531&w=2 > > [...] Awesome, thanks for the link Amir. > > I'm not sure whether fs_view works for this. Taking a quick look at > > fsnotify, the state is already on the inode? If there's a globabl fsnotify > > state that needs to be per subtree than yes we could move that to the > > fs_view and you'd simply deref it from the inode struct. > > > > That was my thinking. I have patches to attach an fsnotify mark > to super block. If fs_view could have a root that is different than > super block's root and if file system can guaranty that objects > cannot be moved outside of fs_view root, then fsnotify mark > could be attached to fs_view. Ahh yeah so in that case we could have the mark on the fs_view instead of the superblock and you'd deref it from the inode to get to that inodes root. Thanks, --Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-06-06 9:49 ` Amir Goldstein 2018-06-06 20:42 ` Mark Fasheh @ 2018-06-06 21:19 ` Jeff Mahoney 2018-06-07 6:17 ` Amir Goldstein 1 sibling, 1 reply; 10+ messages in thread From: Jeff Mahoney @ 2018-06-06 21:19 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Mark Fasheh, linux-fsdevel, linux-kernel, Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara [-- Attachment #1.1: Type: text/plain, Size: 7255 bytes --] Hi Amir - Mark's answered some of your questions already, so I'll skip those. On 6/6/18 5:49 AM, Amir Goldstein wrote: > On Tue, Jun 5, 2018 at 11:17 PM, Jeff Mahoney <jeffm@suse.com> wrote: >> Sorry, just getting back to this. >> >> On 5/9/18 2:41 AM, Dave Chinner wrote: >>> On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote: >>>> On 5/8/18 7:38 PM, Dave Chinner wrote: >>>>> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: >>>>>> Hi, >>>>>> >>>>>> The VFS's super_block covers a variety of filesystem functionality. In >>>>>> particular we have a single structure representing both I/O and >>>>>> namespace domains. >>>>>> >>>>>> There are requirements to de-couple this functionality. For example, >>>>>> filesystems with more than one root (such as btrfs subvolumes) can >>>>>> have multiple inode namespaces. This starts to confuse userspace when >>>>>> it notices multiple inodes with the same inode/device tuple on a >>>>>> filesystem. >>>>> > > Speaking as someone who joined this discussion late, maybe years after > it started, it would help to get an overview of existing problems and how > fs_view aims to solve them. > > I do believe that both Overlayfs and Btrfs can benefit from a layer of > abstraction > in the VFS, but I think it is best if we start with laying all the > common problems > and then see how a solution would look like. > > Even the name of the abstraction (fs_view) doesn't make it clear to me what > it is we are abstracting (security context? st_dev? what else?). probably best > to try to describe the abstraction from user POV rather then give sporadic > examples of what MAY go into fs_view. > > While at it, need to see if this discussion has any intersections with David > Howell's fs_context work, because if we consider adding sub volume support > to VFS, we may want to leave some reserved bits in the API for it. > > [...] > >> One thing is clear: If we want to solve the btrfs and overlayfs problems >> in the same way, the view approach with a simple static mapping doesn't >> work. Sticking something between the inode and superblock doesn't get >> the job done when the belongs to a different file system. Overlayfs >> needs a per-object remapper, which means a callback that takes a path. >> Suddenly the way we do things in the SUSE kernel doesn't seem so hacky >> anymore. >> > > And what is the SUSE way? > >> I'm not sure we need the same solution for btrfs and overlayfs. It's >> not the same problem. Every object in overlayfs as a unique mapping >> already. If we report s_dev and i_ino from the inode, it still maps to >> a unique user-visible object. It may not map back to the overlayfs >> name, but that's a separate issue that's more difficult to solve. The >> btrfs issue isn't one of presenting an alternative namespace to the >> user. Btrfs has multiple internal namespaces and no way to publish them >> to the rest of the kernel. >> > > FYI, the Overlayfs file/inode mapping is about to change with many > VFS hacks queued for removal, so stay tuned. > > [...] I have to admit I'm curious how this will work. I've heard rumor of using overlayfs inodes and calling the underlying file system's inode_operations. If part of that work removes the danger of overlayfs inode numbers colliding with xino mode, I'm definitely interested. >>> My point is that if we are talking about infrastructure to remap >>> what userspace sees from different mountpoint views into a >>> filesystem, then it should be done above the filesystem layers in >>> the VFS so all filesystems behave the same way. And in this case, >>> the vfsmount maps exactly to the "fs_view" that Mark has proposed we >>> add to the superblock..... >> >> It's proposed to be above the superblock with a default view in the >> superblock. It would sit between the inode and the superblock so we >> have access to it anywhere we already have an inode. That's the main >> difference. We already have the inode everywhere it's needed. Plumbing >> a vfsmount everywhere needed means changing code that only requires an >> inode and doesn't need a vfsmount. >> >> The two biggest problem areas: >> - Writeback tracepoints print a dev/inode pair. Do we want to plumb a >> vfsmount into __mark_inode_dirty, super_operations->write_inode, >> __writeback_single_inode, writeback_sb_inodes, etc? >> - Audit. As it happens, most of audit has a path or file that can be >> used. We do run into problems with fsnotify. fsnotify_move is called >> from vfs_rename which turns into a can of worms pretty quickly. >> > > Can you please elaborate on that problem. > Do you mean when watching a directory for changes, you need to > be able to tell in which fs_view the directory inode that is being watched? I was investigating whether Dave's suggestion of using a vfsmount was feasible. When following the audit call graph up, I found fsnotify_move, called by vfs_rename. Piping a vfsmount into the vfs_* operations has historically been rejected by Al (see Apparmor discussions, among others), and with good reason. The file system implementation shouldn't care about where it's mounted. While piping it into vfs_rename doesn't seem too invasive, it's called by various file systems' ->rename callback, so then we're stuck piping vfsmounts into inode_operations, which is what Al's been wanting to avoid for years. >>>> It makes sense for that to be above the >>>> superblock because the file system doesn't care about them. We're >>>> interested in the inode namespace, which for every other file system can >>>> be described using an inode and a superblock pair, but btrfs has another >>>> layer in the middle: inode -> btrfs_root -> superblock. >>> >>> Which seems to me to be irrelevant if there's a vfsmount per >>> subvolume that can hold per-subvolume information. >> >> I disagree. There are a ton of places where we only have access to an >> inode and only need access to an inode. It also doesn't solve the >> overlayfs issue. >> > > I have an interest of solving another problem. > In VFS operations where only inode is available, I would like to be able to > report fsnotify events (e.g. fsnotify_move()) only in directories under a > certain subtree root. That could be achieved either by bind mount the subtree > root and passing vfsmount into vfs_rename() or by defining an fs_view on the > subtree and mounting that fs_view. I'm not sure there's a lot of overlap, but I expect that this will end up running into the same review feedback that Al gave during the AppArmor merge: vfsmounts have no business at the lower level and you can get the same behavior by hooking in at a higher level. See security_path_* vs security_inode_* for how that was resolved. What you're talking about isn't really what we had in mind for the fs_view. In our case, it sits between the inode and superblock, which would be at too low a level for determining whether an inode is under a certain subtree. In any event, wouldn't you need a path instead of an inode to do what you're proposing? -Jeff -- Jeff Mahoney SUSE Labs [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root 2018-06-06 21:19 ` Jeff Mahoney @ 2018-06-07 6:17 ` Amir Goldstein 0 siblings, 0 replies; 10+ messages in thread From: Amir Goldstein @ 2018-06-07 6:17 UTC (permalink / raw) To: Jeff Mahoney Cc: Dave Chinner, Mark Fasheh, linux-fsdevel, linux-kernel, Linux Btrfs, Miklos Szeredi, David Howells, Jan Kara On Thu, Jun 7, 2018 at 12:19 AM, Jeff Mahoney <jeffm@suse.com> wrote: [...] >> >> FYI, the Overlayfs file/inode mapping is about to change with many >> VFS hacks queued for removal, so stay tuned. >> >> [...] > > I have to admit I'm curious how this will work. I've heard rumor of > using overlayfs inodes and calling the underlying file system's > inode_operations. If part of that work removes the danger of overlayfs > inode numbers colliding with xino mode, I'm definitely interested. See https://marc.info/?l=linux-fsdevel&m=152760014530531&w=2 It doesn't remove the need to maintain a unique and persistent inode number namespace in overlayfs. It just reduces exposure of the underlying inode to VFS. [...] >>> - Audit. As it happens, most of audit has a path or file that can be >>> used. We do run into problems with fsnotify. fsnotify_move is called >>> from vfs_rename which turns into a can of worms pretty quickly. >>> >> >> Can you please elaborate on that problem. >> Do you mean when watching a directory for changes, you need to >> be able to tell in which fs_view the directory inode that is being watched? > > I was investigating whether Dave's suggestion of using a vfsmount was > feasible. When following the audit call graph up, I found > fsnotify_move, called by vfs_rename. Piping a vfsmount into the vfs_* > operations has historically been rejected by Al (see Apparmor > discussions, among others), and with good reason. The file system > implementation shouldn't care about where it's mounted. While piping it > into vfs_rename doesn't seem too invasive, it's called by various file > systems' ->rename callback, so then we're stuck piping vfsmounts into > inode_operations, which is what Al's been wanting to avoid for years. > Yes, I've heard about this objection, thought didn't have a reference. [...] >> >> I have an interest of solving another problem. >> In VFS operations where only inode is available, I would like to be able to >> report fsnotify events (e.g. fsnotify_move()) only in directories under a >> certain subtree root. That could be achieved either by bind mount the subtree >> root and passing vfsmount into vfs_rename() or by defining an fs_view on the >> subtree and mounting that fs_view. > > I'm not sure there's a lot of overlap, but I expect that this will end > up running into the same review feedback that Al gave during the > AppArmor merge: vfsmounts have no business at the lower level and you > can get the same behavior by hooking in at a higher level. See > security_path_* vs security_inode_* for how that was resolved. > > What you're talking about isn't really what we had in mind for the > fs_view. In our case, it sits between the inode and superblock, which > would be at too low a level for determining whether an inode is under a > certain subtree. In any event, wouldn't you need a path instead of an > inode to do what you're proposing? > Not if the fs_view could have a root that is different than sb root then I could attach fsnotify mark to fs_view instead of say vfsmount and I have the information I need inside fsnotify_move(). Thanks, Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-06-07 20:44 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20180508180436.716-1-mfasheh@suse.de>
2018-05-08 23:38 ` [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root Dave Chinner
2018-05-09 2:06 ` Jeff Mahoney
2018-05-09 6:41 ` Dave Chinner
2018-06-05 20:17 ` Jeff Mahoney
2018-06-06 9:49 ` Amir Goldstein
2018-06-06 20:42 ` Mark Fasheh
2018-06-07 6:06 ` Amir Goldstein
2018-06-07 20:44 ` Mark Fasheh
2018-06-06 21:19 ` Jeff Mahoney
2018-06-07 6:17 ` Amir Goldstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).