linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
       [not found] ` <20241002014017.3801899-5-david@fromorbit.com>
@ 2024-10-03  7:23   ` Christoph Hellwig
  2024-10-03  7:38     ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Mickaël Salaün, Jann Horn, Serge Hallyn,
	Kees Cook, linux-security-module

On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote:
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security)
>  
>  /*
>   * Release the inodes used in a security policy.
> - *
> - * Cf. fsnotify_unmount_inodes() and invalidate_inodes()
>   */
> +static int release_inode_fn(struct inode *inode, void *data)

Looks like this is called from the sb_delete LSM hook, which
is only implemented by landlock, and only called from
generic_shutdown_super, separated from evict_inodes only by call
to fsnotify_sb_delete.  Why did LSM not hook into that and instead
added another iteration of the per-sb inode list?

(Note that this is not tying to get Dave to fix this, just noticed
it when reviewing this series)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03  7:23   ` lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes() Christoph Hellwig
@ 2024-10-03  7:38     ` Christoph Hellwig
  2024-10-03 11:57       ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Mickaël Salaün, Jann Horn, Serge Hallyn,
	Kees Cook, linux-security-module, Jan Kara, Amir Goldstein

On Thu, Oct 03, 2024 at 12:23:41AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote:
> > --- a/security/landlock/fs.c
> > +++ b/security/landlock/fs.c
> > @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security)
> >  
> >  /*
> >   * Release the inodes used in a security policy.
> > - *
> > - * Cf. fsnotify_unmount_inodes() and invalidate_inodes()
> >   */
> > +static int release_inode_fn(struct inode *inode, void *data)
> 
> Looks like this is called from the sb_delete LSM hook, which
> is only implemented by landlock, and only called from
> generic_shutdown_super, separated from evict_inodes only by call
> to fsnotify_sb_delete.  Why did LSM not hook into that and instead

An the main thing that fsnotify_sb_delete does is yet another inode
iteration..

Ay chance you all could get together an figure out how to get down
to a single sb inode iteration per unmount?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03  7:38     ` Christoph Hellwig
@ 2024-10-03 11:57       ` Jan Kara
  2024-10-03 12:11         ` Christoph Hellwig
  2024-10-07 20:37         ` Linus Torvalds
  0 siblings, 2 replies; 34+ messages in thread
From: Jan Kara @ 2024-10-03 11:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Jan Kara,
	Amir Goldstein

On Thu 03-10-24 00:38:05, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 12:23:41AM -0700, Christoph Hellwig wrote:
> > On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote:
> > > --- a/security/landlock/fs.c
> > > +++ b/security/landlock/fs.c
> > > @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security)
> > >  
> > >  /*
> > >   * Release the inodes used in a security policy.
> > > - *
> > > - * Cf. fsnotify_unmount_inodes() and invalidate_inodes()
> > >   */
> > > +static int release_inode_fn(struct inode *inode, void *data)
> > 
> > Looks like this is called from the sb_delete LSM hook, which
> > is only implemented by landlock, and only called from
> > generic_shutdown_super, separated from evict_inodes only by call
> > to fsnotify_sb_delete.  Why did LSM not hook into that and instead
> 
> An the main thing that fsnotify_sb_delete does is yet another inode
> iteration..
> 
> Ay chance you all could get together an figure out how to get down
> to a single sb inode iteration per unmount?

Fair enough. If we go with the iterator variant I've suggested to Dave in
[1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
Landlocks hook_sb_delete() into a single iteration relatively easily. But
I'd wait with that convertion until this series lands.

								Honza

[1] https://lore.kernel.org/all/20241003114555.bl34fkqsja4s5tok@quack3
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 11:57       ` Jan Kara
@ 2024-10-03 12:11         ` Christoph Hellwig
  2024-10-03 12:26           ` Jan Kara
  2024-10-07 20:37         ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> Fair enough. If we go with the iterator variant I've suggested to Dave in
> [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> Landlocks hook_sb_delete() into a single iteration relatively easily. But
> I'd wait with that convertion until this series lands.

I don't see how that has anything to do with iterators or not.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:11         ` Christoph Hellwig
@ 2024-10-03 12:26           ` Jan Kara
  2024-10-03 12:39             ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2024-10-03 12:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu 03-10-24 05:11:11, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > I'd wait with that convertion until this series lands.
> 
> I don't see how that has anything to do with iterators or not.

Well, the patches would obviously conflict which seems pointless if we
could live with three iterations for a few years until somebody noticed :).
And with current Dave's version of iterators it will not be possible to
integrate evict_inodes() iteration with the other two without a layering
violation. Still we could go from 3 to 2 iterations.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:26           ` Jan Kara
@ 2024-10-03 12:39             ` Christoph Hellwig
  2024-10-03 12:56               ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 02:26:57PM +0200, Jan Kara wrote:
> On Thu 03-10-24 05:11:11, Christoph Hellwig wrote:
> > On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > I'd wait with that convertion until this series lands.
> > 
> > I don't see how that has anything to do with iterators or not.
> 
> Well, the patches would obviously conflict

Conflict with what?

> which seems pointless if we
> could live with three iterations for a few years until somebody noticed :).
> And with current Dave's version of iterators it will not be possible to
> integrate evict_inodes() iteration with the other two without a layering
> violation. Still we could go from 3 to 2 iterations.

What layering violation?

Below is quick compile tested part to do the fsnotify side and
get rid of the fsnotify iteration, which looks easily worth it.

landlock is a bit more complex because of lsm hooks, and the internal
underobj abstraction inside of landlock.  But looking at the release
inode data vs unomunt synchronization it has and the duplicate code I
think doing it this way is worth the effort even more, it'll just need 
someone who knows landlock and the lsm layering to help with the work.

diff --git a/fs/inode.c b/fs/inode.c
index 3f335f78c5b228..7d5f8a09e4d29d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
  */
 static int evict_inode_fn(struct inode *inode, void *data)
 {
+	struct super_block *sb = inode->i_sb;
 	struct list_head *dispose = data;
+	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
 
 	spin_lock(&inode->i_lock);
-	if (atomic_read(&inode->i_count) ||
-	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
+	if (atomic_read(&inode->i_count)) {
+		spin_unlock(&inode->i_lock);
+
+		/* for each watch, send FS_UNMOUNT and then remove it */
+		if (post_unmount && fsnotify_sb_info(sb)) {
+			fsnotify_inode(inode, FS_UNMOUNT);
+			fsnotify_inode_delete(inode);
+		}
+		return INO_ITER_DONE;
+	}
+
+	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 		spin_unlock(&inode->i_lock);
 		return INO_ITER_DONE;
 	}
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 68c34ed9427190..cf89aa69e82c8d 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -28,16 +28,6 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt)
 	fsnotify_clear_marks_by_mount(mnt);
 }
 
-static int fsnotify_unmount_inode_fn(struct inode *inode, void *data)
-{
-	spin_unlock(&inode->i_lock);
-
-	/* for each watch, send FS_UNMOUNT and then remove it */
-	fsnotify_inode(inode, FS_UNMOUNT);
-	fsnotify_inode_delete(inode);
-	return INO_ITER_DONE;
-}
-
 void fsnotify_sb_delete(struct super_block *sb)
 {
 	struct fsnotify_sb_info *sbinfo = fsnotify_sb_info(sb);
@@ -46,19 +36,6 @@ void fsnotify_sb_delete(struct super_block *sb)
 	if (!sbinfo)
 		return;
 
-	/*
-	 * If i_count is zero, the inode cannot have any watches and
-	 * doing an __iget/iput with SB_ACTIVE clear would actually
-	 * evict all inodes with zero i_count from icache which is
-	 * unnecessarily violent and may in fact be illegal to do.
-	 * However, we should have been called /after/ evict_inodes
-	 * removed all zero refcount inodes, in any case. Hence we use
-	 * INO_ITER_REFERENCED to ensure zero refcount inodes are filtered
-	 * properly.
-	 */
-	super_iter_inodes(sb, fsnotify_unmount_inode_fn, NULL,
-			INO_ITER_REFERENCED);
-
 	fsnotify_clear_marks_by_sb(sb);
 	/* Wait for outstanding object references from connectors */
 	wait_var_event(fsnotify_sb_watched_objects(sb),
diff --git a/fs/super.c b/fs/super.c
index 971ad4e996e0ba..88dd1703fe73db 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -167,28 +167,17 @@ static void super_wake(struct super_block *sb, unsigned int flag)
 	wake_up_var(&sb->s_flags);
 }
 
-bool super_iter_iget(struct inode *inode, int flags)
+bool super_iter_iget(struct inode *inode)
 {
-	bool	ret = false;
+	bool ret = false;
 
 	spin_lock(&inode->i_lock);
-	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))
-		goto out_unlock;
-
-	/*
-	 * Skip over zero refcount inode if the caller only wants
-	 * referenced inodes to be iterated.
-	 */
-	if ((flags & INO_ITER_REFERENCED) &&
-	    !atomic_read(&inode->i_count))
-		goto out_unlock;
-
-	__iget(inode);
-	ret = true;
-out_unlock:
+	if (!(inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
+		__iget(inode);
+		ret = true;
+	}
 	spin_unlock(&inode->i_lock);
 	return ret;
-
 }
 EXPORT_SYMBOL_GPL(super_iter_iget);
 
@@ -216,7 +205,7 @@ int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
 
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (!super_iter_iget(inode, flags))
+		if (!super_iter_iget(inode))
 			continue;
 		spin_unlock(&sb->s_inode_list_lock);
 		iput(old_inode);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ee544556cee728..5a174e690424fb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1654,8 +1654,7 @@ xfs_iter_vfs_igrab(
 	if (ip->i_flags & XFS_ITER_VFS_NOGRAB_IFLAGS)
 		goto out_unlock_noent;
 
-	if ((flags & INO_ITER_UNSAFE) ||
-	    super_iter_iget(inode, flags))
+	if ((flags & INO_ITER_UNSAFE) || super_iter_iget(inode))
 		ret = true;
 
 out_unlock_noent:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2aa335228b84bf..a3c682f0d94c1b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2224,7 +2224,7 @@ enum freeze_holder {
 typedef int (*ino_iter_fn)(struct inode *inode, void *priv);
 int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
 		void *private_data, int flags);
-bool super_iter_iget(struct inode *inode, int flags);
+bool super_iter_iget(struct inode *inode);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:39             ` Christoph Hellwig
@ 2024-10-03 12:56               ` Jan Kara
  2024-10-03 13:04                 ` Christoph Hellwig
  2024-10-03 13:59                 ` Dave Chinner
  0 siblings, 2 replies; 34+ messages in thread
From: Jan Kara @ 2024-10-03 12:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu 03-10-24 05:39:23, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 02:26:57PM +0200, Jan Kara wrote:
> > On Thu 03-10-24 05:11:11, Christoph Hellwig wrote:
> > > On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > I'd wait with that convertion until this series lands.
> > > 
> > > I don't see how that has anything to do with iterators or not.
> > 
> > Well, the patches would obviously conflict
> 
> Conflict with what?

I thought you wanted the interations to be unified in current state of
code. If you meant after Dave's series, then we are in agreement.

> > which seems pointless if we
> > could live with three iterations for a few years until somebody noticed :).
> > And with current Dave's version of iterators it will not be possible to
> > integrate evict_inodes() iteration with the other two without a layering
> > violation. Still we could go from 3 to 2 iterations.
> 
> What layering violation?
> 
> Below is quick compile tested part to do the fsnotify side and
> get rid of the fsnotify iteration, which looks easily worth it.

...

> @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
>   */
>  static int evict_inode_fn(struct inode *inode, void *data)
>  {
> +	struct super_block *sb = inode->i_sb;
>  	struct list_head *dispose = data;
> +	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
>  
>  	spin_lock(&inode->i_lock);
> -	if (atomic_read(&inode->i_count) ||
> -	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
> +	if (atomic_read(&inode->i_count)) {
> +		spin_unlock(&inode->i_lock);
> +
> +		/* for each watch, send FS_UNMOUNT and then remove it */
> +		if (post_unmount && fsnotify_sb_info(sb)) {
> +			fsnotify_inode(inode, FS_UNMOUNT);
> +			fsnotify_inode_delete(inode);
> +		}

This will not work because you are in unsafe iterator holding
sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
iget / iput dance and releasing of s_inode_list_lock which does not work
when a filesystem has its own inodes iterator AFAICT... That's why I've
called it a layering violation.

									Honza

> +		return INO_ITER_DONE;
> +	}
> +
> +	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>  		spin_unlock(&inode->i_lock);
>  		return INO_ITER_DONE;
>  	}
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:56               ` Jan Kara
@ 2024-10-03 13:04                 ` Christoph Hellwig
  2024-10-03 13:59                 ` Dave Chinner
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2024-10-03 13:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote:
> > +	if (atomic_read(&inode->i_count)) {
> > +		spin_unlock(&inode->i_lock);
> > +
> > +		/* for each watch, send FS_UNMOUNT and then remove it */
> > +		if (post_unmount && fsnotify_sb_info(sb)) {
> > +			fsnotify_inode(inode, FS_UNMOUNT);
> > +			fsnotify_inode_delete(inode);
> > +		}
> 
> This will not work because you are in unsafe iterator holding
> sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
> iget / iput dance and releasing of s_inode_list_lock which does not work
> when a filesystem has its own inodes iterator AFAICT... That's why I've
> called it a layering violation.

Ah, yes.  So we'll need to special case it some way either way.  Still
feels saner to do it in one iteration and make the inode eviction not
use the unsafe version, but maybe that's indeed better postponed until
after Dave's series.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:56               ` Jan Kara
  2024-10-03 13:04                 ` Christoph Hellwig
@ 2024-10-03 13:59                 ` Dave Chinner
  2024-10-03 16:17                   ` Jan Kara
  1 sibling, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2024-10-03 13:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote:
> On Thu 03-10-24 05:39:23, Christoph Hellwig wrote:
> > @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
> >   */
> >  static int evict_inode_fn(struct inode *inode, void *data)
> >  {
> > +	struct super_block *sb = inode->i_sb;
> >  	struct list_head *dispose = data;
> > +	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
> >  
> >  	spin_lock(&inode->i_lock);
> > -	if (atomic_read(&inode->i_count) ||
> > -	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
> > +	if (atomic_read(&inode->i_count)) {
> > +		spin_unlock(&inode->i_lock);
> > +
> > +		/* for each watch, send FS_UNMOUNT and then remove it */
> > +		if (post_unmount && fsnotify_sb_info(sb)) {
> > +			fsnotify_inode(inode, FS_UNMOUNT);
> > +			fsnotify_inode_delete(inode);
> > +		}
> 
> This will not work because you are in unsafe iterator holding
> sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
> iget / iput dance and releasing of s_inode_list_lock which does not work
> when a filesystem has its own inodes iterator AFAICT... That's why I've
> called it a layering violation.

The whole point of the iget/iput dance is to stabilise the
s_inodes list iteration whilst it is unlocked - the actual fsnotify
calls don't need an inode reference to work correctly.

IOWs, we don't need to run the fsnotify stuff right here - we can
defer that like we do with the dispose list for all the inodes we
mark as I_FREEING here.

So if we pass a structure:

struct evict_inode_args {
	struct list_head	dispose;
	struct list_head	fsnotify;
};

If we use __iget() instead of requiring an inode state flag to keep
the inode off the LRU for the fsnotify cleanup, then the code
fragment above becomes:

	if (atomic_read(&inode->i_count)) {
		if (post_unmount && fsnotify_sb_info(sb)) {
			__iget(inode);
			inode_lru_list_del(inode);
			spin_unlock(&inode->i_lock);
			list_add(&inode->i_lru, &args->fsnotify);
		}
		return INO_ITER_DONE;
	}

And then once we return to evict_inodes(), we do this:

	while (!list_empty(args->fsnotify)) {
		struct inode *inode

		inode = list_first_entry(head, struct inode, i_lru);
                list_del_init(&inode->i_lru);

		fsnotify_inode(inode, FS_UNMOUNT);
		fsnotify_inode_delete(inode);
		iput(inode);
		cond_resched();
	}

And so now all the fsnotify cleanup is done outside the traversal in
one large batch from evict_inodes().

As for the landlock code, I think it needs to have it's own internal
tracking mechanism and not search the sb inode list for inodes that
it holds references to. LSM cleanup should be run before before we
get to tearing down the inode cache, not after....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 13:59                 ` Dave Chinner
@ 2024-10-03 16:17                   ` Jan Kara
  2024-10-04  0:46                     ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2024-10-03 16:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote:
> > On Thu 03-10-24 05:39:23, Christoph Hellwig wrote:
> > > @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
> > >   */
> > >  static int evict_inode_fn(struct inode *inode, void *data)
> > >  {
> > > +	struct super_block *sb = inode->i_sb;
> > >  	struct list_head *dispose = data;
> > > +	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
> > >  
> > >  	spin_lock(&inode->i_lock);
> > > -	if (atomic_read(&inode->i_count) ||
> > > -	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
> > > +	if (atomic_read(&inode->i_count)) {
> > > +		spin_unlock(&inode->i_lock);
> > > +
> > > +		/* for each watch, send FS_UNMOUNT and then remove it */
> > > +		if (post_unmount && fsnotify_sb_info(sb)) {
> > > +			fsnotify_inode(inode, FS_UNMOUNT);
> > > +			fsnotify_inode_delete(inode);
> > > +		}
> > 
> > This will not work because you are in unsafe iterator holding
> > sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
> > iget / iput dance and releasing of s_inode_list_lock which does not work
> > when a filesystem has its own inodes iterator AFAICT... That's why I've
> > called it a layering violation.
> 
> The whole point of the iget/iput dance is to stabilise the
> s_inodes list iteration whilst it is unlocked - the actual fsnotify
> calls don't need an inode reference to work correctly.
> 
> IOWs, we don't need to run the fsnotify stuff right here - we can
> defer that like we do with the dispose list for all the inodes we
> mark as I_FREEING here.
> 
> So if we pass a structure:
> 
> struct evict_inode_args {
> 	struct list_head	dispose;
> 	struct list_head	fsnotify;
> };
> 
> If we use __iget() instead of requiring an inode state flag to keep
> the inode off the LRU for the fsnotify cleanup, then the code
> fragment above becomes:
> 
> 	if (atomic_read(&inode->i_count)) {
> 		if (post_unmount && fsnotify_sb_info(sb)) {
> 			__iget(inode);
> 			inode_lru_list_del(inode);
> 			spin_unlock(&inode->i_lock);
> 			list_add(&inode->i_lru, &args->fsnotify);
> 		}

Nit: Need to release i_lock in else branch here.  Otherwise interesting
idea. Yes, something like this could work even in unsafe iterator.

> 		return INO_ITER_DONE;
> 	}
> And then once we return to evict_inodes(), we do this:
> 
> 	while (!list_empty(args->fsnotify)) {
> 		struct inode *inode
> 
> 		inode = list_first_entry(head, struct inode, i_lru);
>                 list_del_init(&inode->i_lru);
> 
> 		fsnotify_inode(inode, FS_UNMOUNT);
> 		fsnotify_inode_delete(inode);
> 		iput(inode);
> 		cond_resched();
> 	}
> 
> And so now all the fsnotify cleanup is done outside the traversal in
> one large batch from evict_inodes().

Yup.

> As for the landlock code, I think it needs to have it's own internal
> tracking mechanism and not search the sb inode list for inodes that
> it holds references to. LSM cleanup should be run before before we
> get to tearing down the inode cache, not after....

Well, I think LSM cleanup could in principle be handled together with the
fsnotify cleanup but I didn't check the details.


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 16:17                   ` Jan Kara
@ 2024-10-04  0:46                     ` Dave Chinner
  2024-10-04  7:21                       ` Christian Brauner
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2024-10-04  0:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > As for the landlock code, I think it needs to have it's own internal
> > tracking mechanism and not search the sb inode list for inodes that
> > it holds references to. LSM cleanup should be run before before we
> > get to tearing down the inode cache, not after....
> 
> Well, I think LSM cleanup could in principle be handled together with the
> fsnotify cleanup but I didn't check the details.

I'm not sure how we tell if an inode potentially has a LSM related
reference hanging off it. The landlock code looks to make an
assumption in that the only referenced inodes it sees will have a
valid inode->i_security pointer if landlock is enabled. i.e. it
calls landlock_inode(inode) and dereferences the returned value
without ever checking if inode->i_security is NULL or not.

I mean, we could do a check for inode->i_security when the refcount
is elevated and replace the security_sb_delete hook with an
security_evict_inode hook similar to the proposed fsnotify eviction
from evict_inodes().

But screwing with LSM instructure looks ....  obnoxiously complex
from the outside...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04  0:46                     ` Dave Chinner
@ 2024-10-04  7:21                       ` Christian Brauner
  2024-10-04 12:14                         ` Christoph Hellwig
  2024-10-04 22:57                         ` Dave Chinner
  0 siblings, 2 replies; 34+ messages in thread
From: Christian Brauner @ 2024-10-04  7:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote:
> On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> > On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > > As for the landlock code, I think it needs to have it's own internal
> > > tracking mechanism and not search the sb inode list for inodes that
> > > it holds references to. LSM cleanup should be run before before we
> > > get to tearing down the inode cache, not after....
> > 
> > Well, I think LSM cleanup could in principle be handled together with the
> > fsnotify cleanup but I didn't check the details.
> 
> I'm not sure how we tell if an inode potentially has a LSM related
> reference hanging off it. The landlock code looks to make an
> assumption in that the only referenced inodes it sees will have a
> valid inode->i_security pointer if landlock is enabled. i.e. it
> calls landlock_inode(inode) and dereferences the returned value
> without ever checking if inode->i_security is NULL or not.
> 
> I mean, we could do a check for inode->i_security when the refcount
> is elevated and replace the security_sb_delete hook with an
> security_evict_inode hook similar to the proposed fsnotify eviction
> from evict_inodes().
> 
> But screwing with LSM instructure looks ....  obnoxiously complex
> from the outside...

Imho, please just focus on the immediate feedback and ignore all the
extra bells and whistles that we could or should do. I prefer all of
that to be done after this series lands.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04  7:21                       ` Christian Brauner
@ 2024-10-04 12:14                         ` Christoph Hellwig
  2024-10-04 13:49                           ` Jan Kara
  2024-10-04 22:57                         ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2024-10-04 12:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > But screwing with LSM instructure looks ....  obnoxiously complex
> > from the outside...
> 
> Imho, please just focus on the immediate feedback and ignore all the
> extra bells and whistles that we could or should do. I prefer all of
> that to be done after this series lands.

For the LSM mess: absolutely.  For fsnotify it seems like Dave has
a good idea to integrate it, and it removes the somewhat awkward
need for the reffed flag.  So if that delayed notify idea works out
I'd prefer to see that in over the flag.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04 12:14                         ` Christoph Hellwig
@ 2024-10-04 13:49                           ` Jan Kara
  2024-10-04 18:15                             ` Paul Moore
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2024-10-04 13:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Dave Chinner, Jan Kara, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri 04-10-24 05:14:36, Christoph Hellwig wrote:
> On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > > But screwing with LSM instructure looks ....  obnoxiously complex
> > > from the outside...
> > 
> > Imho, please just focus on the immediate feedback and ignore all the
> > extra bells and whistles that we could or should do. I prefer all of
> > that to be done after this series lands.
> 
> For the LSM mess: absolutely.  For fsnotify it seems like Dave has
> a good idea to integrate it, and it removes the somewhat awkward
> need for the reffed flag.  So if that delayed notify idea works out
> I'd prefer to see that in over the flag.

As I wrote in one of the emails in this (now huge) thread, I'm fine with
completely dropping that inode->i_refcount check from the
fsnotify_unmount_inodes(). It made sense when it was called before
evict_inodes() but after 1edc8eb2e931 ("fs: call fsnotify_sb_delete after
evict_inodes") the usefulness of this check is rather doubtful. So we can
drop the awkward flag regardless whether we unify evict_inodes() with
fsnotify_unmount_inodes() or not. I have no strong preference whether the
unification happens as part of this patch set or later on so it's up to
Dave as far as I'm concerned.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04 13:49                           ` Jan Kara
@ 2024-10-04 18:15                             ` Paul Moore
  0 siblings, 0 replies; 34+ messages in thread
From: Paul Moore @ 2024-10-04 18:15 UTC (permalink / raw)
  To: Mickaël Salaün, Günther Noack
  Cc: Jan Kara, Christoph Hellwig, Christian Brauner, Dave Chinner,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Mickaël Salaün, Jann Horn, Serge Hallyn,
	Kees Cook, linux-security-module, Amir Goldstein

On Fri, Oct 4, 2024 at 9:49 AM Jan Kara <jack@suse.cz> wrote:
> On Fri 04-10-24 05:14:36, Christoph Hellwig wrote:
> > On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > > > But screwing with LSM instructure looks ....  obnoxiously complex
> > > > from the outside...
> > >
> > > Imho, please just focus on the immediate feedback and ignore all the
> > > extra bells and whistles that we could or should do. I prefer all of
> > > that to be done after this series lands.
> >
> > For the LSM mess: absolutely.  For fsnotify it seems like Dave has
> > a good idea to integrate it, and it removes the somewhat awkward
> > need for the reffed flag.  So if that delayed notify idea works out
> > I'd prefer to see that in over the flag.
>
> As I wrote in one of the emails in this (now huge) thread, I'm fine with
> completely dropping that inode->i_refcount check from the
> fsnotify_unmount_inodes(). It made sense when it was called before
> evict_inodes() but after 1edc8eb2e931 ("fs: call fsnotify_sb_delete after
> evict_inodes") the usefulness of this check is rather doubtful. So we can
> drop the awkward flag regardless whether we unify evict_inodes() with
> fsnotify_unmount_inodes() or not. I have no strong preference whether the
> unification happens as part of this patch set or later on so it's up to
> Dave as far as I'm concerned.

I didn't get a chance to look at this thread until just now and I'm
noticing that the email used for Mickaël is likely not the best, I'm
adding the email he uses in MAINTAINERS as well as that of Günther
Noack, a designated Landlock reviewer.

Mickaël, Günther, the lore link for the full discussion is below:

https://lore.kernel.org/all/Zv5GfY1WS_aaczZM@infradead.org

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04  7:21                       ` Christian Brauner
  2024-10-04 12:14                         ` Christoph Hellwig
@ 2024-10-04 22:57                         ` Dave Chinner
  2024-10-05 15:21                           ` Mickaël Salaün
  1 sibling, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2024-10-04 22:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote:
> > On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> > > On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > > > As for the landlock code, I think it needs to have it's own internal
> > > > tracking mechanism and not search the sb inode list for inodes that
> > > > it holds references to. LSM cleanup should be run before before we
> > > > get to tearing down the inode cache, not after....
> > > 
> > > Well, I think LSM cleanup could in principle be handled together with the
> > > fsnotify cleanup but I didn't check the details.
> > 
> > I'm not sure how we tell if an inode potentially has a LSM related
> > reference hanging off it. The landlock code looks to make an
> > assumption in that the only referenced inodes it sees will have a
> > valid inode->i_security pointer if landlock is enabled. i.e. it
> > calls landlock_inode(inode) and dereferences the returned value
> > without ever checking if inode->i_security is NULL or not.
> > 
> > I mean, we could do a check for inode->i_security when the refcount
> > is elevated and replace the security_sb_delete hook with an
> > security_evict_inode hook similar to the proposed fsnotify eviction
> > from evict_inodes().
> > 
> > But screwing with LSM instructure looks ....  obnoxiously complex
> > from the outside...
> 
> Imho, please just focus on the immediate feedback and ignore all the
> extra bells and whistles that we could or should do. I prefer all of
> that to be done after this series lands.

Actually, it's not as bad as I thought it was going to be. I've
already moved both fsnotify and LSM inode eviction to
evict_inodes() as preparatory patches...

Dave Chinner (2):
      vfs: move fsnotify inode eviction to evict_inodes()
      vfs, lsm: rework lsm inode eviction at unmount

 fs/inode.c                    |  52 +++++++++++++---
 fs/notify/fsnotify.c          |  60 -------------------
 fs/super.c                    |   8 +--
 include/linux/lsm_hook_defs.h |   2 +-
 include/linux/security.h      |   2 +-
 security/landlock/fs.c        | 134 ++++++++++--------------------------------
 security/security.c           |  31 ++++++----
7 files changed, 99 insertions(+), 190 deletions(-)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04 22:57                         ` Dave Chinner
@ 2024-10-05 15:21                           ` Mickaël Salaün
  2024-10-05 16:03                             ` Mickaël Salaün
  2024-10-05 16:03                             ` Paul Moore
  0 siblings, 2 replies; 34+ messages in thread
From: Mickaël Salaün @ 2024-10-05 15:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Paul Moore, Günther Noack

On Sat, Oct 05, 2024 at 08:57:32AM +1000, Dave Chinner wrote:
> On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote:
> > > On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> > > > On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > > > > As for the landlock code, I think it needs to have it's own internal
> > > > > tracking mechanism and not search the sb inode list for inodes that
> > > > > it holds references to. LSM cleanup should be run before before we
> > > > > get to tearing down the inode cache, not after....
> > > > 
> > > > Well, I think LSM cleanup could in principle be handled together with the
> > > > fsnotify cleanup but I didn't check the details.
> > > 
> > > I'm not sure how we tell if an inode potentially has a LSM related
> > > reference hanging off it. The landlock code looks to make an
> > > assumption in that the only referenced inodes it sees will have a
> > > valid inode->i_security pointer if landlock is enabled. i.e. it
> > > calls landlock_inode(inode) and dereferences the returned value
> > > without ever checking if inode->i_security is NULL or not.

Correct, i_security should always be valid when this hook is called
because it means that at least Landlock is enabled and then i_security
refers to a valid LSM blob.

> > > 
> > > I mean, we could do a check for inode->i_security when the refcount
> > > is elevated and replace the security_sb_delete hook with an
> > > security_evict_inode hook similar to the proposed fsnotify eviction
> > > from evict_inodes().

That would be nice.

> > > 
> > > But screwing with LSM instructure looks ....  obnoxiously complex
> > > from the outside...
> > 
> > Imho, please just focus on the immediate feedback and ignore all the
> > extra bells and whistles that we could or should do. I prefer all of
> > that to be done after this series lands.
> 
> Actually, it's not as bad as I thought it was going to be. I've
> already moved both fsnotify and LSM inode eviction to
> evict_inodes() as preparatory patches...

Good, please Cc me and Günther on related patch series.

FYI, we have the two release_inodes tests to check this hook in
tools/testing/selftests/landlock/fs_test.c

> 
> Dave Chinner (2):
>       vfs: move fsnotify inode eviction to evict_inodes()
>       vfs, lsm: rework lsm inode eviction at unmount
> 
>  fs/inode.c                    |  52 +++++++++++++---
>  fs/notify/fsnotify.c          |  60 -------------------
>  fs/super.c                    |   8 +--
>  include/linux/lsm_hook_defs.h |   2 +-
>  include/linux/security.h      |   2 +-
>  security/landlock/fs.c        | 134 ++++++++++--------------------------------
>  security/security.c           |  31 ++++++----
> 7 files changed, 99 insertions(+), 190 deletions(-)
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-05 15:21                           ` Mickaël Salaün
@ 2024-10-05 16:03                             ` Mickaël Salaün
  2024-10-05 16:03                             ` Paul Moore
  1 sibling, 0 replies; 34+ messages in thread
From: Mickaël Salaün @ 2024-10-05 16:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Paul Moore, Günther Noack

On Sat, Oct 05, 2024 at 05:21:30PM +0200, Mickaël Salaün wrote:
> On Sat, Oct 05, 2024 at 08:57:32AM +1000, Dave Chinner wrote:

> > Actually, it's not as bad as I thought it was going to be. I've
> > already moved both fsnotify and LSM inode eviction to
> > evict_inodes() as preparatory patches...
> 
> Good, please Cc me and Günther on related patch series.
> 
> FYI, we have the two release_inodes tests to check this hook in
> tools/testing/selftests/landlock/fs_test.c
> 
> > 
> > Dave Chinner (2):
> >       vfs: move fsnotify inode eviction to evict_inodes()
> >       vfs, lsm: rework lsm inode eviction at unmount
> > 
> >  fs/inode.c                    |  52 +++++++++++++---
> >  fs/notify/fsnotify.c          |  60 -------------------
> >  fs/super.c                    |   8 +--
> >  include/linux/lsm_hook_defs.h |   2 +-
> >  include/linux/security.h      |   2 +-
> >  security/landlock/fs.c        | 134 ++++++++++--------------------------------

Please run clang-format -i security/landlock/fs.c

> >  security/security.c           |  31 ++++++----
> > 7 files changed, 99 insertions(+), 190 deletions(-)
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-05 15:21                           ` Mickaël Salaün
  2024-10-05 16:03                             ` Mickaël Salaün
@ 2024-10-05 16:03                             ` Paul Moore
  1 sibling, 0 replies; 34+ messages in thread
From: Paul Moore @ 2024-10-05 16:03 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Dave Chinner, Christian Brauner, Jan Kara, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein, Günther Noack

On Sat, Oct 5, 2024 at 11:21 AM Mickaël Salaün <mic@digikod.net> wrote:
> On Sat, Oct 05, 2024 at 08:57:32AM +1000, Dave Chinner wrote:

...

> > Actually, it's not as bad as I thought it was going to be. I've
> > already moved both fsnotify and LSM inode eviction to
> > evict_inodes() as preparatory patches...
>
> Good, please Cc me and Günther on related patch series.

As well as the LSM list since the LSM framework looks to have some
changes as well.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 11:57       ` Jan Kara
  2024-10-03 12:11         ` Christoph Hellwig
@ 2024-10-07 20:37         ` Linus Torvalds
  2024-10-07 23:33           ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2024-10-07 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
>
> Fair enough. If we go with the iterator variant I've suggested to Dave in
> [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> Landlocks hook_sb_delete() into a single iteration relatively easily. But
> I'd wait with that convertion until this series lands.

Honza, I looked at this a bit more, particularly with an eye of "what
happens if we just end up making the inode lifetimes subject to the
dentry lifetimes" as suggested by Dave elsewhere.

And honestly, the whole inode list use by the fsnotify layer seems to
kind of suck. But I may be entirely missing something, so maybe I'm
very wrong for some reason.

The reason I say it "seems to kind of suck" is that the whole final

                /* for each watch, send FS_UNMOUNT and then remove it */
                fsnotify_inode(inode, FS_UNMOUNT);

                fsnotify_inode_delete(inode);

sequence seems to be entirely timing-dependent, and largely pointless and wrong.

Why?

Because inodes with no users will get removed at completely arbitrary
times under memory pressure in evict() -> destroy_inode(), and
obviously with I_DONTCACHE that ends up happening even earlier when
the dentry is removed.

So the whole "send FS_UNMOUNT and then remove it " thing seems to be
entirely bogus, and depending on memory pressure, lots of inodes will
only see the fsnotify_inode_delete() at eviction time and never get
the FS_UNMOUNT notification anyway.

So I get the feeling that we'd be better off entirely removing the
sb->s_inodes use from fsnotify, and replace this "get rid of them at
umount" with something like this instead:

  diff --git a/fs/dcache.c b/fs/dcache.c
  index 0f6b16ba30d0..aa2558de8d1f 100644
  --- a/fs/dcache.c
  +++ b/fs/dcache.c
  @@ -406,6 +406,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
        spin_unlock(&inode->i_lock);
        if (!inode->i_nlink)
                fsnotify_inoderemove(inode);
  +     fsnotify_inode_delete(inode);
        if (dentry->d_op && dentry->d_op->d_iput)
                dentry->d_op->d_iput(dentry, inode);
        else
  diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
  index 278620e063ab..ea91cc216028 100644
  --- a/include/linux/fsnotify.h
  +++ b/include/linux/fsnotify.h
  @@ -261,7 +261,6 @@ static inline void
fsnotify_vfsmount_delete(struct vfsmount *mnt)
   static inline void fsnotify_inoderemove(struct inode *inode)
   {
        fsnotify_inode(inode, FS_DELETE_SELF);
  -     __fsnotify_inode_delete(inode);
   }

   /*

which makes the fsnotify_inode_delete() happen when the inode is
removed from the dentry.

Then at umount time, the dentry shrinking will deal with all live
dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
just the root dentry inodes?

Wouldn't that make things much cleaner, and remove at least *one* odd
use of the nasty s_inodes list?

I have this feeling that maybe we can just remove the other users too
using similar models. I think the LSM layer use (in landlock) is bogus
for exactly the same reason - there's really no reason to keep things
around for a random cached inode without a dentry.

And I wonder if the quota code (which uses the s_inodes list to enable
quotas on already mounted filesystems) could for all the same reasons
just walk the dentry tree instead (and remove_dquot_ref similarly
could just remove it at dentry_unlink_inode() time)?

It really feels like most (all?) of the s_inode list users are
basically historical, and shouldn't use that list at all. And there
aren't _that_ many of them. I think Dave was right in just saying that
this list should go away entirely (or was it somebody else who made
that comment?)

                   Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-07 20:37         ` Linus Torvalds
@ 2024-10-07 23:33           ` Dave Chinner
  2024-10-08  0:28             ` Linus Torvalds
  2024-10-08  8:57             ` Amir Goldstein
  0 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2024-10-07 23:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> >
> > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > I'd wait with that convertion until this series lands.
> 
> Honza, I looked at this a bit more, particularly with an eye of "what
> happens if we just end up making the inode lifetimes subject to the
> dentry lifetimes" as suggested by Dave elsewhere.

....

> which makes the fsnotify_inode_delete() happen when the inode is
> removed from the dentry.

There may be other inode references being held that make
the inode live longer than the dentry cache. When should the
fsnotify marks be removed from the inode in that case? Do they need
to remain until, e.g, writeback completes?

> Then at umount time, the dentry shrinking will deal with all live
> dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> just the root dentry inodes?

I don't think even that is necessary, because
shrink_dcache_for_umount() drops the sb->s_root dentry after
trimming the dentry tree. Hence the dcache drop would cleanup all
inode references, roots included.

> Wouldn't that make things much cleaner, and remove at least *one* odd
> use of the nasty s_inodes list?

Yes, it would, but someone who knows exactly when the fsnotify
marks can be removed needs to chime in here...

> I have this feeling that maybe we can just remove the other users too
> using similar models. I think the LSM layer use (in landlock) is bogus
> for exactly the same reason - there's really no reason to keep things
> around for a random cached inode without a dentry.

Perhaps, but I'm not sure what the landlock code is actually trying
to do. It seems to be trying to avoid races between syscalls
releasing inode references and unmount calling security_sb_delete()
to clean up inode references that it has leaked. This implies that
it's not a) tracking inodes itself, and b) not cleaning up internal
state early enough in unmount.

Hence, to me, the lifecycle and reference counting of inode related
objects in landlock doesn't seem quite right, and the use of the
security_sb_delete() callout appears to be papering over an internal
lifecycle issue.

I'd love to get rid of it altogether.

> And I wonder if the quota code (which uses the s_inodes list to enable
> quotas on already mounted filesystems) could for all the same reasons
> just walk the dentry tree instead (and remove_dquot_ref similarly
> could just remove it at dentry_unlink_inode() time)?

I don't think that will work because we have to be able to modify
quota in evict() processing. This is especially true for unlinked
inodes being evicted from cache, but also the dquots need to stay
attached until writeback completes.

Hence I don't think we can remove the quota refs from the inode
before we call iput_final(), and so I think quotaoff (at least)
still needs to iterate inodes...

> It really feels like most (all?) of the s_inode list users are
> basically historical, and shouldn't use that list at all. And there
> aren't _that_ many of them. I think Dave was right in just saying that
> this list should go away entirely (or was it somebody else who made
> that comment?)

Yeah, I said that it should go away entirely.

My view of this whole s_inodes list is that subsystems that are
taking references to inodes *must* track or manage the references to
the inodes themselves.

The canonical example is the VFS itself: evict_inodes() doesn't need
to iterate s_inodes at all. It can walk the inode LRU to purge all
the unreferenced cached inodes from memory. iput_final() guarantees
that all unreferenced inodes are either put on the LRU or torn down
immediately.

Hence I think that it is a poor architectural decision to require
superblock teardown to clean up inode references random subsystems
have *leaked* to prevent UAFs.  It forces the sb to track all
inodes whether the VFS actually needs to track them or not.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-07 23:33           ` Dave Chinner
@ 2024-10-08  0:28             ` Linus Torvalds
  2024-10-08  0:54               ` Linus Torvalds
  2024-10-08 12:59               ` Mickaël Salaün
  2024-10-08  8:57             ` Amir Goldstein
  1 sibling, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2024-10-08  0:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
>
> There may be other inode references being held that make
> the inode live longer than the dentry cache. When should the
> fsnotify marks be removed from the inode in that case? Do they need
> to remain until, e.g, writeback completes?

Note that my idea is to just remove the fsnotify marks when the dentry
discards the inode.

That means that yes, the inode may still have a lifetime after the
dentry (because of other references, _or_ just because I_DONTCACHE
isn't set and we keep caching the inode).

BUT - fsnotify won't care. There won't be any fsnotify marks on that
inode any more, and without a dentry that points to it, there's no way
to add such marks.

(A new dentry may be re-attached to such an inode, and then fsnotify
could re-add new marks, but that doesn't change anything - the next
time the dentry is detached, the marks would go away again).

And yes, this changes the timing on when fsnotify events happen, but
what I'm actually hoping for is that Jan will agree that it doesn't
actually matter semantically.

> > Then at umount time, the dentry shrinking will deal with all live
> > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > just the root dentry inodes?
>
> I don't think even that is necessary, because
> shrink_dcache_for_umount() drops the sb->s_root dentry after
> trimming the dentry tree. Hence the dcache drop would cleanup all
> inode references, roots included.

Ahh - even better.

I didn't actually look very closely at the actual umount path, I was
looking just at the fsnotify_inoderemove() place in
dentry_unlink_inode() and went "couldn't we do _this_ instead?"

> > Wouldn't that make things much cleaner, and remove at least *one* odd
> > use of the nasty s_inodes list?
>
> Yes, it would, but someone who knows exactly when the fsnotify
> marks can be removed needs to chime in here...

Yup. Honza?

(Aside: I don't actually know if you prefer Jan or Honza, so I use
both randomly and interchangeably?)

> > I have this feeling that maybe we can just remove the other users too
> > using similar models. I think the LSM layer use (in landlock) is bogus
> > for exactly the same reason - there's really no reason to keep things
> > around for a random cached inode without a dentry.
>
> Perhaps, but I'm not sure what the landlock code is actually trying
> to do.

Yeah, I wouldn't be surprised if it's just confused - it's very odd.

But I'd be perfectly happy just removing one use at a time - even if
we keep the s_inodes list around because of other users, it would
still be "one less thing".

> Hence, to me, the lifecycle and reference counting of inode related
> objects in landlock doesn't seem quite right, and the use of the
> security_sb_delete() callout appears to be papering over an internal
> lifecycle issue.
>
> I'd love to get rid of it altogether.

Yeah, I think the inode lifetime is just so random these days that
anything that depends on it is questionable.

The quota case is probably the only thing where the inode lifetime
*really* makes sense, and that's the one where I looked at the code
and went "I *hope* this can be converted to traversing the dentry
tree", but at the same time it did look sensible to make it be about
inodes.

If we can convert the quota side to be based on dentry lifetimes, it
will almost certainly then have to react to the places that do
"d_add()" when re-connecting an inode to a dentry at lookup time.

So yeah, the quota code looks worse, but even if we could just remove
fsnotify and landlock, I'd still be much happier.

             Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  0:28             ` Linus Torvalds
@ 2024-10-08  0:54               ` Linus Torvalds
  2024-10-09  9:49                 ` Jan Kara
  2024-10-08 12:59               ` Mickaël Salaün
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2024-10-08  0:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Mon, 7 Oct 2024 at 17:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And yes, this changes the timing on when fsnotify events happen, but
> what I'm actually hoping for is that Jan will agree that it doesn't
> actually matter semantically.

.. and yes, I realize it might actually matter. fsnotify does do
'ihold()' to hold an inode ref, and with this that would actually be
more or less pointless, because the mark would be removed _despite_
such a ref.

So maybe it's not an option to do what I suggested. I don't know the
users well enough.

         Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-07 23:33           ` Dave Chinner
  2024-10-08  0:28             ` Linus Torvalds
@ 2024-10-08  8:57             ` Amir Goldstein
  2024-10-08 11:23               ` Jan Kara
  1 sibling, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2024-10-08  8:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > >
> > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > I'd wait with that convertion until this series lands.
> >
> > Honza, I looked at this a bit more, particularly with an eye of "what
> > happens if we just end up making the inode lifetimes subject to the
> > dentry lifetimes" as suggested by Dave elsewhere.
>
> ....
>
> > which makes the fsnotify_inode_delete() happen when the inode is
> > removed from the dentry.
>
> There may be other inode references being held that make
> the inode live longer than the dentry cache. When should the
> fsnotify marks be removed from the inode in that case? Do they need
> to remain until, e.g, writeback completes?
>

fsnotify inode marks remain until explicitly removed or until sb
is unmounted (*), so other inode references are irrelevant to
inode mark removal.

(*) fanotify has "evictable" inode marks, which do not hold inode
reference and go away on inode evict, but those mark evictions
do not generate any event (i.e. there is no FAN_UNMOUNT).

> > Then at umount time, the dentry shrinking will deal with all live
> > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > just the root dentry inodes?
>
> I don't think even that is necessary, because
> shrink_dcache_for_umount() drops the sb->s_root dentry after
> trimming the dentry tree. Hence the dcache drop would cleanup all
> inode references, roots included.
>
> > Wouldn't that make things much cleaner, and remove at least *one* odd
> > use of the nasty s_inodes list?
>
> Yes, it would, but someone who knows exactly when the fsnotify
> marks can be removed needs to chime in here...
>
> > I have this feeling that maybe we can just remove the other users too
> > using similar models. I think the LSM layer use (in landlock) is bogus
> > for exactly the same reason - there's really no reason to keep things
> > around for a random cached inode without a dentry.
>
> Perhaps, but I'm not sure what the landlock code is actually trying
> to do. It seems to be trying to avoid races between syscalls
> releasing inode references and unmount calling security_sb_delete()
> to clean up inode references that it has leaked. This implies that
> it's not a) tracking inodes itself, and b) not cleaning up internal
> state early enough in unmount.
>
> Hence, to me, the lifecycle and reference counting of inode related
> objects in landlock doesn't seem quite right, and the use of the
> security_sb_delete() callout appears to be papering over an internal
> lifecycle issue.
>
> I'd love to get rid of it altogether.
>
> > And I wonder if the quota code (which uses the s_inodes list to enable
> > quotas on already mounted filesystems) could for all the same reasons
> > just walk the dentry tree instead (and remove_dquot_ref similarly
> > could just remove it at dentry_unlink_inode() time)?
>
> I don't think that will work because we have to be able to modify
> quota in evict() processing. This is especially true for unlinked
> inodes being evicted from cache, but also the dquots need to stay
> attached until writeback completes.
>
> Hence I don't think we can remove the quota refs from the inode
> before we call iput_final(), and so I think quotaoff (at least)
> still needs to iterate inodes...
>
> > It really feels like most (all?) of the s_inode list users are
> > basically historical, and shouldn't use that list at all. And there
> > aren't _that_ many of them. I think Dave was right in just saying that
> > this list should go away entirely (or was it somebody else who made
> > that comment?)
>
> Yeah, I said that it should go away entirely.
>
> My view of this whole s_inodes list is that subsystems that are
> taking references to inodes *must* track or manage the references to
> the inodes themselves.
>
> The canonical example is the VFS itself: evict_inodes() doesn't need
> to iterate s_inodes at all. It can walk the inode LRU to purge all
> the unreferenced cached inodes from memory. iput_final() guarantees
> that all unreferenced inodes are either put on the LRU or torn down
> immediately.
>
> Hence I think that it is a poor architectural decision to require
> superblock teardown to clean up inode references random subsystems
> have *leaked* to prevent UAFs.  It forces the sb to track all
> inodes whether the VFS actually needs to track them or not.
>

For fsnotify, I think we can/should maintain a list of marked inodes
inside sb->s_fsnotify_info, we can iterate this private list in
fsnotify_unmount_inodes() to remove the marks.

TBH, I am not sure I understand the suggested change for inode
lifetime. An inode can have a reference from dentry or from some
subsystem (e.g. fsnotify) which is responsible for putting their held
reference before unmount. What is the alternative?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  8:57             ` Amir Goldstein
@ 2024-10-08 11:23               ` Jan Kara
  2024-10-08 12:16                 ` Christian Brauner
  2024-10-08 23:44                 ` Dave Chinner
  0 siblings, 2 replies; 34+ messages in thread
From: Jan Kara @ 2024-10-08 11:23 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Linus Torvalds, Jan Kara, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > I'd wait with that convertion until this series lands.
> > >
> > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > happens if we just end up making the inode lifetimes subject to the
> > > dentry lifetimes" as suggested by Dave elsewhere.
> >
> > ....
> >
> > > which makes the fsnotify_inode_delete() happen when the inode is
> > > removed from the dentry.
> >
> > There may be other inode references being held that make
> > the inode live longer than the dentry cache. When should the
> > fsnotify marks be removed from the inode in that case? Do they need
> > to remain until, e.g, writeback completes?
> >
> 
> fsnotify inode marks remain until explicitly removed or until sb
> is unmounted (*), so other inode references are irrelevant to
> inode mark removal.
> 
> (*) fanotify has "evictable" inode marks, which do not hold inode
> reference and go away on inode evict, but those mark evictions
> do not generate any event (i.e. there is no FAN_UNMOUNT).

Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
is for inotify which guarantees that either you get an event about somebody
unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
we would maintain this behavior with what Linus proposes.

> > > Then at umount time, the dentry shrinking will deal with all live
> > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > just the root dentry inodes?
> >
> > I don't think even that is necessary, because
> > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > trimming the dentry tree. Hence the dcache drop would cleanup all
> > inode references, roots included.
> >
> > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > use of the nasty s_inodes list?
> >
> > Yes, it would, but someone who knows exactly when the fsnotify
> > marks can be removed needs to chime in here...

So fsnotify needs a list of inodes for the superblock which have marks
attached and for which we hold inode reference. We can keep it inside
fsnotify code although it would practically mean another list_head for the
inode for this list (probably in our fsnotify_connector structure which
connects list of notification marks to the inode). If we actually get rid
of i_sb_list in struct inode, this will be a win for the overall system,
otherwise it is a net loss IMHO. So if we can figure out how to change
other s_inodes owners we can certainly do this fsnotify change.

> > > And I wonder if the quota code (which uses the s_inodes list to enable
> > > quotas on already mounted filesystems) could for all the same reasons
> > > just walk the dentry tree instead (and remove_dquot_ref similarly
> > > could just remove it at dentry_unlink_inode() time)?
> >
> > I don't think that will work because we have to be able to modify
> > quota in evict() processing. This is especially true for unlinked
> > inodes being evicted from cache, but also the dquots need to stay
> > attached until writeback completes.
> >
> > Hence I don't think we can remove the quota refs from the inode
> > before we call iput_final(), and so I think quotaoff (at least)
> > still needs to iterate inodes...

Yeah, I'm not sure how to get rid of the s_inodes use in quota code. One of
the things we need s_inodes list for is during quotaoff on a mounted
filesystem when we need to iterate all inodes which are referencing quota
structures and free them.  In theory we could keep a list of inodes
referencing quota structures but that would require adding list_head to
inode structure for filesystems that support quotas. Now for the sake of
full context I'll also say that enabling / disabling quotas on a mounted
filesystem is a legacy feature because it is quite easy that quota
accounting goes wrong with it. So ext4 and f2fs support for quite a few
years a mode where quota tracking is enabled on mount and disabled on
unmount (if appropriate fs feature is enabled) and you can only enable /
disable enforcement of quota limits during runtime.  So I could see us
deprecating this functionality altogether although jfs never adapted to
this new way we do quotas so we'd have to deal with that somehow.  But one
way or another it would take a significant amount of time before we can
completely remove this so it is out of question for this series.

I see one problem with the idea "whoever has a need to iterate inodes needs
to keep track of inodes it needs to iterate through". It is fine
conceptually but with s_inodes list we pay the cost only once and multiple
users benefit. With each subsystem tracking inodes we pay the cost for each
user (both in terms of memory and CPU). So if you don't use any of the
subsystems that need iteration, you win, but if you use two or more of
these subsystems, in particular those which need to track significant
portion of all inodes, you are losing.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 11:23               ` Jan Kara
@ 2024-10-08 12:16                 ` Christian Brauner
  2024-10-09  0:03                   ` Dave Chinner
  2024-10-08 23:44                 ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2024-10-08 12:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Dave Chinner, Linus Torvalds, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 08, 2024 at 01:23:44PM GMT, Jan Kara wrote:
> On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > I'd wait with that convertion until this series lands.
> > > >
> > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > happens if we just end up making the inode lifetimes subject to the
> > > > dentry lifetimes" as suggested by Dave elsewhere.
> > >
> > > ....
> > >
> > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > removed from the dentry.
> > >
> > > There may be other inode references being held that make
> > > the inode live longer than the dentry cache. When should the
> > > fsnotify marks be removed from the inode in that case? Do they need
> > > to remain until, e.g, writeback completes?
> > >
> > 
> > fsnotify inode marks remain until explicitly removed or until sb
> > is unmounted (*), so other inode references are irrelevant to
> > inode mark removal.
> > 
> > (*) fanotify has "evictable" inode marks, which do not hold inode
> > reference and go away on inode evict, but those mark evictions
> > do not generate any event (i.e. there is no FAN_UNMOUNT).
> 
> Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> is for inotify which guarantees that either you get an event about somebody
> unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> we would maintain this behavior with what Linus proposes.
> 
> > > > Then at umount time, the dentry shrinking will deal with all live
> > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > just the root dentry inodes?
> > >
> > > I don't think even that is necessary, because
> > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > inode references, roots included.
> > >
> > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > use of the nasty s_inodes list?
> > >
> > > Yes, it would, but someone who knows exactly when the fsnotify
> > > marks can be removed needs to chime in here...
> 
> So fsnotify needs a list of inodes for the superblock which have marks
> attached and for which we hold inode reference. We can keep it inside
> fsnotify code although it would practically mean another list_head for the
> inode for this list (probably in our fsnotify_connector structure which
> connects list of notification marks to the inode). If we actually get rid
> of i_sb_list in struct inode, this will be a win for the overall system,
> otherwise it is a net loss IMHO. So if we can figure out how to change
> other s_inodes owners we can certainly do this fsnotify change.
> 
> > > > And I wonder if the quota code (which uses the s_inodes list to enable
> > > > quotas on already mounted filesystems) could for all the same reasons
> > > > just walk the dentry tree instead (and remove_dquot_ref similarly
> > > > could just remove it at dentry_unlink_inode() time)?
> > >
> > > I don't think that will work because we have to be able to modify
> > > quota in evict() processing. This is especially true for unlinked
> > > inodes being evicted from cache, but also the dquots need to stay
> > > attached until writeback completes.
> > >
> > > Hence I don't think we can remove the quota refs from the inode
> > > before we call iput_final(), and so I think quotaoff (at least)
> > > still needs to iterate inodes...
> 
> Yeah, I'm not sure how to get rid of the s_inodes use in quota code. One of
> the things we need s_inodes list for is during quotaoff on a mounted
> filesystem when we need to iterate all inodes which are referencing quota
> structures and free them.  In theory we could keep a list of inodes
> referencing quota structures but that would require adding list_head to
> inode structure for filesystems that support quotas. Now for the sake of
> full context I'll also say that enabling / disabling quotas on a mounted
> filesystem is a legacy feature because it is quite easy that quota
> accounting goes wrong with it. So ext4 and f2fs support for quite a few
> years a mode where quota tracking is enabled on mount and disabled on
> unmount (if appropriate fs feature is enabled) and you can only enable /
> disable enforcement of quota limits during runtime.  So I could see us
> deprecating this functionality altogether although jfs never adapted to
> this new way we do quotas so we'd have to deal with that somehow.  But one
> way or another it would take a significant amount of time before we can
> completely remove this so it is out of question for this series.

I still maintain that we don't need to solve the fsnotify and lsm rework
as part of this particular series.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  0:28             ` Linus Torvalds
  2024-10-08  0:54               ` Linus Torvalds
@ 2024-10-08 12:59               ` Mickaël Salaün
  2024-10-09  0:21                 ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Mickaël Salaün @ 2024-10-08 12:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Günther Noack

On Mon, Oct 07, 2024 at 05:28:57PM -0700, Linus Torvalds wrote:
> On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
> >
> > There may be other inode references being held that make
> > the inode live longer than the dentry cache. When should the
> > fsnotify marks be removed from the inode in that case? Do they need
> > to remain until, e.g, writeback completes?
> 
> Note that my idea is to just remove the fsnotify marks when the dentry
> discards the inode.
> 
> That means that yes, the inode may still have a lifetime after the
> dentry (because of other references, _or_ just because I_DONTCACHE
> isn't set and we keep caching the inode).
> 
> BUT - fsnotify won't care. There won't be any fsnotify marks on that
> inode any more, and without a dentry that points to it, there's no way
> to add such marks.
> 
> (A new dentry may be re-attached to such an inode, and then fsnotify
> could re-add new marks, but that doesn't change anything - the next
> time the dentry is detached, the marks would go away again).
> 
> And yes, this changes the timing on when fsnotify events happen, but
> what I'm actually hoping for is that Jan will agree that it doesn't
> actually matter semantically.
> 
> > > Then at umount time, the dentry shrinking will deal with all live
> > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > just the root dentry inodes?
> >
> > I don't think even that is necessary, because
> > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > trimming the dentry tree. Hence the dcache drop would cleanup all
> > inode references, roots included.
> 
> Ahh - even better.
> 
> I didn't actually look very closely at the actual umount path, I was
> looking just at the fsnotify_inoderemove() place in
> dentry_unlink_inode() and went "couldn't we do _this_ instead?"
> 
> > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > use of the nasty s_inodes list?
> >
> > Yes, it would, but someone who knows exactly when the fsnotify
> > marks can be removed needs to chime in here...
> 
> Yup. Honza?
> 
> (Aside: I don't actually know if you prefer Jan or Honza, so I use
> both randomly and interchangeably?)
> 
> > > I have this feeling that maybe we can just remove the other users too
> > > using similar models. I think the LSM layer use (in landlock) is bogus
> > > for exactly the same reason - there's really no reason to keep things
> > > around for a random cached inode without a dentry.
> >
> > Perhaps, but I'm not sure what the landlock code is actually trying
> > to do.

In Landlock, inodes (see landlock_object) may be referenced by several
rulesets, either tied to a task's cred or a ruleset's file descriptor.
A ruleset may outlive its referenced inodes, and this should not block
related umounts.  security_sb_delete() is used to gracefully release
such references.

> 
> Yeah, I wouldn't be surprised if it's just confused - it's very odd.
> 
> But I'd be perfectly happy just removing one use at a time - even if
> we keep the s_inodes list around because of other users, it would
> still be "one less thing".
> 
> > Hence, to me, the lifecycle and reference counting of inode related
> > objects in landlock doesn't seem quite right, and the use of the
> > security_sb_delete() callout appears to be papering over an internal
> > lifecycle issue.
> >
> > I'd love to get rid of it altogether.

I'm not sure to fully understand the implications for now, but it would
definitely be good to simplify this lifetime management.  The only
requirement for Landlock is that inodes references should live as long
as the related inodes are accessible by user space or already in use.
The sooner these references are removed from related ruleset, the
better.

> 
> Yeah, I think the inode lifetime is just so random these days that
> anything that depends on it is questionable.
> 
> The quota case is probably the only thing where the inode lifetime
> *really* makes sense, and that's the one where I looked at the code
> and went "I *hope* this can be converted to traversing the dentry
> tree", but at the same time it did look sensible to make it be about
> inodes.
> 
> If we can convert the quota side to be based on dentry lifetimes, it
> will almost certainly then have to react to the places that do
> "d_add()" when re-connecting an inode to a dentry at lookup time.
> 
> So yeah, the quota code looks worse, but even if we could just remove
> fsnotify and landlock, I'd still be much happier.
> 
>              Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 11:23               ` Jan Kara
  2024-10-08 12:16                 ` Christian Brauner
@ 2024-10-08 23:44                 ` Dave Chinner
  2024-10-09  6:10                   ` Amir Goldstein
  2024-10-09 14:18                   ` Jan Kara
  1 sibling, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2024-10-08 23:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > I'd wait with that convertion until this series lands.
> > > >
> > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > happens if we just end up making the inode lifetimes subject to the
> > > > dentry lifetimes" as suggested by Dave elsewhere.
> > >
> > > ....
> > >
> > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > removed from the dentry.
> > >
> > > There may be other inode references being held that make
> > > the inode live longer than the dentry cache. When should the
> > > fsnotify marks be removed from the inode in that case? Do they need
> > > to remain until, e.g, writeback completes?
> > >
> > 
> > fsnotify inode marks remain until explicitly removed or until sb
> > is unmounted (*), so other inode references are irrelevant to
> > inode mark removal.
> > 
> > (*) fanotify has "evictable" inode marks, which do not hold inode
> > reference and go away on inode evict, but those mark evictions
> > do not generate any event (i.e. there is no FAN_UNMOUNT).
> 
> Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> is for inotify which guarantees that either you get an event about somebody
> unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> we would maintain this behavior with what Linus proposes.

Thanks. I didn't respond last night when I read Amir's decription
because I wanted to think it over. Knowing where the unmount event
requirement certainly helps.

I am probably missing something important, but it really seems to me
that the object reference counting model is the back to
front.  Currently the mark is being attached to the inode and then
the inode pinned by a reference count to make the mark attached
to the inode persistent until unmount. This then requires the inodes
to be swept by unmount because fsnotify has effectively leaked them
as it isn't tracking such inodes itself.

[ Keep in mind that I'm not saying this was a bad or wrong thing to
do because the s_inodes list was there to be able to do this sort of
lazy cleanup. But now that we want to remove the s_inodes list if at
all possible, it is a problem we need to solve differently. ]

AFAICT, inotify does not appear to require the inode to send events
- it only requires access to the inode mark itself. Hence it does
not the inode in cache to generate IN_UNMOUNT events, it just
needs the mark itself to be findable at unmount.  Do any of the
other backends that require unmount notifications that require
special access to the inode itself?

If not, and the fsnotify sb info is tracking these persistent marks,
then we don't need to iterate inodes at unmount. This means we don't
need to pin inodes when they have marks attached, and so the
dependency on the s_inodes list goes away.

With this inverted model, we need the first fsnotify event callout
after the inode is instantiated to look for a persistent mark for
the inode. We know how to do this efficiently - it's exactly the
same caching model we use for ACLs. On the first lookup, we check
the inode for ACL data and set the ACL pointer appropriately to
indicate that a lookup has been done and there are no ACLs
associated with the inode.

At this point, the fsnotify inode marks can all be removed from the
inode when it is being evicted and there's no need for fsnotify to
pin inodes at all.

> > > > Then at umount time, the dentry shrinking will deal with all live
> > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > just the root dentry inodes?
> > >
> > > I don't think even that is necessary, because
> > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > inode references, roots included.
> > >
> > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > use of the nasty s_inodes list?
> > >
> > > Yes, it would, but someone who knows exactly when the fsnotify
> > > marks can be removed needs to chime in here...
> 
> So fsnotify needs a list of inodes for the superblock which have marks
> attached and for which we hold inode reference. We can keep it inside
> fsnotify code although it would practically mean another list_head for the
> inode for this list (probably in our fsnotify_connector structure which
> connects list of notification marks to the inode).

I don't think that is necessary. We need to get rid of the inode
reference, not move where we track inode references. The persistent
object is the fsnotify mark, not the cached inode. It's the mark
that needs to be persistent, and that's what the fsnotify code
should be tracking.

The fsnotify marks are much smaller than inodes, and there going to
be fewer cached marks than inodes, especially once inode pinning is
removed. Hence I think this will result in a net reduction in memory
footprint for "marked-until-unmount" configurations as we won't pin
nearly as many inodes in cache...

> If we actually get rid
> of i_sb_list in struct inode, this will be a win for the overall system,
> otherwise it is a net loss IMHO. So if we can figure out how to change
> other s_inodes owners we can certainly do this fsnotify change.

Yes, I am exploring what it would take to get rid of i_sb_list
altogether right now. That, I don't think this is a concern given
the difference in memory footprint of the same number of persistent
marks. i.e. "persistent mark, reclaimable inode" will always have a
significantly lower memory footprint than "persistent inode and
mark" under memory pressure....

> > > > And I wonder if the quota code (which uses the s_inodes list
> > > > to enable quotas on already mounted filesystems) could for
> > > > all the same reasons just walk the dentry tree instead (and
> > > > remove_dquot_ref similarly could just remove it at
> > > > dentry_unlink_inode() time)?
> > >
> > > I don't think that will work because we have to be able to
> > > modify quota in evict() processing. This is especially true
> > > for unlinked inodes being evicted from cache, but also the
> > > dquots need to stay attached until writeback completes.
> > >
> > > Hence I don't think we can remove the quota refs from the
> > > inode before we call iput_final(), and so I think quotaoff (at
> > > least) still needs to iterate inodes...
> 
> Yeah, I'm not sure how to get rid of the s_inodes use in quota
> code. One of the things we need s_inodes list for is during
> quotaoff on a mounted filesystem when we need to iterate all
> inodes which are referencing quota structures and free them.  In
> theory we could keep a list of inodes referencing quota structures
> but that would require adding list_head to inode structure for
> filesystems that support quotas.

I don't think that's quite true. Quota is not modular, so we can
lazily free quota objects even when quota is turned off. All we need
to ensure is that code checks whether quota is enabled, not for the
existence of quota objects attached to the inode.

Hence quota-off simply turns off all the quota operations in memory,
and normal inode eviction cleans up the stale quota objects
naturally.

My main question is why the quota-on add_dquot_ref() pass is
required. AFAICT all of the filesystem operations that will modify
quota call dquot_initialize() directly to attach the required dquots
to the inode before the operation is started. If that's true, then
why does quota-on need to do this for all the inodes that are
already in cache?

i.e. I'm not sure I understand why we need quota to do these
iterations at all...

> Now for the sake of
> full context I'll also say that enabling / disabling quotas on a mounted
> filesystem is a legacy feature because it is quite easy that quota
> accounting goes wrong with it. So ext4 and f2fs support for quite a few
> years a mode where quota tracking is enabled on mount and disabled on
> unmount (if appropriate fs feature is enabled) and you can only enable /
> disable enforcement of quota limits during runtime.

Sure, this is how XFS works, too. But I think this behaviour is
largely irrelevant because there are still filesystems out there
that do stuff the old way...

> So I could see us
> deprecating this functionality altogether although jfs never adapted to
> this new way we do quotas so we'd have to deal with that somehow.  But one
> way or another it would take a significant amount of time before we can
> completely remove this so it is out of question for this series.

I'm not sure that matters, though it adds to the reasons why we
should be removing old, unmaintained filesystems from the tree
and old, outdated formats from maintained filesystems....

> I see one problem with the idea "whoever has a need to iterate inodes needs
> to keep track of inodes it needs to iterate through". It is fine
> conceptually but with s_inodes list we pay the cost only once and multiple
> users benefit. With each subsystem tracking inodes we pay the cost for each
> user (both in terms of memory and CPU). So if you don't use any of the
> subsystems that need iteration, you win, but if you use two or more of
> these subsystems, in particular those which need to track significant
> portion of all inodes, you are losing.

AFAICT, most of the subsystems don't need to track inodes directly.

We don't need s_inodes for evict_inodes() - we have the inode LRU
tracking all unreferenced inodes on the superblock. The GFS2 use
case can probably walk the inode LRU directly, too.

It looks to me that we can avoid needing unmount iteration for
fsnotify, and I suspect landlock can likely use the same persistence
inversion as fsnotify (same persistent ruleset model).

The bdev superblock can implement it's own internal list using
inode->i_devices as this list_head is only used by chardev
inodes.

All that then remains is the page cache dropping code, and that's
not really critical to have exacting behaviour. We certainly
shouldn't be taking a runtime penalty just to optimise the rare
case of dropping caches..

IOWs, there aren't that many users, and I think there are ways to
make all these iterations go away without adding new per-inode
list heads to track inodes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 12:16                 ` Christian Brauner
@ 2024-10-09  0:03                   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2024-10-09  0:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 08, 2024 at 02:16:04PM +0200, Christian Brauner wrote:
> I still maintain that we don't need to solve the fsnotify and lsm rework
> as part of this particular series.

Sure, I heard you the first time. :)

However, the patchset I posted was just a means to start the
discussion with a concrete proposal. Now I'm trying to work out how
all the pieces of the bigger puzzle go together as people think
about what it means, not polish the first little step.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 12:59               ` Mickaël Salaün
@ 2024-10-09  0:21                 ` Dave Chinner
  2024-10-09  9:23                   ` Mickaël Salaün
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2024-10-09  0:21 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Linus Torvalds, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Günther Noack

On Tue, Oct 08, 2024 at 02:59:07PM +0200, Mickaël Salaün wrote:
> On Mon, Oct 07, 2024 at 05:28:57PM -0700, Linus Torvalds wrote:
> > On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > There may be other inode references being held that make
> > > the inode live longer than the dentry cache. When should the
> > > fsnotify marks be removed from the inode in that case? Do they need
> > > to remain until, e.g, writeback completes?
> > 
> > Note that my idea is to just remove the fsnotify marks when the dentry
> > discards the inode.
> > 
> > That means that yes, the inode may still have a lifetime after the
> > dentry (because of other references, _or_ just because I_DONTCACHE
> > isn't set and we keep caching the inode).
> > 
> > BUT - fsnotify won't care. There won't be any fsnotify marks on that
> > inode any more, and without a dentry that points to it, there's no way
> > to add such marks.
> > 
> > (A new dentry may be re-attached to such an inode, and then fsnotify
> > could re-add new marks, but that doesn't change anything - the next
> > time the dentry is detached, the marks would go away again).
> > 
> > And yes, this changes the timing on when fsnotify events happen, but
> > what I'm actually hoping for is that Jan will agree that it doesn't
> > actually matter semantically.
> > 
> > > > Then at umount time, the dentry shrinking will deal with all live
> > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > just the root dentry inodes?
> > >
> > > I don't think even that is necessary, because
> > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > inode references, roots included.
> > 
> > Ahh - even better.
> > 
> > I didn't actually look very closely at the actual umount path, I was
> > looking just at the fsnotify_inoderemove() place in
> > dentry_unlink_inode() and went "couldn't we do _this_ instead?"
> > 
> > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > use of the nasty s_inodes list?
> > >
> > > Yes, it would, but someone who knows exactly when the fsnotify
> > > marks can be removed needs to chime in here...
> > 
> > Yup. Honza?
> > 
> > (Aside: I don't actually know if you prefer Jan or Honza, so I use
> > both randomly and interchangeably?)
> > 
> > > > I have this feeling that maybe we can just remove the other users too
> > > > using similar models. I think the LSM layer use (in landlock) is bogus
> > > > for exactly the same reason - there's really no reason to keep things
> > > > around for a random cached inode without a dentry.
> > >
> > > Perhaps, but I'm not sure what the landlock code is actually trying
> > > to do.
> 
> In Landlock, inodes (see landlock_object) may be referenced by several
> rulesets, either tied to a task's cred or a ruleset's file descriptor.
> A ruleset may outlive its referenced inodes, and this should not block
> related umounts.  security_sb_delete() is used to gracefully release
> such references.

Ah, there's the problem. The ruleset is persistent, not the inode.
Like fsnotify, the life cycle and reference counting is upside down.
The inode should cache the ruleset rather than the ruleset pinning
the inode.

See my reply to Jan about fsnotify.

> > Yeah, I wouldn't be surprised if it's just confused - it's very odd.
> > 
> > But I'd be perfectly happy just removing one use at a time - even if
> > we keep the s_inodes list around because of other users, it would
> > still be "one less thing".
> > 
> > > Hence, to me, the lifecycle and reference counting of inode related
> > > objects in landlock doesn't seem quite right, and the use of the
> > > security_sb_delete() callout appears to be papering over an internal
> > > lifecycle issue.
> > >
> > > I'd love to get rid of it altogether.
> 
> I'm not sure to fully understand the implications for now, but it would
> definitely be good to simplify this lifetime management.  The only
> requirement for Landlock is that inodes references should live as long
> as the related inodes are accessible by user space or already in use.
> The sooner these references are removed from related ruleset, the
> better.

I'm missing something.  Inodes are accessible to users even when
they are not in cache - we just read them from disk and instantiate
a new VFS inode.

So how do you attach the correct ruleset to a newly instantiated
inode?

i.e. If you can find the ruleset for any given inode that is brought
into cache (e.g. opening an existing, uncached file), then why do
you need to take inode references so they are never evicted?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 23:44                 ` Dave Chinner
@ 2024-10-09  6:10                   ` Amir Goldstein
  2024-10-09 14:18                   ` Jan Kara
  1 sibling, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2024-10-09  6:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Linus Torvalds, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Wed, Oct 9, 2024 at 1:44 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> > On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > > I'd wait with that convertion until this series lands.
> > > > >
> > > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > > happens if we just end up making the inode lifetimes subject to the
> > > > > dentry lifetimes" as suggested by Dave elsewhere.
> > > >
> > > > ....
> > > >
> > > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > > removed from the dentry.
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > >
> > >
> > > fsnotify inode marks remain until explicitly removed or until sb
> > > is unmounted (*), so other inode references are irrelevant to
> > > inode mark removal.
> > >
> > > (*) fanotify has "evictable" inode marks, which do not hold inode
> > > reference and go away on inode evict, but those mark evictions
> > > do not generate any event (i.e. there is no FAN_UNMOUNT).
> >
> > Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> > is for inotify which guarantees that either you get an event about somebody
> > unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> > unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> > we would maintain this behavior with what Linus proposes.
>
> Thanks. I didn't respond last night when I read Amir's decription
> because I wanted to think it over. Knowing where the unmount event
> requirement certainly helps.
>
> I am probably missing something important, but it really seems to me
> that the object reference counting model is the back to
> front.  Currently the mark is being attached to the inode and then
> the inode pinned by a reference count to make the mark attached
> to the inode persistent until unmount. This then requires the inodes
> to be swept by unmount because fsnotify has effectively leaked them
> as it isn't tracking such inodes itself.
>
> [ Keep in mind that I'm not saying this was a bad or wrong thing to
> do because the s_inodes list was there to be able to do this sort of
> lazy cleanup. But now that we want to remove the s_inodes list if at
> all possible, it is a problem we need to solve differently. ]
>
> AFAICT, inotify does not appear to require the inode to send events
> - it only requires access to the inode mark itself. Hence it does
> not the inode in cache to generate IN_UNMOUNT events, it just
> needs the mark itself to be findable at unmount.  Do any of the
> other backends that require unmount notifications that require
> special access to the inode itself?
>

No other backend supports IN_UNMOUNT/FS_UNMOUNT.
We want to add unmount events support to fanotify, but those are
only going to be possible for watching a mount or an sb, not inodes.

> If not, and the fsnotify sb info is tracking these persistent marks,
> then we don't need to iterate inodes at unmount. This means we don't
> need to pin inodes when they have marks attached, and so the
> dependency on the s_inodes list goes away.
>
> With this inverted model, we need the first fsnotify event callout
> after the inode is instantiated to look for a persistent mark for
> the inode. We know how to do this efficiently - it's exactly the
> same caching model we use for ACLs. On the first lookup, we check
> the inode for ACL data and set the ACL pointer appropriately to
> indicate that a lookup has been done and there are no ACLs
> associated with the inode.
>
> At this point, the fsnotify inode marks can all be removed from the
> inode when it is being evicted and there's no need for fsnotify to
> pin inodes at all.
>
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > >
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> >
> > So fsnotify needs a list of inodes for the superblock which have marks
> > attached and for which we hold inode reference. We can keep it inside
> > fsnotify code although it would practically mean another list_head for the
> > inode for this list (probably in our fsnotify_connector structure which
> > connects list of notification marks to the inode).
>
> I don't think that is necessary. We need to get rid of the inode
> reference, not move where we track inode references. The persistent
> object is the fsnotify mark, not the cached inode. It's the mark
> that needs to be persistent, and that's what the fsnotify code
> should be tracking.
>
> The fsnotify marks are much smaller than inodes, and there going to
> be fewer cached marks than inodes, especially once inode pinning is
> removed. Hence I think this will result in a net reduction in memory
> footprint for "marked-until-unmount" configurations as we won't pin
> nearly as many inodes in cache...
>

It is a feasible design which has all the benefits that you listed.
But it is a big change, just to get away from s_inodes
(much easier to maintain a private list of pinned inodes).

inotify (recursive tree watches for that matter) has been
inefficient that way for a long time, and users now have less
memory hogging solutions like fanotify mount and sb marks.
granted, not unprivileged users, but still.

So there needs to be a good justification to make this design change.
One such justification would be to provide the infrastructure to
the feature that Jan referred to as the "holy grail" in his LPC talk,
namely, subtree watches.

If we introduce code that looks up persistent "mark rules" on
inode instantiation, then we could use it to "reconnect" inotify
persistent inode marks (by ino/fid) or to establish automatic
marks based on subtree/path based rules.

audit code has something that resembles this and I suspect that
this Landlock is doing something similar (?), but I didn't check.
path based rules are always going to be elusive and tricky and
Al is always going to hate them ;)

Bottom line - good idea, not easy, requires allocating development resources.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-09  0:21                 ` Dave Chinner
@ 2024-10-09  9:23                   ` Mickaël Salaün
  0 siblings, 0 replies; 34+ messages in thread
From: Mickaël Salaün @ 2024-10-09  9:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Günther Noack, Christian Brauner

On Wed, Oct 09, 2024 at 11:21:10AM +1100, Dave Chinner wrote:
> On Tue, Oct 08, 2024 at 02:59:07PM +0200, Mickaël Salaün wrote:
> > On Mon, Oct 07, 2024 at 05:28:57PM -0700, Linus Torvalds wrote:
> > > On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > 
> > > Note that my idea is to just remove the fsnotify marks when the dentry
> > > discards the inode.
> > > 
> > > That means that yes, the inode may still have a lifetime after the
> > > dentry (because of other references, _or_ just because I_DONTCACHE
> > > isn't set and we keep caching the inode).
> > > 
> > > BUT - fsnotify won't care. There won't be any fsnotify marks on that
> > > inode any more, and without a dentry that points to it, there's no way
> > > to add such marks.
> > > 
> > > (A new dentry may be re-attached to such an inode, and then fsnotify
> > > could re-add new marks, but that doesn't change anything - the next
> > > time the dentry is detached, the marks would go away again).
> > > 
> > > And yes, this changes the timing on when fsnotify events happen, but
> > > what I'm actually hoping for is that Jan will agree that it doesn't
> > > actually matter semantically.
> > > 
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > 
> > > Ahh - even better.
> > > 
> > > I didn't actually look very closely at the actual umount path, I was
> > > looking just at the fsnotify_inoderemove() place in
> > > dentry_unlink_inode() and went "couldn't we do _this_ instead?"
> > > 
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> > > 
> > > Yup. Honza?
> > > 
> > > (Aside: I don't actually know if you prefer Jan or Honza, so I use
> > > both randomly and interchangeably?)
> > > 
> > > > > I have this feeling that maybe we can just remove the other users too
> > > > > using similar models. I think the LSM layer use (in landlock) is bogus
> > > > > for exactly the same reason - there's really no reason to keep things
> > > > > around for a random cached inode without a dentry.
> > > >
> > > > Perhaps, but I'm not sure what the landlock code is actually trying
> > > > to do.
> > 
> > In Landlock, inodes (see landlock_object) may be referenced by several
> > rulesets, either tied to a task's cred or a ruleset's file descriptor.
> > A ruleset may outlive its referenced inodes, and this should not block
> > related umounts.  security_sb_delete() is used to gracefully release
> > such references.
> 
> Ah, there's the problem. The ruleset is persistent, not the inode.
> Like fsnotify, the life cycle and reference counting is upside down.
> The inode should cache the ruleset rather than the ruleset pinning
> the inode.

A ruleset needs to takes a reference to the inode as for an opened file
and keep it "alive" as long as it may be re-used by user space (i.e. as
long as the superblock exists).  One of the goal of a ruleset is to
identify inodes as long as they are accessible.  When a sandboxed
process request to open a file, its sandbox's ruleset checks against the
referenced inodes (in a nutshell).

In practice, rulesets reference a set of struct landlock_object which
references an inode or not (if it vanished).  There is only one
landlock_object referenced per inode.  This makes it possible to have a
dynamic N:M mapping between rulesets and inodes which enables a ruleset
to be deleted before its referenced inodes, or the other way around.

> 
> See my reply to Jan about fsnotify.
> 
> > > Yeah, I wouldn't be surprised if it's just confused - it's very odd.
> > > 
> > > But I'd be perfectly happy just removing one use at a time - even if
> > > we keep the s_inodes list around because of other users, it would
> > > still be "one less thing".
> > > 
> > > > Hence, to me, the lifecycle and reference counting of inode related
> > > > objects in landlock doesn't seem quite right, and the use of the
> > > > security_sb_delete() callout appears to be papering over an internal
> > > > lifecycle issue.
> > > >
> > > > I'd love to get rid of it altogether.
> > 
> > I'm not sure to fully understand the implications for now, but it would
> > definitely be good to simplify this lifetime management.  The only
> > requirement for Landlock is that inodes references should live as long
> > as the related inodes are accessible by user space or already in use.
> > The sooner these references are removed from related ruleset, the
> > better.
> 
> I'm missing something.  Inodes are accessible to users even when
> they are not in cache - we just read them from disk and instantiate
> a new VFS inode.
> 
> So how do you attach the correct ruleset to a newly instantiated
> inode?

We can see a Landlock ruleset as a set of weakly opened files/inodes.  A
Landolck ruleset call iget() to keep the related VFS inodes alive, which
means that when user space opens a file pointing to the same inode, the
same VFS inode will be re-used and then we can match it against a ruleset.

> 
> i.e. If you can find the ruleset for any given inode that is brought
> into cache (e.g. opening an existing, uncached file), then why do
> you need to take inode references so they are never evicted?

A landlock_object only keep a reference to an inode, not to the rulesets
pointing to it:
* inode -> 1 landlock_object or NULL
* landlock_object -> 1 inode or NULL
* ruleset -> N landlock_object

There are mainly two different operations:
1. Match 1 inode against a set of N inode references (i.e. a ruleset).
2. Drop the references of N rulesets (in practice 1 intermediate
   landlock_object) pointing to 1 inode.

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  0:54               ` Linus Torvalds
@ 2024-10-09  9:49                 ` Jan Kara
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2024-10-09  9:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Mon 07-10-24 17:54:16, Linus Torvalds wrote:
> On Mon, 7 Oct 2024 at 17:28, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And yes, this changes the timing on when fsnotify events happen, but
> > what I'm actually hoping for is that Jan will agree that it doesn't
> > actually matter semantically.
> 
> .. and yes, I realize it might actually matter. fsnotify does do
> 'ihold()' to hold an inode ref, and with this that would actually be
> more or less pointless, because the mark would be removed _despite_
> such a ref.
> 
> So maybe it's not an option to do what I suggested. I don't know the
> users well enough.

Yeah, we need to keep the notification mark alive either until the inode is
deleted or until the filesystem is unmounted to maintain behavior of
inotify and fanotify APIs.

That being said we could rework lifetime rules inside fsnotify subsystem as
Dave suggests so that fsnotify would not pin inodes, detach it's structures
from inodes on inode reclaim and associate notification marks with inodes
when they are loaded from disk.  But it's a relatively big overhaul.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 23:44                 ` Dave Chinner
  2024-10-09  6:10                   ` Amir Goldstein
@ 2024-10-09 14:18                   ` Jan Kara
  1 sibling, 0 replies; 34+ messages in thread
From: Jan Kara @ 2024-10-09 14:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Wed 09-10-24 10:44:12, Dave Chinner wrote:
> On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> > On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > > I'd wait with that convertion until this series lands.
> > > > >
> > > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > > happens if we just end up making the inode lifetimes subject to the
> > > > > dentry lifetimes" as suggested by Dave elsewhere.
> > > >
> > > > ....
> > > >
> > > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > > removed from the dentry.
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > >
> > > 
> > > fsnotify inode marks remain until explicitly removed or until sb
> > > is unmounted (*), so other inode references are irrelevant to
> > > inode mark removal.
> > > 
> > > (*) fanotify has "evictable" inode marks, which do not hold inode
> > > reference and go away on inode evict, but those mark evictions
> > > do not generate any event (i.e. there is no FAN_UNMOUNT).
> > 
> > Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> > is for inotify which guarantees that either you get an event about somebody
> > unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> > unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> > we would maintain this behavior with what Linus proposes.
> 
> Thanks. I didn't respond last night when I read Amir's decription
> because I wanted to think it over. Knowing where the unmount event
> requirement certainly helps.
> 
> I am probably missing something important, but it really seems to me
> that the object reference counting model is the back to
> front.  Currently the mark is being attached to the inode and then
> the inode pinned by a reference count to make the mark attached
> to the inode persistent until unmount. This then requires the inodes
> to be swept by unmount because fsnotify has effectively leaked them
> as it isn't tracking such inodes itself.
> 
> [ Keep in mind that I'm not saying this was a bad or wrong thing to
> do because the s_inodes list was there to be able to do this sort of
> lazy cleanup. But now that we want to remove the s_inodes list if at
> all possible, it is a problem we need to solve differently. ]

Yes, agreed.

> AFAICT, inotify does not appear to require the inode to send events
> - it only requires access to the inode mark itself. Hence it does
> not the inode in cache to generate IN_UNMOUNT events, it just
> needs the mark itself to be findable at unmount.  Do any of the
> other backends that require unmount notifications that require
> special access to the inode itself?

No, I don't think unmount notification requires looking at the inode and it
is inotify-specific thing as Amir wrote. We do require inode access when
generating fanotify events (to open fd where event happened) but that gets
handled separately by creating struct path when event happens and using it
for dentry_open() later when reporting to userspace so that carries its own
set on dentry + mnt references while the event is waiting in the queue.

> If not, and the fsnotify sb info is tracking these persistent marks,
> then we don't need to iterate inodes at unmount. This means we don't
> need to pin inodes when they have marks attached, and so the
> dependency on the s_inodes list goes away.
> 
> With this inverted model, we need the first fsnotify event callout
> after the inode is instantiated to look for a persistent mark for
> the inode. We know how to do this efficiently - it's exactly the
> same caching model we use for ACLs. On the first lookup, we check
> the inode for ACL data and set the ACL pointer appropriately to
> indicate that a lookup has been done and there are no ACLs
> associated with the inode.

Yes, I agree such scheme should be possible although a small snag I see is
that we need to keep in fsnotify mark enough info so that it can be
associated with an inode when it is read from the disk. And this info is
filesystem specific with uncertain size for filesystems which use iget5().
So I suspect we'll need some support from individual filesystems which is
always tedious.

> At this point, the fsnotify inode marks can all be removed from the
> inode when it is being evicted and there's no need for fsnotify to
> pin inodes at all.
> 
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > >
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> > 
> > So fsnotify needs a list of inodes for the superblock which have marks
> > attached and for which we hold inode reference. We can keep it inside
> > fsnotify code although it would practically mean another list_head for the
> > inode for this list (probably in our fsnotify_connector structure which
> > connects list of notification marks to the inode).
> 
> I don't think that is necessary. We need to get rid of the inode
> reference, not move where we track inode references. The persistent
> object is the fsnotify mark, not the cached inode. It's the mark
> that needs to be persistent, and that's what the fsnotify code
> should be tracking.

Right, I was not precise here. We don't need a list of tracked inodes. We
are fine with a list of all marks for inodes on a superblock which we could
crawl on umount.

> The fsnotify marks are much smaller than inodes, and there going to
> be fewer cached marks than inodes, especially once inode pinning is
> removed. Hence I think this will result in a net reduction in memory
> footprint for "marked-until-unmount" configurations as we won't pin
> nearly as many inodes in cache...

I agree. If fsnotify marks stop pinning inodes, we'll probably win much
more memory by keeping inodes reclaimable than we loose by extra overhead
of the mark tracking.

> > > > > And I wonder if the quota code (which uses the s_inodes list
> > > > > to enable quotas on already mounted filesystems) could for
> > > > > all the same reasons just walk the dentry tree instead (and
> > > > > remove_dquot_ref similarly could just remove it at
> > > > > dentry_unlink_inode() time)?
> > > >
> > > > I don't think that will work because we have to be able to
> > > > modify quota in evict() processing. This is especially true
> > > > for unlinked inodes being evicted from cache, but also the
> > > > dquots need to stay attached until writeback completes.
> > > >
> > > > Hence I don't think we can remove the quota refs from the
> > > > inode before we call iput_final(), and so I think quotaoff (at
> > > > least) still needs to iterate inodes...
> > 
> > Yeah, I'm not sure how to get rid of the s_inodes use in quota
> > code. One of the things we need s_inodes list for is during
> > quotaoff on a mounted filesystem when we need to iterate all
> > inodes which are referencing quota structures and free them.  In
> > theory we could keep a list of inodes referencing quota structures
> > but that would require adding list_head to inode structure for
> > filesystems that support quotas.
> 
> I don't think that's quite true. Quota is not modular, so we can
> lazily free quota objects even when quota is turned off. All we need
> to ensure is that code checks whether quota is enabled, not for the
> existence of quota objects attached to the inode.
> 
> Hence quota-off simply turns off all the quota operations in memory,
> and normal inode eviction cleans up the stale quota objects
> naturally.

Ho, hum, possibly yes. I need to think a bit more about this.

> My main question is why the quota-on add_dquot_ref() pass is
> required. AFAICT all of the filesystem operations that will modify
> quota call dquot_initialize() directly to attach the required dquots
> to the inode before the operation is started. If that's true, then
> why does quota-on need to do this for all the inodes that are
> already in cache?

This is again for handling quotaon on already mounted filesystem. We
initialize quotas for the inode when opening a file so if some files are
already open when we do quotaon, we want to attach quota structures to
these inodes. I think this was kind of important to limit mismatch between
real usage and accounted usage when old style quotas were used e.g. for
root filesystem but to be fair this code was there when I became quota
maintainer in 1999 and I never dared to remove it :)

> > Now for the sake of
> > full context I'll also say that enabling / disabling quotas on a mounted
> > filesystem is a legacy feature because it is quite easy that quota
> > accounting goes wrong with it. So ext4 and f2fs support for quite a few
> > years a mode where quota tracking is enabled on mount and disabled on
> > unmount (if appropriate fs feature is enabled) and you can only enable /
> > disable enforcement of quota limits during runtime.
> 
> Sure, this is how XFS works, too. But I think this behaviour is
> largely irrelevant because there are still filesystems out there
> that do stuff the old way...
> 
> > So I could see us
> > deprecating this functionality altogether although jfs never adapted to
> > this new way we do quotas so we'd have to deal with that somehow.  But one
> > way or another it would take a significant amount of time before we can
> > completely remove this so it is out of question for this series.
> 
> I'm not sure that matters, though it adds to the reasons why we
> should be removing old, unmaintained filesystems from the tree
> and old, outdated formats from maintained filesystems....

True.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-10-09 14:18 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20241002014017.3801899-1-david@fromorbit.com>
     [not found] ` <20241002014017.3801899-5-david@fromorbit.com>
2024-10-03  7:23   ` lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes() Christoph Hellwig
2024-10-03  7:38     ` Christoph Hellwig
2024-10-03 11:57       ` Jan Kara
2024-10-03 12:11         ` Christoph Hellwig
2024-10-03 12:26           ` Jan Kara
2024-10-03 12:39             ` Christoph Hellwig
2024-10-03 12:56               ` Jan Kara
2024-10-03 13:04                 ` Christoph Hellwig
2024-10-03 13:59                 ` Dave Chinner
2024-10-03 16:17                   ` Jan Kara
2024-10-04  0:46                     ` Dave Chinner
2024-10-04  7:21                       ` Christian Brauner
2024-10-04 12:14                         ` Christoph Hellwig
2024-10-04 13:49                           ` Jan Kara
2024-10-04 18:15                             ` Paul Moore
2024-10-04 22:57                         ` Dave Chinner
2024-10-05 15:21                           ` Mickaël Salaün
2024-10-05 16:03                             ` Mickaël Salaün
2024-10-05 16:03                             ` Paul Moore
2024-10-07 20:37         ` Linus Torvalds
2024-10-07 23:33           ` Dave Chinner
2024-10-08  0:28             ` Linus Torvalds
2024-10-08  0:54               ` Linus Torvalds
2024-10-09  9:49                 ` Jan Kara
2024-10-08 12:59               ` Mickaël Salaün
2024-10-09  0:21                 ` Dave Chinner
2024-10-09  9:23                   ` Mickaël Salaün
2024-10-08  8:57             ` Amir Goldstein
2024-10-08 11:23               ` Jan Kara
2024-10-08 12:16                 ` Christian Brauner
2024-10-09  0:03                   ` Dave Chinner
2024-10-08 23:44                 ` Dave Chinner
2024-10-09  6:10                   ` Amir Goldstein
2024-10-09 14:18                   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).