* [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
@ 2021-03-28 15:56 Amir Goldstein
  2021-03-30  7:31 ` Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Amir Goldstein @ 2021-03-28 15:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, linux-api
Add a high level hook fsnotify_path_create() which is called from
syscall context where mount context is available, so that FAN_CREATE
event can be added to a mount mark mask.
This high level hook is called in addition to fsnotify_create(),
fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
context is not available.
In the context where fsnotify_path_create() will be called, a dentry flag
flag is set on the new dentry the suppress the FS_CREATE event in the vfs
level hooks.
This functionality was requested by Christian Brauner to replace
recursive inotify watches for detecting when some path was created under
an idmapped mount without having to monitor FAN_CREATE events in the
entire filesystem.
In combination with more changes to allow unprivileged fanotify listener
to watch an idmapped mount, this functionality would be usable also by
nested container managers.
Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
Jan,
After trying several different approaches, I finally realized that
making FAN_CREATE available for mount marks is not that hard and it could
be very useful IMO.
Adding support for other "inode events" with mount mark, such as
FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
for FAN_CREATE was really easy due to the fact that all call sites are
already surrounded by filename_creat()/done_path_create() calls.
Also, there is an inherent a-symetry between FAN_CREATE and other
events. All the rest of the events may be set when watching a postive
path, for example, to know when a path of a bind mount that was
"injected" to a container was moved or deleted, it is possible to start
watching that directory before injecting the bind mount.
It is not possible to do the same with a "negative" path to know when
a positive dentry was instantiated at that path.
This patch provides functionality that is independant of other changes,
but I also tested it along with other changes that demonstrate how it
would be utilized in userns setups [1][2].
As can be seen in dcache.h patch, this patch comes on top a revert patch
to reclaim an unused dentry flag. If you accept this proposal, I will
post the full series.
Thanks,
Amir.
[1] https://github.com/amir73il/linux/commits/fanotify_userns
[2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
 fs/namei.c               | 21 ++++++++++++++++++++-
 include/linux/dcache.h   |  2 +-
 include/linux/fanotify.h |  8 ++++----
 include/linux/fsnotify.h | 36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+), 6 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 216f16e74351..cf979e956938 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3288,7 +3288,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 		inode_lock_shared(dir->d_inode);
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
-		fsnotify_create(dir->d_inode, dentry);
+		fsnotify_path_create(&nd->path, dentry);
 	if (open_flag & O_CREAT)
 		inode_unlock(dir->d_inode);
 	else
@@ -3560,6 +3560,20 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 	return file;
 }
 
+static void d_set_path_create(struct dentry *dentry)
+{
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags |= DCACHE_PATH_CREATE;
+	spin_unlock(&dentry->d_lock);
+}
+
+static void d_clear_path_create(struct dentry *dentry)
+{
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags &= ~DCACHE_PATH_CREATE;
+	spin_unlock(&dentry->d_lock);
+}
+
 static struct dentry *filename_create(int dfd, struct filename *name,
 				struct path *path, unsigned int lookup_flags)
 {
@@ -3617,6 +3631,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 		goto fail;
 	}
 	putname(name);
+	/* Start "path create" context that ends in done_path_create() */
+	d_set_path_create(dentry);
 	return dentry;
 fail:
 	dput(dentry);
@@ -3641,6 +3657,9 @@ EXPORT_SYMBOL(kern_path_create);
 
 void done_path_create(struct path *path, struct dentry *dentry)
 {
+	if (d_inode(dentry))
+		fsnotify_path_create(path, dentry);
+	d_clear_path_create(dentry);
 	dput(dentry);
 	inode_unlock(path->dentry->d_inode);
 	mnt_drop_write(path->mnt);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 4225caa8cf02..d153793d5b95 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -213,7 +213,7 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00600000 /* Symlink (or fallthru to such) */
 
 #define DCACHE_MAY_FREE			0x00800000
-/* Was #define DCACHE_FALLTHRU			0x01000000 */
+#define DCACHE_PATH_CREATE		0x01000000 /* "path_create" context */
 #define DCACHE_NOKEY_NAME		0x02000000 /* Encrypted name encoded without key */
 #define DCACHE_OP_REAL			0x04000000
 
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index bad41bcb25df..f0c5a4a82b6e 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -65,10 +65,10 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 
 /*
  * Events that can be reported with data type FSNOTIFY_EVENT_PATH.
- * Note that FAN_MODIFY can also be reported with data type
+ * Note that FAN_MODIFY and FAN_CREATE can also be reported with data type
  * FSNOTIFY_EVENT_INODE.
  */
-#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | \
+#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | FAN_CREATE | \
 				 FAN_CLOSE | FAN_OPEN | FAN_OPEN_EXEC)
 
 /*
@@ -78,8 +78,8 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 #define FANOTIFY_DIRENT_EVENTS	(FAN_MOVE | FAN_CREATE | FAN_DELETE)
 
 /* Events that can only be reported with data type FSNOTIFY_EVENT_INODE */
-#define FANOTIFY_INODE_EVENTS	(FANOTIFY_DIRENT_EVENTS | \
-				 FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
+#define FANOTIFY_INODE_EVENTS	(FAN_MOVE | FAN_DELETE | FAN_ATTRIB | \
+				 FAN_MOVE_SELF | FAN_DELETE_SELF)
 
 /* Events that user can request to be notified on */
 #define FANOTIFY_EVENTS		(FANOTIFY_PATH_EVENTS | \
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index f8acddcf54fb..9a3d9f7beeb2 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -179,6 +179,30 @@ static inline void fsnotify_inoderemove(struct inode *inode)
 	__fsnotify_inode_delete(inode);
 }
 
+/*
+ * fsnotify_path_create - an inode was linked to namespace
+ *
+ * This higher level hook is called in addition to fsnotify_create(),
+ * fsnotify_mkdir() and fsnotify_link() vfs hooks when the mount context is
+ * available, so that FS_CREATE event can be added to a mount mark mask.
+ *
+ * In that case the, DCACHE_PATH_CREATE flag is set to suppress the FS_CREATE
+ * event in the lower level vfs hooks.
+ */
+static inline void fsnotify_path_create(struct path *path,
+					struct dentry *child)
+{
+	struct inode *dir = path->dentry->d_inode;
+	__u32 mask = FS_CREATE;
+
+	WARN_ON_ONCE(!inode_is_locked(dir));
+
+	if (S_ISDIR(d_inode(child)->i_mode))
+		mask |= FS_ISDIR;
+
+	fsnotify(mask, path, FSNOTIFY_EVENT_PATH, dir, &child->d_name, NULL, 0);
+}
+
 /*
  * fsnotify_create - 'name' was linked in
  */
@@ -186,6 +210,10 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
 {
 	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
 
+	/* fsnotify_path_create() will be called */
+	if (dentry->d_flags & DCACHE_PATH_CREATE)
+		return;
+
 	fsnotify_dirent(inode, dentry, FS_CREATE);
 }
 
@@ -200,6 +228,10 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode,
 	fsnotify_link_count(inode);
 	audit_inode_child(dir, new_dentry, AUDIT_TYPE_CHILD_CREATE);
 
+	/* fsnotify_path_create() will be called */
+	if (new_dentry->d_flags & DCACHE_PATH_CREATE)
+		return;
+
 	fsnotify_name(dir, FS_CREATE, inode, &new_dentry->d_name, 0);
 }
 
@@ -223,6 +255,10 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 {
 	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
 
+	/* fsnotify_path_create() will be called */
+	if (dentry->d_flags & DCACHE_PATH_CREATE)
+		return;
+
 	fsnotify_dirent(inode, dentry, FS_CREATE | FS_ISDIR);
 }
 
-- 
2.30.0
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
@ 2021-03-30  7:31 ` Christian Brauner
  2021-03-30  9:31   ` Amir Goldstein
  2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
  2021-03-30 12:20 ` Amir Goldstein
  2 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2021-03-30  7:31 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-api
On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> Add a high level hook fsnotify_path_create() which is called from
> syscall context where mount context is available, so that FAN_CREATE
> event can be added to a mount mark mask.
> 
> This high level hook is called in addition to fsnotify_create(),
> fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> context is not available.
> 
> In the context where fsnotify_path_create() will be called, a dentry flag
> flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> level hooks.
> 
> This functionality was requested by Christian Brauner to replace
> recursive inotify watches for detecting when some path was created under
> an idmapped mount without having to monitor FAN_CREATE events in the
> entire filesystem.
> 
> In combination with more changes to allow unprivileged fanotify listener
> to watch an idmapped mount, this functionality would be usable also by
> nested container managers.
> 
> Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
Was about to look at this. Does this require preliminary patches since
it doesn't apply to current master. If so, can you just give me a link
to a branch so I can pull from that? :)
Christian
> 
> Jan,
> 
> After trying several different approaches, I finally realized that
> making FAN_CREATE available for mount marks is not that hard and it could
> be very useful IMO.
> 
> Adding support for other "inode events" with mount mark, such as
> FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
> for FAN_CREATE was really easy due to the fact that all call sites are
> already surrounded by filename_creat()/done_path_create() calls.
> 
> Also, there is an inherent a-symetry between FAN_CREATE and other
> events. All the rest of the events may be set when watching a postive
> path, for example, to know when a path of a bind mount that was
> "injected" to a container was moved or deleted, it is possible to start
> watching that directory before injecting the bind mount.
> 
> It is not possible to do the same with a "negative" path to know when
> a positive dentry was instantiated at that path.
> 
> This patch provides functionality that is independant of other changes,
> but I also tested it along with other changes that demonstrate how it
> would be utilized in userns setups [1][2].
> 
> As can be seen in dcache.h patch, this patch comes on top a revert patch
> to reclaim an unused dentry flag. If you accept this proposal, I will
> post the full series.
> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/linux/commits/fanotify_userns
> [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> 
>  fs/namei.c               | 21 ++++++++++++++++++++-
>  include/linux/dcache.h   |  2 +-
>  include/linux/fanotify.h |  8 ++++----
>  include/linux/fsnotify.h | 36 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 61 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 216f16e74351..cf979e956938 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3288,7 +3288,7 @@ static const char *open_last_lookups(struct nameidata *nd,
>  		inode_lock_shared(dir->d_inode);
>  	dentry = lookup_open(nd, file, op, got_write);
>  	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
> -		fsnotify_create(dir->d_inode, dentry);
> +		fsnotify_path_create(&nd->path, dentry);
>  	if (open_flag & O_CREAT)
>  		inode_unlock(dir->d_inode);
>  	else
> @@ -3560,6 +3560,20 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
>  	return file;
>  }
>  
> +static void d_set_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags |= DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
> +static void d_clear_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags &= ~DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
>  static struct dentry *filename_create(int dfd, struct filename *name,
>  				struct path *path, unsigned int lookup_flags)
>  {
> @@ -3617,6 +3631,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  		goto fail;
>  	}
>  	putname(name);
> +	/* Start "path create" context that ends in done_path_create() */
> +	d_set_path_create(dentry);
>  	return dentry;
>  fail:
>  	dput(dentry);
> @@ -3641,6 +3657,9 @@ EXPORT_SYMBOL(kern_path_create);
>  
>  void done_path_create(struct path *path, struct dentry *dentry)
>  {
> +	if (d_inode(dentry))
> +		fsnotify_path_create(path, dentry);
> +	d_clear_path_create(dentry);
>  	dput(dentry);
>  	inode_unlock(path->dentry->d_inode);
>  	mnt_drop_write(path->mnt);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 4225caa8cf02..d153793d5b95 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -213,7 +213,7 @@ struct dentry_operations {
>  #define DCACHE_SYMLINK_TYPE		0x00600000 /* Symlink (or fallthru to such) */
>  
>  #define DCACHE_MAY_FREE			0x00800000
> -/* Was #define DCACHE_FALLTHRU			0x01000000 */
> +#define DCACHE_PATH_CREATE		0x01000000 /* "path_create" context */
>  #define DCACHE_NOKEY_NAME		0x02000000 /* Encrypted name encoded without key */
>  #define DCACHE_OP_REAL			0x04000000
>  
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index bad41bcb25df..f0c5a4a82b6e 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -65,10 +65,10 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
>  
>  /*
>   * Events that can be reported with data type FSNOTIFY_EVENT_PATH.
> - * Note that FAN_MODIFY can also be reported with data type
> + * Note that FAN_MODIFY and FAN_CREATE can also be reported with data type
>   * FSNOTIFY_EVENT_INODE.
>   */
> -#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | \
> +#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | FAN_CREATE | \
>  				 FAN_CLOSE | FAN_OPEN | FAN_OPEN_EXEC)
>  
>  /*
> @@ -78,8 +78,8 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
>  #define FANOTIFY_DIRENT_EVENTS	(FAN_MOVE | FAN_CREATE | FAN_DELETE)
>  
>  /* Events that can only be reported with data type FSNOTIFY_EVENT_INODE */
> -#define FANOTIFY_INODE_EVENTS	(FANOTIFY_DIRENT_EVENTS | \
> -				 FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
> +#define FANOTIFY_INODE_EVENTS	(FAN_MOVE | FAN_DELETE | FAN_ATTRIB | \
> +				 FAN_MOVE_SELF | FAN_DELETE_SELF)
>  
>  /* Events that user can request to be notified on */
>  #define FANOTIFY_EVENTS		(FANOTIFY_PATH_EVENTS | \
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index f8acddcf54fb..9a3d9f7beeb2 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -179,6 +179,30 @@ static inline void fsnotify_inoderemove(struct inode *inode)
>  	__fsnotify_inode_delete(inode);
>  }
>  
> +/*
> + * fsnotify_path_create - an inode was linked to namespace
> + *
> + * This higher level hook is called in addition to fsnotify_create(),
> + * fsnotify_mkdir() and fsnotify_link() vfs hooks when the mount context is
> + * available, so that FS_CREATE event can be added to a mount mark mask.
> + *
> + * In that case the, DCACHE_PATH_CREATE flag is set to suppress the FS_CREATE
> + * event in the lower level vfs hooks.
> + */
> +static inline void fsnotify_path_create(struct path *path,
> +					struct dentry *child)
> +{
> +	struct inode *dir = path->dentry->d_inode;
> +	__u32 mask = FS_CREATE;
> +
> +	WARN_ON_ONCE(!inode_is_locked(dir));
> +
> +	if (S_ISDIR(d_inode(child)->i_mode))
> +		mask |= FS_ISDIR;
> +
> +	fsnotify(mask, path, FSNOTIFY_EVENT_PATH, dir, &child->d_name, NULL, 0);
> +}
> +
>  /*
>   * fsnotify_create - 'name' was linked in
>   */
> @@ -186,6 +210,10 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
>  {
>  	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
>  
> +	/* fsnotify_path_create() will be called */
> +	if (dentry->d_flags & DCACHE_PATH_CREATE)
> +		return;
> +
>  	fsnotify_dirent(inode, dentry, FS_CREATE);
>  }
>  
> @@ -200,6 +228,10 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode,
>  	fsnotify_link_count(inode);
>  	audit_inode_child(dir, new_dentry, AUDIT_TYPE_CHILD_CREATE);
>  
> +	/* fsnotify_path_create() will be called */
> +	if (new_dentry->d_flags & DCACHE_PATH_CREATE)
> +		return;
> +
>  	fsnotify_name(dir, FS_CREATE, inode, &new_dentry->d_name, 0);
>  }
>  
> @@ -223,6 +255,10 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
>  {
>  	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
>  
> +	/* fsnotify_path_create() will be called */
> +	if (dentry->d_flags & DCACHE_PATH_CREATE)
> +		return;
> +
>  	fsnotify_dirent(inode, dentry, FS_CREATE | FS_ISDIR);
>  }
>  
> -- 
> 2.30.0
> 
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30  7:31 ` Christian Brauner
@ 2021-03-30  9:31   ` Amir Goldstein
  2021-03-30 16:24     ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-03-30  9:31 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API
On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > Add a high level hook fsnotify_path_create() which is called from
> > syscall context where mount context is available, so that FAN_CREATE
> > event can be added to a mount mark mask.
> >
> > This high level hook is called in addition to fsnotify_create(),
> > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > context is not available.
> >
> > In the context where fsnotify_path_create() will be called, a dentry flag
> > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > level hooks.
> >
> > This functionality was requested by Christian Brauner to replace
> > recursive inotify watches for detecting when some path was created under
> > an idmapped mount without having to monitor FAN_CREATE events in the
> > entire filesystem.
> >
> > In combination with more changes to allow unprivileged fanotify listener
> > to watch an idmapped mount, this functionality would be usable also by
> > nested container managers.
> >
> > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
>
> Was about to look at this. Does this require preliminary patches since
> it doesn't apply to current master. If so, can you just give me a link
> to a branch so I can pull from that? :)
>
The patch is less useful on its own.
Better take the entire work for the demo which includes this patch:
[1] https://github.com/amir73il/linux/commits/fanotify_userns
[2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
  2021-03-30  7:31 ` Christian Brauner
@ 2021-03-30 12:12 ` Christian Brauner
  2021-03-30 12:33   ` Amir Goldstein
  2021-03-30 12:20 ` Amir Goldstein
  2 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2021-03-30 12:12 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-api
On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> Add a high level hook fsnotify_path_create() which is called from
> syscall context where mount context is available, so that FAN_CREATE
> event can be added to a mount mark mask.
> 
> This high level hook is called in addition to fsnotify_create(),
> fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> context is not available.
> 
> In the context where fsnotify_path_create() will be called, a dentry flag
> flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> level hooks.
Ok, just to make sure this scheme would also work for overlay-style
filesystems like ecryptfs where you possible generate two notify events:
- in the ecryptfs layer
- in the lower fs layer
at least when you set a regular inode watch.
If you set a mount watch you ideally would generate two events in both
layers too, right? But afaict that wouldn't work.
Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
flag will be set on the new ecryptfs dentry and so no notify event will
be generated for the ecryptfs layer again. Then ecryptfs calls
vfs_link() to create a new dentry in the lower layer. The new dentry in
the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
But since vfs_link() doesn't have access to the mnt context itself you
can't generate a notify event for the mount associated with the lower
fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
mount to not get notified about creation events going through the
ecryptfs layer. Is that right?  Seems like this could be a problem.
Christian
> 
> This functionality was requested by Christian Brauner to replace
> recursive inotify watches for detecting when some path was created under
> an idmapped mount without having to monitor FAN_CREATE events in the
> entire filesystem.
> 
> In combination with more changes to allow unprivileged fanotify listener
> to watch an idmapped mount, this functionality would be usable also by
> nested container managers.
> 
> Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Jan,
> 
> After trying several different approaches, I finally realized that
> making FAN_CREATE available for mount marks is not that hard and it could
> be very useful IMO.
> 
> Adding support for other "inode events" with mount mark, such as
> FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
> for FAN_CREATE was really easy due to the fact that all call sites are
> already surrounded by filename_creat()/done_path_create() calls.
> 
> Also, there is an inherent a-symetry between FAN_CREATE and other
> events. All the rest of the events may be set when watching a postive
> path, for example, to know when a path of a bind mount that was
> "injected" to a container was moved or deleted, it is possible to start
> watching that directory before injecting the bind mount.
> 
> It is not possible to do the same with a "negative" path to know when
> a positive dentry was instantiated at that path.
> 
> This patch provides functionality that is independant of other changes,
> but I also tested it along with other changes that demonstrate how it
> would be utilized in userns setups [1][2].
> 
> As can be seen in dcache.h patch, this patch comes on top a revert patch
> to reclaim an unused dentry flag. If you accept this proposal, I will
> post the full series.
> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/linux/commits/fanotify_userns
> [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> 
>  fs/namei.c               | 21 ++++++++++++++++++++-
>  include/linux/dcache.h   |  2 +-
>  include/linux/fanotify.h |  8 ++++----
>  include/linux/fsnotify.h | 36 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 61 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 216f16e74351..cf979e956938 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3288,7 +3288,7 @@ static const char *open_last_lookups(struct nameidata *nd,
>  		inode_lock_shared(dir->d_inode);
>  	dentry = lookup_open(nd, file, op, got_write);
>  	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
> -		fsnotify_create(dir->d_inode, dentry);
> +		fsnotify_path_create(&nd->path, dentry);
>  	if (open_flag & O_CREAT)
>  		inode_unlock(dir->d_inode);
>  	else
> @@ -3560,6 +3560,20 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
>  	return file;
>  }
>  
> +static void d_set_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags |= DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
> +static void d_clear_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags &= ~DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
>  static struct dentry *filename_create(int dfd, struct filename *name,
>  				struct path *path, unsigned int lookup_flags)
>  {
> @@ -3617,6 +3631,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  		goto fail;
>  	}
>  	putname(name);
> +	/* Start "path create" context that ends in done_path_create() */
> +	d_set_path_create(dentry);
>  	return dentry;
>  fail:
>  	dput(dentry);
> @@ -3641,6 +3657,9 @@ EXPORT_SYMBOL(kern_path_create);
>  
>  void done_path_create(struct path *path, struct dentry *dentry)
>  {
> +	if (d_inode(dentry))
> +		fsnotify_path_create(path, dentry);
> +	d_clear_path_create(dentry);
>  	dput(dentry);
>  	inode_unlock(path->dentry->d_inode);
>  	mnt_drop_write(path->mnt);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 4225caa8cf02..d153793d5b95 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -213,7 +213,7 @@ struct dentry_operations {
>  #define DCACHE_SYMLINK_TYPE		0x00600000 /* Symlink (or fallthru to such) */
>  
>  #define DCACHE_MAY_FREE			0x00800000
> -/* Was #define DCACHE_FALLTHRU			0x01000000 */
Indeed, that seems completely unused.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
  2021-03-30  7:31 ` Christian Brauner
  2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
@ 2021-03-30 12:20 ` Amir Goldstein
  2 siblings, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2021-03-30 12:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, Linux API
On Sun, Mar 28, 2021 at 6:56 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Add a high level hook fsnotify_path_create() which is called from
> syscall context where mount context is available, so that FAN_CREATE
> event can be added to a mount mark mask.
>
> This high level hook is called in addition to fsnotify_create(),
> fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> context is not available.
>
> In the context where fsnotify_path_create() will be called, a dentry flag
> flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> level hooks.
>
> This functionality was requested by Christian Brauner to replace
> recursive inotify watches for detecting when some path was created under
> an idmapped mount without having to monitor FAN_CREATE events in the
> entire filesystem.
>
> In combination with more changes to allow unprivileged fanotify listener
> to watch an idmapped mount, this functionality would be usable also by
> nested container managers.
>
> Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>
> Jan,
>
> After trying several different approaches, I finally realized that
> making FAN_CREATE available for mount marks is not that hard and it could
> be very useful IMO.
>
> Adding support for other "inode events" with mount mark, such as
> FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
> for FAN_CREATE was really easy due to the fact that all call sites are
> already surrounded by filename_creat()/done_path_create() calls.
>
FWIW, adding support for FAN_DELETE and FAN_MOVE_SELF was not
so hard. The move event at least will also be needed for the use case
where watching when a negative path is instantiated.
https://github.com/amir73il/linux/commits/fsnotify_path_hooks
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
@ 2021-03-30 12:33   ` Amir Goldstein
  2021-03-30 12:53     ` Christian Brauner
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-03-30 12:33 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API
On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > Add a high level hook fsnotify_path_create() which is called from
> > syscall context where mount context is available, so that FAN_CREATE
> > event can be added to a mount mark mask.
> >
> > This high level hook is called in addition to fsnotify_create(),
> > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > context is not available.
> >
> > In the context where fsnotify_path_create() will be called, a dentry flag
> > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > level hooks.
>
> Ok, just to make sure this scheme would also work for overlay-style
> filesystems like ecryptfs where you possible generate two notify events:
> - in the ecryptfs layer
> - in the lower fs layer
> at least when you set a regular inode watch.
>
> If you set a mount watch you ideally would generate two events in both
> layers too, right? But afaict that wouldn't work.
>
> Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> flag will be set on the new ecryptfs dentry and so no notify event will
> be generated for the ecryptfs layer again. Then ecryptfs calls
> vfs_link() to create a new dentry in the lower layer. The new dentry in
> the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
>
> But since vfs_link() doesn't have access to the mnt context itself you
> can't generate a notify event for the mount associated with the lower
> fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> mount to not get notified about creation events going through the
> ecryptfs layer. Is that right?  Seems like this could be a problem.
>
Not sure I follow what the problem might be.
FAN_MARK_MOUNT subscribes to get only events that were
generated via that vfsmount - that has been that way forever.
A listener may subscribe to (say) FAN_CREATE on a certain
mount AND also also on a specific parent directory.
If the listener is watching the entire ecryptfs mount and the
specific lower directory where said vfs_link() happens, both
events will be reported. One from fsnotify_create_path() and
the lower from fsnotify_create().
If one listener is watching the ecryptfs mount and another
listener is watching the specific ecryptfs directory, both
listeners will get a single event each. They will both get
the event that is emitted from fsnotify_path_create().
Besides I am not sure about ecryptfs, but overlayfs uses
private mount clone for accessing lower layer, so by definition
users cannot watch the underlying overlayfs operations using
a mount mark. Also, overlayfs suppresses fsnotify events on
underlying files intentionally with FMODE_NONOTIFY.
Hope that answers your question.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:33   ` Amir Goldstein
@ 2021-03-30 12:53     ` Christian Brauner
  2021-03-30 12:55       ` Christian Brauner
  2021-03-30 13:54       ` Amir Goldstein
  0 siblings, 2 replies; 34+ messages in thread
From: Christian Brauner @ 2021-03-30 12:53 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API
On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > Add a high level hook fsnotify_path_create() which is called from
> > > syscall context where mount context is available, so that FAN_CREATE
> > > event can be added to a mount mark mask.
> > >
> > > This high level hook is called in addition to fsnotify_create(),
> > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > context is not available.
> > >
> > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > level hooks.
> >
> > Ok, just to make sure this scheme would also work for overlay-style
> > filesystems like ecryptfs where you possible generate two notify events:
> > - in the ecryptfs layer
> > - in the lower fs layer
> > at least when you set a regular inode watch.
> >
> > If you set a mount watch you ideally would generate two events in both
> > layers too, right? But afaict that wouldn't work.
> >
> > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > flag will be set on the new ecryptfs dentry and so no notify event will
> > be generated for the ecryptfs layer again. Then ecryptfs calls
> > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> >
> > But since vfs_link() doesn't have access to the mnt context itself you
> > can't generate a notify event for the mount associated with the lower
> > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > mount to not get notified about creation events going through the
> > ecryptfs layer. Is that right?  Seems like this could be a problem.
> >
> 
> Not sure I follow what the problem might be.
> 
> FAN_MARK_MOUNT subscribes to get only events that were
> generated via that vfsmount - that has been that way forever.
> 
> A listener may subscribe to (say) FAN_CREATE on a certain
> mount AND also also on a specific parent directory.
> 
> If the listener is watching the entire ecryptfs mount and the
> specific lower directory where said vfs_link() happens, both
> events will be reported. One from fsnotify_create_path() and
> the lower from fsnotify_create().
> 
> If one listener is watching the ecryptfs mount and another
> listener is watching the specific ecryptfs directory, both
> listeners will get a single event each. They will both get
> the event that is emitted from fsnotify_path_create().
> 
> Besides I am not sure about ecryptfs, but overlayfs uses
> private mount clone for accessing lower layer, so by definition
I know. That's why I was using ecryptfs as an example which doesn't do
that (And I think it should be switched tbh.). It simply uses
kern_path() and then stashes that path.
My example probably would be something like:
mount -t ext4 /dev/sdb /A
1. FAN_MARK_MOUNT(/A)
mount --bind /A /B
2. FAN_MARK_MOUNT(/B)
mount -t ecryptfs /B /C
3. FAN_MARK_MOUNT(/C)
let's say I now do
touch /unencrypted/bla
I may be way off here but intuitively it seems both 1. and 2. should get
a creation event but not 3., right?
But with your proposal would both 1. and 2. still get a creation event?
> users cannot watch the underlying overlayfs operations using
> a mount mark. Also, overlayfs suppresses fsnotify events on
> underlying files intentionally with FMODE_NONOTIFY.
Probably ecryptfs should too?
Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:53     ` Christian Brauner
@ 2021-03-30 12:55       ` Christian Brauner
  2021-03-30 13:54       ` Amir Goldstein
  1 sibling, 0 replies; 34+ messages in thread
From: Christian Brauner @ 2021-03-30 12:55 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API
On Tue, Mar 30, 2021 at 02:53:36PM +0200, Christian Brauner wrote:
> On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> > On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > Add a high level hook fsnotify_path_create() which is called from
> > > > syscall context where mount context is available, so that FAN_CREATE
> > > > event can be added to a mount mark mask.
> > > >
> > > > This high level hook is called in addition to fsnotify_create(),
> > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > context is not available.
> > > >
> > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > level hooks.
> > >
> > > Ok, just to make sure this scheme would also work for overlay-style
> > > filesystems like ecryptfs where you possible generate two notify events:
> > > - in the ecryptfs layer
> > > - in the lower fs layer
> > > at least when you set a regular inode watch.
> > >
> > > If you set a mount watch you ideally would generate two events in both
> > > layers too, right? But afaict that wouldn't work.
> > >
> > > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > > flag will be set on the new ecryptfs dentry and so no notify event will
> > > be generated for the ecryptfs layer again. Then ecryptfs calls
> > > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> > >
> > > But since vfs_link() doesn't have access to the mnt context itself you
> > > can't generate a notify event for the mount associated with the lower
> > > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > > mount to not get notified about creation events going through the
> > > ecryptfs layer. Is that right?  Seems like this could be a problem.
> > >
> > 
> > Not sure I follow what the problem might be.
> > 
> > FAN_MARK_MOUNT subscribes to get only events that were
> > generated via that vfsmount - that has been that way forever.
> > 
> > A listener may subscribe to (say) FAN_CREATE on a certain
> > mount AND also also on a specific parent directory.
> > 
> > If the listener is watching the entire ecryptfs mount and the
> > specific lower directory where said vfs_link() happens, both
> > events will be reported. One from fsnotify_create_path() and
> > the lower from fsnotify_create().
> > 
> > If one listener is watching the ecryptfs mount and another
> > listener is watching the specific ecryptfs directory, both
> > listeners will get a single event each. They will both get
> > the event that is emitted from fsnotify_path_create().
> > 
> > Besides I am not sure about ecryptfs, but overlayfs uses
> > private mount clone for accessing lower layer, so by definition
> 
> I know. That's why I was using ecryptfs as an example which doesn't do
> that (And I think it should be switched tbh.). It simply uses
> kern_path() and then stashes that path.
> 
> My example probably would be something like:
> 
> mount -t ext4 /dev/sdb /A
> 
> 1. FAN_MARK_MOUNT(/A)
> 
> mount --bind /A /B
> 
> 2. FAN_MARK_MOUNT(/B)
> 
> mount -t ecryptfs /B /C
> 
> 3. FAN_MARK_MOUNT(/C)
> 
> let's say I now do
> 
> touch /unencrypted/bla
touch /C/bla
> 
> I may be way off here but intuitively it seems both 1. and 2. should get
> a creation event but not 3., right?
> 
> But with your proposal would both 1. and 2. still get a creation event?
> 
> > users cannot watch the underlying overlayfs operations using
> > a mount mark. Also, overlayfs suppresses fsnotify events on
> > underlying files intentionally with FMODE_NONOTIFY.
> 
> Probably ecryptfs should too?
> 
> Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:53     ` Christian Brauner
  2021-03-30 12:55       ` Christian Brauner
@ 2021-03-30 13:54       ` Amir Goldstein
  2021-03-30 14:17         ` Christian Brauner
  1 sibling, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-03-30 13:54 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi
On Tue, Mar 30, 2021 at 3:53 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> > On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > Add a high level hook fsnotify_path_create() which is called from
> > > > syscall context where mount context is available, so that FAN_CREATE
> > > > event can be added to a mount mark mask.
> > > >
> > > > This high level hook is called in addition to fsnotify_create(),
> > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > context is not available.
> > > >
> > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > level hooks.
> > >
> > > Ok, just to make sure this scheme would also work for overlay-style
> > > filesystems like ecryptfs where you possible generate two notify events:
> > > - in the ecryptfs layer
> > > - in the lower fs layer
> > > at least when you set a regular inode watch.
> > >
> > > If you set a mount watch you ideally would generate two events in both
> > > layers too, right? But afaict that wouldn't work.
> > >
> > > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > > flag will be set on the new ecryptfs dentry and so no notify event will
> > > be generated for the ecryptfs layer again. Then ecryptfs calls
> > > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> > >
> > > But since vfs_link() doesn't have access to the mnt context itself you
> > > can't generate a notify event for the mount associated with the lower
> > > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > > mount to not get notified about creation events going through the
> > > ecryptfs layer. Is that right?  Seems like this could be a problem.
> > >
> >
> > Not sure I follow what the problem might be.
> >
> > FAN_MARK_MOUNT subscribes to get only events that were
> > generated via that vfsmount - that has been that way forever.
> >
> > A listener may subscribe to (say) FAN_CREATE on a certain
> > mount AND also also on a specific parent directory.
> >
> > If the listener is watching the entire ecryptfs mount and the
> > specific lower directory where said vfs_link() happens, both
> > events will be reported. One from fsnotify_create_path() and
> > the lower from fsnotify_create().
> >
> > If one listener is watching the ecryptfs mount and another
> > listener is watching the specific ecryptfs directory, both
> > listeners will get a single event each. They will both get
> > the event that is emitted from fsnotify_path_create().
> >
> > Besides I am not sure about ecryptfs, but overlayfs uses
> > private mount clone for accessing lower layer, so by definition
>
> I know. That's why I was using ecryptfs as an example which doesn't do
> that (And I think it should be switched tbh.). It simply uses
> kern_path() and then stashes that path.
>
> My example probably would be something like:
>
> mount -t ext4 /dev/sdb /A
>
> 1. FAN_MARK_MOUNT(/A)
>
> mount --bind /A /B
>
> 2. FAN_MARK_MOUNT(/B)
>
> mount -t ecryptfs /B /C
>
> 3. FAN_MARK_MOUNT(/C)
>
> let's say I now do
>
> touch /C/bla
>
> I may be way off here but intuitively it seems both 1. and 2. should get
> a creation event but not 3., right?
>
Why not 3?
You explicitly set a mark on /C requesting to be notified when
objects are created via /C.
> But with your proposal would both 1. and 2. still get a creation event?
>
They would not get an event, because fsnotify() looks for CREATE event
subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
and does not find any.
The vfs_create() -> fsnotify_create() hook passes data_type inode to
fsnotify() so there is no fsnotify_data_path() to extract mnt event
subscribers from.
The same fate would be to files created by overlayfs, nfsd and cachefiles.
Only the create event on /C/bla from the syscall context would
call fsnoity_path_create() and result with path data in fsnotify(), so
the mnt event subscriber would be found.
> > users cannot watch the underlying overlayfs operations using
> > a mount mark. Also, overlayfs suppresses fsnotify events on
> > underlying files intentionally with FMODE_NONOTIFY.
>
> Probably ecryptfs should too?
>
<shrug> :)
FMODE_NONOTIFY is there not because there was a requirement
not to send events, but because the path of the internal file is
"fake", so it has a weird looking path. After all there are many
other events that would be sent (not on open files).
At least I think that's the reason...
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 13:54       ` Amir Goldstein
@ 2021-03-30 14:17         ` Christian Brauner
  2021-03-30 14:56           ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2021-03-30 14:17 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi
On Tue, Mar 30, 2021 at 04:54:02PM +0300, Amir Goldstein wrote:
> On Tue, Mar 30, 2021 at 3:53 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> > > On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > >
> > > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > > Add a high level hook fsnotify_path_create() which is called from
> > > > > syscall context where mount context is available, so that FAN_CREATE
> > > > > event can be added to a mount mark mask.
> > > > >
> > > > > This high level hook is called in addition to fsnotify_create(),
> > > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > > context is not available.
> > > > >
> > > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > > level hooks.
> > > >
> > > > Ok, just to make sure this scheme would also work for overlay-style
> > > > filesystems like ecryptfs where you possible generate two notify events:
> > > > - in the ecryptfs layer
> > > > - in the lower fs layer
> > > > at least when you set a regular inode watch.
> > > >
> > > > If you set a mount watch you ideally would generate two events in both
> > > > layers too, right? But afaict that wouldn't work.
> > > >
> > > > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > > > flag will be set on the new ecryptfs dentry and so no notify event will
> > > > be generated for the ecryptfs layer again. Then ecryptfs calls
> > > > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > > > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> > > >
> > > > But since vfs_link() doesn't have access to the mnt context itself you
> > > > can't generate a notify event for the mount associated with the lower
> > > > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > > > mount to not get notified about creation events going through the
> > > > ecryptfs layer. Is that right?  Seems like this could be a problem.
> > > >
> > >
> > > Not sure I follow what the problem might be.
> > >
> > > FAN_MARK_MOUNT subscribes to get only events that were
> > > generated via that vfsmount - that has been that way forever.
> > >
> > > A listener may subscribe to (say) FAN_CREATE on a certain
> > > mount AND also also on a specific parent directory.
> > >
> > > If the listener is watching the entire ecryptfs mount and the
> > > specific lower directory where said vfs_link() happens, both
> > > events will be reported. One from fsnotify_create_path() and
> > > the lower from fsnotify_create().
> > >
> > > If one listener is watching the ecryptfs mount and another
> > > listener is watching the specific ecryptfs directory, both
> > > listeners will get a single event each. They will both get
> > > the event that is emitted from fsnotify_path_create().
> > >
> > > Besides I am not sure about ecryptfs, but overlayfs uses
> > > private mount clone for accessing lower layer, so by definition
> >
> > I know. That's why I was using ecryptfs as an example which doesn't do
> > that (And I think it should be switched tbh.). It simply uses
> > kern_path() and then stashes that path.
> >
> > My example probably would be something like:
> >
> > mount -t ext4 /dev/sdb /A
> >
> > 1. FAN_MARK_MOUNT(/A)
> >
> > mount --bind /A /B
> >
> > 2. FAN_MARK_MOUNT(/B)
> >
> > mount -t ecryptfs /B /C
> >
> > 3. FAN_MARK_MOUNT(/C)
> >
> > let's say I now do
> >
> > touch /C/bla
> >
> > I may be way off here but intuitively it seems both 1. and 2. should get
> > a creation event but not 3., right?
> >
> 
> Why not 3?
> You explicitly set a mark on /C requesting to be notified when
> objects are created via /C.
Sorry, that was a typo. I meant to write, both 2. and 3. should get a
creation event but not 1.
> 
> > But with your proposal would both 1. and 2. still get a creation event?
> >
Same obvious typo. The correct question would be: with your proposal do
2. and 3. both get an event?
Because it feels like they both should since /C is mounted on top of /B
and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
FAN_MARK_MOUNT(/C) should get a creation event after all both will have
mnt->mnt_fsnotify_marks set.
> 
> They would not get an event, because fsnotify() looks for CREATE event
> subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> and does not find any.
Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
_should_ look at
	    (!mnt || !mnt->mnt_fsnotify_marks) &&
and see that there are subscribers and should notify the subscribers in
/B even if the file is created through /C.
My point is with your solution this can't be handled and I want to make
sure that this is ok. Because right now you'd not be notified about a
new file having been created in /B even though mnt->mnt_fsnotify_marks
is set and the creation went through /B via /C.
_Unless_ we switch to an argument like overlayfs and say "This is a
private mount which is opaque and so we don't need to generate events.".
Overlayfs handles this cleanly due to clone_private_mount() which will
shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
argument we follow, no?
> 
> The vfs_create() -> fsnotify_create() hook passes data_type inode to
> fsnotify() so there is no fsnotify_data_path() to extract mnt event
> subscribers from.
Right, that was my point. You don't have the mnt context for the
underlying fs at a time when e.g. call vfs_link() which ultimately calls
fsnotify_create/link() which I'm saying might be a problem.
> 
> The same fate would be to files created by overlayfs, nfsd and cachefiles.
> 
> Only the create event on /C/bla from the syscall context would
> call fsnoity_path_create() and result with path data in fsnotify(), so
> the mnt event subscriber would be found.
> 
> > > users cannot watch the underlying overlayfs operations using
> > > a mount mark. Also, overlayfs suppresses fsnotify events on
> > > underlying files intentionally with FMODE_NONOTIFY.
> >
> > Probably ecryptfs should too?
It really feels like ecryptfs should do clone_private_mnt() and probably
cachefiles too. I mentioned this to David just a few weeks ago actually.
Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 14:17         ` Christian Brauner
@ 2021-03-30 14:56           ` Amir Goldstein
  2021-03-31  9:46             ` Christian Brauner
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-03-30 14:56 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi
> > > My example probably would be something like:
> > >
> > > mount -t ext4 /dev/sdb /A
> > >
> > > 1. FAN_MARK_MOUNT(/A)
> > >
> > > mount --bind /A /B
> > >
> > > 2. FAN_MARK_MOUNT(/B)
> > >
> > > mount -t ecryptfs /B /C
> > >
> > > 3. FAN_MARK_MOUNT(/C)
> > >
> > > let's say I now do
> > >
> > > touch /C/bla
> > >
> > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > a creation event but not 3., right?
> > >
> >
> > Why not 3?
> > You explicitly set a mark on /C requesting to be notified when
> > objects are created via /C.
>
> Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> creation event but not 1.
>
> >
> > > But with your proposal would both 1. and 2. still get a creation event?
> > >
>
> Same obvious typo. The correct question would be: with your proposal do
> 2. and 3. both get an event?
>
> Because it feels like they both should since /C is mounted on top of /B
> and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> mnt->mnt_fsnotify_marks set.
>
Right.
There are two ways to address this inconsistency:
1. Change internal callers of vfs_ helpers to use a private mount,
    as you yourself suggested for ecryptfs and cachefiles
2. Add fsnotify_path_ hooks at caller site - that would be the
    correct thing to do for nfsd IMO
> >
> > They would not get an event, because fsnotify() looks for CREATE event
> > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > and does not find any.
>
> Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> _should_ look at
>             (!mnt || !mnt->mnt_fsnotify_marks) &&
> and see that there are subscribers and should notify the subscribers in
> /B even if the file is created through /C.
>
> My point is with your solution this can't be handled and I want to make
> sure that this is ok. Because right now you'd not be notified about a
> new file having been created in /B even though mnt->mnt_fsnotify_marks
> is set and the creation went through /B via /C.
>
If you are referring to the ecryptfs use case specifically, then I think it is
ok. After all, whether ecryptfs uses a private mount clone or not is not
something the user can know.
> _Unless_ we switch to an argument like overlayfs and say "This is a
> private mount which is opaque and so we don't need to generate events.".
> Overlayfs handles this cleanly due to clone_private_mount() which will
> shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> argument we follow, no?
>
There is simply no way that the user can infer from the documentation
of FAN_MARK_MOUNT that the event on /B is expected when /B is
underlying layer of ecryptfs or overlayfs.
It requires deep internal knowledge of the stacked fs implementation.
In best case, the user can infer that she MAY get an event on /B.
Some users MAY also expect to get an event on /A because they do not
understand the concept of bind mounts...
Clone a mount ns and you will get more lost users...
> >
> > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > subscribers from.
>
> Right, that was my point. You don't have the mnt context for the
> underlying fs at a time when e.g. call vfs_link() which ultimately calls
> fsnotify_create/link() which I'm saying might be a problem.
>
It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)
It would be a problem if people think that the FAN_MOUNT_MARK
is a subtree mark, which it certainly is not. And I have no doubt that
as Jan said, people really do want a subtree mark.
My question to you with this RFC is: Does the ability to subscribe to
CREATE/DELETE/MOVE events on a mount help any of your use
cases? With or without the property that mount marks are allowed
inside userns for idmapped mounts.
Note that if we think the semantics of this are useful for container
managers, but too complex for most mortals, we may decide to
restrict the ability to subscribe to those events to idmapped mounts(?).
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30  9:31   ` Amir Goldstein
@ 2021-03-30 16:24     ` Amir Goldstein
  2021-03-31 10:08       ` Christian Brauner
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-03-30 16:24 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API
On Tue, Mar 30, 2021 at 12:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > Add a high level hook fsnotify_path_create() which is called from
> > > syscall context where mount context is available, so that FAN_CREATE
> > > event can be added to a mount mark mask.
> > >
> > > This high level hook is called in addition to fsnotify_create(),
> > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > context is not available.
> > >
> > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > level hooks.
> > >
> > > This functionality was requested by Christian Brauner to replace
> > > recursive inotify watches for detecting when some path was created under
> > > an idmapped mount without having to monitor FAN_CREATE events in the
> > > entire filesystem.
> > >
> > > In combination with more changes to allow unprivileged fanotify listener
> > > to watch an idmapped mount, this functionality would be usable also by
> > > nested container managers.
> > >
> > > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> >
> > Was about to look at this. Does this require preliminary patches since
> > it doesn't apply to current master. If so, can you just give me a link
> > to a branch so I can pull from that? :)
> >
>
> The patch is less useful on its own.
> Better take the entire work for the demo which includes this patch:
>
> [1] https://github.com/amir73il/linux/commits/fanotify_userns
> [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
>
Christian,
Apologies for the fast moving target.
I just force force the kernel+demo branches to include support for
the two extra events (delete and move) on mount mark.
3dd3d7f02717...6cfe8f7a9148 fanotify_userns -> fanotify_userns (forced update)
The same demo I described before with a mix of:
- CREATE event on MOUNT mark
- Rest of events on recursive inode marks
Still work when running demo script with --fanotify --recursive
on idmapped mount
But now the demo also works with --global on idmapped mount
by setting up only a mount mark to watch most events
excluding the unsupported events (see below).
Thanks,
Amir.
# ./test_demo.sh /mnt 0
+ WD=/mnt
+ ID=0
...
+ inotifywatch --global -w --timeout -2 /mnt
Establishing filesystem global watch...
...
total  modify  close_write  move_self  create  delete  filename
3      0       1            0          1       1       /mnt/a/1 (deleted)
2      0       1            0          1       0       /mnt/a/0 (deleted)
2      0       1            0          1       0       /mnt/a/2
2      0       1            0          1       0       /mnt/a/3
2      0       0            0          1       1       /mnt/a/dir1 (deleted)
2      0       1            0          1       0       /mnt/a/dir2/A/B/C/file2
1      0       0            0          1       0       /mnt/a/dir0 (deleted)
1      0       0            0          1       0       /mnt/a/dir2
1      0       0            0          1       0       /mnt/a/dir2/A
1      0       0            0          1       0       /mnt/a/dir2/A/B
1      0       0            0          1       0       /mnt/a/dir2/A/B/C
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 14:56           ` Amir Goldstein
@ 2021-03-31  9:46             ` Christian Brauner
  2021-03-31 11:29               ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2021-03-31  9:46 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi
On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > My example probably would be something like:
> > > >
> > > > mount -t ext4 /dev/sdb /A
> > > >
> > > > 1. FAN_MARK_MOUNT(/A)
> > > >
> > > > mount --bind /A /B
> > > >
> > > > 2. FAN_MARK_MOUNT(/B)
> > > >
> > > > mount -t ecryptfs /B /C
> > > >
> > > > 3. FAN_MARK_MOUNT(/C)
> > > >
> > > > let's say I now do
> > > >
> > > > touch /C/bla
> > > >
> > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > a creation event but not 3., right?
> > > >
> > >
> > > Why not 3?
> > > You explicitly set a mark on /C requesting to be notified when
> > > objects are created via /C.
> >
> > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > creation event but not 1.
> >
> > >
> > > > But with your proposal would both 1. and 2. still get a creation event?
> > > >
> >
> > Same obvious typo. The correct question would be: with your proposal do
> > 2. and 3. both get an event?
> >
> > Because it feels like they both should since /C is mounted on top of /B
> > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > mnt->mnt_fsnotify_marks set.
> >
> 
> Right.
> 
> There are two ways to address this inconsistency:
> 1. Change internal callers of vfs_ helpers to use a private mount,
>     as you yourself suggested for ecryptfs and cachefiles
I feel like this is he correct thing to do independently of the fanotify
considerations. I think I'll send an RFC for this today or later this
week.
> 2. Add fsnotify_path_ hooks at caller site - that would be the
>     correct thing to do for nfsd IMO
I do not have an informed opinion yet on nfsd so I simply need to trust
you here. :)
> 
> > >
> > > They would not get an event, because fsnotify() looks for CREATE event
> > > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > > and does not find any.
> >
> > Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> > _should_ look at
> >             (!mnt || !mnt->mnt_fsnotify_marks) &&
> > and see that there are subscribers and should notify the subscribers in
> > /B even if the file is created through /C.
> >
> > My point is with your solution this can't be handled and I want to make
> > sure that this is ok. Because right now you'd not be notified about a
> > new file having been created in /B even though mnt->mnt_fsnotify_marks
> > is set and the creation went through /B via /C.
> >
> 
> If you are referring to the ecryptfs use case specifically, then I think it is
> ok. After all, whether ecryptfs uses a private mount clone or not is not
> something the user can know.
> 
> > _Unless_ we switch to an argument like overlayfs and say "This is a
> > private mount which is opaque and so we don't need to generate events.".
> > Overlayfs handles this cleanly due to clone_private_mount() which will
> > shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> > argument we follow, no?
> >
> 
> There is simply no way that the user can infer from the documentation
> of FAN_MARK_MOUNT that the event on /B is expected when /B is
> underlying layer of ecryptfs or overlayfs.
> It requires deep internal knowledge of the stacked fs implementation.
> In best case, the user can infer that she MAY get an event on /B.
> Some users MAY also expect to get an event on /A because they do not
> understand the concept of bind mounts...
> Clone a mount ns and you will get more lost users...
I disagree to some extent. For example, a user might remount /B
read-only at which point /C is effectively read-only too which makes it
plain obvious to the user that /C piggy-backs on /B.
But leaving that aside my questioning is more concerned with whether the
implementation we're aiming for is consistent and intuitive and that
stacking example came to my mind pretty quickly.
> 
> > >
> > > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > > subscribers from.
> >
> > Right, that was my point. You don't have the mnt context for the
> > underlying fs at a time when e.g. call vfs_link() which ultimately calls
> > fsnotify_create/link() which I'm saying might be a problem.
> >
> 
> It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)
> 
> It would be a problem if people think that the FAN_MOUNT_MARK
> is a subtree mark, which it certainly is not. And I have no doubt that
I don't think subtree marks figure into the example above. But we
digress.
> as Jan said, people really do want a subtree mark.
> 
> My question to you with this RFC is: Does the ability to subscribe to
> CREATE/DELETE/MOVE events on a mount help any of your use
> cases? With or without the property that mount marks are allowed
Since I explicitly pointed on in a prior mail that it would be great to
have the same events as for a regular fanotify watch I think I already
answered that question. :)
> inside userns for idmapped mounts.
But if it helps then I'll do it once: yes, both would indeed be very
useful.
> 
> Note that if we think the semantics of this are useful for container
> managers, but too complex for most mortals, we may decide to
> restrict the ability to subscribe to those events to idmapped mounts(?).
I don't think it's too complex for most users. But supervisors and
container managers are prime users of a feature like this.
Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 16:24     ` Amir Goldstein
@ 2021-03-31 10:08       ` Christian Brauner
  2021-03-31 10:57         ` Amir Goldstein
  2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
  0 siblings, 2 replies; 34+ messages in thread
From: Christian Brauner @ 2021-03-31 10:08 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API
On Tue, Mar 30, 2021 at 07:24:02PM +0300, Amir Goldstein wrote:
> On Tue, Mar 30, 2021 at 12:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > Add a high level hook fsnotify_path_create() which is called from
> > > > syscall context where mount context is available, so that FAN_CREATE
> > > > event can be added to a mount mark mask.
> > > >
> > > > This high level hook is called in addition to fsnotify_create(),
> > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > context is not available.
> > > >
> > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > level hooks.
> > > >
> > > > This functionality was requested by Christian Brauner to replace
> > > > recursive inotify watches for detecting when some path was created under
> > > > an idmapped mount without having to monitor FAN_CREATE events in the
> > > > entire filesystem.
> > > >
> > > > In combination with more changes to allow unprivileged fanotify listener
> > > > to watch an idmapped mount, this functionality would be usable also by
> > > > nested container managers.
> > > >
> > > > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > > > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > ---
> > >
> > > Was about to look at this. Does this require preliminary patches since
> > > it doesn't apply to current master. If so, can you just give me a link
> > > to a branch so I can pull from that? :)
> > >
> >
> > The patch is less useful on its own.
> > Better take the entire work for the demo which includes this patch:
> >
> > [1] https://github.com/amir73il/linux/commits/fanotify_userns
> > [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> >
> 
> Christian,
> 
> Apologies for the fast moving target.
No problem.
> I just force force the kernel+demo branches to include support for
> the two extra events (delete and move) on mount mark.
Sounds good.
One thing your patch
commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Sat Mar 20 12:58:12 2021 +0200
    fs: allow open by file handle inside userns
    open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
    where most filesystems are mounted.
    Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
    inside userns to open by file handle in filesystems that were
    mounted inside that userns.
    In addition, also allow open by handle in an idmapped mount, which is
    mapped to the userns while verifying that the returned open file path
    is under the root of the idmapped mount.
    This is going to be needed for setting an fanotify mark on a filesystem
    and watching events inside userns.
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
following and other callchains:
exportfs_decode_fh()
-> exportfs_decode_fh_raw()
   -> lookup_one_len()
      -> inode_permission(mnt_userns, ...)
That's not a huge problem though I did all these changes for the
overlayfs support for idmapped mounts I have in a branch from an earlier
version of the idmapped mounts patchset. Basically lookup_one_len(),
lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
the mnt_userns into account. I can rebase my change and send it for
consideration next cycle. If you can live without the
open_by_handle_at() support for now in this patchset (Which I think you
said you could.) then it's not a blocker either. Sorry for the
inconvenience.
Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 10:08       ` Christian Brauner
@ 2021-03-31 10:57         ` Amir Goldstein
  2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
  1 sibling, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2021-03-31 10:57 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API
On Wed, Mar 31, 2021 at 1:08 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 30, 2021 at 07:24:02PM +0300, Amir Goldstein wrote:
> > On Tue, Mar 30, 2021 at 12:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > >
> > > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > > Add a high level hook fsnotify_path_create() which is called from
> > > > > syscall context where mount context is available, so that FAN_CREATE
> > > > > event can be added to a mount mark mask.
> > > > >
> > > > > This high level hook is called in addition to fsnotify_create(),
> > > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > > context is not available.
> > > > >
> > > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > > level hooks.
> > > > >
> > > > > This functionality was requested by Christian Brauner to replace
> > > > > recursive inotify watches for detecting when some path was created under
> > > > > an idmapped mount without having to monitor FAN_CREATE events in the
> > > > > entire filesystem.
> > > > >
> > > > > In combination with more changes to allow unprivileged fanotify listener
> > > > > to watch an idmapped mount, this functionality would be usable also by
> > > > > nested container managers.
> > > > >
> > > > > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > > > > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > ---
> > > >
> > > > Was about to look at this. Does this require preliminary patches since
> > > > it doesn't apply to current master. If so, can you just give me a link
> > > > to a branch so I can pull from that? :)
> > > >
> > >
> > > The patch is less useful on its own.
> > > Better take the entire work for the demo which includes this patch:
> > >
> > > [1] https://github.com/amir73il/linux/commits/fanotify_userns
> > > [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> > >
> >
> > Christian,
> >
> > Apologies for the fast moving target.
>
> No problem.
>
> > I just force force the kernel+demo branches to include support for
> > the two extra events (delete and move) on mount mark.
>
> Sounds good.
>
> One thing your patch
>
> commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> Author: Amir Goldstein <amir73il@gmail.com>
> Date:   Sat Mar 20 12:58:12 2021 +0200
>
>     fs: allow open by file handle inside userns
>
>     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
>     where most filesystems are mounted.
>
>     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
>     inside userns to open by file handle in filesystems that were
>     mounted inside that userns.
>
>     In addition, also allow open by handle in an idmapped mount, which is
>     mapped to the userns while verifying that the returned open file path
>     is under the root of the idmapped mount.
>
>     This is going to be needed for setting an fanotify mark on a filesystem
>     and watching events inside userns.
>
>     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> following and other callchains:
>
> exportfs_decode_fh()
> -> exportfs_decode_fh_raw()
>    -> lookup_one_len()
>       -> inode_permission(mnt_userns, ...)
>
> That's not a huge problem though I did all these changes for the
> overlayfs support for idmapped mounts I have in a branch from an earlier
> version of the idmapped mounts patchset. Basically lookup_one_len(),
> lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> the mnt_userns into account. I can rebase my change and send it for
> consideration next cycle. If you can live without the
> open_by_handle_at() support for now in this patchset (Which I think you
> said you could.) then it's not a blocker either. Sorry for the
> inconvenience.
>
My preference is that you take this patch through your branch.
I think you will do a much better job than me selling it and arguing for
its correctness.
The ability to setup fanotify marks on idmapped mounts does not
depend on it in any way.
I've only added it to my branch to facilitate the demo, which also
independently resolves the reported fids to paths.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31  9:46             ` Christian Brauner
@ 2021-03-31 11:29               ` Amir Goldstein
  2021-03-31 12:17                 ` Christian Brauner
                                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Amir Goldstein @ 2021-03-31 11:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields
On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > > My example probably would be something like:
> > > > >
> > > > > mount -t ext4 /dev/sdb /A
> > > > >
> > > > > 1. FAN_MARK_MOUNT(/A)
> > > > >
> > > > > mount --bind /A /B
> > > > >
> > > > > 2. FAN_MARK_MOUNT(/B)
> > > > >
> > > > > mount -t ecryptfs /B /C
> > > > >
> > > > > 3. FAN_MARK_MOUNT(/C)
> > > > >
> > > > > let's say I now do
> > > > >
> > > > > touch /C/bla
> > > > >
> > > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > > a creation event but not 3., right?
> > > > >
> > > >
> > > > Why not 3?
> > > > You explicitly set a mark on /C requesting to be notified when
> > > > objects are created via /C.
> > >
> > > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > > creation event but not 1.
> > >
> > > >
> > > > > But with your proposal would both 1. and 2. still get a creation event?
> > > > >
> > >
> > > Same obvious typo. The correct question would be: with your proposal do
> > > 2. and 3. both get an event?
> > >
> > > Because it feels like they both should since /C is mounted on top of /B
> > > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > > mnt->mnt_fsnotify_marks set.
> > >
> >
> > Right.
> >
> > There are two ways to address this inconsistency:
> > 1. Change internal callers of vfs_ helpers to use a private mount,
> >     as you yourself suggested for ecryptfs and cachefiles
>
> I feel like this is he correct thing to do independently of the fanotify
> considerations. I think I'll send an RFC for this today or later this
> week.
>
> > 2. Add fsnotify_path_ hooks at caller site - that would be the
> >     correct thing to do for nfsd IMO
>
> I do not have an informed opinion yet on nfsd so I simply need to trust
> you here. :)
>
As long as "exp_export: export of idmapped mounts not yet supported.\n"
I don't think it matters much.
It feels like adding idmapped mounts to nfsd is on your roadmap.
When you get to that we can discuss adding fsnotify path hooks to nfsd
if Jan agrees to the fsnotify path hooks concept.
> >
> > > >
> > > > They would not get an event, because fsnotify() looks for CREATE event
> > > > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > > > and does not find any.
> > >
> > > Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> > > _should_ look at
> > >             (!mnt || !mnt->mnt_fsnotify_marks) &&
> > > and see that there are subscribers and should notify the subscribers in
> > > /B even if the file is created through /C.
> > >
> > > My point is with your solution this can't be handled and I want to make
> > > sure that this is ok. Because right now you'd not be notified about a
> > > new file having been created in /B even though mnt->mnt_fsnotify_marks
> > > is set and the creation went through /B via /C.
> > >
> >
> > If you are referring to the ecryptfs use case specifically, then I think it is
> > ok. After all, whether ecryptfs uses a private mount clone or not is not
> > something the user can know.
> >
> > > _Unless_ we switch to an argument like overlayfs and say "This is a
> > > private mount which is opaque and so we don't need to generate events.".
> > > Overlayfs handles this cleanly due to clone_private_mount() which will
> > > shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> > > argument we follow, no?
> > >
> >
> > There is simply no way that the user can infer from the documentation
> > of FAN_MARK_MOUNT that the event on /B is expected when /B is
> > underlying layer of ecryptfs or overlayfs.
> > It requires deep internal knowledge of the stacked fs implementation.
> > In best case, the user can infer that she MAY get an event on /B.
> > Some users MAY also expect to get an event on /A because they do not
> > understand the concept of bind mounts...
> > Clone a mount ns and you will get more lost users...
>
> I disagree to some extent. For example, a user might remount /B
> read-only at which point /C is effectively read-only too which makes it
> plain obvious to the user that /C piggy-backs on /B.
Yes, but that is a bug. /C should not become read-only. It should use
a private clone of /B, so I don't see where this is going.
> But leaving that aside my questioning is more concerned with whether the
> implementation we're aiming for is consistent and intuitive and that
> stacking example came to my mind pretty quickly.
>
This implementation is a compromise for not having clear user mount
context in all places that call for an event.
For every person you find that thinks it is intuitive to get an event on /B
for touch C/bla, you will find another person that thinks it is not intuitive
to get an event. I think we are way beyond the stage with mount
namespaces where intuition alone suffice.
w.r.t consistent, you gave a few examples and I suggested how IMO
they should be fixed to behave consistently.
If you have other examples of alleged inconsistencies, please list them.
> >
> > > >
> > > > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > > > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > > > subscribers from.
> > >
> > > Right, that was my point. You don't have the mnt context for the
> > > underlying fs at a time when e.g. call vfs_link() which ultimately calls
> > > fsnotify_create/link() which I'm saying might be a problem.
> > >
> >
> > It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)
> >
> > It would be a problem if people think that the FAN_MOUNT_MARK
> > is a subtree mark, which it certainly is not. And I have no doubt that
>
> I don't think subtree marks figure into the example above. But we
> digress.
>
> > as Jan said, people really do want a subtree mark.
> >
> > My question to you with this RFC is: Does the ability to subscribe to
> > CREATE/DELETE/MOVE events on a mount help any of your use
> > cases? With or without the property that mount marks are allowed
>
> Since I explicitly pointed on in a prior mail that it would be great to
> have the same events as for a regular fanotify watch I think I already
> answered that question. :)
>
> > inside userns for idmapped mounts.
>
> But if it helps then I'll do it once: yes, both would indeed be very
> useful.
>
OK. I understand that the "promise" of those abilities is very useful.
Please also confirm once you tested the demo code that the new
events on an idmapped mount will "actually" be useful to container
managers. If you can work my demo code into a demo branch for
the bind mount injection or something that would be best.
The reason I am asking this is because while I was working on
enabling sb/mount marks inside userns I found several other issues
(e.g. open_by_handle_at()) without fixing them the demo would have
been much less impressive and much less useful in practice.
So I would like to know that we really have all the pieces needed for
a useful solution, before proposing the fanotify patches.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 11:29               ` Amir Goldstein
@ 2021-03-31 12:17                 ` Christian Brauner
  2021-03-31 12:59                   ` Amir Goldstein
  2021-03-31 12:54                 ` Jan Kara
  2021-03-31 13:06                 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask J. Bruce Fields
  2 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2021-03-31 12:17 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields
On Wed, Mar 31, 2021 at 02:29:04PM +0300, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > > > My example probably would be something like:
> > > > > >
> > > > > > mount -t ext4 /dev/sdb /A
> > > > > >
> > > > > > 1. FAN_MARK_MOUNT(/A)
> > > > > >
> > > > > > mount --bind /A /B
> > > > > >
> > > > > > 2. FAN_MARK_MOUNT(/B)
> > > > > >
> > > > > > mount -t ecryptfs /B /C
> > > > > >
> > > > > > 3. FAN_MARK_MOUNT(/C)
> > > > > >
> > > > > > let's say I now do
> > > > > >
> > > > > > touch /C/bla
> > > > > >
> > > > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > > > a creation event but not 3., right?
> > > > > >
> > > > >
> > > > > Why not 3?
> > > > > You explicitly set a mark on /C requesting to be notified when
> > > > > objects are created via /C.
> > > >
> > > > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > > > creation event but not 1.
> > > >
> > > > >
> > > > > > But with your proposal would both 1. and 2. still get a creation event?
> > > > > >
> > > >
> > > > Same obvious typo. The correct question would be: with your proposal do
> > > > 2. and 3. both get an event?
> > > >
> > > > Because it feels like they both should since /C is mounted on top of /B
> > > > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > > > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > > > mnt->mnt_fsnotify_marks set.
> > > >
> > >
> > > Right.
> > >
> > > There are two ways to address this inconsistency:
> > > 1. Change internal callers of vfs_ helpers to use a private mount,
> > >     as you yourself suggested for ecryptfs and cachefiles
> >
[1]:
> > I feel like this is he correct thing to do independently of the fanotify
> > considerations. I think I'll send an RFC for this today or later this
> > week.
> >
> > > 2. Add fsnotify_path_ hooks at caller site - that would be the
> > >     correct thing to do for nfsd IMO
> >
> > I do not have an informed opinion yet on nfsd so I simply need to trust
> > you here. :)
> >
> 
> As long as "exp_export: export of idmapped mounts not yet supported.\n"
> I don't think it matters much.
> It feels like adding idmapped mounts to nfsd is on your roadmap.
> When you get to that we can discuss adding fsnotify path hooks to nfsd
> if Jan agrees to the fsnotify path hooks concept.
> 
> > >
> > > > >
> > > > > They would not get an event, because fsnotify() looks for CREATE event
> > > > > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > > > > and does not find any.
> > > >
> > > > Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> > > > _should_ look at
> > > >             (!mnt || !mnt->mnt_fsnotify_marks) &&
> > > > and see that there are subscribers and should notify the subscribers in
> > > > /B even if the file is created through /C.
> > > >
> > > > My point is with your solution this can't be handled and I want to make
> > > > sure that this is ok. Because right now you'd not be notified about a
> > > > new file having been created in /B even though mnt->mnt_fsnotify_marks
> > > > is set and the creation went through /B via /C.
> > > >
> > >
> > > If you are referring to the ecryptfs use case specifically, then I think it is
> > > ok. After all, whether ecryptfs uses a private mount clone or not is not
> > > something the user can know.
> > >
> > > > _Unless_ we switch to an argument like overlayfs and say "This is a
> > > > private mount which is opaque and so we don't need to generate events.".
> > > > Overlayfs handles this cleanly due to clone_private_mount() which will
> > > > shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> > > > argument we follow, no?
> > > >
> > >
> > > There is simply no way that the user can infer from the documentation
> > > of FAN_MARK_MOUNT that the event on /B is expected when /B is
> > > underlying layer of ecryptfs or overlayfs.
> > > It requires deep internal knowledge of the stacked fs implementation.
> > > In best case, the user can infer that she MAY get an event on /B.
> > > Some users MAY also expect to get an event on /A because they do not
> > > understand the concept of bind mounts...
> > > Clone a mount ns and you will get more lost users...
> >
> > I disagree to some extent. For example, a user might remount /B
> > read-only at which point /C is effectively read-only too which makes it
> > plain obvious to the user that /C piggy-backs on /B.
> 
> Yes, but that is a bug. /C should not become read-only. It should use
I agree. That's why I said they should use one. :)
I just pointed out how it is today for two filesystems.
> a private clone of /B, so I don't see where this is going.
It's going to [1] above. :)
> 
> > But leaving that aside my questioning is more concerned with whether the
> > implementation we're aiming for is consistent and intuitive and that
> > stacking example came to my mind pretty quickly.
> >
> 
> This implementation is a compromise for not having clear user mount
> context in all places that call for an event.
> For every person you find that thinks it is intuitive to get an event on /B
> for touch C/bla, you will find another person that thinks it is not intuitive
And I think here we disagree. The technical implementation currently
requires this since the two mounts are both clearly marked and the first
mount creates objects by going through the other mount and they don't
have a private mount. All I was saying is that the current patchset
can't handle this case and asked whether we are ok with that and if not
what we do to fix it.
My proposal two or three mails ago and then picked up by you is: make
them both use a private clone mount which is - as I said in earlier
mails - the correct solution anyway and falls in line with overlayfs
too.
> to get an event. I think we are way beyond the stage with mount
> namespaces where intuition alone suffice.
> 
> w.r.t consistent, you gave a few examples and I suggested how IMO
> they should be fixed to behave consistently.
> If you have other examples of alleged inconsistencies, please list them.
It feels like I somehow upset you with this. Again, I agree with the
push of the patchset in general and I'm grateful you're doing that work
and we agree on the fix for the two filesystems. As I said, I'll try to
get an RFC fix out for both.
> 
> > >
> > > > >
> > > > > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > > > > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > > > > subscribers from.
> > > >
> > > > Right, that was my point. You don't have the mnt context for the
> > > > underlying fs at a time when e.g. call vfs_link() which ultimately calls
> > > > fsnotify_create/link() which I'm saying might be a problem.
> > > >
> > >
> > > It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)
> > >
> > > It would be a problem if people think that the FAN_MOUNT_MARK
> > > is a subtree mark, which it certainly is not. And I have no doubt that
> >
> > I don't think subtree marks figure into the example above. But we
> > digress.
> >
> > > as Jan said, people really do want a subtree mark.
> > >
> > > My question to you with this RFC is: Does the ability to subscribe to
> > > CREATE/DELETE/MOVE events on a mount help any of your use
> > > cases? With or without the property that mount marks are allowed
> >
> > Since I explicitly pointed on in a prior mail that it would be great to
> > have the same events as for a regular fanotify watch I think I already
> > answered that question. :)
> >
> > > inside userns for idmapped mounts.
> >
> > But if it helps then I'll do it once: yes, both would indeed be very
> > useful.
> >
> 
> OK. I understand that the "promise" of those abilities is very useful.
> Please also confirm once you tested the demo code that the new
> events on an idmapped mount will "actually" be useful to container
> managers. If you can work my demo code into a demo branch for
> the bind mount injection or something that would be best.
> 
> The reason I am asking this is because while I was working on
> enabling sb/mount marks inside userns I found several other issues
> (e.g. open_by_handle_at()) without fixing them the demo would have
> been much less impressive and much less useful in practice.
> 
> So I would like to know that we really have all the pieces needed for
> a useful solution, before proposing the fanotify patches.
Sure, if you think that you have your branch in the shape that you want
to. So far it has been evolving quite rapidly as you said yourself. :)
I can probably test this soon early next week seems most likely since I
need to find a timeslot to actually do the work you're asking. Hope that
works.
Thanks!
Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 11:29               ` Amir Goldstein
  2021-03-31 12:17                 ` Christian Brauner
@ 2021-03-31 12:54                 ` Jan Kara
  2021-03-31 14:06                   ` Amir Goldstein
  2021-03-31 13:06                 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask J. Bruce Fields
  2 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2021-03-31 12:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API,
	Miklos Szeredi, J. Bruce Fields
On Wed 31-03-21 14:29:04, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > > > My example probably would be something like:
> > > > > >
> > > > > > mount -t ext4 /dev/sdb /A
> > > > > >
> > > > > > 1. FAN_MARK_MOUNT(/A)
> > > > > >
> > > > > > mount --bind /A /B
> > > > > >
> > > > > > 2. FAN_MARK_MOUNT(/B)
> > > > > >
> > > > > > mount -t ecryptfs /B /C
> > > > > >
> > > > > > 3. FAN_MARK_MOUNT(/C)
> > > > > >
> > > > > > let's say I now do
> > > > > >
> > > > > > touch /C/bla
> > > > > >
> > > > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > > > a creation event but not 3., right?
> > > > > >
> > > > >
> > > > > Why not 3?
> > > > > You explicitly set a mark on /C requesting to be notified when
> > > > > objects are created via /C.
> > > >
> > > > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > > > creation event but not 1.
> > > >
> > > > >
> > > > > > But with your proposal would both 1. and 2. still get a creation event?
> > > > > >
> > > >
> > > > Same obvious typo. The correct question would be: with your proposal do
> > > > 2. and 3. both get an event?
> > > >
> > > > Because it feels like they both should since /C is mounted on top of /B
> > > > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > > > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > > > mnt->mnt_fsnotify_marks set.
> > > >
> > >
> > > Right.
> > >
> > > There are two ways to address this inconsistency:
> > > 1. Change internal callers of vfs_ helpers to use a private mount,
> > >     as you yourself suggested for ecryptfs and cachefiles
> >
> > I feel like this is he correct thing to do independently of the fanotify
> > considerations. I think I'll send an RFC for this today or later this
> > week.
> >
> > > 2. Add fsnotify_path_ hooks at caller site - that would be the
> > >     correct thing to do for nfsd IMO
> >
> > I do not have an informed opinion yet on nfsd so I simply need to trust
> > you here. :)
> >
> 
> As long as "exp_export: export of idmapped mounts not yet supported.\n"
> I don't think it matters much.
> It feels like adding idmapped mounts to nfsd is on your roadmap.
> When you get to that we can discuss adding fsnotify path hooks to nfsd
> if Jan agrees to the fsnotify path hooks concept.
I was looking at the patch and thinking about it for a few days already. I
think that generating fsnotify event later (higher up the stack where we
have mount information) is fine and a neat idea. I just dislike the hackery
with dentry flags. Also I'm somewhat uneasy that it is random (from
userspace POV) when path event is generated and when not (at least that's
my impression from the patch - maybe I'm wrong). How difficult would it be
to get rid of it? I mean what if we just moved say fsnotify_create() call
wholly up the stack? It would mean more explicit calls to fsnotify_create()
from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
we would probably have to lift generation of all directory events like
that. That would be already notable churn but maybe doable... I know you've
been looking at similar things in the past so if you are aware why this
won't fly, please tell me.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 12:17                 ` Christian Brauner
@ 2021-03-31 12:59                   ` Amir Goldstein
  0 siblings, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2021-03-31 12:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields
> > This implementation is a compromise for not having clear user mount
> > context in all places that call for an event.
> > For every person you find that thinks it is intuitive to get an event on /B
> > for touch C/bla, you will find another person that thinks it is not intuitive
>
> And I think here we disagree. The technical implementation currently
> requires this since the two mounts are both clearly marked and the first
> mount creates objects by going through the other mount and they don't
> have a private mount. All I was saying is that the current patchset
> can't handle this case and asked whether we are ok with that and if not
> what we do to fix it.
> My proposal two or three mails ago and then picked up by you is: make
> them both use a private clone mount which is - as I said in earlier
> mails - the correct solution anyway and falls in line with overlayfs
> too.
>
As long as we agree on the solution ;-)
> > to get an event. I think we are way beyond the stage with mount
> > namespaces where intuition alone suffice.
> >
> > w.r.t consistent, you gave a few examples and I suggested how IMO
> > they should be fixed to behave consistently.
> > If you have other examples of alleged inconsistencies, please list them.
>
> It feels like I somehow upset you with this.
You do not upset me.
I just didn't find a better way to address "consistent and intuitive" concern
without asking for more concrete examples, after we eliminated the
ecryptfs example, which we already agreed(?), is a non issue.
My claim about "intuitive" is that there is a limit to how intuitive
this could be.
I do not see myself explaining in the man page why FAN_DELETE_SELF
cannot be requested for a mount mark. It's just too low level.
So the best I can do is document the events that are available to inode/sb
mark and not available to mount mark.
Currently, the fanotify_mark.2 man page reads:
"...The events which require that filesystem objects are identified by
file handles,
 such as FAN_CREATE, FAN_ATTRIB,  FAN_MOVE,  and FAN_DELETE_SELF,
 cannot  be provided as a mask when flags contains FAN_MARK_MOUNT..."
I will change that to:
"...The events FAN_ATTRIB,  FAN_MOVE,  and FAN_DELETE_SELF,
 cannot  be provided as a mask when flags contain FAN_MARK_MOUNT..."
Without providing a rationale to the list of forbidden events.
BTW, there is an undocumented fact about FAN_MODIFY -
This event is allowed in a mount mark mask, but it only reports the events
generated by fsnotify_modify() on file writes. It does not report to a mount
mark, the FAN_MODIFY event generated by fsnotify_change() from
truncate() and utimensat(), because of the missing mount context.
So yeh, I do understand where the "inconsistent" feeling is coming from... ;-)
[...]
> > So I would like to know that we really have all the pieces needed for
> > a useful solution, before proposing the fanotify patches.
>
> Sure, if you think that you have your branch in the shape that you want
> to. So far it has been evolving quite rapidly as you said yourself. :)
> I can probably test this soon early next week seems most likely since I
> need to find a timeslot to actually do the work you're asking. Hope that
> works.
>
No plans to make any more changes to those branches and no rush
as to when to post tham. This is not v4.13-rc1 material anyway.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 11:29               ` Amir Goldstein
  2021-03-31 12:17                 ` Christian Brauner
  2021-03-31 12:54                 ` Jan Kara
@ 2021-03-31 13:06                 ` J. Bruce Fields
  2 siblings, 0 replies; 34+ messages in thread
From: J. Bruce Fields @ 2021-03-31 13:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API,
	Miklos Szeredi
On Wed, Mar 31, 2021 at 02:29:04PM +0300, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > I feel like this is he correct thing to do independently of the fanotify
> > considerations. I think I'll send an RFC for this today or later this
> > week.
> >
> > > 2. Add fsnotify_path_ hooks at caller site - that would be the
> > >     correct thing to do for nfsd IMO
> >
> > I do not have an informed opinion yet on nfsd so I simply need to trust
> > you here. :)
> >
> 
> As long as "exp_export: export of idmapped mounts not yet supported.\n"
> I don't think it matters much.
Last I looked, nfsd support for idmapped mounts looked like it would be
a pretty straightforward job, for what it's worth.
--b.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 12:54                 ` Jan Kara
@ 2021-03-31 14:06                   ` Amir Goldstein
  2021-03-31 20:59                     ` fsnotify path hooks Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-03-31 14:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields
> > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > I don't think it matters much.
> > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > if Jan agrees to the fsnotify path hooks concept.
>
> I was looking at the patch and thinking about it for a few days already. I
> think that generating fsnotify event later (higher up the stack where we
> have mount information) is fine and a neat idea. I just dislike the hackery
> with dentry flags.
Me as well. I used this hack for fast POC.
If we stick with the dual hooks approach, we will have to either pass a new
argument to vfs helpers or use another trick:
Convert all the many calls sites that were converted by Christian to:
   vfs_XXX(&init_user_ns, ...
because they do not have mount context, to:
   vfs_XXX(NULL, ...
Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
A non-NULL mnt_userns arg means "path_notify" context.
I have already POC code for passing mnt_userns to fsnotify hooks [1].
I did not check if this assumption always works, but there seems to
be a large overlap between idmapped aware callers and use cases
that will require sending events to a mount mark.
> Also I'm somewhat uneasy that it is random (from
> userspace POV) when path event is generated and when not (at least that's
> my impression from the patch - maybe I'm wrong). How difficult would it be
> to get rid of it? I mean what if we just moved say fsnotify_create() call
> wholly up the stack? It would mean more explicit calls to fsnotify_create()
> from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
1. I don't think we can do that for all the fsnotify_create() hooks, such as
    debugfs for example
2. It is useless to pass the mount from overlayfs to fsnotify, its a private
    mount that users cannot set a mark on anyway and Christian has
    promised to propose the same change for cachefiles and ecryptfs,
    so I think it's not worth the churn in those call sites
3. I am uneasy with removing the fsnotify hooks from vfs helpers and
    trusting that new callers of vfs_create() will remember to add the high
    level hooks, so I prefer the existing behavior remains for such callers
> we would probably have to lift generation of all directory events like
> that. That would be already notable churn but maybe doable... I know you've
> been looking at similar things in the past so if you are aware why this
> won't fly, please tell me.
I agree with that and since I posted this RFC patch, I have already added
support for FAN_DELETE and FAN_MOVE_SELF [2].
This was easy - not much churn at all.
FAN_MOVED_FROM I dropped because of the old name snapshot.
FAN_MOVED_TO I dropped because it needs the cookie to be in sync with
that of the FAN_MOVED_FROM event.
Besides, this event pair is "inotify legacy" as far as I am concerned.
FAN_MOVE_SELF can provide most of the needed functionality.
The rest of the functionality should be provided by a new event pair
IMO, FAN_LINK/FAN_UNLINK, as described in this proposal [3].
Which leaves us with two events: FAN_DELETE_SELF and FAN_ATTRIB.
FAN_DELETE_SELF is not appropriate for a mount mark IMO.
FAN_ATTRIB would be useful on mount mark IMO.
It would incur a bit more churn to add it, but I think it's certainly doable.
Just need to decide if we stay with the "dual hooks" approach and if so
on the technique to pass the "notify_path" state into vfs helpers and
existing fsnotify hooks.
Thanks,
Amir.
[1] https://github.com/amir73il/linux/commits/fanotify_in_userns
[2] https://github.com/amir73il/linux/commits/fsnotify_path_hooks
[3] https://lore.kernel.org/linux-fsdevel/CAOQ4uxhEsbfA5+sW4XPnUKgCkXtwoDA-BR3iRO34Nx5c4y7Nug@mail.gmail.com/
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: fsnotify path hooks
  2021-03-31 14:06                   ` Amir Goldstein
@ 2021-03-31 20:59                     ` Amir Goldstein
  2021-04-01 10:29                       ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-03-31 20:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields
On Wed, Mar 31, 2021 at 5:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > > I don't think it matters much.
> > > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > > if Jan agrees to the fsnotify path hooks concept.
> >
> > I was looking at the patch and thinking about it for a few days already. I
> > think that generating fsnotify event later (higher up the stack where we
> > have mount information) is fine and a neat idea. I just dislike the hackery
> > with dentry flags.
>
> Me as well. I used this hack for fast POC.
>
> If we stick with the dual hooks approach, we will have to either pass a new
> argument to vfs helpers or use another trick:
>
> Convert all the many calls sites that were converted by Christian to:
>    vfs_XXX(&init_user_ns, ...
> because they do not have mount context, to:
>    vfs_XXX(NULL, ...
>
> Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
> but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
> A non-NULL mnt_userns arg means "path_notify" context.
> I have already POC code for passing mnt_userns to fsnotify hooks [1].
>
> I did not check if this assumption always works, but there seems to
> be a large overlap between idmapped aware callers and use cases
> that will require sending events to a mount mark.
>
The above "trick" is pretty silly as I believe Christian intends
to fix all those call sites that pass init_user_ns.
> > Also I'm somewhat uneasy that it is random (from
> > userspace POV) when path event is generated and when not (at least that's
> > my impression from the patch - maybe I'm wrong). How difficult would it be
> > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
>
> 1. I don't think we can do that for all the fsnotify_create() hooks, such as
>     debugfs for example
> 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
>     mount that users cannot set a mark on anyway and Christian has
>     promised to propose the same change for cachefiles and ecryptfs,
>     so I think it's not worth the churn in those call sites
> 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
>     trusting that new callers of vfs_create() will remember to add the high
>     level hooks, so I prefer the existing behavior remains for such callers
>
So I read your proposal the wrong way.
You meant move fsnotify_create() up *without* passing mount context
from overlayfs and friends.
So yeh, I do think it is manageable. I think the best solution would be
something along the lines of wrappers like the following:
static inline int vfs_mkdir(...)
{
        int error = __vfs_mkdir_nonotify(...);
        if (!error)
                fsnotify_mkdir(dir, dentry);
        return error;
}
And then the few call sites that call the fsnotify_path_ hooks
(i.e. in syscalls and perhaps later in nfsd) will call the
__vfs_xxx_nonotify() variant.
I suppose that with this approach I could make all the relevant events
available for mount mark with relatively little churn.
I will try it out.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: fsnotify path hooks
  2021-03-31 20:59                     ` fsnotify path hooks Amir Goldstein
@ 2021-04-01 10:29                       ` Jan Kara
  2021-04-01 14:18                         ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2021-04-01 10:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API,
	Miklos Szeredi, J. Bruce Fields
On Wed 31-03-21 23:59:27, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 5:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > > > I don't think it matters much.
> > > > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > > > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > > > if Jan agrees to the fsnotify path hooks concept.
> > >
> > > I was looking at the patch and thinking about it for a few days already. I
> > > think that generating fsnotify event later (higher up the stack where we
> > > have mount information) is fine and a neat idea. I just dislike the hackery
> > > with dentry flags.
> >
> > Me as well. I used this hack for fast POC.
> >
> > If we stick with the dual hooks approach, we will have to either pass a new
> > argument to vfs helpers or use another trick:
> >
> > Convert all the many calls sites that were converted by Christian to:
> >    vfs_XXX(&init_user_ns, ...
> > because they do not have mount context, to:
> >    vfs_XXX(NULL, ...
> >
> > Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
> > but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
> > A non-NULL mnt_userns arg means "path_notify" context.
> > I have already POC code for passing mnt_userns to fsnotify hooks [1].
> >
> > I did not check if this assumption always works, but there seems to
> > be a large overlap between idmapped aware callers and use cases
> > that will require sending events to a mount mark.
> >
> 
> The above "trick" is pretty silly as I believe Christian intends
> to fix all those call sites that pass init_user_ns.
If he does that we also should have the mountpoint there to use for
fsnotify, shouldn't we? :)
> > > Also I'm somewhat uneasy that it is random (from
> > > userspace POV) when path event is generated and when not (at least that's
> > > my impression from the patch - maybe I'm wrong). How difficult would it be
> > > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
> >
> > 1. I don't think we can do that for all the fsnotify_create() hooks, such as
> >     debugfs for example
> > 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
> >     mount that users cannot set a mark on anyway and Christian has
> >     promised to propose the same change for cachefiles and ecryptfs,
> >     so I think it's not worth the churn in those call sites
> > 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
> >     trusting that new callers of vfs_create() will remember to add the high
> >     level hooks, so I prefer the existing behavior remains for such callers
> >
> 
> So I read your proposal the wrong way.
> You meant move fsnotify_create() up *without* passing mount context
> from overlayfs and friends.
Well, I was thinking that we could find appropriate mount context for
overlayfs or ecryptfs (which just shows how little I know about these
filesystems ;) I didn't think of e.g. debugfs. Anyway, if we can make
mountpoint marks work for directory events at least for most filesystems, I
think that is OK as well. However it would be then needed to detect whether
a given filesystem actually supports mount marks for dir events and if not,
report error from fanotify_mark() instead of silently not generating
events.
> So yeh, I do think it is manageable. I think the best solution would be
> something along the lines of wrappers like the following:
> 
> static inline int vfs_mkdir(...)
> {
>         int error = __vfs_mkdir_nonotify(...);
>         if (!error)
>                 fsnotify_mkdir(dir, dentry);
>         return error;
> }
> 
> And then the few call sites that call the fsnotify_path_ hooks
> (i.e. in syscalls and perhaps later in nfsd) will call the
> __vfs_xxx_nonotify() variant.
Yes, that is OK with me. Or we could have something like:
static inline void fsnotify_dirent(struct vfsmount *mnt, struct inode *dir,
				   struct dentry *dentry, __u32 mask)
{
	if (!mnt) {
		fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_INODE, dir,
			 &dentry->d_name, NULL, 0);
	} else {
		struct path path = {
			.mnt = mnt,
			.dentry = d_find_any_alias(dir)
		};
		fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_PATH, &path,
			 &dentry->d_name, NULL, 0);
	}
}
static inline void fsnotify_mkdir(struct vfsmount *mnt, struct inode *inode,
				  struct dentry *dentry)
{
        audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
        fsnotify_dirent(mnt, inode, dentry, FS_CREATE | FS_ISDIR);
}
static inline int vfs_mkdir(mnt, ...)
{
	int error = __vfs_mkdir_nonotify(...);
	if (!error)
		fsnotify_mkdir(mnt, dir, dentry);
}
And pass mnt to vfs_mkdir() for filesystems where we have it...
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: fsnotify path hooks
  2021-04-01 10:29                       ` Jan Kara
@ 2021-04-01 14:18                         ` Amir Goldstein
  2021-04-02  8:20                           ` Amir Goldstein
  2021-04-06  8:35                           ` Jan Kara
  0 siblings, 2 replies; 34+ messages in thread
From: Amir Goldstein @ 2021-04-01 14:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields
On Thu, Apr 1, 2021 at 1:29 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 31-03-21 23:59:27, Amir Goldstein wrote:
> > On Wed, Mar 31, 2021 at 5:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > > > > I don't think it matters much.
> > > > > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > > > > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > > > > if Jan agrees to the fsnotify path hooks concept.
> > > >
> > > > I was looking at the patch and thinking about it for a few days already. I
> > > > think that generating fsnotify event later (higher up the stack where we
> > > > have mount information) is fine and a neat idea. I just dislike the hackery
> > > > with dentry flags.
> > >
> > > Me as well. I used this hack for fast POC.
> > >
> > > If we stick with the dual hooks approach, we will have to either pass a new
> > > argument to vfs helpers or use another trick:
> > >
> > > Convert all the many calls sites that were converted by Christian to:
> > >    vfs_XXX(&init_user_ns, ...
> > > because they do not have mount context, to:
> > >    vfs_XXX(NULL, ...
> > >
> > > Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
> > > but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
> > > A non-NULL mnt_userns arg means "path_notify" context.
> > > I have already POC code for passing mnt_userns to fsnotify hooks [1].
> > >
> > > I did not check if this assumption always works, but there seems to
> > > be a large overlap between idmapped aware callers and use cases
> > > that will require sending events to a mount mark.
> > >
> >
> > The above "trick" is pretty silly as I believe Christian intends
> > to fix all those call sites that pass init_user_ns.
>
> If he does that we also should have the mountpoint there to use for
> fsnotify, shouldn't we? :)
>
Yes, but that's not going to be hard for us anyway.
nfsd has mount context available via fhp for any access
and for overlayfs/ecryptfs we don't want the mount mark event.
I will explain why...
> > > > Also I'm somewhat uneasy that it is random (from
> > > > userspace POV) when path event is generated and when not (at least that's
> > > > my impression from the patch - maybe I'm wrong). How difficult would it be
> > > > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > > > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > > > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > > > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
> > >
> > > 1. I don't think we can do that for all the fsnotify_create() hooks, such as
> > >     debugfs for example
> > > 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
> > >     mount that users cannot set a mark on anyway and Christian has
> > >     promised to propose the same change for cachefiles and ecryptfs,
> > >     so I think it's not worth the churn in those call sites
> > > 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
> > >     trusting that new callers of vfs_create() will remember to add the high
> > >     level hooks, so I prefer the existing behavior remains for such callers
> > >
> >
> > So I read your proposal the wrong way.
> > You meant move fsnotify_create() up *without* passing mount context
> > from overlayfs and friends.
>
> Well, I was thinking that we could find appropriate mount context for
> overlayfs or ecryptfs (which just shows how little I know about these
> filesystems ;) I didn't think of e.g. debugfs. Anyway, if we can make
> mountpoint marks work for directory events at least for most filesystems, I
> think that is OK as well. However it would be then needed to detect whether
> a given filesystem actually supports mount marks for dir events and if not,
> report error from fanotify_mark() instead of silently not generating
> events.
>
It's not about "filesystems that support mount marks".
mount marks will work perfectly well on overlayfs.
The thing is if you place a mount mark on the underlying store of
overlayfs (say xfs) and then files are created/deleted by the
overlayfs driver (in xfs) you wont get any events, because
overlayfs uses a private mount clone to perform underlying operations.
So while we CAN get the overlayfs underlying layer mount context
it is irrelevant because no user can setup a mount mark on that
private mount, so no need to bother calling the path hooks.
This is not the case with nfsd IMO.
With nfsd, when "exporting" a path to clients, nfsd is really exporting
a specific mount (and keeping that mount busy too).
It can even export whole mount topologies.
But then again, getting the mount context in every nfsd operation
is easy, there is an export context to client requests and the export
context has the exported path.
Therefore, nfsd is my only user using the vfs helpers that is expected
to call the fsnotify path hooks (other than syscalls).
> > So yeh, I do think it is manageable. I think the best solution would be
> > something along the lines of wrappers like the following:
> >
> > static inline int vfs_mkdir(...)
> > {
> >         int error = __vfs_mkdir_nonotify(...);
> >         if (!error)
> >                 fsnotify_mkdir(dir, dentry);
> >         return error;
> > }
> >
> > And then the few call sites that call the fsnotify_path_ hooks
> > (i.e. in syscalls and perhaps later in nfsd) will call the
> > __vfs_xxx_nonotify() variant.
>
> Yes, that is OK with me. Or we could have something like:
>
> static inline void fsnotify_dirent(struct vfsmount *mnt, struct inode *dir,
>                                    struct dentry *dentry, __u32 mask)
> {
>         if (!mnt) {
>                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_INODE, dir,
>                          &dentry->d_name, NULL, 0);
>         } else {
>                 struct path path = {
>                         .mnt = mnt,
>                         .dentry = d_find_any_alias(dir)
>                 };
>                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_PATH, &path,
>                          &dentry->d_name, NULL, 0);
>         }
> }
>
> static inline void fsnotify_mkdir(struct vfsmount *mnt, struct inode *inode,
>                                   struct dentry *dentry)
> {
>         audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
>
>         fsnotify_dirent(mnt, inode, dentry, FS_CREATE | FS_ISDIR);
> }
>
> static inline int vfs_mkdir(mnt, ...)
> {
>         int error = __vfs_mkdir_nonotify(...);
>         if (!error)
>                 fsnotify_mkdir(mnt, dir, dentry);
> }
>
I've done something similar to that. I think it's a bit cleaner,
but we can debate on the details later.
Pushed POC to branch fsnotify_path_hooks.
At the moment, create, delete, move and move_self are supported
for syscalls and helpers are ready for nfsd.
The method I used for rename hook is a bit different than
for other hooks, because other hooks are very easy to open code
while rename is complex so I create a helper for nfsd to call.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: fsnotify path hooks
  2021-04-01 14:18                         ` Amir Goldstein
@ 2021-04-02  8:20                           ` Amir Goldstein
  2021-04-06  8:35                           ` Jan Kara
  1 sibling, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2021-04-02  8:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields, Tyler Hicks, Mimi Zohar, James Morris,
	Serge E. Hallyn
> This is not the case with nfsd IMO.
> With nfsd, when "exporting" a path to clients, nfsd is really exporting
> a specific mount (and keeping that mount busy too).
> It can even export whole mount topologies.
>
> But then again, getting the mount context in every nfsd operation
> is easy, there is an export context to client requests and the export
> context has the exported path.
>
> Therefore, nfsd is my only user using the vfs helpers that is expected
> to call the fsnotify path hooks (other than syscalls).
>
[...]
>
> I've done something similar to that. I think it's a bit cleaner,
> but we can debate on the details later.
> Pushed POC to branch fsnotify_path_hooks.
>
> At the moment, create, delete, move and move_self are supported
> for syscalls and helpers are ready for nfsd.
>
> The method I used for rename hook is a bit different than
> for other hooks, because other hooks are very easy to open code
> while rename is complex so I create a helper for nfsd to call.
>
I pushed the nfsd example code as well (only compile tested):
https://github.com/amir73il/linux/commits/fsnotify_path_hooks
Now all that is left is dealing with notify_change() and with
vfs_{set,remove}xattr().
Nice thing about vfs_{set,remove}xattr() is that they already have
several levels of __vfs_ helpers and nfsd already calls those, so
we can hoist fsnotify_xattr() hooks hooks up from the __vfs_xxx
helpers to the common vfs_xxx helpers and add fsnotify hooks to
the very few callers of __vfs_ helpers.
nfsd is consistently calling __vfs_{set,remove}xattr_locked() which
do generate events, but ecryptfs mixes __vfs_setxattr_locked() with
__vfs_removexattr(), which does not generate event and does not
check permissions - it looks like an oversight.
The thing is, right now __vfs_setxattr_noperm() generates events,
but looking at all the security/* callers, it feels to me like those are
very internal operations and that "noperm" should also imply "nonotify".
To prove my point, all those callers call __vfs_removexattr() which
does NOT generate an event.
Also, I *think* the EVM setxattr is something that usually follows
another file data/metadata change, so some event would have been
generated by the original change anyway.
Mimi,
Do you have an opinion on that?
The question is if you think it is important for an inotify/fanotify watcher
that subscribed to IN_ATTRIB/FAN_ATTRIB events on a file to get an
event when the IMA security blob changes.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: fsnotify path hooks
  2021-04-01 14:18                         ` Amir Goldstein
  2021-04-02  8:20                           ` Amir Goldstein
@ 2021-04-06  8:35                           ` Jan Kara
  1 sibling, 0 replies; 34+ messages in thread
From: Jan Kara @ 2021-04-06  8:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API,
	Miklos Szeredi, J. Bruce Fields
On Thu 01-04-21 17:18:05, Amir Goldstein wrote:
> > > > > Also I'm somewhat uneasy that it is random (from
> > > > > userspace POV) when path event is generated and when not (at least that's
> > > > > my impression from the patch - maybe I'm wrong). How difficult would it be
> > > > > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > > > > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > > > > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > > > > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
> > > >
> > > > 1. I don't think we can do that for all the fsnotify_create() hooks, such as
> > > >     debugfs for example
> > > > 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
> > > >     mount that users cannot set a mark on anyway and Christian has
> > > >     promised to propose the same change for cachefiles and ecryptfs,
> > > >     so I think it's not worth the churn in those call sites
> > > > 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
> > > >     trusting that new callers of vfs_create() will remember to add the high
> > > >     level hooks, so I prefer the existing behavior remains for such callers
> > > >
> > >
> > > So I read your proposal the wrong way.
> > > You meant move fsnotify_create() up *without* passing mount context
> > > from overlayfs and friends.
> >
> > Well, I was thinking that we could find appropriate mount context for
> > overlayfs or ecryptfs (which just shows how little I know about these
> > filesystems ;) I didn't think of e.g. debugfs. Anyway, if we can make
> > mountpoint marks work for directory events at least for most filesystems, I
> > think that is OK as well. However it would be then needed to detect whether
> > a given filesystem actually supports mount marks for dir events and if not,
> > report error from fanotify_mark() instead of silently not generating
> > events.
> >
> 
> It's not about "filesystems that support mount marks".
> mount marks will work perfectly well on overlayfs.
> 
> The thing is if you place a mount mark on the underlying store of
> overlayfs (say xfs) and then files are created/deleted by the
> overlayfs driver (in xfs) you wont get any events, because
> overlayfs uses a private mount clone to perform underlying operations.
OK, understood.
> So while we CAN get the overlayfs underlying layer mount context
> it is irrelevant because no user can setup a mount mark on that
> private mount, so no need to bother calling the path hooks.
> 
> This is not the case with nfsd IMO.
> With nfsd, when "exporting" a path to clients, nfsd is really exporting
> a specific mount (and keeping that mount busy too).
> It can even export whole mount topologies.
> 
> But then again, getting the mount context in every nfsd operation
> is easy, there is an export context to client requests and the export
> context has the exported path.
> 
> Therefore, nfsd is my only user using the vfs helpers that is expected
> to call the fsnotify path hooks (other than syscalls).
I agree.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: open_by_handle_at() in userns
  2021-03-31 10:08       ` Christian Brauner
  2021-03-31 10:57         ` Amir Goldstein
@ 2021-04-08 11:44         ` Amir Goldstein
  2021-04-08 12:55           ` Christian Brauner
  1 sibling, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-04-08 11:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API
> One thing your patch
>
> commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> Author: Amir Goldstein <amir73il@gmail.com>
> Date:   Sat Mar 20 12:58:12 2021 +0200
>
>     fs: allow open by file handle inside userns
>
>     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
>     where most filesystems are mounted.
>
>     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
>     inside userns to open by file handle in filesystems that were
>     mounted inside that userns.
>
>     In addition, also allow open by handle in an idmapped mount, which is
>     mapped to the userns while verifying that the returned open file path
>     is under the root of the idmapped mount.
>
>     This is going to be needed for setting an fanotify mark on a filesystem
>     and watching events inside userns.
>
>     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> following and other callchains:
>
> exportfs_decode_fh()
> -> exportfs_decode_fh_raw()
>    -> lookup_one_len()
>       -> inode_permission(mnt_userns, ...)
>
> That's not a huge problem though I did all these changes for the
> overlayfs support for idmapped mounts I have in a branch from an earlier
> version of the idmapped mounts patchset. Basically lookup_one_len(),
> lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> the mnt_userns into account. I can rebase my change and send it for
> consideration next cycle. If you can live without the
> open_by_handle_at() support for now in this patchset (Which I think you
> said you could.) then it's not a blocker either. Sorry for the
> inconvenience.
>
Christian,
I think making exportfs_decode_fh() idmapped mount aware is not
enough, because when a dentry alias is found in dcache, none of
those lookup functions are called.
I think we will also need something like this:
https://github.com/amir73il/linux/commits/fhandle_userns
I factored-out a helper from nfsd_apcceptable() which implements
the "subtree_check" nfsd logic and uses it for open_by_handle_at().
I've also added a small patch to name_to_handle_at() with a UAPI
change that could make these changes usable by userspace nfs
server inside userns, but I have no demo nor tests for that and frankly,
I have little incentive to try and promote this UAPI change without
anybody asking for it...
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: open_by_handle_at() in userns
  2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
@ 2021-04-08 12:55           ` Christian Brauner
  2021-04-08 14:15             ` J. Bruce Fields
  2021-04-08 15:34             ` Amir Goldstein
  0 siblings, 2 replies; 34+ messages in thread
From: Christian Brauner @ 2021-04-08 12:55 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, bfields
On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > One thing your patch
> >
> > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > Author: Amir Goldstein <amir73il@gmail.com>
> > Date:   Sat Mar 20 12:58:12 2021 +0200
> >
> >     fs: allow open by file handle inside userns
> >
> >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> >     where most filesystems are mounted.
> >
> >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> >     inside userns to open by file handle in filesystems that were
> >     mounted inside that userns.
> >
> >     In addition, also allow open by handle in an idmapped mount, which is
> >     mapped to the userns while verifying that the returned open file path
> >     is under the root of the idmapped mount.
> >
> >     This is going to be needed for setting an fanotify mark on a filesystem
> >     and watching events inside userns.
> >
> >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > following and other callchains:
> >
> > exportfs_decode_fh()
> > -> exportfs_decode_fh_raw()
> >    -> lookup_one_len()
> >       -> inode_permission(mnt_userns, ...)
> >
> > That's not a huge problem though I did all these changes for the
> > overlayfs support for idmapped mounts I have in a branch from an earlier
> > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > the mnt_userns into account. I can rebase my change and send it for
> > consideration next cycle. If you can live without the
> > open_by_handle_at() support for now in this patchset (Which I think you
> > said you could.) then it's not a blocker either. Sorry for the
> > inconvenience.
> >
> 
> Christian,
> 
> I think making exportfs_decode_fh() idmapped mount aware is not
> enough, because when a dentry alias is found in dcache, none of
> those lookup functions are called.
> 
> I think we will also need something like this:
> https://github.com/amir73il/linux/commits/fhandle_userns
> 
> I factored-out a helper from nfsd_apcceptable() which implements
> the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> 
> I've also added a small patch to name_to_handle_at() with a UAPI
> change that could make these changes usable by userspace nfs
> server inside userns, but I have no demo nor tests for that and frankly,
> I have little incentive to try and promote this UAPI change without
> anybody asking for it...
Ah, at first I was confused about why this would matter but it matters
because nfsd already implements a check of that sort directly in nfsd
independent of idmapped mounts:
https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
Afaict, an nfs server can't be mounted inside of userns right now. That
is something that folks from Netflix and from Kinvolk have been
interested in enabling. They also want the ability to use idmapped
mounts + nfs. Understandable that you don't want to drive this of
course. I'll sync with them about this.
Independent of that, I thought our last understanding was that you
wouldn't need to handle open_by_handle_at() for now.
Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: open_by_handle_at() in userns
  2021-04-08 12:55           ` Christian Brauner
@ 2021-04-08 14:15             ` J. Bruce Fields
  2021-04-08 15:54               ` Amir Goldstein
  2021-04-08 15:34             ` Amir Goldstein
  1 sibling, 1 reply; 34+ messages in thread
From: J. Bruce Fields @ 2021-04-08 14:15 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Linux API
On Thu, Apr 08, 2021 at 02:55:30PM +0200, Christian Brauner wrote:
> On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > One thing your patch
> > >
> > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > Author: Amir Goldstein <amir73il@gmail.com>
> > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > >
> > >     fs: allow open by file handle inside userns
> > >
> > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > >     where most filesystems are mounted.
> > >
> > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > >     inside userns to open by file handle in filesystems that were
> > >     mounted inside that userns.
> > >
> > >     In addition, also allow open by handle in an idmapped mount, which is
> > >     mapped to the userns while verifying that the returned open file path
> > >     is under the root of the idmapped mount.
> > >
> > >     This is going to be needed for setting an fanotify mark on a filesystem
> > >     and watching events inside userns.
> > >
> > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > following and other callchains:
> > >
> > > exportfs_decode_fh()
> > > -> exportfs_decode_fh_raw()
> > >    -> lookup_one_len()
> > >       -> inode_permission(mnt_userns, ...)
> > >
> > > That's not a huge problem though I did all these changes for the
> > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > the mnt_userns into account. I can rebase my change and send it for
> > > consideration next cycle. If you can live without the
> > > open_by_handle_at() support for now in this patchset (Which I think you
> > > said you could.) then it's not a blocker either. Sorry for the
> > > inconvenience.
> > >
> > 
> > Christian,
> > 
> > I think making exportfs_decode_fh() idmapped mount aware is not
> > enough, because when a dentry alias is found in dcache, none of
> > those lookup functions are called.
> > 
> > I think we will also need something like this:
> > https://github.com/amir73il/linux/commits/fhandle_userns
> > 
> > I factored-out a helper from nfsd_apcceptable() which implements
> > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> > 
> > I've also added a small patch to name_to_handle_at() with a UAPI
> > change that could make these changes usable by userspace nfs
> > server inside userns, but I have no demo nor tests for that and frankly,
> > I have little incentive to try and promote this UAPI change without
> > anybody asking for it...
> 
> Ah, at first I was confused about why this would matter but it matters
> because nfsd already implements a check of that sort directly in nfsd
> independent of idmapped mounts:
> https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
Only in the NFSEXP_NOSUBTREECHECK case.  Taking a quick look, I think
Amir's not proposing a check like that by default, so, fine.  (I assume
problems with e.g. subtreechecking and cross-directory renames are
understood....)
> Afaict, an nfs server can't be mounted inside of userns right now. That
> is something that folks from Netflix and from Kinvolk have been
> interested in enabling. They also want the ability to use idmapped
> mounts + nfs. Understandable that you don't want to drive this of
> course. I'll sync with them about this.
I think those would both be reasonable things to do.
--b.
> Independent of that, I thought our last understanding was that you
> wouldn't need to handle open_by_handle_at() for now.
> 
> Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: open_by_handle_at() in userns
  2021-04-08 12:55           ` Christian Brauner
  2021-04-08 14:15             ` J. Bruce Fields
@ 2021-04-08 15:34             ` Amir Goldstein
  2021-04-08 15:41               ` Christian Brauner
  1 sibling, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-04-08 15:34 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API, J. Bruce Fields
On Thu, Apr 8, 2021 at 3:55 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > One thing your patch
> > >
> > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > Author: Amir Goldstein <amir73il@gmail.com>
> > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > >
> > >     fs: allow open by file handle inside userns
> > >
> > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > >     where most filesystems are mounted.
> > >
> > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > >     inside userns to open by file handle in filesystems that were
> > >     mounted inside that userns.
> > >
> > >     In addition, also allow open by handle in an idmapped mount, which is
> > >     mapped to the userns while verifying that the returned open file path
> > >     is under the root of the idmapped mount.
> > >
> > >     This is going to be needed for setting an fanotify mark on a filesystem
> > >     and watching events inside userns.
> > >
> > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > following and other callchains:
> > >
> > > exportfs_decode_fh()
> > > -> exportfs_decode_fh_raw()
> > >    -> lookup_one_len()
> > >       -> inode_permission(mnt_userns, ...)
> > >
> > > That's not a huge problem though I did all these changes for the
> > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > the mnt_userns into account. I can rebase my change and send it for
> > > consideration next cycle. If you can live without the
> > > open_by_handle_at() support for now in this patchset (Which I think you
> > > said you could.) then it's not a blocker either. Sorry for the
> > > inconvenience.
> > >
> >
> > Christian,
> >
> > I think making exportfs_decode_fh() idmapped mount aware is not
> > enough, because when a dentry alias is found in dcache, none of
> > those lookup functions are called.
> >
> > I think we will also need something like this:
> > https://github.com/amir73il/linux/commits/fhandle_userns
> >
> > I factored-out a helper from nfsd_apcceptable() which implements
> > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> >
> > I've also added a small patch to name_to_handle_at() with a UAPI
> > change that could make these changes usable by userspace nfs
> > server inside userns, but I have no demo nor tests for that and frankly,
> > I have little incentive to try and promote this UAPI change without
> > anybody asking for it...
>
> Ah, at first I was confused about why this would matter but it matters
> because nfsd already implements a check of that sort directly in nfsd
> independent of idmapped mounts:
> https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
>
The check is needed for slightly different reasons.
nfsd "subtree_check" feature explicitly meant to forbid access in case
file was moved "out of reach", for example, out of the export path.
Note the nfsd "subtree_check" affects both file handle encoding
(i.e. "connectable") and file handle decoding (i.e. nfsd_acceptable()).
open_by_handle_at() in idmapped mount needs to verify that the ancestry
inode owners can be mapped to the userns, because we already checked
that user has CAP_DAC_READ_SEARCH in userns, but it's nicer to do
the full inode_permission() check IMO.
> Afaict, an nfs server can't be mounted inside of userns right now. That
> is something that folks from Netflix and from Kinvolk have been
> interested in enabling. They also want the ability to use idmapped
> mounts + nfs. Understandable that you don't want to drive this of
> course. I'll sync with them about this.
>
> Independent of that, I thought our last understanding was that you
> wouldn't need to handle open_by_handle_at() for now.
>
I don't need it. But I realized that the fanotify_userns demo branch
I provided you is buggy in terms of security, so I wanted to give you
(or whoever wants to pursue this) a better reference.
It was one of those things that are easier to code than to explain ;-)
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: open_by_handle_at() in userns
  2021-04-08 15:34             ` Amir Goldstein
@ 2021-04-08 15:41               ` Christian Brauner
  0 siblings, 0 replies; 34+ messages in thread
From: Christian Brauner @ 2021-04-08 15:41 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, J. Bruce Fields
On Thu, Apr 08, 2021 at 06:34:01PM +0300, Amir Goldstein wrote:
> On Thu, Apr 8, 2021 at 3:55 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > > One thing your patch
> > > >
> > > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > > Author: Amir Goldstein <amir73il@gmail.com>
> > > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > > >
> > > >     fs: allow open by file handle inside userns
> > > >
> > > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > > >     where most filesystems are mounted.
> > > >
> > > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > > >     inside userns to open by file handle in filesystems that were
> > > >     mounted inside that userns.
> > > >
> > > >     In addition, also allow open by handle in an idmapped mount, which is
> > > >     mapped to the userns while verifying that the returned open file path
> > > >     is under the root of the idmapped mount.
> > > >
> > > >     This is going to be needed for setting an fanotify mark on a filesystem
> > > >     and watching events inside userns.
> > > >
> > > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > > following and other callchains:
> > > >
> > > > exportfs_decode_fh()
> > > > -> exportfs_decode_fh_raw()
> > > >    -> lookup_one_len()
> > > >       -> inode_permission(mnt_userns, ...)
> > > >
> > > > That's not a huge problem though I did all these changes for the
> > > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > > the mnt_userns into account. I can rebase my change and send it for
> > > > consideration next cycle. If you can live without the
> > > > open_by_handle_at() support for now in this patchset (Which I think you
> > > > said you could.) then it's not a blocker either. Sorry for the
> > > > inconvenience.
> > > >
> > >
> > > Christian,
> > >
> > > I think making exportfs_decode_fh() idmapped mount aware is not
> > > enough, because when a dentry alias is found in dcache, none of
> > > those lookup functions are called.
> > >
> > > I think we will also need something like this:
> > > https://github.com/amir73il/linux/commits/fhandle_userns
> > >
> > > I factored-out a helper from nfsd_apcceptable() which implements
> > > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> > >
> > > I've also added a small patch to name_to_handle_at() with a UAPI
> > > change that could make these changes usable by userspace nfs
> > > server inside userns, but I have no demo nor tests for that and frankly,
> > > I have little incentive to try and promote this UAPI change without
> > > anybody asking for it...
> >
> > Ah, at first I was confused about why this would matter but it matters
> > because nfsd already implements a check of that sort directly in nfsd
> > independent of idmapped mounts:
> > https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
> >
> 
> The check is needed for slightly different reasons.
> nfsd "subtree_check" feature explicitly meant to forbid access in case
> file was moved "out of reach", for example, out of the export path.
> Note the nfsd "subtree_check" affects both file handle encoding
> (i.e. "connectable") and file handle decoding (i.e. nfsd_acceptable()).
> 
> open_by_handle_at() in idmapped mount needs to verify that the ancestry
> inode owners can be mapped to the userns, because we already checked
> that user has CAP_DAC_READ_SEARCH in userns, but it's nicer to do
> the full inode_permission() check IMO.
> 
> > Afaict, an nfs server can't be mounted inside of userns right now. That
> > is something that folks from Netflix and from Kinvolk have been
> > interested in enabling. They also want the ability to use idmapped
> > mounts + nfs. Understandable that you don't want to drive this of
> > course. I'll sync with them about this.
> >
> > Independent of that, I thought our last understanding was that you
> > wouldn't need to handle open_by_handle_at() for now.
> >
> 
> I don't need it. But I realized that the fanotify_userns demo branch
> I provided you is buggy in terms of security, so I wanted to give you
> (or whoever wants to pursue this) a better reference.
> It was one of those things that are easier to code than to explain ;-)
And I highly appreciate it! :)
Please keep those branches for reference somewhere. I've noted them down
to make sure to look at them before I do anything!
Christian
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: open_by_handle_at() in userns
  2021-04-08 14:15             ` J. Bruce Fields
@ 2021-04-08 15:54               ` Amir Goldstein
  2021-04-08 16:08                 ` J. Bruce Fields
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2021-04-08 15:54 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API,
	Miklos Szeredi
On Thu, Apr 8, 2021 at 5:15 PM J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Thu, Apr 08, 2021 at 02:55:30PM +0200, Christian Brauner wrote:
> > On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > > One thing your patch
> > > >
> > > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > > Author: Amir Goldstein <amir73il@gmail.com>
> > > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > > >
> > > >     fs: allow open by file handle inside userns
> > > >
> > > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > > >     where most filesystems are mounted.
> > > >
> > > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > > >     inside userns to open by file handle in filesystems that were
> > > >     mounted inside that userns.
> > > >
> > > >     In addition, also allow open by handle in an idmapped mount, which is
> > > >     mapped to the userns while verifying that the returned open file path
> > > >     is under the root of the idmapped mount.
> > > >
> > > >     This is going to be needed for setting an fanotify mark on a filesystem
> > > >     and watching events inside userns.
> > > >
> > > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > > following and other callchains:
> > > >
> > > > exportfs_decode_fh()
> > > > -> exportfs_decode_fh_raw()
> > > >    -> lookup_one_len()
> > > >       -> inode_permission(mnt_userns, ...)
> > > >
> > > > That's not a huge problem though I did all these changes for the
> > > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > > the mnt_userns into account. I can rebase my change and send it for
> > > > consideration next cycle. If you can live without the
> > > > open_by_handle_at() support for now in this patchset (Which I think you
> > > > said you could.) then it's not a blocker either. Sorry for the
> > > > inconvenience.
> > > >
> > >
> > > Christian,
> > >
> > > I think making exportfs_decode_fh() idmapped mount aware is not
> > > enough, because when a dentry alias is found in dcache, none of
> > > those lookup functions are called.
> > >
> > > I think we will also need something like this:
> > > https://github.com/amir73il/linux/commits/fhandle_userns
> > >
> > > I factored-out a helper from nfsd_apcceptable() which implements
> > > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> > >
> > > I've also added a small patch to name_to_handle_at() with a UAPI
> > > change that could make these changes usable by userspace nfs
> > > server inside userns, but I have no demo nor tests for that and frankly,
> > > I have little incentive to try and promote this UAPI change without
> > > anybody asking for it...
> >
> > Ah, at first I was confused about why this would matter but it matters
> > because nfsd already implements a check of that sort directly in nfsd
> > independent of idmapped mounts:
> > https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
>
> Only in the NFSEXP_NOSUBTREECHECK case.  Taking a quick look, I think
> Amir's not proposing a check like that by default, so, fine.  (I assume
> problems with e.g. subtreechecking and cross-directory renames are
> understood....)
>
They are understood to me :) but I didn't want to get into it, because it is
complicated to explain and I wasn't sure if anyone cared...
I started working on open_by_handle_at() in userns for fanotify and fanotify
mostly reports directory fhandle, so no issues with cross-directory renames.
In any case, fanotify never reports "connectable" non-dir file handles.
Because my proposed change ALSO makes it possible to start talking about
userspace nfs server inside userns (in case anyone cares), I wanted to lay
out the path towards a userspace "subtree_check" like solution.
Another thing I am contemplating is, if and when idmapped mount support
is added to overlayfs, we can store an additional "connectable" file handle
in the overlayfs index (whose key is the non-connectable fhandle) and
fix ovl_acceptable() similar to nfsd_acceptable() and then we will be able
to mount an overlayfs inside userns with nfs_export support.
I've included a two liner patch on the fhandle_userns branch to allow
overlayfs inside userns with nfs_export support in the case that
underlying filesystem was mounted inside userns, but that is not such
an interesting use case IMO.
Thanks,
Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: open_by_handle_at() in userns
  2021-04-08 15:54               ` Amir Goldstein
@ 2021-04-08 16:08                 ` J. Bruce Fields
  2021-04-08 16:48                   ` Frank Filz
  0 siblings, 1 reply; 34+ messages in thread
From: J. Bruce Fields @ 2021-04-08 16:08 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API,
	Miklos Szeredi
On Thu, Apr 08, 2021 at 06:54:52PM +0300, Amir Goldstein wrote:
> They are understood to me :) but I didn't want to get into it, because it is
> complicated to explain and I wasn't sure if anyone cared...
> 
> I started working on open_by_handle_at() in userns for fanotify and fanotify
> mostly reports directory fhandle, so no issues with cross-directory renames.
> In any case, fanotify never reports "connectable" non-dir file handles.
> 
> Because my proposed change ALSO makes it possible to start talking about
> userspace nfs server inside userns (in case anyone cares), I wanted to lay
> out the path towards a userspace "subtree_check" like solution.
We have to support subdirectory exports and subtree checking because we
already have, but, FWIW, if I were writing a new NFS server from
scratch, I don't think I would.  It's poorly understood, and the effort
would be better spent on more flexible storage management.
--b.
> Another thing I am contemplating is, if and when idmapped mount support
> is added to overlayfs, we can store an additional "connectable" file handle
> in the overlayfs index (whose key is the non-connectable fhandle) and
> fix ovl_acceptable() similar to nfsd_acceptable() and then we will be able
> to mount an overlayfs inside userns with nfs_export support.
> 
> I've included a two liner patch on the fhandle_userns branch to allow
> overlayfs inside userns with nfs_export support in the case that
> underlying filesystem was mounted inside userns, but that is not such
> an interesting use case IMO.
> 
> Thanks,
> Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* RE: open_by_handle_at() in userns
  2021-04-08 16:08                 ` J. Bruce Fields
@ 2021-04-08 16:48                   ` Frank Filz
  0 siblings, 0 replies; 34+ messages in thread
From: Frank Filz @ 2021-04-08 16:48 UTC (permalink / raw)
  To: 'J. Bruce Fields', 'Amir Goldstein'
  Cc: 'Christian Brauner', 'Jan Kara',
	'linux-fsdevel', 'Linux API',
	'Miklos Szeredi'
> On Thu, Apr 08, 2021 at 06:54:52PM +0300, Amir Goldstein wrote:
> > They are understood to me :) but I didn't want to get into it, because
> > it is complicated to explain and I wasn't sure if anyone cared...
> >
> > I started working on open_by_handle_at() in userns for fanotify and
> > fanotify mostly reports directory fhandle, so no issues with
cross-directory
> renames.
> > In any case, fanotify never reports "connectable" non-dir file handles.
> >
> > Because my proposed change ALSO makes it possible to start talking
> > about userspace nfs server inside userns (in case anyone cares), I
> > wanted to lay out the path towards a userspace "subtree_check" like
solution.
> 
> We have to support subdirectory exports and subtree checking because we
> already have, but, FWIW, if I were writing a new NFS server from scratch,
I don't
> think I would.  It's poorly understood, and the effort would be better
spent on
> more flexible storage management.
Yea, nfs-ganesha does not attempt to support subtree checking. It will allow
subtree exports, but it makes no assurance that they are secure. One option
though that turns out to work well for them is btrfs subvols since each
subvol has its own st_dev device ID, it's really as if it's a separate
filesystem (and nfs-ganesha treats it as such).
I'm curious about the userns solution. I'm not familiar with it, but we have
issues with running nfs-ganesha inside containers due to the privileges it
requires to properly run.
> > Another thing I am contemplating is, if and when idmapped mount
> > support is added to overlayfs, we can store an additional
> > "connectable" file handle in the overlayfs index (whose key is the
> > non-connectable fhandle) and fix ovl_acceptable() similar to
> > nfsd_acceptable() and then we will be able to mount an overlayfs inside
userns
> with nfs_export support.
> >
> > I've included a two liner patch on the fhandle_userns branch to allow
> > overlayfs inside userns with nfs_export support in the case that
> > underlying filesystem was mounted inside userns, but that is not such
> > an interesting use case IMO.
> >
> > Thanks,
> > Amir.
^ permalink raw reply	[flat|nested] 34+ messages in thread
end of thread, other threads:[~2021-04-08 17:10 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
2021-03-30  7:31 ` Christian Brauner
2021-03-30  9:31   ` Amir Goldstein
2021-03-30 16:24     ` Amir Goldstein
2021-03-31 10:08       ` Christian Brauner
2021-03-31 10:57         ` Amir Goldstein
2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
2021-04-08 12:55           ` Christian Brauner
2021-04-08 14:15             ` J. Bruce Fields
2021-04-08 15:54               ` Amir Goldstein
2021-04-08 16:08                 ` J. Bruce Fields
2021-04-08 16:48                   ` Frank Filz
2021-04-08 15:34             ` Amir Goldstein
2021-04-08 15:41               ` Christian Brauner
2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
2021-03-30 12:33   ` Amir Goldstein
2021-03-30 12:53     ` Christian Brauner
2021-03-30 12:55       ` Christian Brauner
2021-03-30 13:54       ` Amir Goldstein
2021-03-30 14:17         ` Christian Brauner
2021-03-30 14:56           ` Amir Goldstein
2021-03-31  9:46             ` Christian Brauner
2021-03-31 11:29               ` Amir Goldstein
2021-03-31 12:17                 ` Christian Brauner
2021-03-31 12:59                   ` Amir Goldstein
2021-03-31 12:54                 ` Jan Kara
2021-03-31 14:06                   ` Amir Goldstein
2021-03-31 20:59                     ` fsnotify path hooks Amir Goldstein
2021-04-01 10:29                       ` Jan Kara
2021-04-01 14:18                         ` Amir Goldstein
2021-04-02  8:20                           ` Amir Goldstein
2021-04-06  8:35                           ` Jan Kara
2021-03-31 13:06                 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask J. Bruce Fields
2021-03-30 12:20 ` Amir Goldstein
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).