* [RFC PATCH v2 0/3] fanotify HSM events for directories
@ 2025-06-04 16:09 Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 1/3] fanotify: allow creating FAN_PRE_ACCESS events on directories Amir Goldstein
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-06-04 16:09 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
Jan,
In v1 there was only patch 1 [1] to allow FAN_PRE_ACCESS events
on readdir (with FAN_ONDIR).
Following your feedback on v1, v2 adds support for FAN_PATH_ACCESS
event so that a non-populated directory could be populted either on
first readdir or on first lookup.
I am still tagging this as RFC for two semi-related reasons:
1) In my original draft of man-page for FAN_PATH_ACCESS [2],
I had introduced a new class FAN_CLASS_PRE_PATH, which FAN_PATH_ACCESS
requires and is defined as:
"Unlike FAN_CLASS_PRE_CONTENT, this class can be used along with
FAN_REPORT_DFID_NAME to report the names of the looked up files along
with O_PATH file descriptos in the new path lookup events."
I am not sure if we really need FAN_CLASS_PRE_PATH, so wanted to ask
your opinion.
The basic HSM (as implemented in my POC) does not need to get the lookup
name in the event - it populates dir on first readdir or lookup access.
So I think that support for (FAN_CLASS_PRE_CONTENT | FAN_REPORT_DFID_NAME)
could be added later per demand.
2) Current code does not generate FAN_PRE_ACCESS from vfs internal
lookup helpers such as lookup_one*() helpers from overalyfs and nfsd.
This is related to the API of reporting an O_PATH event->fd for
FAN_PATH_ACCESS event, which requires a mount.
If we decide that we want to support FAN_PATH_ACCESS from all the
path-less lookup_one*() helpers, then we need to support reporting
FAN_PATH_ACCESS event with directory fid.
If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
on its "work" mount - the path via which directories are populated.
There may be a middle ground:
- Pass optional path arg to __lookup_slow() (i.e. from walk_component())
- Move fsnotify hook into __lookup_slow()
- fsnotify_lookup_perm() passes optional path data to fsnotify()
- fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
path data
This way, if HSM is enabled on an sb and not ignored on specific dir
after it was populated, path lookup from syscall will trigger
FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
non-populated directories.
Supporting populate events from overalyfs/nfsd could be implemented
later per demand by reporting directory fid instead of O_PATH fd.
If you think that is worth checking, I can prepare a patch for the above
so we can expose it to performance regression bots.
Better yet, if you have no issues with the implementation in this
patch set, maybe let it soak in for_next/for_testing as is to make
sure that it does not already introduce any performance regressions.
Thoughts?
Amir.
Changes since v1:
- Jan's rewrite of patch 1
- Add support for O_PATH event->fd
- Add FAN_PATH_ACCESS event
[1] https://lore.kernel.org/all/20250402062707.1637811-1-amir73il@gmail.com/
[2] https://github.com/amir73il/man-pages/commits/fan_pre_path
[3] https://github.com/amir73il/ltp/commits/fan_hsm/
Amir Goldstein (2):
fanotify: allow O_PATH flag in event_f_flags
fanotify: introduce FAN_PATH_ACCESS event
Jan Kara (1):
fanotify: allow creating FAN_PRE_ACCESS events on directories
fs/namei.c | 70 +++++++++++++++++++++++++++---
fs/notify/fanotify/fanotify.c | 11 +++--
fs/notify/fanotify/fanotify_user.c | 11 +----
fs/notify/fsnotify.c | 2 +-
fs/open.c | 4 +-
include/linux/fanotify.h | 2 +-
include/linux/fs.h | 10 +++--
include/linux/fsnotify.h | 28 +++++++++++-
include/linux/fsnotify_backend.h | 4 +-
include/linux/namei.h | 3 ++
include/uapi/linux/fanotify.h | 2 +
11 files changed, 116 insertions(+), 31 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH v2 1/3] fanotify: allow creating FAN_PRE_ACCESS events on directories
2025-06-04 16:09 [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
@ 2025-06-04 16:09 ` Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 2/3] fanotify: allow O_PATH flag in event_f_flags Amir Goldstein
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-06-04 16:09 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
From: Jan Kara <jack@suse.cz>
Like files, a FAN_PRE_ACCESS event will be generated before every
read access to directory, that is on readdir(3).
Unlike files, there will be no range info record following a
FAN_PRE_ACCESS event, because the range of access on a directory
is not well defined.
FAN_PRE_ACCESS events on readdir are only generated when user opts-in
with FAN_ONDIR request in event mask and the FAN_PRE_ACCESS events on
readdir report the FAN_ONDIR flag, so user can differentiate them from
event on read.
An HSM service is expected to use those events to populate directories
from slower tier on first readdir access. Having to range info means
that the entire directory will need to be populated on the first
readdir() call.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250402062707.1637811-1-amir73il@gmail.com
---
fs/notify/fanotify/fanotify.c | 8 +++++---
fs/notify/fanotify/fanotify_user.c | 9 ---------
2 files changed, 5 insertions(+), 12 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 3083643b864b..7c9a2614e715 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -303,8 +303,7 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
struct inode *dir)
{
__u32 marks_mask = 0, marks_ignore_mask = 0;
- __u32 test_mask, user_mask = FANOTIFY_OUTGOING_EVENTS |
- FANOTIFY_EVENT_FLAGS;
+ __u32 test_mask, user_mask = FANOTIFY_OUTGOING_EVENTS;
const struct path *path = fsnotify_data_path(data, data_type);
unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
struct fsnotify_mark *mark;
@@ -356,6 +355,9 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
* the child entry name information, we report FAN_ONDIR for mkdir/rmdir
* so user can differentiate them from creat/unlink.
*
+ * For pre-content events we report FAN_ONDIR for readdir, so user can
+ * differentiate them from read.
+ *
* For backward compatibility and consistency, do not report FAN_ONDIR
* to user in legacy fanotify mode (reporting fd) and report FAN_ONDIR
* to user in fid mode for all event types.
@@ -364,7 +366,7 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
* fanotify_alloc_event() when group is reporting fid as indication
* that event happened on child.
*/
- if (fid_mode) {
+ if (fid_mode || test_mask & FANOTIFY_PRE_CONTENT_EVENTS) {
/* Do not report event flags without any event */
if (!(test_mask & ~FANOTIFY_EVENT_FLAGS))
return 0;
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index b192ee068a7a..9d7b3a610b4a 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1411,11 +1411,6 @@ static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
return -EEXIST;
- /* For now pre-content events are not generated for directories */
- mask |= fsn_mark->mask;
- if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
- return -EEXIST;
-
return 0;
}
@@ -1951,10 +1946,6 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
if (mask & FAN_RENAME && !(fid_mode & FAN_REPORT_NAME))
return -EINVAL;
- /* Pre-content events are not currently generated for directories. */
- if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
- return -EINVAL;
-
if (mark_cmd == FAN_MARK_FLUSH) {
fsnotify_clear_marks_by_group(group, obj_type);
return 0;
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH v2 2/3] fanotify: allow O_PATH flag in event_f_flags
2025-06-04 16:09 [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 1/3] fanotify: allow creating FAN_PRE_ACCESS events on directories Amir Goldstein
@ 2025-06-04 16:09 ` Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 3/3] fanotify: introduce FAN_PATH_ACCESS event Amir Goldstein
` (2 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-06-04 16:09 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
Many of the use cases for fanotify events only use the event->fd
to resolve the path of the event target object.
Using O_PATH for this purpose is more efficient and prevents
exposing a readable fd of the object if that is not required.
To be able to distinguish a user opened O_PATH fd, from fanotify
provided O_PATH event->fd, do not explicitly set FMODE_NONOTIFY
on open of all O_PATH fds, do not override FMODE_NONOTIFY when setting
FMODE_PATH in do_dentry_open() and check explicitly for FMODE_PATH in
fsnotify_file() to suppress FAN_CLOSE events on close of O_PATH fds.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify_user.c | 2 +-
fs/open.c | 4 ++--
include/linux/fs.h | 10 ++++++----
include/linux/fsnotify.h | 2 +-
4 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9d7b3a610b4a..fd2906a8a15e 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -105,7 +105,7 @@ static void __init fanotify_sysctls_init(void)
#define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
O_ACCMODE | O_APPEND | O_NONBLOCK | \
__O_SYNC | O_DSYNC | O_CLOEXEC | \
- O_LARGEFILE | O_NOATIME )
+ O_LARGEFILE | O_NOATIME | O_PATH)
extern const struct fsnotify_ops fanotify_fsnotify_ops;
diff --git a/fs/open.c b/fs/open.c
index 7828234a7caa..4664240f4c5e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -913,8 +913,8 @@ static int do_dentry_open(struct file *f,
f->f_sb_err = file_sample_sb_err(f);
if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH | FMODE_OPENED;
- file_set_fsnotify_mode(f, FMODE_NONOTIFY);
+ f->f_mode = FMODE_PATH | FMODE_OPENED |
+ FMODE_FSNOTIFY(f->f_mode);
f->f_op = &empty_fops;
return 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aad2fb940a45..098456235cb5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -210,14 +210,16 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define FMODE_FSNOTIFY_MASK \
(FMODE_NONOTIFY | FMODE_NONOTIFY_PERM)
+#define FMODE_FSNOTIFY(mode) \
+ (mode & FMODE_FSNOTIFY_MASK)
#define FMODE_FSNOTIFY_NONE(mode) \
- ((mode & FMODE_FSNOTIFY_MASK) == FMODE_NONOTIFY)
+ (FMODE_FSNOTIFY(mode) == FMODE_NONOTIFY)
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
#define FMODE_FSNOTIFY_PERM(mode) \
- ((mode & FMODE_FSNOTIFY_MASK) == 0 || \
- (mode & FMODE_FSNOTIFY_MASK) == (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM))
+ (FMODE_FSNOTIFY(mode) == 0 || \
+ FMODE_FSNOTIFY(mode) == (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM))
#define FMODE_FSNOTIFY_HSM(mode) \
- ((mode & FMODE_FSNOTIFY_MASK) == 0)
+ (FMODE_FSNOTIFY(mode) == 0)
#else
#define FMODE_FSNOTIFY_PERM(mode) 0
#define FMODE_FSNOTIFY_HSM(mode) 0
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 454d8e466958..175149167642 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -121,7 +121,7 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
* FMODE_PATH fds (involves open & close events) as they are just
* handle creation / destruction events and not "real" file events.
*/
- if (FMODE_FSNOTIFY_NONE(file->f_mode))
+ if (FMODE_FSNOTIFY_NONE(file->f_mode) || (file->f_mode & FMODE_PATH))
return 0;
return fsnotify_path(&file->f_path, mask);
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH v2 3/3] fanotify: introduce FAN_PATH_ACCESS event
2025-06-04 16:09 [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 1/3] fanotify: allow creating FAN_PRE_ACCESS events on directories Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 2/3] fanotify: allow O_PATH flag in event_f_flags Amir Goldstein
@ 2025-06-04 16:09 ` Amir Goldstein
2025-06-04 16:35 ` [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
2025-06-10 13:49 ` Jan Kara
4 siblings, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-06-04 16:09 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
The exiting FAN_PRE_ACCESS pre-content event on a directory can be used
to populate a directory before ian application is reading its content.
The new FAN_PATH_ACCESS pre-content event adds the ability to populate
a directory before lookup in the directory index.
The hook is called from lookup_slow_notify() when travesing a directory
during path lookup walk. Because it is only relevant for directories,
the FAN_PATH_ACCESS event is generated regardless of FAN_ONDIR flag
in mark mask.
The permission hook cannot be sent to usersapce while lookup is in
RCU walk, so if dcache entries already exist, it is assumed that
FAN_PATH_ACCESS events are not needed, because an event was already
generated when dcache was populated.
To prevent a deadlock, when a directory access monitoring process uses
the event->fd obtained in a FAN_PATH_ACCESS event to lookup a path,
relative to event->fd, this lookup does not generate recursive
FAN_PATH_ACCESS events
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/namei.c | 70 ++++++++++++++++++++++++++++----
fs/notify/fanotify/fanotify.c | 3 +-
fs/notify/fsnotify.c | 2 +-
include/linux/fanotify.h | 2 +-
include/linux/fsnotify.h | 26 ++++++++++++
include/linux/fsnotify_backend.h | 4 +-
include/linux/namei.h | 3 ++
include/uapi/linux/fanotify.h | 2 +
8 files changed, 101 insertions(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 4bb889fc980b..b41fedc0f11f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -657,6 +657,14 @@ struct nameidata {
#define ND_ROOT_PRESET 1
#define ND_ROOT_GRABBED 2
#define ND_JUMPED 4
+#define ND_NONOTIFY 8
+
+static void nd_set_jumped(struct nameidata *nd)
+{
+ nd->state |= ND_JUMPED;
+ /* Maybe crossed sb/mount so need to re-test sb/mount watches */
+ nd->state &= ~ND_NONOTIFY;
+}
static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
@@ -1051,7 +1059,7 @@ static int nd_jump_root(struct nameidata *nd)
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
- nd->state |= ND_JUMPED;
+ nd_set_jumped(nd);
return 0;
}
@@ -1079,7 +1087,7 @@ int nd_jump_link(const struct path *path)
path_put(&nd->path);
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
- nd->state |= ND_JUMPED;
+ nd_set_jumped(nd);
return 0;
err:
@@ -1594,7 +1602,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path)
if (mounted) {
path->mnt = &mounted->mnt;
dentry = path->dentry = mounted->mnt.mnt_root;
- nd->state |= ND_JUMPED;
+ nd_set_jumped(nd);
nd->next_seq = read_seqcount_begin(&dentry->d_seq);
flags = dentry->d_flags;
// makes sure that non-RCU pathwalk could reach
@@ -1634,7 +1642,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
if (unlikely(nd->flags & LOOKUP_NO_XDEV))
ret = -EXDEV;
else
- nd->state |= ND_JUMPED;
+ nd_set_jumped(nd);
}
if (unlikely(ret)) {
dput(path->dentry);
@@ -1836,6 +1844,47 @@ static struct dentry *lookup_slow(const struct qstr *name,
return res;
}
+static bool lookup_should_notify(struct nameidata *nd)
+{
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+ /* Was dirfd obtained from fanotify event->fd? */
+ if (unlikely(nd->flags & LOOKUP_NONOTIFY))
+ return false;
+
+ /*
+ * An open coded version of the fsnotify mask checks in fsnotify()
+ * optimized to check the sb/mount marks only once per jump.
+ */
+ if (unlikely(nd->inode->i_fsnotify_mask & FS_PATH_ACCESS))
+ return true;
+
+ if (likely(nd->state & ND_NONOTIFY))
+ return false;
+
+ if (unlikely(nd->inode->i_sb->s_fsnotify_mask & FS_PATH_ACCESS) ||
+ unlikely(real_mount(nd->path.mnt)->mnt_fsnotify_mask &
+ FS_PATH_ACCESS))
+ return true;
+
+ /* Avoid testing sb/mount masks again until nd_set_jumped() */
+ nd->state |= ND_NONOTIFY;
+#endif
+ return false;
+}
+
+
+static struct dentry *lookup_slow_notify(struct nameidata *nd)
+{
+ if (lookup_should_notify(nd)) {
+ int err = fsnotify_lookup_perm(&nd->path, nd->inode, &nd->last);
+
+ if (unlikely(err))
+ return ERR_PTR(err);
+ }
+
+ return lookup_slow(&nd->last, nd->path.dentry, nd->flags);
+}
+
static inline int may_lookup(struct mnt_idmap *idmap,
struct nameidata *restrict nd)
{
@@ -2135,7 +2184,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
- dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
+ dentry = lookup_slow_notify(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
}
@@ -2461,7 +2510,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
switch(lastword) {
case LAST_WORD_IS_DOTDOT:
nd->last_type = LAST_DOTDOT;
- nd->state |= ND_JUMPED;
+ nd_set_jumped(nd);
break;
case LAST_WORD_IS_DOT:
@@ -2541,7 +2590,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->seq = nd->next_seq = 0;
nd->flags = flags;
- nd->state |= ND_JUMPED;
+ nd_set_jumped(nd);
nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount);
nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount);
@@ -2608,6 +2657,13 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
if (*s && unlikely(!d_can_lookup(dentry)))
return ERR_PTR(-ENOTDIR);
+ /*
+ * dfd was opened by fanotify and lookup_slow_notify()
+ * shouldn't generate fanotify events.
+ */
+ if (FMODE_FSNOTIFY_NONE(fd_file(f)->f_mode))
+ nd->flags |= LOOKUP_NONOTIFY;
+
nd->path = fd_file(f)->f_path;
if (flags & LOOKUP_RCU) {
nd->inode = nd->path.dentry->d_inode;
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 7c9a2614e715..df9b10d717d2 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -952,8 +952,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
+ BUILD_BUG_ON(FAN_PATH_ACCESS != FS_PATH_ACCESS);
- BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 24);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 25);
mask = fanotify_group_event_mask(group, iter_info, &match_mask,
mask, data, data_type, dir);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index e2b4f17a48bb..004a24036a12 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -745,7 +745,7 @@ static __init int fsnotify_init(void)
{
int ret;
- BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 26);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 27);
ret = init_srcu_struct(&fsnotify_mark_srcu);
if (ret)
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 879cff5eccd4..0eaae7a46a33 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -93,7 +93,7 @@
#define FANOTIFY_CONTENT_PERM_EVENTS (FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | \
FAN_ACCESS_PERM)
/* Pre-content events can be used to fill file content */
-#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS)
+#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS | FAN_PATH_ACCESS)
/* Events that require a permission response from user */
#define FANOTIFY_PERM_EVENTS (FANOTIFY_CONTENT_PERM_EVENTS | \
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 175149167642..44b8496730a3 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -227,6 +227,25 @@ static inline int fsnotify_open_perm(struct file *file)
return fsnotify_path(&file->f_path, FS_OPEN_PERM);
}
+/*
+ * fsnotify_lookup_perm - permission hook before path access
+ */
+static inline int fsnotify_lookup_perm(struct path *path,
+ struct inode *dir,
+ const struct qstr *name)
+{
+ if (!(dir->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
+ !fsnotify_sb_has_priority_watchers(dir->i_sb,
+ FSNOTIFY_PRIO_PRE_CONTENT))
+ return 0;
+
+ if (WARN_ON_ONCE(!S_ISDIR(dir->i_mode)))
+ return 0;
+
+ return fsnotify(FS_PATH_ACCESS, path, FSNOTIFY_EVENT_PATH,
+ dir, name, NULL, 0);
+}
+
#else
static inline void file_set_fsnotify_mode_from_watchers(struct file *file)
{
@@ -258,6 +277,13 @@ static inline int fsnotify_open_perm(struct file *file)
{
return 0;
}
+
+static inline int fsnotify_lookup_perm(struct path *path,
+ struct inode *dir,
+ const struct qstr *name)
+{
+ return 0;
+}
#endif
/*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index d4034ddaf392..c9a71cb97f3b 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -43,6 +43,7 @@
#define FS_OPEN_EXEC 0x00001000 /* File was opened for exec */
#define FS_UNMOUNT 0x00002000 /* inode on umount fs */
+
#define FS_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
#define FS_ERROR 0x00008000 /* Filesystem Error (fanotify) */
@@ -58,6 +59,7 @@
/* #define FS_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
#define FS_PRE_ACCESS 0x00100000 /* Pre-content access hook */
+#define FS_PATH_ACCESS 0x00200000 /* Pre-path access hook */
#define FS_MNT_ATTACH 0x01000000 /* Mount was attached */
#define FS_MNT_DETACH 0x02000000 /* Mount was detached */
@@ -91,7 +93,7 @@
#define FSNOTIFY_CONTENT_PERM_EVENTS (FS_OPEN_PERM | FS_OPEN_EXEC_PERM | \
FS_ACCESS_PERM)
/* Pre-content events can be used to fill file content */
-#define FSNOTIFY_PRE_CONTENT_EVENTS (FS_PRE_ACCESS)
+#define FSNOTIFY_PRE_CONTENT_EVENTS (FS_PRE_ACCESS | FS_PATH_ACCESS)
#define ALL_FSNOTIFY_PERM_EVENTS (FSNOTIFY_CONTENT_PERM_EVENTS | \
FSNOTIFY_PRE_CONTENT_EVENTS)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 5d085428e471..f084b5493f02 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -49,6 +49,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
/* 3 spare bits for scoping */
+/* dirfd was opened by fanotify and lookup shouldn't generate fanotify events */
+#define LOOKUP_NONOTIFY 0x4000000
+
extern int path_pts(struct path *path);
extern int user_path_at(int, const char __user *, unsigned, struct path *);
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index e710967c7c26..190c2258f623 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -28,6 +28,8 @@
/* #define FAN_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
#define FAN_PRE_ACCESS 0x00100000 /* Pre-content access hook */
+#define FAN_PATH_ACCESS 0x00200000 /* Pre-path access hook */
+
#define FAN_MNT_ATTACH 0x01000000 /* Mount was attached */
#define FAN_MNT_DETACH 0x02000000 /* Mount was detached */
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-04 16:09 [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
` (2 preceding siblings ...)
2025-06-04 16:09 ` [RFC PATCH v2 3/3] fanotify: introduce FAN_PATH_ACCESS event Amir Goldstein
@ 2025-06-04 16:35 ` Amir Goldstein
2025-06-10 13:49 ` Jan Kara
4 siblings, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-06-04 16:35 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
On Wed, Jun 4, 2025 at 6:09 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Jan,
>
> In v1 there was only patch 1 [1] to allow FAN_PRE_ACCESS events
> on readdir (with FAN_ONDIR).
>
> Following your feedback on v1, v2 adds support for FAN_PATH_ACCESS
> event so that a non-populated directory could be populted either on
> first readdir or on first lookup.
>
> I am still tagging this as RFC for two semi-related reasons:
>
> 1) In my original draft of man-page for FAN_PATH_ACCESS [2],
> I had introduced a new class FAN_CLASS_PRE_PATH, which FAN_PATH_ACCESS
> requires and is defined as:
> "Unlike FAN_CLASS_PRE_CONTENT, this class can be used along with
> FAN_REPORT_DFID_NAME to report the names of the looked up files along
> with O_PATH file descriptos in the new path lookup events."
>
> I am not sure if we really need FAN_CLASS_PRE_PATH, so wanted to ask
> your opinion.
>
> The basic HSM (as implemented in my POC) does not need to get the lookup
> name in the event - it populates dir on first readdir or lookup access.
> So I think that support for (FAN_CLASS_PRE_CONTENT | FAN_REPORT_DFID_NAME)
> could be added later per demand.
>
> 2) Current code does not generate FAN_PRE_ACCESS from vfs internal
> lookup helpers such as lookup_one*() helpers from overalyfs and nfsd.
> This is related to the API of reporting an O_PATH event->fd for
> FAN_PATH_ACCESS event, which requires a mount.
>
> If we decide that we want to support FAN_PATH_ACCESS from all the
> path-less lookup_one*() helpers, then we need to support reporting
> FAN_PATH_ACCESS event with directory fid.
>
> If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> on its "work" mount - the path via which directories are populated.
>
> There may be a middle ground:
> - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> - Move fsnotify hook into __lookup_slow()
> - fsnotify_lookup_perm() passes optional path data to fsnotify()
> - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> path data
>
> This way, if HSM is enabled on an sb and not ignored on specific dir
> after it was populated, path lookup from syscall will trigger
> FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> non-populated directories.
>
> Supporting populate events from overalyfs/nfsd could be implemented
> later per demand by reporting directory fid instead of O_PATH fd.
>
> If you think that is worth checking, I can prepare a patch for the above
> so we can expose it to performance regression bots.
>
> Better yet, if you have no issues with the implementation in this
> patch set, maybe let it soak in for_next/for_testing as is to make
> sure that it does not already introduce any performance regressions.
>
> Thoughts?
>
> Amir.
>
> Changes since v1:
> - Jan's rewrite of patch 1
> - Add support for O_PATH event->fd
> - Add FAN_PATH_ACCESS event
>
> [1] https://lore.kernel.org/all/20250402062707.1637811-1-amir73il@gmail.com/
> [2] https://github.com/amir73il/man-pages/commits/fan_pre_path
Forgot to mention of course that I wrote an LTP test for both events on
lookup and readdir:
> [3] https://github.com/amir73il/ltp/commits/fan_hsm/
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-04 16:09 [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
` (3 preceding siblings ...)
2025-06-04 16:35 ` [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
@ 2025-06-10 13:49 ` Jan Kara
2025-06-10 15:25 ` Amir Goldstein
4 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-06-10 13:49 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel
Hi Amir!
On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> In v1 there was only patch 1 [1] to allow FAN_PRE_ACCESS events
> on readdir (with FAN_ONDIR).
>
> Following your feedback on v1, v2 adds support for FAN_PATH_ACCESS
> event so that a non-populated directory could be populted either on
> first readdir or on first lookup.
OK, it's good that now we have a bit more wider context for the discussion
:). First, when reading this I've started wondering whether we need both
FAN_PRE_ACCESS on directories and FAN_PATH_ACCESS (only on directories).
Firstly, I don't love adding more use to the FAN_ONDIR flag when creating
marks because you can only specify you want FAN_PRE_ACCESS on files,
FAN_PRE_ACCESS on files & dirs but there's no way to tell you care only
about FAN_PRE_ACCESS on dirs. You have to filter that when receiving
events. Secondly, the distinction between FAN_PRE_ACCESS and
FAN_PATH_ACCESS is somewhat weak - it's kind of similar to the situation
with regular files when we notify about access to the whole file vs only to
a specific range. So what if we had an event like FAN_PRE_DIR_ACCESS that
would report looked up name on lookup and nothing on readdir meaning you
need to fetch everything?
> I am still tagging this as RFC for two semi-related reasons:
>
> 1) In my original draft of man-page for FAN_PATH_ACCESS [2],
> I had introduced a new class FAN_CLASS_PRE_PATH, which FAN_PATH_ACCESS
> requires and is defined as:
> "Unlike FAN_CLASS_PRE_CONTENT, this class can be used along with
> FAN_REPORT_DFID_NAME to report the names of the looked up files along
> with O_PATH file descriptos in the new path lookup events."
>
> I am not sure if we really need FAN_CLASS_PRE_PATH, so wanted to ask
> your opinion.
>
> The basic HSM (as implemented in my POC) does not need to get the lookup
> name in the event - it populates dir on first readdir or lookup access.
> So I think that support for (FAN_CLASS_PRE_CONTENT | FAN_REPORT_DFID_NAME)
> could be added later per demand.
The question here is what a real user is going to do? I know Meta guys
don't care about directory events for their usecase. You seem to care about
them so I presume you have some production use in mind? How's that going to
work? Because if I should guess I'd think that someone asks for the name
being looked up sooner rather than later because for large dirs not having
to fetch everything on lookup would be a noticeable win... And if that's
the case then IMHO we should design (but not necessarily fully implement)
API that's the simplest and most logical when this is added.
This ties to the discussion how the FAN_PATH_ACCESS / FAN_PRE_DIR_ACCESS
event is going to report the name.
> 2) Current code does not generate FAN_PRE_ACCESS from vfs internal
> lookup helpers such as lookup_one*() helpers from overalyfs and nfsd.
> This is related to the API of reporting an O_PATH event->fd for
> FAN_PATH_ACCESS event, which requires a mount.
AFAIU this means that you could not NFS export a filesystem that is HSM
managed and you could not use HSM managed filesystem to compose overlayfs.
I don't find either of those a critical feature but OTOH it would be nice
if the API didn't restrict us from somehow implementing this in the future.
> If we decide that we want to support FAN_PATH_ACCESS from all the
> path-less lookup_one*() helpers, then we need to support reporting
> FAN_PATH_ACCESS event with directory fid.
>
> If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> on its "work" mount - the path via which directories are populated.
>
> There may be a middle ground:
> - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> - Move fsnotify hook into __lookup_slow()
> - fsnotify_lookup_perm() passes optional path data to fsnotify()
> - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> path data
>
> This way, if HSM is enabled on an sb and not ignored on specific dir
> after it was populated, path lookup from syscall will trigger
> FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> non-populated directories.
OK, but how will this manifest from the user POV? If we have say nfs
exported filesystem that is HSM managed then there would have to be some
knowledge in nfsd to know how to access needed files so that HSM can pull
them? I guess I'm missing the advantage of this middle-ground solution...
> Supporting populate events from overalyfs/nfsd could be implemented
> later per demand by reporting directory fid instead of O_PATH fd.
>
> If you think that is worth checking, I can prepare a patch for the above
> so we can expose it to performance regression bots.
>
> Better yet, if you have no issues with the implementation in this
> patch set, maybe let it soak in for_next/for_testing as is to make
> sure that it does not already introduce any performance regressions.
>
> Thoughts?
If I should summarize the API situation: If we ever want to support HSM +
NFS export / overlayfs, we must implement support for pre-content events
with FID (DFID + name to be precise). If we want to support HSM events on
lookup with looked up name, we don't have to go for full DFID + name but we
must at least add additional info with the name to the event. Also if we go
for reporting directory pre-content events with standard events, you want
to add support for returning O_PATH fds for better efficiency and the code
to handle FMODE_NONOTIFY directory fds in path lookup.
Frankly seeing all this I think that going for DFID + name events for
directory HSM events from the start may be the cleanest choice long term
because then we'll have one way how to access the directory HSM
functionality with possibility of extensions without having to add
different APIs for it. We'd just have to implement replying to FID events
because we won't have fd to identify event we reply to so that will need
some thought.
What are your thoughts? Am I missing something?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-10 13:49 ` Jan Kara
@ 2025-06-10 15:25 ` Amir Goldstein
2025-06-14 17:03 ` Amir Goldstein
2025-06-16 9:07 ` Jan Kara
0 siblings, 2 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-06-10 15:25 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
Hi Jan!
Thanks for taking the time to read my long email ;)
> On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > In v1 there was only patch 1 [1] to allow FAN_PRE_ACCESS events
> > on readdir (with FAN_ONDIR).
> >
> > Following your feedback on v1, v2 adds support for FAN_PATH_ACCESS
> > event so that a non-populated directory could be populted either on
> > first readdir or on first lookup.
>
> OK, it's good that now we have a bit more wider context for the discussion
> :). First, when reading this I've started wondering whether we need both
> FAN_PRE_ACCESS on directories and FAN_PATH_ACCESS (only on directories).
> Firstly, I don't love adding more use to the FAN_ONDIR flag when creating
> marks because you can only specify you want FAN_PRE_ACCESS on files,
> FAN_PRE_ACCESS on files & dirs but there's no way to tell you care only
> about FAN_PRE_ACCESS on dirs. You have to filter that when receiving
> events. Secondly, the distinction between FAN_PRE_ACCESS and
> FAN_PATH_ACCESS is somewhat weak - it's kind of similar to the situation
> with regular files when we notify about access to the whole file vs only to
> a specific range. So what if we had an event like FAN_PRE_DIR_ACCESS that
> would report looked up name on lookup and nothing on readdir meaning you
> need to fetch everything?
>
This makes a lot of sense to me. and I also like the suggested event name.
Another advantage is that FAN_PRE_ACCESS can always expect a range
(as documented)
> > I am still tagging this as RFC for two semi-related reasons:
> >
> > 1) In my original draft of man-page for FAN_PATH_ACCESS [2],
> > I had introduced a new class FAN_CLASS_PRE_PATH, which FAN_PATH_ACCESS
> > requires and is defined as:
> > "Unlike FAN_CLASS_PRE_CONTENT, this class can be used along with
> > FAN_REPORT_DFID_NAME to report the names of the looked up files along
> > with O_PATH file descriptos in the new path lookup events."
> >
> > I am not sure if we really need FAN_CLASS_PRE_PATH, so wanted to ask
> > your opinion.
> >
> > The basic HSM (as implemented in my POC) does not need to get the lookup
> > name in the event - it populates dir on first readdir or lookup access.
> > So I think that support for (FAN_CLASS_PRE_CONTENT | FAN_REPORT_DFID_NAME)
> > could be added later per demand.
>
> The question here is what a real user is going to do? I know Meta guys
> don't care about directory events for their usecase. You seem to care about
> them so I presume you have some production use in mind?
Yes we have had it in production for a long time -
You can instantaneously create a lazy clone of the entire "cloud fs" locally.
Directories get populated with sparse files on first access.
Sparse files get populated with data on first IO access.
It is essentially the same use case as Meta and many other similar users.
The need for directory populate is just a question of scale -
How much does it take to create a "metadata clone" or the remote fs copy
and create all the sparse files.
At some point, it becomes too heavy to not do it lazily.
> How's that going to
> work? Because if I should guess I'd think that someone asks for the name
> being looked up sooner rather than later because for large dirs not having
> to fetch everything on lookup would be a noticeable win...
Populating a single sparse file on lookup would be a large win, but also pretty
hard to implement this "partially populated dir" state correctly. For
that reason,
We have not implemented this so far, but one can imagine (as you wrote) that
someone else may want to make use of that in the future.
> And if that's
> the case then IMHO we should design (but not necessarily fully implement)
> API that's the simplest and most logical when this is added.
>
> This ties to the discussion how the FAN_PATH_ACCESS / FAN_PRE_DIR_ACCESS
> event is going to report the name.
>
Absolutely. The FAN_PRE_DIR_ACCESS should support reporting name info
in the future. Naturally, this could be an opt-in with FAN_REPORT_NAME,
because for most implementations (like ours) the name will not be needed.
I do not see an immediate reason to implement FAN_REPORT_NAME from
the start, but OTOH, if we do want to implement FAN_REPORT_DIR_FID from
the start, implementing FAN_REPORT_NAME would be a no-brainer.
> > 2) Current code does not generate FAN_PRE_ACCESS from vfs internal
> > lookup helpers such as lookup_one*() helpers from overalyfs and nfsd.
> > This is related to the API of reporting an O_PATH event->fd for
> > FAN_PATH_ACCESS event, which requires a mount.
>
> AFAIU this means that you could not NFS export a filesystem that is HSM
> managed and you could not use HSM managed filesystem to compose overlayfs.
> I don't find either of those a critical feature but OTOH it would be nice
> if the API didn't restrict us from somehow implementing this in the future.
>
Right.
There are a few ways to address this.
FAN_REPORT_DFID_NAME is one of them.
Actually, the two cases, overlayfs and nfsd are different
in the aspect that the overlayfs layer uses a private mount clone
while nfsd actually exports a specific user visible mount.
So at least in theory nfsd could report lookup events with a path
as demonstrated with commit from my WIP FAN_PRE_MODIFY patches
https://github.com/amir73il/linux/commit/4a8b6401e64d8dbe0721e5aaa496f0ad59208560
Another way is to say that event->fd does not need to indicate the
mount where the event happened.
Especially if event->fd is O_PATH fd, then it could simply refer to a
directory dentry using some arbitrary mount that the listener has access to.
For example, we can allow an opt-in flag to say that the listener keeps
an O_PATH fd for the path provided in fanotify_mark() (i.e. for an sb mark)
and let fanotify report event->fd based on the listener's mount regardless
of the event generator's mount.
There is no real concern about the listener keeping the fs mount busy because:
1. lsof will show this reference to the mount
2. A proper listener with FAN_REPORT_DFID_NAME has to keep open
mount_fd mapped to fsid anyway to be able to repose paths from events
(for example: fsnotifywatch implementation in inotify-tools)
Then functionally, FAN_REPORT_DIR_FID and FAN_REPORT_DIR_FD
would be similar, except that the latter keeps a reference to the object while
in the event queue and the former does not.
> > If we decide that we want to support FAN_PATH_ACCESS from all the
> > path-less lookup_one*() helpers, then we need to support reporting
> > FAN_PATH_ACCESS event with directory fid.
> >
> > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > on its "work" mount - the path via which directories are populated.
> >
> > There may be a middle ground:
> > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > - Move fsnotify hook into __lookup_slow()
> > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > path data
> >
> > This way, if HSM is enabled on an sb and not ignored on specific dir
> > after it was populated, path lookup from syscall will trigger
> > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > non-populated directories.
>
> OK, but how will this manifest from the user POV? If we have say nfs
> exported filesystem that is HSM managed then there would have to be some
> knowledge in nfsd to know how to access needed files so that HSM can pull
> them? I guess I'm missing the advantage of this middle-ground solution...
>
The advantage is that an admin is able to set up a "lazy populated fs"
with the guarantee that:
1. Non-populated objects can never be accessed
2. If the remote fetch service is up and the objects are accessed
from a supported path (i.e. not overlayfs layer) then the objects
will be populated on access
This is stronger and more useful than silently serving invalid content IMO.
This is related to the discussion about persistent marks and how to protect
against access to non-populated objects while service is down, but since
we have at least one case that can result in an EIO error (service down)
then another case (access from overlayfs) maybe is not a game changer(?)
> > Supporting populate events from overalyfs/nfsd could be implemented
> > later per demand by reporting directory fid instead of O_PATH fd.
> >
> > If you think that is worth checking, I can prepare a patch for the above
> > so we can expose it to performance regression bots.
> >
> > Better yet, if you have no issues with the implementation in this
> > patch set, maybe let it soak in for_next/for_testing as is to make
> > sure that it does not already introduce any performance regressions.
> >
> > Thoughts?
>
> If I should summarize the API situation: If we ever want to support HSM +
> NFS export / overlayfs, we must implement support for pre-content events
> with FID (DFID + name to be precise).
Yes, but there may be alternatives to FID.
> If we want to support HSM events on
> lookup with looked up name, we don't have to go for full DFID + name but we
> must at least add additional info with the name to the event.
Yes, reporting name is really a feature that could be opt-in.
And if we report name, it is no effort to also report FID,
regardless if we also report event->fd or not.
> Also if we go
> for reporting directory pre-content events with standard events, you want
> to add support for returning O_PATH fds for better efficiency and the code
> to handle FMODE_NONOTIFY directory fds in path lookup.
>
Yes. technically, O_PATH fd itself could be used to perform the populate of
dir in a kin way to event->fd being used to populate a file, so it is
elegant IMO.
> Frankly seeing all this I think that going for DFID + name events for
> directory HSM events from the start may be the cleanest choice long term
> because then we'll have one way how to access the directory HSM
> functionality with possibility of extensions without having to add
> different APIs for it.
I see the appeal in that.
I definitely considered that when we planned the API
just wanted to consult with you before going forward with implementation.
> We'd just have to implement replying to FID events
> because we won't have fd to identify event we reply to so that will need
> some thought.
I keep forgetting about that :-D
My suggestion for FAN_REPORT_DIR_FD could work around this
problem ellegantly.
>
> What are your thoughts? Am I missing something?
>
My thoughts are that FAN_REPORT_DIR_FD and
FAN_REPORT_DFID_NAME may both be valid solutions and
they are not even conflicting.
In fact, there is no clear reason to deny mixing them together.
If you do not have any objection to the FAN_REPORT_DIR_FD
solution, then we need to decide if we want to do them both?
one at a time? both from the start?
My gut feeling is that FAN_REPORT_DIR_FD is going to be
more easy to implement and to users to use for first version
and then whether or not we need to extend to report name
we can deal with later.
WDYT?
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-10 15:25 ` Amir Goldstein
@ 2025-06-14 17:03 ` Amir Goldstein
2025-06-16 9:07 ` Jan Kara
1 sibling, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-06-14 17:03 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
On Tue, Jun 10, 2025 at 5:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir!
> >
>
> Hi Jan!
>
> Thanks for taking the time to read my long email ;)
>
> > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > In v1 there was only patch 1 [1] to allow FAN_PRE_ACCESS events
> > > on readdir (with FAN_ONDIR).
> > >
> > > Following your feedback on v1, v2 adds support for FAN_PATH_ACCESS
> > > event so that a non-populated directory could be populted either on
> > > first readdir or on first lookup.
> >
> > OK, it's good that now we have a bit more wider context for the discussion
> > :). First, when reading this I've started wondering whether we need both
> > FAN_PRE_ACCESS on directories and FAN_PATH_ACCESS (only on directories).
> > Firstly, I don't love adding more use to the FAN_ONDIR flag when creating
> > marks because you can only specify you want FAN_PRE_ACCESS on files,
> > FAN_PRE_ACCESS on files & dirs but there's no way to tell you care only
> > about FAN_PRE_ACCESS on dirs. You have to filter that when receiving
> > events. Secondly, the distinction between FAN_PRE_ACCESS and
> > FAN_PATH_ACCESS is somewhat weak - it's kind of similar to the situation
> > with regular files when we notify about access to the whole file vs only to
> > a specific range. So what if we had an event like FAN_PRE_DIR_ACCESS that
> > would report looked up name on lookup and nothing on readdir meaning you
> > need to fetch everything?
> >
>
> This makes a lot of sense to me. and I also like the suggested event name.
> Another advantage is that FAN_PRE_ACCESS can always expect a range
> (as documented)
>
Hi Jan,
I started looking at combining readdir() and lookup() to generate
FAN_PRE_DIR_ACCESS and I hit this problem:
Currently, FAN_PRE_ACCESS is an event that is FS_EVENTS_POSS_ON_CHILD,
so watching a parent with FAN_EVENT_ON_CHILD can report this event
on files.
The same is true for FAN_OPEN, FAN_ACCESS and FAN_OPEN_PERM events,
but in that case, also true for an opened/access directory.
I do not think it is right to generate pre-content lookup events in
subdir when watching
a parent directory. I don't think that generating pre-content readdir
events on subdir
when watching a parent dir is very useful, but if you do not allow
that, we deviate
from the behavior of the event FAN_ACCESS | FAN_ONDIR which also happens
on readdir.
Honestly, I always found it a bit confusing that when reporting DFID_NAME of
events FAN_OPEN | FAN_ONDIR and FAN_ACCESS | FAN_ONDIR, when
watching the parent, we do not report the name of the subdir (like
inotify does),
but I still think it was the right thing to do.
Do you understand my dilemma?
Do you think it is fine for FAN_PRE_DIR_ACCESS to break out of this
confusing pattern and not be reported for subdirs on a watches parent?
Do you think we should report pre-content lookup events in subdir
with a watched parent?
Do you have an idea how to make this less confusing to users?
Or should we drop the idea of unifying the readdir/lookup events
and keep the legacy semantics for pre-readdir.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-10 15:25 ` Amir Goldstein
2025-06-14 17:03 ` Amir Goldstein
@ 2025-06-16 9:07 ` Jan Kara
2025-06-16 17:00 ` Amir Goldstein
1 sibling, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-06-16 9:07 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel
Hi Amir!
On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > I am still tagging this as RFC for two semi-related reasons:
> > >
> > > 1) In my original draft of man-page for FAN_PATH_ACCESS [2],
> > > I had introduced a new class FAN_CLASS_PRE_PATH, which FAN_PATH_ACCESS
> > > requires and is defined as:
> > > "Unlike FAN_CLASS_PRE_CONTENT, this class can be used along with
> > > FAN_REPORT_DFID_NAME to report the names of the looked up files along
> > > with O_PATH file descriptos in the new path lookup events."
> > >
> > > I am not sure if we really need FAN_CLASS_PRE_PATH, so wanted to ask
> > > your opinion.
> > >
> > > The basic HSM (as implemented in my POC) does not need to get the lookup
> > > name in the event - it populates dir on first readdir or lookup access.
> > > So I think that support for (FAN_CLASS_PRE_CONTENT | FAN_REPORT_DFID_NAME)
> > > could be added later per demand.
> >
> > The question here is what a real user is going to do? I know Meta guys
> > don't care about directory events for their usecase. You seem to care about
> > them so I presume you have some production use in mind?
>
> Yes we have had it in production for a long time -
> You can instantaneously create a lazy clone of the entire "cloud fs" locally.
> Directories get populated with sparse files on first access.
> Sparse files get populated with data on first IO access.
> It is essentially the same use case as Meta and many other similar users.
> The need for directory populate is just a question of scale -
> How much does it take to create a "metadata clone" or the remote fs copy
> and create all the sparse files.
> At some point, it becomes too heavy to not do it lazily.
>
> > How's that going to
> > work? Because if I should guess I'd think that someone asks for the name
> > being looked up sooner rather than later because for large dirs not having
> > to fetch everything on lookup would be a noticeable win...
>
> Populating a single sparse file on lookup would be a large win, but also pretty
> hard to implement this "partially populated dir" state correctly. For
> that reason,
> We have not implemented this so far, but one can imagine (as you wrote) that
> someone else may want to make use of that in the future.
OK, thanks for clarification.
> > > 2) Current code does not generate FAN_PRE_ACCESS from vfs internal
> > > lookup helpers such as lookup_one*() helpers from overalyfs and nfsd.
> > > This is related to the API of reporting an O_PATH event->fd for
> > > FAN_PATH_ACCESS event, which requires a mount.
> >
> > AFAIU this means that you could not NFS export a filesystem that is HSM
> > managed and you could not use HSM managed filesystem to compose overlayfs.
> > I don't find either of those a critical feature but OTOH it would be nice
> > if the API didn't restrict us from somehow implementing this in the future.
> >
>
> Right.
> There are a few ways to address this.
> FAN_REPORT_DFID_NAME is one of them.
>
> Actually, the two cases, overlayfs and nfsd are different
> in the aspect that the overlayfs layer uses a private mount clone
> while nfsd actually exports a specific user visible mount.
> So at least in theory nfsd could report lookup events with a path
> as demonstrated with commit from my WIP FAN_PRE_MODIFY patches
> https://github.com/amir73il/linux/commit/4a8b6401e64d8dbe0721e5aaa496f0ad59208560
OK, I agree that for nfsd reporting the event with the mount that nfsd exports
would make sense.
> Another way is to say that event->fd does not need to indicate the
> mount where the event happened.
> Especially if event->fd is O_PATH fd, then it could simply refer to a
> directory dentry using some arbitrary mount that the listener has access to.
> For example, we can allow an opt-in flag to say that the listener keeps
> an O_PATH fd for the path provided in fanotify_mark() (i.e. for an sb mark)
> and let fanotify report event->fd based on the listener's mount regardless
> of the event generator's mount.
So this would be controversial I think. Mounts can have different
properties (like different read-only settings, different id mappings, ...),
can reveal different parts of the filesystem and generally will be
differently placed in mount hierarchy. So in particular with sb marks the
implications of arbitrarily combining mount of a sb with some random dentry
(which need not even be accesible through the mount) could lead to surprising
results.
> There is no real concern about the listener keeping the fs mount busy because:
> 1. lsof will show this reference to the mount
> 2. A proper listener with FAN_REPORT_DFID_NAME has to keep open
> mount_fd mapped to fsid anyway to be able to repose paths from events
> (for example: fsnotifywatch implementation in inotify-tools)
>
> Then functionally, FAN_REPORT_DIR_FID and FAN_REPORT_DIR_FD
> would be similar, except that the latter keeps a reference to the object while
> in the event queue and the former does not.
Yeah, I'm not that concerned about keeping the fs busy. After all we
currently grab inode references and drop them on umount and we could do the
same with whatever other references we have to the fs.
> > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > path-less lookup_one*() helpers, then we need to support reporting
> > > FAN_PATH_ACCESS event with directory fid.
> > >
> > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > on its "work" mount - the path via which directories are populated.
> > >
> > > There may be a middle ground:
> > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > - Move fsnotify hook into __lookup_slow()
> > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > path data
> > >
> > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > after it was populated, path lookup from syscall will trigger
> > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > non-populated directories.
> >
> > OK, but how will this manifest from the user POV? If we have say nfs
> > exported filesystem that is HSM managed then there would have to be some
> > knowledge in nfsd to know how to access needed files so that HSM can pull
> > them? I guess I'm missing the advantage of this middle-ground solution...
>
> The advantage is that an admin is able to set up a "lazy populated fs"
> with the guarantee that:
> 1. Non-populated objects can never be accessed
> 2. If the remote fetch service is up and the objects are accessed
> from a supported path (i.e. not overlayfs layer) then the objects
> will be populated on access
>
> This is stronger and more useful than silently serving invalid content IMO.
>
> This is related to the discussion about persistent marks and how to protect
> against access to non-populated objects while service is down, but since
> we have at least one case that can result in an EIO error (service down)
> then another case (access from overlayfs) maybe is not a game changer(?)
Yes, reporting error for unpopulated content would be acceptable behavior.
I just don't see this would be all that useful.
> > > Supporting populate events from overalyfs/nfsd could be implemented
> > > later per demand by reporting directory fid instead of O_PATH fd.
> > >
> > > If you think that is worth checking, I can prepare a patch for the above
> > > so we can expose it to performance regression bots.
> > >
> > > Better yet, if you have no issues with the implementation in this
> > > patch set, maybe let it soak in for_next/for_testing as is to make
> > > sure that it does not already introduce any performance regressions.
> > >
> > > Thoughts?
> >
> > If I should summarize the API situation: If we ever want to support HSM +
> > NFS export / overlayfs, we must implement support for pre-content events
> > with FID (DFID + name to be precise).
>
> Yes, but there may be alternatives to FID.
>
> > If we want to support HSM events on
> > lookup with looked up name, we don't have to go for full DFID + name but we
> > must at least add additional info with the name to the event.
>
> Yes, reporting name is really a feature that could be opt-in.
> And if we report name, it is no effort to also report FID,
> regardless if we also report event->fd or not.
>
> > Also if we go
> > for reporting directory pre-content events with standard events, you want
> > to add support for returning O_PATH fds for better efficiency and the code
> > to handle FMODE_NONOTIFY directory fds in path lookup.
> >
>
> Yes. technically, O_PATH fd itself could be used to perform the populate of
> dir in a kin way to event->fd being used to populate a file, so it is
> elegant IMO.
I agree it is kind of elegant. But I find reporting DFID and leaving upto
userspace to provide "ignored" mount to fill in the contents elegant as
well and there's no need to define behavior of O_PATH dir fds with NONOTIFY
flags.
> > Frankly seeing all this I think that going for DFID + name events for
> > directory HSM events from the start may be the cleanest choice long term
> > because then we'll have one way how to access the directory HSM
> > functionality with possibility of extensions without having to add
> > different APIs for it.
>
> I see the appeal in that.
> I definitely considered that when we planned the API
> just wanted to consult with you before going forward with implementation.
>
> > We'd just have to implement replying to FID events
> > because we won't have fd to identify event we reply to so that will need
> > some thought.
>
> I keep forgetting about that :-D
>
> My suggestion for FAN_REPORT_DIR_FD could work around this
> problem ellegantly.
I agree FAN_REPORT_DIR_FD doesn't have this problem and that certainly adds
some appeal to it:). OTOH it is not that hard to solve - you'd need some
idr allocator in the notification group and pass the identifier together
with the event.
> > What are your thoughts? Am I missing something?
>
> My thoughts are that FAN_REPORT_DIR_FD and
> FAN_REPORT_DFID_NAME may both be valid solutions and
> they are not even conflicting.
> In fact, there is no clear reason to deny mixing them together.
>
> If you do not have any objection to the FAN_REPORT_DIR_FD
> solution, then we need to decide if we want to do them both?
> one at a time? both from the start?
>
> My gut feeling is that FAN_REPORT_DIR_FD is going to be
> more easy to implement and to users to use for first version
> and then whether or not we need to extend to report name
> we can deal with later.
As I wrote in my first email what I'd like to avoid is having part of the
functionality accessible in one way (say through FAN_REPORT_DIR_FD) and
having to switch to different way (FAN_REPORT_DFID_NAME) for full
functionality. That is in my opinion confusing to users and makes the api
messy in the long term. So I'd lean more towards implementing fid-based
events from the start. I don't think implementation-wise it's going to be
much higher effort than FAN_REPORT_DIR_FD. I agree that for users it is
somewhat more effort - they have to keep the private mount, open fhandle to
get to the dir so that they can fill it in. But that doesn't seem to be
that high bar either?
We even have some precedens that events for regular files support both fd
and fid events and for directory operations only fid events are supported.
We could do it similarly for HSM events...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-16 9:07 ` Jan Kara
@ 2025-06-16 17:00 ` Amir Goldstein
2025-06-17 9:43 ` Jan Kara
0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2025-06-16 17:00 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel
On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > I am still tagging this as RFC for two semi-related reasons:
> > > >
> > > > 1) In my original draft of man-page for FAN_PATH_ACCESS [2],
> > > > I had introduced a new class FAN_CLASS_PRE_PATH, which FAN_PATH_ACCESS
> > > > requires and is defined as:
> > > > "Unlike FAN_CLASS_PRE_CONTENT, this class can be used along with
> > > > FAN_REPORT_DFID_NAME to report the names of the looked up files along
> > > > with O_PATH file descriptos in the new path lookup events."
> > > >
> > > > I am not sure if we really need FAN_CLASS_PRE_PATH, so wanted to ask
> > > > your opinion.
> > > >
> > > > The basic HSM (as implemented in my POC) does not need to get the lookup
> > > > name in the event - it populates dir on first readdir or lookup access.
> > > > So I think that support for (FAN_CLASS_PRE_CONTENT | FAN_REPORT_DFID_NAME)
> > > > could be added later per demand.
> > >
> > > The question here is what a real user is going to do? I know Meta guys
> > > don't care about directory events for their usecase. You seem to care about
> > > them so I presume you have some production use in mind?
> >
> > Yes we have had it in production for a long time -
> > You can instantaneously create a lazy clone of the entire "cloud fs" locally.
> > Directories get populated with sparse files on first access.
> > Sparse files get populated with data on first IO access.
> > It is essentially the same use case as Meta and many other similar users.
> > The need for directory populate is just a question of scale -
> > How much does it take to create a "metadata clone" or the remote fs copy
> > and create all the sparse files.
> > At some point, it becomes too heavy to not do it lazily.
> >
> > > How's that going to
> > > work? Because if I should guess I'd think that someone asks for the name
> > > being looked up sooner rather than later because for large dirs not having
> > > to fetch everything on lookup would be a noticeable win...
> >
> > Populating a single sparse file on lookup would be a large win, but also pretty
> > hard to implement this "partially populated dir" state correctly. For
> > that reason,
> > We have not implemented this so far, but one can imagine (as you wrote) that
> > someone else may want to make use of that in the future.
>
> OK, thanks for clarification.
>
> > > > 2) Current code does not generate FAN_PRE_ACCESS from vfs internal
> > > > lookup helpers such as lookup_one*() helpers from overalyfs and nfsd.
> > > > This is related to the API of reporting an O_PATH event->fd for
> > > > FAN_PATH_ACCESS event, which requires a mount.
> > >
> > > AFAIU this means that you could not NFS export a filesystem that is HSM
> > > managed and you could not use HSM managed filesystem to compose overlayfs.
> > > I don't find either of those a critical feature but OTOH it would be nice
> > > if the API didn't restrict us from somehow implementing this in the future.
> > >
> >
> > Right.
> > There are a few ways to address this.
> > FAN_REPORT_DFID_NAME is one of them.
> >
> > Actually, the two cases, overlayfs and nfsd are different
> > in the aspect that the overlayfs layer uses a private mount clone
> > while nfsd actually exports a specific user visible mount.
> > So at least in theory nfsd could report lookup events with a path
> > as demonstrated with commit from my WIP FAN_PRE_MODIFY patches
> > https://github.com/amir73il/linux/commit/4a8b6401e64d8dbe0721e5aaa496f0ad59208560
>
> OK, I agree that for nfsd reporting the event with the mount that nfsd exports
> would make sense.
>
> > Another way is to say that event->fd does not need to indicate the
> > mount where the event happened.
> > Especially if event->fd is O_PATH fd, then it could simply refer to a
> > directory dentry using some arbitrary mount that the listener has access to.
> > For example, we can allow an opt-in flag to say that the listener keeps
> > an O_PATH fd for the path provided in fanotify_mark() (i.e. for an sb mark)
> > and let fanotify report event->fd based on the listener's mount regardless
> > of the event generator's mount.
>
> So this would be controversial I think. Mounts can have different
> properties (like different read-only settings, different id mappings, ...),
> can reveal different parts of the filesystem and generally will be
> differently placed in mount hierarchy. So in particular with sb marks the
> implications of arbitrarily combining mount of a sb with some random dentry
> (which need not even be accesible through the mount) could lead to surprising
> results.
>
<nod>
> > There is no real concern about the listener keeping the fs mount busy because:
> > 1. lsof will show this reference to the mount
> > 2. A proper listener with FAN_REPORT_DFID_NAME has to keep open
> > mount_fd mapped to fsid anyway to be able to repose paths from events
> > (for example: fsnotifywatch implementation in inotify-tools)
> >
> > Then functionally, FAN_REPORT_DIR_FID and FAN_REPORT_DIR_FD
> > would be similar, except that the latter keeps a reference to the object while
> > in the event queue and the former does not.
>
> Yeah, I'm not that concerned about keeping the fs busy. After all we
> currently grab inode references and drop them on umount and we could do the
> same with whatever other references we have to the fs.
>
> > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > FAN_PATH_ACCESS event with directory fid.
> > > >
> > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > on its "work" mount - the path via which directories are populated.
> > > >
> > > > There may be a middle ground:
> > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > - Move fsnotify hook into __lookup_slow()
> > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > path data
> > > >
> > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > after it was populated, path lookup from syscall will trigger
> > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > non-populated directories.
> > >
> > > OK, but how will this manifest from the user POV? If we have say nfs
> > > exported filesystem that is HSM managed then there would have to be some
> > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > them? I guess I'm missing the advantage of this middle-ground solution...
> >
> > The advantage is that an admin is able to set up a "lazy populated fs"
> > with the guarantee that:
> > 1. Non-populated objects can never be accessed
> > 2. If the remote fetch service is up and the objects are accessed
> > from a supported path (i.e. not overlayfs layer) then the objects
> > will be populated on access
> >
> > This is stronger and more useful than silently serving invalid content IMO.
> >
> > This is related to the discussion about persistent marks and how to protect
> > against access to non-populated objects while service is down, but since
> > we have at least one case that can result in an EIO error (service down)
> > then another case (access from overlayfs) maybe is not a game changer(?)
>
> Yes, reporting error for unpopulated content would be acceptable behavior.
> I just don't see this would be all that useful.
>
Regarding overlayfs, I think there is an even bigger problem.
There is the promise that we are not calling the blocking pre-content hook
with freeze protection held.
In overlayfs it is very common to take the upper layer freeze protection
for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
and perform lookups on upper fs or lower fs within this scope.
I am afraid that cleaning that up is not going to be realistic.
IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
will be mutually exclusive features.
This is quite similar to overlayfs resulting in EIO if lower fs has an
auto mount point.
Is it quite common for users to want overlayfs mounted over
/var/lib/docker/overlay2
on the root fs.
HSM is not likely to be running on / and /etc, but likely on a very
distinct lazy populated source dir or something.
We can easily document and deny mounting overlayfs over subtrees where
HSM is enabled (or just pre-path events).
This way we can provide HSM lazy dir populate to the users that do not care
about overlayfs without having to solve very hard to unsolvable issues.
I will need to audit all the other users of vfs lookup helpers other than
overlayfs and nfsd, to estimate how many of them are pre-content event
safe and how many are a hopeless case.
On the top of my head, trying to make a cachefilesd directory an HSM
directory is absolutely insane, so not every user of vfs lookup helpers
should be able to populate HSM content - should should simply fail
(with a meaningful kmsg log).
> > > > Supporting populate events from overalyfs/nfsd could be implemented
> > > > later per demand by reporting directory fid instead of O_PATH fd.
> > > >
> > > > If you think that is worth checking, I can prepare a patch for the above
> > > > so we can expose it to performance regression bots.
> > > >
> > > > Better yet, if you have no issues with the implementation in this
> > > > patch set, maybe let it soak in for_next/for_testing as is to make
> > > > sure that it does not already introduce any performance regressions.
> > > >
> > > > Thoughts?
> > >
> > > If I should summarize the API situation: If we ever want to support HSM +
> > > NFS export / overlayfs, we must implement support for pre-content events
> > > with FID (DFID + name to be precise).
> >
> > Yes, but there may be alternatives to FID.
> >
> > > If we want to support HSM events on
> > > lookup with looked up name, we don't have to go for full DFID + name but we
> > > must at least add additional info with the name to the event.
> >
> > Yes, reporting name is really a feature that could be opt-in.
> > And if we report name, it is no effort to also report FID,
> > regardless if we also report event->fd or not.
> >
> > > Also if we go
> > > for reporting directory pre-content events with standard events, you want
> > > to add support for returning O_PATH fds for better efficiency and the code
> > > to handle FMODE_NONOTIFY directory fds in path lookup.
> > >
> >
> > Yes. technically, O_PATH fd itself could be used to perform the populate of
> > dir in a kin way to event->fd being used to populate a file, so it is
> > elegant IMO.
>
> I agree it is kind of elegant. But I find reporting DFID and leaving upto
> userspace to provide "ignored" mount to fill in the contents elegant as
> well and there's no need to define behavior of O_PATH dir fds with NONOTIFY
> flags.
>
ok.
> > > Frankly seeing all this I think that going for DFID + name events for
> > > directory HSM events from the start may be the cleanest choice long term
> > > because then we'll have one way how to access the directory HSM
> > > functionality with possibility of extensions without having to add
> > > different APIs for it.
> >
> > I see the appeal in that.
> > I definitely considered that when we planned the API
> > just wanted to consult with you before going forward with implementation.
> >
> > > We'd just have to implement replying to FID events
> > > because we won't have fd to identify event we reply to so that will need
> > > some thought.
> >
> > I keep forgetting about that :-D
> >
> > My suggestion for FAN_REPORT_DIR_FD could work around this
> > problem ellegantly.
>
> I agree FAN_REPORT_DIR_FD doesn't have this problem and that certainly adds
> some appeal to it:). OTOH it is not that hard to solve - you'd need some
> idr allocator in the notification group and pass the identifier together
> with the event.
>
ok.
> > > What are your thoughts? Am I missing something?
> >
> > My thoughts are that FAN_REPORT_DIR_FD and
> > FAN_REPORT_DFID_NAME may both be valid solutions and
> > they are not even conflicting.
> > In fact, there is no clear reason to deny mixing them together.
> >
> > If you do not have any objection to the FAN_REPORT_DIR_FD
> > solution, then we need to decide if we want to do them both?
> > one at a time? both from the start?
> >
> > My gut feeling is that FAN_REPORT_DIR_FD is going to be
> > more easy to implement and to users to use for first version
> > and then whether or not we need to extend to report name
> > we can deal with later.
>
> As I wrote in my first email what I'd like to avoid is having part of the
> functionality accessible in one way (say through FAN_REPORT_DIR_FD) and
> having to switch to different way (FAN_REPORT_DFID_NAME) for full
> functionality. That is in my opinion confusing to users and makes the api
> messy in the long term. So I'd lean more towards implementing fid-based
> events from the start. I don't think implementation-wise it's going to be
> much higher effort than FAN_REPORT_DIR_FD. I agree that for users it is
> somewhat more effort - they have to keep the private mount, open fhandle to
> get to the dir so that they can fill it in. But that doesn't seem to be
> that high bar either?
>
ok.
> We even have some precedens that events for regular files support both fd
> and fid events and for directory operations only fid events are supported.
> We could do it similarly for HSM events...
That's true.
Another advantage is that FAN_REPORT_FID | FAN_CLASS_PRE_CONTENT
has not been allowed so far, so we can use it to set new semantics
that do not allow FAN_ONDIR and FAN_EVENT_ON_CHILD at all.
The two would be fully implied from the event type, unlike today
where we ignore them for some event types and use different meanings
to other event types.
I'll see what I can come up with, but first we need to reach an agreement
about the overlayfs case.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-16 17:00 ` Amir Goldstein
@ 2025-06-17 9:43 ` Jan Kara
2025-07-03 19:14 ` Amir Goldstein
0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-06-17 9:43 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel
On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@suse.cz> wrote:
> > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > FAN_PATH_ACCESS event with directory fid.
> > > > >
> > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > on its "work" mount - the path via which directories are populated.
> > > > >
> > > > > There may be a middle ground:
> > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > - Move fsnotify hook into __lookup_slow()
> > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > path data
> > > > >
> > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > after it was populated, path lookup from syscall will trigger
> > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > non-populated directories.
> > > >
> > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > exported filesystem that is HSM managed then there would have to be some
> > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > >
> > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > with the guarantee that:
> > > 1. Non-populated objects can never be accessed
> > > 2. If the remote fetch service is up and the objects are accessed
> > > from a supported path (i.e. not overlayfs layer) then the objects
> > > will be populated on access
> > >
> > > This is stronger and more useful than silently serving invalid content IMO.
> > >
> > > This is related to the discussion about persistent marks and how to protect
> > > against access to non-populated objects while service is down, but since
> > > we have at least one case that can result in an EIO error (service down)
> > > then another case (access from overlayfs) maybe is not a game changer(?)
> >
> > Yes, reporting error for unpopulated content would be acceptable behavior.
> > I just don't see this would be all that useful.
> >
>
> Regarding overlayfs, I think there is an even bigger problem.
> There is the promise that we are not calling the blocking pre-content hook
> with freeze protection held.
> In overlayfs it is very common to take the upper layer freeze protection
> for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> and perform lookups on upper fs or lower fs within this scope.
> I am afraid that cleaning that up is not going to be realistic.
>
> IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> will be mutually exclusive features.
>
> This is quite similar to overlayfs resulting in EIO if lower fs has an
> auto mount point.
>
> Is it quite common for users to want overlayfs mounted over
> /var/lib/docker/overlay2
> on the root fs.
> HSM is not likely to be running on / and /etc, but likely on a very
> distinct lazy populated source dir or something.
> We can easily document and deny mounting overlayfs over subtrees where
> HSM is enabled (or just pre-path events).
>
> This way we can provide HSM lazy dir populate to the users that do not care
> about overlayfs without having to solve very hard to unsolvable issues.
>
> I will need to audit all the other users of vfs lookup helpers other than
> overlayfs and nfsd, to estimate how many of them are pre-content event
> safe and how many are a hopeless case.
>
> On the top of my head, trying to make a cachefilesd directory an HSM
> directory is absolutely insane, so not every user of vfs lookup helpers
> should be able to populate HSM content - should should simply fail
> (with a meaningful kmsg log).
Right. What you write makes a lot of sense. You've convinced me that
returning error from overlayfs (or similar users) when they try to access
HSM managed dir is the least painful solution :).
> > As I wrote in my first email what I'd like to avoid is having part of the
> > functionality accessible in one way (say through FAN_REPORT_DIR_FD) and
> > having to switch to different way (FAN_REPORT_DFID_NAME) for full
> > functionality. That is in my opinion confusing to users and makes the api
> > messy in the long term. So I'd lean more towards implementing fid-based
> > events from the start. I don't think implementation-wise it's going to be
> > much higher effort than FAN_REPORT_DIR_FD. I agree that for users it is
> > somewhat more effort - they have to keep the private mount, open fhandle to
> > get to the dir so that they can fill it in. But that doesn't seem to be
> > that high bar either?
> >
>
> ok.
>
> > We even have some precedens that events for regular files support both fd
> > and fid events and for directory operations only fid events are supported.
> > We could do it similarly for HSM events...
>
> That's true.
>
> Another advantage is that FAN_REPORT_FID | FAN_CLASS_PRE_CONTENT
> has not been allowed so far, so we can use it to set new semantics
> that do not allow FAN_ONDIR and FAN_EVENT_ON_CHILD at all.
> The two would be fully implied from the event type, unlike today
> where we ignore them for some event types and use different meanings
> to other event types.
Right.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-06-17 9:43 ` Jan Kara
@ 2025-07-03 19:14 ` Amir Goldstein
2025-07-04 9:24 ` Jan Kara
0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2025-07-03 19:14 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, NeilBrown
On Tue, Jun 17, 2025 at 11:43 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> > On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@suse.cz> wrote:
> > > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > > FAN_PATH_ACCESS event with directory fid.
> > > > > >
> > > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > > on its "work" mount - the path via which directories are populated.
> > > > > >
> > > > > > There may be a middle ground:
> > > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > > - Move fsnotify hook into __lookup_slow()
> > > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > > path data
> > > > > >
> > > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > > after it was populated, path lookup from syscall will trigger
> > > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > > non-populated directories.
> > > > >
> > > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > > exported filesystem that is HSM managed then there would have to be some
> > > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > > >
> > > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > > with the guarantee that:
> > > > 1. Non-populated objects can never be accessed
> > > > 2. If the remote fetch service is up and the objects are accessed
> > > > from a supported path (i.e. not overlayfs layer) then the objects
> > > > will be populated on access
> > > >
> > > > This is stronger and more useful than silently serving invalid content IMO.
> > > >
> > > > This is related to the discussion about persistent marks and how to protect
> > > > against access to non-populated objects while service is down, but since
> > > > we have at least one case that can result in an EIO error (service down)
> > > > then another case (access from overlayfs) maybe is not a game changer(?)
> > >
> > > Yes, reporting error for unpopulated content would be acceptable behavior.
> > > I just don't see this would be all that useful.
> > >
> >
> > Regarding overlayfs, I think there is an even bigger problem.
> > There is the promise that we are not calling the blocking pre-content hook
> > with freeze protection held.
> > In overlayfs it is very common to take the upper layer freeze protection
> > for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> > and perform lookups on upper fs or lower fs within this scope.
> > I am afraid that cleaning that up is not going to be realistic.
> >
> > IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> > will be mutually exclusive features.
> >
> > This is quite similar to overlayfs resulting in EIO if lower fs has an
> > auto mount point.
> >
> > Is it quite common for users to want overlayfs mounted over
> > /var/lib/docker/overlay2
> > on the root fs.
> > HSM is not likely to be running on / and /etc, but likely on a very
> > distinct lazy populated source dir or something.
> > We can easily document and deny mounting overlayfs over subtrees where
> > HSM is enabled (or just pre-path events).
> >
> > This way we can provide HSM lazy dir populate to the users that do not care
> > about overlayfs without having to solve very hard to unsolvable issues.
> >
> > I will need to audit all the other users of vfs lookup helpers other than
> > overlayfs and nfsd, to estimate how many of them are pre-content event
> > safe and how many are a hopeless case.
> >
> > On the top of my head, trying to make a cachefilesd directory an HSM
> > directory is absolutely insane, so not every user of vfs lookup helpers
> > should be able to populate HSM content - should should simply fail
> > (with a meaningful kmsg log).
>
> Right. What you write makes a lot of sense. You've convinced me that
> returning error from overlayfs (or similar users) when they try to access
> HSM managed dir is the least painful solution :).
>
Oh oh, now I need to try to convince you of a solution that is less painful
than the least painful solution ;)
I have been experimenting with some code and also did a first pass audit
of the vfs lookup callers.
First of all, Neil's work to categorize the callers into lookup_noperm*
and lookup_one* really helped this audit. (thanks Neil!)
The lookup_noperm* callers are not "vfs users" they are internal fs
callers that should not call fsnotify pre-content hooks IMO.
The lookup_one* callers are vfs callers like ovl,cachefiles, as well
as nfsd,ksmbd.
Some of the lookup_one() calls are made from locked context, so not
good for pre-content events, but most of them are not relevant anyway
because they are not first access to dir (e.g. readdirplus which already
started to iterate dir).
Adding lookup pre-content hooks to nfsd and ksmbd before the relevant
lookup_one* callers and before fs locks are taken looks doable.
But the more important observation I had is that allowing access to
dirs with unpopulated content is not that big of a deal.
Allowing access to files that are sparse files before their content is filled
could have led applications to fault and users to suffer.
Allowing access to unpopulated dirs, e.g. from overlayfs or even from
nfsd, just results in getting ENOENT or viewing an empty directory.
My conclusion is, that if we place the fsnotify lookup hook in
lookup_slow() then the only thing we need to do is:
When doing lookup_one*() from possibly unsafe context,
in a fs that has pre-dir-content watchers,
we always allow the lookup,
but we never let it leave a negative dcache entry.
If the lookup finds a child entry, then dir is anyway populated.
If dir is not populated, the -ENOENT result will not be cached,
so future lookups of the same name from safe context will call the hook again,
populate the file or entire directory and create positive/negative dentry,
and then following lookups of the same name will not call the hook.
The only flaw in this plan is that the users that do not populate
directories can create entries in those directories, which can violate
O_CREAT | O_EXCL and mkdir(), w.r.t to a remote file.
But this flaw can be reported properly by the HSM daemon when
trying to populate a directory which is marked as unpopulated and
finding files inside it.
HSM could auto-resolve those conflicts or prompt admin for action
and can return ESTALE error (or a like) to users.
Was I clear? Does that sound reasonable to you?
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-07-03 19:14 ` Amir Goldstein
@ 2025-07-04 9:24 ` Jan Kara
2025-07-04 10:58 ` Amir Goldstein
0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-07-04 9:24 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, NeilBrown
On Thu 03-07-25 21:14:11, Amir Goldstein wrote:
> On Tue, Jun 17, 2025 at 11:43 AM Jan Kara <jack@suse.cz> wrote:
> > On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> > > On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > > > FAN_PATH_ACCESS event with directory fid.
> > > > > > >
> > > > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > > > on its "work" mount - the path via which directories are populated.
> > > > > > >
> > > > > > > There may be a middle ground:
> > > > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > > > - Move fsnotify hook into __lookup_slow()
> > > > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > > > path data
> > > > > > >
> > > > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > > > after it was populated, path lookup from syscall will trigger
> > > > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > > > non-populated directories.
> > > > > >
> > > > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > > > exported filesystem that is HSM managed then there would have to be some
> > > > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > > > >
> > > > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > > > with the guarantee that:
> > > > > 1. Non-populated objects can never be accessed
> > > > > 2. If the remote fetch service is up and the objects are accessed
> > > > > from a supported path (i.e. not overlayfs layer) then the objects
> > > > > will be populated on access
> > > > >
> > > > > This is stronger and more useful than silently serving invalid content IMO.
> > > > >
> > > > > This is related to the discussion about persistent marks and how to protect
> > > > > against access to non-populated objects while service is down, but since
> > > > > we have at least one case that can result in an EIO error (service down)
> > > > > then another case (access from overlayfs) maybe is not a game changer(?)
> > > >
> > > > Yes, reporting error for unpopulated content would be acceptable behavior.
> > > > I just don't see this would be all that useful.
> > > >
> > >
> > > Regarding overlayfs, I think there is an even bigger problem.
> > > There is the promise that we are not calling the blocking pre-content hook
> > > with freeze protection held.
> > > In overlayfs it is very common to take the upper layer freeze protection
> > > for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> > > and perform lookups on upper fs or lower fs within this scope.
> > > I am afraid that cleaning that up is not going to be realistic.
> > >
> > > IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> > > will be mutually exclusive features.
> > >
> > > This is quite similar to overlayfs resulting in EIO if lower fs has an
> > > auto mount point.
> > >
> > > Is it quite common for users to want overlayfs mounted over
> > > /var/lib/docker/overlay2
> > > on the root fs.
> > > HSM is not likely to be running on / and /etc, but likely on a very
> > > distinct lazy populated source dir or something.
> > > We can easily document and deny mounting overlayfs over subtrees where
> > > HSM is enabled (or just pre-path events).
> > >
> > > This way we can provide HSM lazy dir populate to the users that do not care
> > > about overlayfs without having to solve very hard to unsolvable issues.
> > >
> > > I will need to audit all the other users of vfs lookup helpers other than
> > > overlayfs and nfsd, to estimate how many of them are pre-content event
> > > safe and how many are a hopeless case.
> > >
> > > On the top of my head, trying to make a cachefilesd directory an HSM
> > > directory is absolutely insane, so not every user of vfs lookup helpers
> > > should be able to populate HSM content - should should simply fail
> > > (with a meaningful kmsg log).
> >
> > Right. What you write makes a lot of sense. You've convinced me that
> > returning error from overlayfs (or similar users) when they try to access
> > HSM managed dir is the least painful solution :).
> >
>
> Oh oh, now I need to try to convince you of a solution that is less painful
> than the least painful solution ;)
:)
> I have been experimenting with some code and also did a first pass audit
> of the vfs lookup callers.
>
> First of all, Neil's work to categorize the callers into lookup_noperm*
> and lookup_one* really helped this audit. (thanks Neil!)
>
> The lookup_noperm* callers are not "vfs users" they are internal fs
> callers that should not call fsnotify pre-content hooks IMO.
>
> The lookup_one* callers are vfs callers like ovl,cachefiles, as well
> as nfsd,ksmbd.
>
> Some of the lookup_one() calls are made from locked context, so not
> good for pre-content events, but most of them are not relevant anyway
> because they are not first access to dir (e.g. readdirplus which already
> started to iterate dir).
>
> Adding lookup pre-content hooks to nfsd and ksmbd before the relevant
> lookup_one* callers and before fs locks are taken looks doable.
>
> But the more important observation I had is that allowing access to
> dirs with unpopulated content is not that big of a deal.
>
> Allowing access to files that are sparse files before their content is filled
> could have led applications to fault and users to suffer.
>
> Allowing access to unpopulated dirs, e.g. from overlayfs or even from
> nfsd, just results in getting ENOENT or viewing an empty directory.
Right. Although if some important files would be missing, you'd still cause
troubles to applications and possible crashes (or app shutdowns). But I
take the ENOENT return in this case as a particular implementation of the
"just return error to userspace if we have no chance to handle the lookup
in this context" policy.
> My conclusion is, that if we place the fsnotify lookup hook in
> lookup_slow() then the only thing we need to do is:
> When doing lookup_one*() from possibly unsafe context,
> in a fs that has pre-dir-content watchers,
> we always allow the lookup,
> but we never let it leave a negative dcache entry.
>
> If the lookup finds a child entry, then dir is anyway populated.
> If dir is not populated, the -ENOENT result will not be cached,
> so future lookups of the same name from safe context will call the hook again,
> populate the file or entire directory and create positive/negative dentry,
> and then following lookups of the same name will not call the hook.
Yes, this looks pretty much like what we've agreed on before, just now the
implementation is getting more concrete shape. Or am I missing something?
> The only flaw in this plan is that the users that do not populate
> directories can create entries in those directories, which can violate
> O_CREAT | O_EXCL and mkdir(), w.r.t to a remote file.
>
> But this flaw can be reported properly by the HSM daemon when
> trying to populate a directory which is marked as unpopulated and
> finding files inside it.
>
> HSM could auto-resolve those conflicts or prompt admin for action
> and can return ESTALE error (or a like) to users.
>
> Was I clear? Does that sound reasonable to you?
So far we have only one-way synchronization for files (i.e., we don't
expect the HSM client to actually modify the filesystem, the server is the
ultimate source of truth). Aren't we going to do it similarly for
directories? It would be weird to try to handle dir modifications without
supporting file modifications. And if we aim for one-way synchronization,
then I'd expect the HSM service to maybe just expose RO mount to
applications (superblock and the private mount used for filling in data
have to be obviously RW). If it lets applications write to the fs for
whatever reason, it has to keep all the pieces together, I don't think the
kernel is responsible there. Hmm?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-07-04 9:24 ` Jan Kara
@ 2025-07-04 10:58 ` Amir Goldstein
2025-07-08 15:32 ` Amir Goldstein
0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2025-07-04 10:58 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, NeilBrown
On Fri, Jul 4, 2025 at 11:24 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 03-07-25 21:14:11, Amir Goldstein wrote:
> > On Tue, Jun 17, 2025 at 11:43 AM Jan Kara <jack@suse.cz> wrote:
> > > On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> > > > On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@suse.cz> wrote:
> > > > > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > > > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > > > > FAN_PATH_ACCESS event with directory fid.
> > > > > > > >
> > > > > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > > > > on its "work" mount - the path via which directories are populated.
> > > > > > > >
> > > > > > > > There may be a middle ground:
> > > > > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > > > > - Move fsnotify hook into __lookup_slow()
> > > > > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > > > > path data
> > > > > > > >
> > > > > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > > > > after it was populated, path lookup from syscall will trigger
> > > > > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > > > > non-populated directories.
> > > > > > >
> > > > > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > > > > exported filesystem that is HSM managed then there would have to be some
> > > > > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > > > > >
> > > > > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > > > > with the guarantee that:
> > > > > > 1. Non-populated objects can never be accessed
> > > > > > 2. If the remote fetch service is up and the objects are accessed
> > > > > > from a supported path (i.e. not overlayfs layer) then the objects
> > > > > > will be populated on access
> > > > > >
> > > > > > This is stronger and more useful than silently serving invalid content IMO.
> > > > > >
> > > > > > This is related to the discussion about persistent marks and how to protect
> > > > > > against access to non-populated objects while service is down, but since
> > > > > > we have at least one case that can result in an EIO error (service down)
> > > > > > then another case (access from overlayfs) maybe is not a game changer(?)
> > > > >
> > > > > Yes, reporting error for unpopulated content would be acceptable behavior.
> > > > > I just don't see this would be all that useful.
> > > > >
> > > >
> > > > Regarding overlayfs, I think there is an even bigger problem.
> > > > There is the promise that we are not calling the blocking pre-content hook
> > > > with freeze protection held.
> > > > In overlayfs it is very common to take the upper layer freeze protection
> > > > for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> > > > and perform lookups on upper fs or lower fs within this scope.
> > > > I am afraid that cleaning that up is not going to be realistic.
> > > >
> > > > IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> > > > will be mutually exclusive features.
> > > >
> > > > This is quite similar to overlayfs resulting in EIO if lower fs has an
> > > > auto mount point.
> > > >
> > > > Is it quite common for users to want overlayfs mounted over
> > > > /var/lib/docker/overlay2
> > > > on the root fs.
> > > > HSM is not likely to be running on / and /etc, but likely on a very
> > > > distinct lazy populated source dir or something.
> > > > We can easily document and deny mounting overlayfs over subtrees where
> > > > HSM is enabled (or just pre-path events).
> > > >
> > > > This way we can provide HSM lazy dir populate to the users that do not care
> > > > about overlayfs without having to solve very hard to unsolvable issues.
> > > >
> > > > I will need to audit all the other users of vfs lookup helpers other than
> > > > overlayfs and nfsd, to estimate how many of them are pre-content event
> > > > safe and how many are a hopeless case.
> > > >
> > > > On the top of my head, trying to make a cachefilesd directory an HSM
> > > > directory is absolutely insane, so not every user of vfs lookup helpers
> > > > should be able to populate HSM content - should should simply fail
> > > > (with a meaningful kmsg log).
> > >
> > > Right. What you write makes a lot of sense. You've convinced me that
> > > returning error from overlayfs (or similar users) when they try to access
> > > HSM managed dir is the least painful solution :).
> > >
> >
> > Oh oh, now I need to try to convince you of a solution that is less painful
> > than the least painful solution ;)
>
> :)
>
> > I have been experimenting with some code and also did a first pass audit
> > of the vfs lookup callers.
> >
> > First of all, Neil's work to categorize the callers into lookup_noperm*
> > and lookup_one* really helped this audit. (thanks Neil!)
> >
> > The lookup_noperm* callers are not "vfs users" they are internal fs
> > callers that should not call fsnotify pre-content hooks IMO.
> >
> > The lookup_one* callers are vfs callers like ovl,cachefiles, as well
> > as nfsd,ksmbd.
> >
> > Some of the lookup_one() calls are made from locked context, so not
> > good for pre-content events, but most of them are not relevant anyway
> > because they are not first access to dir (e.g. readdirplus which already
> > started to iterate dir).
> >
> > Adding lookup pre-content hooks to nfsd and ksmbd before the relevant
> > lookup_one* callers and before fs locks are taken looks doable.
> >
> > But the more important observation I had is that allowing access to
> > dirs with unpopulated content is not that big of a deal.
> >
> > Allowing access to files that are sparse files before their content is filled
> > could have led applications to fault and users to suffer.
> >
> > Allowing access to unpopulated dirs, e.g. from overlayfs or even from
> > nfsd, just results in getting ENOENT or viewing an empty directory.
>
> Right. Although if some important files would be missing, you'd still cause
> troubles to applications and possible crashes (or app shutdowns). But I
> take the ENOENT return in this case as a particular implementation of the
> "just return error to userspace if we have no chance to handle the lookup
> in this context" policy.
>
> > My conclusion is, that if we place the fsnotify lookup hook in
> > lookup_slow() then the only thing we need to do is:
> > When doing lookup_one*() from possibly unsafe context,
> > in a fs that has pre-dir-content watchers,
> > we always allow the lookup,
> > but we never let it leave a negative dcache entry.
> >
> > If the lookup finds a child entry, then dir is anyway populated.
> > If dir is not populated, the -ENOENT result will not be cached,
> > so future lookups of the same name from safe context will call the hook again,
> > populate the file or entire directory and create positive/negative dentry,
> > and then following lookups of the same name will not call the hook.
>
> Yes, this looks pretty much like what we've agreed on before, just now the
> implementation is getting more concrete shape. Or am I missing something?
>
What (I think) we discussed before was to fail *any* lookup from
internal vfs callers to HSM moderated fs, so ovl would also not be able to
access a populated directory in that case.
What I am suggesting is to always allow the lookup in HSM fs
and depending on a negative lookup result do "something".
There is a nuance here.
Obviously, userspace will get ENOENT in this case, but
does lookup_one() succeed and returns a negative unhashed
dentry (e.g. to ovl) or does it drop the dentry and fail with -ENOENT?
I was thinking of the former, but I think you are implying the latter,
which is indeed a bit closer to what we agreed on.
For callers that use lookup_one_positive_unlocked()
like ovl_lookup(), it makes no difference, but for callers that
create new entries like ovl_create_upper() it means failure to create
and that is desirable IMO.
I guess, if ovl_mkdir() fails with -ENOENT users would be a bit confused
but in a way, this parent directory does not fully exist yet, so
it may be good enough.
We could also annotate those calls as lookup_one_for_create()
to return -EROFS in the negative lookup result in HSM moderated dir,
but not sure that this is needed or if it is less confusing to users.
What's even nicer is that for overlayfs it is the more likely case that
only the lower layer fs is HSM moderated, e.g. for a composefs
"image repository".
Adding safe pre-dir-content hooks for overlayfs lower layer lookup
may be possible down the road. A lot easier that supporting
lazy dir populates in a rw layer.
> > The only flaw in this plan is that the users that do not populate
> > directories can create entries in those directories, which can violate
> > O_CREAT | O_EXCL and mkdir(), w.r.t to a remote file.
> >
> > But this flaw can be reported properly by the HSM daemon when
> > trying to populate a directory which is marked as unpopulated and
> > finding files inside it.
> >
> > HSM could auto-resolve those conflicts or prompt admin for action
> > and can return ESTALE error (or a like) to users.
> >
> > Was I clear? Does that sound reasonable to you?
>
> So far we have only one-way synchronization for files (i.e., we don't
> expect the HSM client to actually modify the filesystem, the server is the
> ultimate source of truth). Aren't we going to do it similarly for
> directories? It would be weird to try to handle dir modifications without
> supporting file modifications. And if we aim for one-way synchronization,
> then I'd expect the HSM service to maybe just expose RO mount to
> applications (superblock and the private mount used for filling in data
> have to be obviously RW). If it lets applications write to the fs for
> whatever reason, it has to keep all the pieces together, I don't think the
> kernel is responsible there. Hmm?
My use case is "cloud sync engine" so the users do get a RW mount
and it's the server's job to keep all the pieces together.
But for lazy dir populare (subtree never accessed by user), the server
expects to get an empty dir to and to populate it.
So it is much better if the kernel is able to block creating entries in
an HSM moderated dir by the non-blocking internal vfs users.
I think the explicit ENOENT for negative lookup results may just be
enough to get this done, so I will give it a shot.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-07-04 10:58 ` Amir Goldstein
@ 2025-07-08 15:32 ` Amir Goldstein
2025-07-09 9:00 ` Amir Goldstein
0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2025-07-08 15:32 UTC (permalink / raw)
To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, NeilBrown
On Fri, Jul 4, 2025 at 12:58 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Fri, Jul 4, 2025 at 11:24 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 03-07-25 21:14:11, Amir Goldstein wrote:
> > > On Tue, Jun 17, 2025 at 11:43 AM Jan Kara <jack@suse.cz> wrote:
> > > > On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> > > > > On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > > > > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > > > > > FAN_PATH_ACCESS event with directory fid.
> > > > > > > > >
> > > > > > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > > > > > on its "work" mount - the path via which directories are populated.
> > > > > > > > >
> > > > > > > > > There may be a middle ground:
> > > > > > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > > > > > - Move fsnotify hook into __lookup_slow()
> > > > > > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > > > > > path data
> > > > > > > > >
> > > > > > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > > > > > after it was populated, path lookup from syscall will trigger
> > > > > > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > > > > > non-populated directories.
> > > > > > > >
> > > > > > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > > > > > exported filesystem that is HSM managed then there would have to be some
> > > > > > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > > > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > > > > > >
> > > > > > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > > > > > with the guarantee that:
> > > > > > > 1. Non-populated objects can never be accessed
> > > > > > > 2. If the remote fetch service is up and the objects are accessed
> > > > > > > from a supported path (i.e. not overlayfs layer) then the objects
> > > > > > > will be populated on access
> > > > > > >
> > > > > > > This is stronger and more useful than silently serving invalid content IMO.
> > > > > > >
> > > > > > > This is related to the discussion about persistent marks and how to protect
> > > > > > > against access to non-populated objects while service is down, but since
> > > > > > > we have at least one case that can result in an EIO error (service down)
> > > > > > > then another case (access from overlayfs) maybe is not a game changer(?)
> > > > > >
> > > > > > Yes, reporting error for unpopulated content would be acceptable behavior.
> > > > > > I just don't see this would be all that useful.
> > > > > >
> > > > >
> > > > > Regarding overlayfs, I think there is an even bigger problem.
> > > > > There is the promise that we are not calling the blocking pre-content hook
> > > > > with freeze protection held.
> > > > > In overlayfs it is very common to take the upper layer freeze protection
> > > > > for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> > > > > and perform lookups on upper fs or lower fs within this scope.
> > > > > I am afraid that cleaning that up is not going to be realistic.
> > > > >
> > > > > IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> > > > > will be mutually exclusive features.
> > > > >
> > > > > This is quite similar to overlayfs resulting in EIO if lower fs has an
> > > > > auto mount point.
> > > > >
> > > > > Is it quite common for users to want overlayfs mounted over
> > > > > /var/lib/docker/overlay2
> > > > > on the root fs.
> > > > > HSM is not likely to be running on / and /etc, but likely on a very
> > > > > distinct lazy populated source dir or something.
> > > > > We can easily document and deny mounting overlayfs over subtrees where
> > > > > HSM is enabled (or just pre-path events).
> > > > >
> > > > > This way we can provide HSM lazy dir populate to the users that do not care
> > > > > about overlayfs without having to solve very hard to unsolvable issues.
> > > > >
> > > > > I will need to audit all the other users of vfs lookup helpers other than
> > > > > overlayfs and nfsd, to estimate how many of them are pre-content event
> > > > > safe and how many are a hopeless case.
> > > > >
> > > > > On the top of my head, trying to make a cachefilesd directory an HSM
> > > > > directory is absolutely insane, so not every user of vfs lookup helpers
> > > > > should be able to populate HSM content - should should simply fail
> > > > > (with a meaningful kmsg log).
> > > >
> > > > Right. What you write makes a lot of sense. You've convinced me that
> > > > returning error from overlayfs (or similar users) when they try to access
> > > > HSM managed dir is the least painful solution :).
> > > >
> > >
> > > Oh oh, now I need to try to convince you of a solution that is less painful
> > > than the least painful solution ;)
> >
> > :)
> >
> > > I have been experimenting with some code and also did a first pass audit
> > > of the vfs lookup callers.
> > >
> > > First of all, Neil's work to categorize the callers into lookup_noperm*
> > > and lookup_one* really helped this audit. (thanks Neil!)
> > >
> > > The lookup_noperm* callers are not "vfs users" they are internal fs
> > > callers that should not call fsnotify pre-content hooks IMO.
> > >
> > > The lookup_one* callers are vfs callers like ovl,cachefiles, as well
> > > as nfsd,ksmbd.
> > >
> > > Some of the lookup_one() calls are made from locked context, so not
> > > good for pre-content events, but most of them are not relevant anyway
> > > because they are not first access to dir (e.g. readdirplus which already
> > > started to iterate dir).
> > >
> > > Adding lookup pre-content hooks to nfsd and ksmbd before the relevant
> > > lookup_one* callers and before fs locks are taken looks doable.
> > >
> > > But the more important observation I had is that allowing access to
> > > dirs with unpopulated content is not that big of a deal.
> > >
> > > Allowing access to files that are sparse files before their content is filled
> > > could have led applications to fault and users to suffer.
> > >
> > > Allowing access to unpopulated dirs, e.g. from overlayfs or even from
> > > nfsd, just results in getting ENOENT or viewing an empty directory.
> >
> > Right. Although if some important files would be missing, you'd still cause
> > troubles to applications and possible crashes (or app shutdowns). But I
> > take the ENOENT return in this case as a particular implementation of the
> > "just return error to userspace if we have no chance to handle the lookup
> > in this context" policy.
> >
> > > My conclusion is, that if we place the fsnotify lookup hook in
> > > lookup_slow() then the only thing we need to do is:
> > > When doing lookup_one*() from possibly unsafe context,
> > > in a fs that has pre-dir-content watchers,
> > > we always allow the lookup,
> > > but we never let it leave a negative dcache entry.
> > >
> > > If the lookup finds a child entry, then dir is anyway populated.
> > > If dir is not populated, the -ENOENT result will not be cached,
> > > so future lookups of the same name from safe context will call the hook again,
> > > populate the file or entire directory and create positive/negative dentry,
> > > and then following lookups of the same name will not call the hook.
> >
> > Yes, this looks pretty much like what we've agreed on before, just now the
> > implementation is getting more concrete shape. Or am I missing something?
> >
>
> What (I think) we discussed before was to fail *any* lookup from
> internal vfs callers to HSM moderated fs, so ovl would also not be able to
> access a populated directory in that case.
>
> What I am suggesting is to always allow the lookup in HSM fs
> and depending on a negative lookup result do "something".
>
> There is a nuance here.
> Obviously, userspace will get ENOENT in this case, but
> does lookup_one() succeed and returns a negative unhashed
> dentry (e.g. to ovl) or does it drop the dentry and fail with -ENOENT?
>
> I was thinking of the former, but I think you are implying the latter,
> which is indeed a bit closer to what we agreed on.
>
> For callers that use lookup_one_positive_unlocked()
> like ovl_lookup(), it makes no difference, but for callers that
> create new entries like ovl_create_upper() it means failure to create
> and that is desirable IMO.
>
> I guess, if ovl_mkdir() fails with -ENOENT users would be a bit confused
> but in a way, this parent directory does not fully exist yet, so
> it may be good enough.
>
> We could also annotate those calls as lookup_one_for_create()
> to return -EROFS in the negative lookup result in HSM moderated dir,
> but not sure that this is needed or if it is less confusing to users.
>
> What's even nicer is that for overlayfs it is the more likely case that
> only the lower layer fs is HSM moderated, e.g. for a composefs
> "image repository".
>
> Adding safe pre-dir-content hooks for overlayfs lower layer lookup
> may be possible down the road. A lot easier that supporting
> lazy dir populates in a rw layer.
>
>
FYI, here is a WIP branch for the scheme that we discussed here:
https://github.com/amir73il/linux/commits/fan_pre_dir_access/
There is an LTP branch of the same name that passes tests.
I also added two simple patches for nfsd support for pre-dir-content events
but they are optional, to demonstrate that internal users could be
supported later.
If you have time to look and give me initial feedback that would be great.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] fanotify HSM events for directories
2025-07-08 15:32 ` Amir Goldstein
@ 2025-07-09 9:00 ` Amir Goldstein
0 siblings, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2025-07-09 9:00 UTC (permalink / raw)
To: Jan Kara
Cc: Christian Brauner, linux-fsdevel, Chuck Lever, Jeff Layton,
NeilBrown
On Tue, Jul 8, 2025 at 5:32 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Fri, Jul 4, 2025 at 12:58 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Fri, Jul 4, 2025 at 11:24 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 03-07-25 21:14:11, Amir Goldstein wrote:
> > > > On Tue, Jun 17, 2025 at 11:43 AM Jan Kara <jack@suse.cz> wrote:
> > > > > On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> > > > > > On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > > > > > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > > > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > > > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > > > > > > FAN_PATH_ACCESS event with directory fid.
> > > > > > > > > >
> > > > > > > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > > > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > > > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > > > > > > on its "work" mount - the path via which directories are populated.
> > > > > > > > > >
> > > > > > > > > > There may be a middle ground:
> > > > > > > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > > > > > > - Move fsnotify hook into __lookup_slow()
> > > > > > > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > > > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > > > > > > path data
> > > > > > > > > >
> > > > > > > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > > > > > > after it was populated, path lookup from syscall will trigger
> > > > > > > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > > > > > > non-populated directories.
> > > > > > > > >
> > > > > > > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > > > > > > exported filesystem that is HSM managed then there would have to be some
> > > > > > > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > > > > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > > > > > > >
> > > > > > > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > > > > > > with the guarantee that:
> > > > > > > > 1. Non-populated objects can never be accessed
> > > > > > > > 2. If the remote fetch service is up and the objects are accessed
> > > > > > > > from a supported path (i.e. not overlayfs layer) then the objects
> > > > > > > > will be populated on access
> > > > > > > >
> > > > > > > > This is stronger and more useful than silently serving invalid content IMO.
> > > > > > > >
> > > > > > > > This is related to the discussion about persistent marks and how to protect
> > > > > > > > against access to non-populated objects while service is down, but since
> > > > > > > > we have at least one case that can result in an EIO error (service down)
> > > > > > > > then another case (access from overlayfs) maybe is not a game changer(?)
> > > > > > >
> > > > > > > Yes, reporting error for unpopulated content would be acceptable behavior.
> > > > > > > I just don't see this would be all that useful.
> > > > > > >
> > > > > >
> > > > > > Regarding overlayfs, I think there is an even bigger problem.
> > > > > > There is the promise that we are not calling the blocking pre-content hook
> > > > > > with freeze protection held.
> > > > > > In overlayfs it is very common to take the upper layer freeze protection
> > > > > > for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> > > > > > and perform lookups on upper fs or lower fs within this scope.
> > > > > > I am afraid that cleaning that up is not going to be realistic.
> > > > > >
> > > > > > IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> > > > > > will be mutually exclusive features.
> > > > > >
> > > > > > This is quite similar to overlayfs resulting in EIO if lower fs has an
> > > > > > auto mount point.
> > > > > >
> > > > > > Is it quite common for users to want overlayfs mounted over
> > > > > > /var/lib/docker/overlay2
> > > > > > on the root fs.
> > > > > > HSM is not likely to be running on / and /etc, but likely on a very
> > > > > > distinct lazy populated source dir or something.
> > > > > > We can easily document and deny mounting overlayfs over subtrees where
> > > > > > HSM is enabled (or just pre-path events).
> > > > > >
> > > > > > This way we can provide HSM lazy dir populate to the users that do not care
> > > > > > about overlayfs without having to solve very hard to unsolvable issues.
> > > > > >
> > > > > > I will need to audit all the other users of vfs lookup helpers other than
> > > > > > overlayfs and nfsd, to estimate how many of them are pre-content event
> > > > > > safe and how many are a hopeless case.
> > > > > >
> > > > > > On the top of my head, trying to make a cachefilesd directory an HSM
> > > > > > directory is absolutely insane, so not every user of vfs lookup helpers
> > > > > > should be able to populate HSM content - should should simply fail
> > > > > > (with a meaningful kmsg log).
> > > > >
> > > > > Right. What you write makes a lot of sense. You've convinced me that
> > > > > returning error from overlayfs (or similar users) when they try to access
> > > > > HSM managed dir is the least painful solution :).
> > > > >
> > > >
> > > > Oh oh, now I need to try to convince you of a solution that is less painful
> > > > than the least painful solution ;)
> > >
> > > :)
> > >
> > > > I have been experimenting with some code and also did a first pass audit
> > > > of the vfs lookup callers.
> > > >
> > > > First of all, Neil's work to categorize the callers into lookup_noperm*
> > > > and lookup_one* really helped this audit. (thanks Neil!)
> > > >
> > > > The lookup_noperm* callers are not "vfs users" they are internal fs
> > > > callers that should not call fsnotify pre-content hooks IMO.
> > > >
> > > > The lookup_one* callers are vfs callers like ovl,cachefiles, as well
> > > > as nfsd,ksmbd.
> > > >
> > > > Some of the lookup_one() calls are made from locked context, so not
> > > > good for pre-content events, but most of them are not relevant anyway
> > > > because they are not first access to dir (e.g. readdirplus which already
> > > > started to iterate dir).
> > > >
> > > > Adding lookup pre-content hooks to nfsd and ksmbd before the relevant
> > > > lookup_one* callers and before fs locks are taken looks doable.
> > > >
> > > > But the more important observation I had is that allowing access to
> > > > dirs with unpopulated content is not that big of a deal.
> > > >
> > > > Allowing access to files that are sparse files before their content is filled
> > > > could have led applications to fault and users to suffer.
> > > >
> > > > Allowing access to unpopulated dirs, e.g. from overlayfs or even from
> > > > nfsd, just results in getting ENOENT or viewing an empty directory.
> > >
> > > Right. Although if some important files would be missing, you'd still cause
> > > troubles to applications and possible crashes (or app shutdowns). But I
> > > take the ENOENT return in this case as a particular implementation of the
> > > "just return error to userspace if we have no chance to handle the lookup
> > > in this context" policy.
> > >
> > > > My conclusion is, that if we place the fsnotify lookup hook in
> > > > lookup_slow() then the only thing we need to do is:
> > > > When doing lookup_one*() from possibly unsafe context,
> > > > in a fs that has pre-dir-content watchers,
> > > > we always allow the lookup,
> > > > but we never let it leave a negative dcache entry.
> > > >
> > > > If the lookup finds a child entry, then dir is anyway populated.
> > > > If dir is not populated, the -ENOENT result will not be cached,
> > > > so future lookups of the same name from safe context will call the hook again,
> > > > populate the file or entire directory and create positive/negative dentry,
> > > > and then following lookups of the same name will not call the hook.
> > >
> > > Yes, this looks pretty much like what we've agreed on before, just now the
> > > implementation is getting more concrete shape. Or am I missing something?
> > >
> >
> > What (I think) we discussed before was to fail *any* lookup from
> > internal vfs callers to HSM moderated fs, so ovl would also not be able to
> > access a populated directory in that case.
> >
> > What I am suggesting is to always allow the lookup in HSM fs
> > and depending on a negative lookup result do "something".
> >
> > There is a nuance here.
> > Obviously, userspace will get ENOENT in this case, but
> > does lookup_one() succeed and returns a negative unhashed
> > dentry (e.g. to ovl) or does it drop the dentry and fail with -ENOENT?
> >
> > I was thinking of the former, but I think you are implying the latter,
> > which is indeed a bit closer to what we agreed on.
> >
> > For callers that use lookup_one_positive_unlocked()
> > like ovl_lookup(), it makes no difference, but for callers that
> > create new entries like ovl_create_upper() it means failure to create
> > and that is desirable IMO.
> >
> > I guess, if ovl_mkdir() fails with -ENOENT users would be a bit confused
> > but in a way, this parent directory does not fully exist yet, so
> > it may be good enough.
> >
> > We could also annotate those calls as lookup_one_for_create()
> > to return -EROFS in the negative lookup result in HSM moderated dir,
> > but not sure that this is needed or if it is less confusing to users.
> >
> > What's even nicer is that for overlayfs it is the more likely case that
> > only the lower layer fs is HSM moderated, e.g. for a composefs
> > "image repository".
> >
> > Adding safe pre-dir-content hooks for overlayfs lower layer lookup
> > may be possible down the road. A lot easier that supporting
> > lazy dir populates in a rw layer.
> >
> >
>
> FYI, here is a WIP branch for the scheme that we discussed here:
>
> https://github.com/amir73il/linux/commits/fan_pre_dir_access/
>
Some more commentary of design choices in this WIP patch set.
We have discussed in the context of page fault events the concept of
a "handle once" HSM event per file range, where events are not
generated if page cache pages are already populated, even if said
page cache pages were populated before the HSM marks were set up.
The pre-content events on page fault did not happen eventually,
but as far as API documentation is concerned, we are still allowed
to suppress multiple pre-content events on the same file range.
A similar concept was applied to pre-dir-content events w.r.t dcache,
but with a few nuances.
The pre-dir-content event from lookup on NAME in DIR is only generated
in lookup_slow() when the dcache entry of NAME is not populated.
If a positive/negative dentry was created before setup of HSM mark,
the event will not be generated.
***This is the new part that we did not discuss that I implemented:***
When a pre-dir-content event WITHOUT a NAME (i.e. from readdir)
is handled by HSM (i.e. FAN_ALLOW response), the dentry is marked
DCACHE_HSM_ONCE and no further pre-dir-content events are generated
on that directory.
The DCACHE_HSM_ONCE flag is not set when there are no HSM marks
or when a directory has an HSM ignore mark!
***
IMO this behavior is inline with the former "handle once" strategy.
> There is an LTP branch of the same name that passes tests.
>
> I also added two simple patches for nfsd support for pre-dir-content events
> but they are optional, to demonstrate that internal users could be
> supported later.
>
The "handle once" design helps with the implementation of nfsd
pre-dir-content hook.
For simplification, those hooks were implemented in the permission check inside
fh_verify(fhp, NFSD_MAY_EXEC) which is called before every nfsd lookup_one()
operation. fhp is the context for the entire "transaction", and this
hook position
is usually [*] safe w.r.t held fs locks.
The DCACHE_HSM_ONCE design, guarantees that there will be a single
"safe context" pre-dir-content event in fh_verify(fhp, NFSD_MAY_EXEC)
and the latter hooks in lookup_one() with freeze protection held will be
suppressed.
However, the nfsd "safe" pre-dir-content hooks are not a must for merging
the pre-dir-content event feature.
If the nfsd "safe" hooks are not merged, nfsd (as will overlayfs) will still be
able to safely do lookup_one() on directories where HSM marks exist
with some caveats:
- A directory that was already populated (DCACHE_HSM_ONCE)
will be fully accessible (read/write) for nfsd
- An existing child name (positive dentry) can be looked up by nfsd
- A lookup of non-existing name lookup will return -ENOENT
as it should (without leaving a negative dentry behind)
- An attempt to create a positive dentry (create/rename) will fail
with -ENOENT (as if directory is IS_DEADDIR)
[*] in nfsd_create_locked() the fh_verify(fhp, NFSD_MAY_EXEC) is not
in safe context, so there is also a preemptive hook also in fh_want_write(fhp)
in nfsd_create().
CC nfsd guys to explain to me why nfsd_create() has a NFSD_MAY_NOP
access permission and not NFSD_MAY_EXEC.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-07-09 9:00 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 16:09 [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 1/3] fanotify: allow creating FAN_PRE_ACCESS events on directories Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 2/3] fanotify: allow O_PATH flag in event_f_flags Amir Goldstein
2025-06-04 16:09 ` [RFC PATCH v2 3/3] fanotify: introduce FAN_PATH_ACCESS event Amir Goldstein
2025-06-04 16:35 ` [RFC PATCH v2 0/3] fanotify HSM events for directories Amir Goldstein
2025-06-10 13:49 ` Jan Kara
2025-06-10 15:25 ` Amir Goldstein
2025-06-14 17:03 ` Amir Goldstein
2025-06-16 9:07 ` Jan Kara
2025-06-16 17:00 ` Amir Goldstein
2025-06-17 9:43 ` Jan Kara
2025-07-03 19:14 ` Amir Goldstein
2025-07-04 9:24 ` Jan Kara
2025-07-04 10:58 ` Amir Goldstein
2025-07-08 15:32 ` Amir Goldstein
2025-07-09 9:00 ` Amir Goldstein
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).