* [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup
@ 2025-11-07 14:21 Mateusz Guzik
2025-11-07 14:21 ` [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC Mateusz Guzik
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Mateusz Guzik @ 2025-11-07 14:21 UTC (permalink / raw)
To: brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso,
torvalds, josef, linux-btrfs, Mateusz Guzik
Commit message in patch 1 says it all.
In short, MAY_WRITE checks are elided.
This obsoletes the idea of pre-computing if perm checks are necessary as
that turned out to be too hairy. The new code has 2 more branches per
path component compared to that idea, but the perf difference for
typical paths (< 6 components) was basically within noise. To be
revisited if someone(tm) removes other slowdowns.
Instead of the pre-computing thing I added IOP_FASTPERM_MAY_EXEC so that
filesystems like btrfs can still avoid the hard work.
v3:
- drop the pre-computation idea and inline the perm check
- add IOP_FASTPERM_MAY_EXEC for filesystems with ->permission hooks so
that they can also take advantage of it
Mateusz Guzik (3):
fs: speed up path lookup with cheaper handling of MAY_EXEC
btrfs: utilize IOP_FASTPERM_MAY_EXEC
fs: retire now stale MAY_WRITE predicts in inode_permission()
fs/btrfs/inode.c | 12 +++++++++++-
fs/namei.c | 47 ++++++++++++++++++++++++++++++++++++++++++----
include/linux/fs.h | 13 +++++++------
3 files changed, 61 insertions(+), 11 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-07 14:21 [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Mateusz Guzik @ 2025-11-07 14:21 ` Mateusz Guzik 2025-11-10 9:32 ` Jan Kara 2025-11-11 9:41 ` Christian Brauner 2025-11-07 14:21 ` [PATCH v3 2/3] btrfs: utilize IOP_FASTPERM_MAY_EXEC Mateusz Guzik ` (2 subsequent siblings) 3 siblings, 2 replies; 14+ messages in thread From: Mateusz Guzik @ 2025-11-07 14:21 UTC (permalink / raw) To: brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs, Mateusz Guzik The generic inode_permission() routine does work which is known to be of no significance for lookup. There are checks for MAY_WRITE, while the requested permission is MAY_EXEC. Additionally devcgroup_inode_permission() is called to check for devices, but it is an invariant the inode is a directory. Absent a ->permission func, execution lands in generic_permission() which checks upfront if the requested permission is granted for everyone. We can elide the branches which are guaranteed to be false and cut straight to the check if everyone happens to be allowed MAY_EXEC on the inode (which holds true most of the time). Moreover, filesystems which provide their own ->permission routine can take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC flag on their inodes, which they can legitimately do if their MAY_EXEC handling matches generic_permission(). As a simple benchmark, as part of compilation gcc issues access(2) on numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s): before: 3797556 after: 3987789 (+5%) Note: this depends on the not-yet-landed ext4 patch to mark inodes with cache_no_acl() Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- fs/namei.c | 43 +++++++++++++++++++++++++++++++++++++++++-- include/linux/fs.h | 13 +++++++------ 2 files changed, 48 insertions(+), 8 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index a9f9d0453425..6b2a5a5478e7 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -540,6 +540,9 @@ static inline int do_inode_permission(struct mnt_idmap *idmap, * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) * * Separate out file-system wide checks from inode-specific permission checks. + * + * Note: lookup_inode_permission_may_exec() does not call here. If you add + * MAY_EXEC checks, adjust it. */ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) { @@ -602,6 +605,42 @@ int inode_permission(struct mnt_idmap *idmap, } EXPORT_SYMBOL(inode_permission); +/** + * lookup_inode_permission_may_exec - Check traversal right for given inode + * + * This is a special case routine for may_lookup() making assumptions specific + * to path traversal. Use inode_permission() if you are doing something else. + * + * Work is shaved off compared to inode_permission() as follows: + * - we know for a fact there is no MAY_WRITE to worry about + * - it is an invariant the inode is a directory + * + * Since majority of real-world traversal happens on inodes which grant it for + * everyone, we check it upfront and only resort to more expensive work if it + * fails. + * + * Filesystems which have their own ->permission hook and consequently miss out + * on IOP_FASTPERM can still get the optimization if they set IOP_FASTPERM_MAY_EXEC + * on their directory inodes. + */ +static __always_inline int lookup_inode_permission_may_exec(struct mnt_idmap *idmap, + struct inode *inode, int mask) +{ + /* Lookup already checked this to return -ENOTDIR */ + VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode), inode); + VFS_BUG_ON((mask & ~MAY_NOT_BLOCK) != 0); + + mask |= MAY_EXEC; + + if (unlikely(!(inode->i_opflags & (IOP_FASTPERM | IOP_FASTPERM_MAY_EXEC)))) + return inode_permission(idmap, inode, mask); + + if (unlikely(((inode->i_mode & 0111) != 0111) || !no_acl_inode(inode))) + return inode_permission(idmap, inode, mask); + + return security_inode_permission(inode, mask); +} + /** * path_get - get a reference to a path * @path: path to get the reference to @@ -1855,7 +1894,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, int err, mask; mask = nd->flags & LOOKUP_RCU ? MAY_NOT_BLOCK : 0; - err = inode_permission(idmap, nd->inode, mask | MAY_EXEC); + err = lookup_inode_permission_may_exec(idmap, nd->inode, mask); if (likely(!err)) return 0; @@ -1870,7 +1909,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, if (err != -ECHILD) // hard error return err; - return inode_permission(idmap, nd->inode, MAY_EXEC); + return lookup_inode_permission_may_exec(idmap, nd->inode, 0); } static int reserve_stack(struct nameidata *nd, struct path *link) diff --git a/include/linux/fs.h b/include/linux/fs.h index 03e450dd5211..7d5de647ac7b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -647,13 +647,14 @@ is_uncached_acl(struct posix_acl *acl) return (long)acl & 1; } -#define IOP_FASTPERM 0x0001 -#define IOP_LOOKUP 0x0002 -#define IOP_NOFOLLOW 0x0004 -#define IOP_XATTR 0x0008 +#define IOP_FASTPERM 0x0001 +#define IOP_LOOKUP 0x0002 +#define IOP_NOFOLLOW 0x0004 +#define IOP_XATTR 0x0008 #define IOP_DEFAULT_READLINK 0x0010 -#define IOP_MGTIME 0x0020 -#define IOP_CACHED_LINK 0x0040 +#define IOP_MGTIME 0x0020 +#define IOP_CACHED_LINK 0x0040 +#define IOP_FASTPERM_MAY_EXEC 0x0080 /* * Inode state bits. Protected by inode->i_lock -- 2.48.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-07 14:21 ` [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC Mateusz Guzik @ 2025-11-10 9:32 ` Jan Kara 2025-11-10 9:46 ` Mateusz Guzik 2025-11-11 9:41 ` Christian Brauner 1 sibling, 1 reply; 14+ messages in thread From: Jan Kara @ 2025-11-10 9:32 UTC (permalink / raw) To: Mateusz Guzik Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Fri 07-11-25 15:21:47, Mateusz Guzik wrote: > The generic inode_permission() routine does work which is known to be of > no significance for lookup. There are checks for MAY_WRITE, while the > requested permission is MAY_EXEC. Additionally devcgroup_inode_permission() > is called to check for devices, but it is an invariant the inode is a > directory. > > Absent a ->permission func, execution lands in generic_permission() > which checks upfront if the requested permission is granted for > everyone. > > We can elide the branches which are guaranteed to be false and cut > straight to the check if everyone happens to be allowed MAY_EXEC on the > inode (which holds true most of the time). > > Moreover, filesystems which provide their own ->permission routine can > take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC > flag on their inodes, which they can legitimately do if their MAY_EXEC > handling matches generic_permission(). > > As a simple benchmark, as part of compilation gcc issues access(2) on > numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o > > Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s): > before: 3797556 > after: 3987789 (+5%) > > Note: this depends on the not-yet-landed ext4 patch to mark inodes with > cache_no_acl() > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> The gain is nice. I'm just wondering where exactly is it coming from? I don't see that we'd be saving some memory load or significant amount of work. So is it really coming from the more compact code and saved several unlikely branches and function calls? Honza > --- > fs/namei.c | 43 +++++++++++++++++++++++++++++++++++++++++-- > include/linux/fs.h | 13 +++++++------ > 2 files changed, 48 insertions(+), 8 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index a9f9d0453425..6b2a5a5478e7 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -540,6 +540,9 @@ static inline int do_inode_permission(struct mnt_idmap *idmap, > * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > * > * Separate out file-system wide checks from inode-specific permission checks. > + * > + * Note: lookup_inode_permission_may_exec() does not call here. If you add > + * MAY_EXEC checks, adjust it. > */ > static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > { > @@ -602,6 +605,42 @@ int inode_permission(struct mnt_idmap *idmap, > } > EXPORT_SYMBOL(inode_permission); > > +/** > + * lookup_inode_permission_may_exec - Check traversal right for given inode > + * > + * This is a special case routine for may_lookup() making assumptions specific > + * to path traversal. Use inode_permission() if you are doing something else. > + * > + * Work is shaved off compared to inode_permission() as follows: > + * - we know for a fact there is no MAY_WRITE to worry about > + * - it is an invariant the inode is a directory > + * > + * Since majority of real-world traversal happens on inodes which grant it for > + * everyone, we check it upfront and only resort to more expensive work if it > + * fails. > + * > + * Filesystems which have their own ->permission hook and consequently miss out > + * on IOP_FASTPERM can still get the optimization if they set IOP_FASTPERM_MAY_EXEC > + * on their directory inodes. > + */ > +static __always_inline int lookup_inode_permission_may_exec(struct mnt_idmap *idmap, > + struct inode *inode, int mask) > +{ > + /* Lookup already checked this to return -ENOTDIR */ > + VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode), inode); > + VFS_BUG_ON((mask & ~MAY_NOT_BLOCK) != 0); > + > + mask |= MAY_EXEC; > + > + if (unlikely(!(inode->i_opflags & (IOP_FASTPERM | IOP_FASTPERM_MAY_EXEC)))) > + return inode_permission(idmap, inode, mask); > + > + if (unlikely(((inode->i_mode & 0111) != 0111) || !no_acl_inode(inode))) > + return inode_permission(idmap, inode, mask); > + > + return security_inode_permission(inode, mask); > +} > + > /** > * path_get - get a reference to a path > * @path: path to get the reference to > @@ -1855,7 +1894,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, > int err, mask; > > mask = nd->flags & LOOKUP_RCU ? MAY_NOT_BLOCK : 0; > - err = inode_permission(idmap, nd->inode, mask | MAY_EXEC); > + err = lookup_inode_permission_may_exec(idmap, nd->inode, mask); > if (likely(!err)) > return 0; > > @@ -1870,7 +1909,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, > if (err != -ECHILD) // hard error > return err; > > - return inode_permission(idmap, nd->inode, MAY_EXEC); > + return lookup_inode_permission_may_exec(idmap, nd->inode, 0); > } > > static int reserve_stack(struct nameidata *nd, struct path *link) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 03e450dd5211..7d5de647ac7b 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -647,13 +647,14 @@ is_uncached_acl(struct posix_acl *acl) > return (long)acl & 1; > } > > -#define IOP_FASTPERM 0x0001 > -#define IOP_LOOKUP 0x0002 > -#define IOP_NOFOLLOW 0x0004 > -#define IOP_XATTR 0x0008 > +#define IOP_FASTPERM 0x0001 > +#define IOP_LOOKUP 0x0002 > +#define IOP_NOFOLLOW 0x0004 > +#define IOP_XATTR 0x0008 > #define IOP_DEFAULT_READLINK 0x0010 > -#define IOP_MGTIME 0x0020 > -#define IOP_CACHED_LINK 0x0040 > +#define IOP_MGTIME 0x0020 > +#define IOP_CACHED_LINK 0x0040 > +#define IOP_FASTPERM_MAY_EXEC 0x0080 > > /* > * Inode state bits. Protected by inode->i_lock > -- > 2.48.1 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-10 9:32 ` Jan Kara @ 2025-11-10 9:46 ` Mateusz Guzik 2025-11-10 10:13 ` Jan Kara 0 siblings, 1 reply; 14+ messages in thread From: Mateusz Guzik @ 2025-11-10 9:46 UTC (permalink / raw) To: Jan Kara Cc: brauner, viro, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Mon, Nov 10, 2025 at 10:32 AM Jan Kara <jack@suse.cz> wrote: > > On Fri 07-11-25 15:21:47, Mateusz Guzik wrote: > > The generic inode_permission() routine does work which is known to be of > > no significance for lookup. There are checks for MAY_WRITE, while the > > requested permission is MAY_EXEC. Additionally devcgroup_inode_permission() > > is called to check for devices, but it is an invariant the inode is a > > directory. > > > > Absent a ->permission func, execution lands in generic_permission() > > which checks upfront if the requested permission is granted for > > everyone. > > > > We can elide the branches which are guaranteed to be false and cut > > straight to the check if everyone happens to be allowed MAY_EXEC on the > > inode (which holds true most of the time). > > > > Moreover, filesystems which provide their own ->permission routine can > > take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC > > flag on their inodes, which they can legitimately do if their MAY_EXEC > > handling matches generic_permission(). > > > > As a simple benchmark, as part of compilation gcc issues access(2) on > > numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o > > > > Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s): > > before: 3797556 > > after: 3987789 (+5%) > > > > Note: this depends on the not-yet-landed ext4 patch to mark inodes with > > cache_no_acl() > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > The gain is nice. I'm just wondering where exactly is it coming from? I > don't see that we'd be saving some memory load or significant amount of > work. So is it really coming from the more compact code and saved several > unlikely branches and function calls? > That's several branches and 2 function calls per path component on the way to the terminal inode. In the path at hand, that's 10 function calls elided. > Honza > > > --- > > fs/namei.c | 43 +++++++++++++++++++++++++++++++++++++++++-- > > include/linux/fs.h | 13 +++++++------ > > 2 files changed, 48 insertions(+), 8 deletions(-) > > > > diff --git a/fs/namei.c b/fs/namei.c > > index a9f9d0453425..6b2a5a5478e7 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -540,6 +540,9 @@ static inline int do_inode_permission(struct mnt_idmap *idmap, > > * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > > * > > * Separate out file-system wide checks from inode-specific permission checks. > > + * > > + * Note: lookup_inode_permission_may_exec() does not call here. If you add > > + * MAY_EXEC checks, adjust it. > > */ > > static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > > { > > @@ -602,6 +605,42 @@ int inode_permission(struct mnt_idmap *idmap, > > } > > EXPORT_SYMBOL(inode_permission); > > > > +/** > > + * lookup_inode_permission_may_exec - Check traversal right for given inode > > + * > > + * This is a special case routine for may_lookup() making assumptions specific > > + * to path traversal. Use inode_permission() if you are doing something else. > > + * > > + * Work is shaved off compared to inode_permission() as follows: > > + * - we know for a fact there is no MAY_WRITE to worry about > > + * - it is an invariant the inode is a directory > > + * > > + * Since majority of real-world traversal happens on inodes which grant it for > > + * everyone, we check it upfront and only resort to more expensive work if it > > + * fails. > > + * > > + * Filesystems which have their own ->permission hook and consequently miss out > > + * on IOP_FASTPERM can still get the optimization if they set IOP_FASTPERM_MAY_EXEC > > + * on their directory inodes. > > + */ > > +static __always_inline int lookup_inode_permission_may_exec(struct mnt_idmap *idmap, > > + struct inode *inode, int mask) > > +{ > > + /* Lookup already checked this to return -ENOTDIR */ > > + VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode), inode); > > + VFS_BUG_ON((mask & ~MAY_NOT_BLOCK) != 0); > > + > > + mask |= MAY_EXEC; > > + > > + if (unlikely(!(inode->i_opflags & (IOP_FASTPERM | IOP_FASTPERM_MAY_EXEC)))) > > + return inode_permission(idmap, inode, mask); > > + > > + if (unlikely(((inode->i_mode & 0111) != 0111) || !no_acl_inode(inode))) > > + return inode_permission(idmap, inode, mask); > > + > > + return security_inode_permission(inode, mask); > > +} > > + > > /** > > * path_get - get a reference to a path > > * @path: path to get the reference to > > @@ -1855,7 +1894,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, > > int err, mask; > > > > mask = nd->flags & LOOKUP_RCU ? MAY_NOT_BLOCK : 0; > > - err = inode_permission(idmap, nd->inode, mask | MAY_EXEC); > > + err = lookup_inode_permission_may_exec(idmap, nd->inode, mask); > > if (likely(!err)) > > return 0; > > > > @@ -1870,7 +1909,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, > > if (err != -ECHILD) // hard error > > return err; > > > > - return inode_permission(idmap, nd->inode, MAY_EXEC); > > + return lookup_inode_permission_may_exec(idmap, nd->inode, 0); > > } > > > > static int reserve_stack(struct nameidata *nd, struct path *link) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 03e450dd5211..7d5de647ac7b 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -647,13 +647,14 @@ is_uncached_acl(struct posix_acl *acl) > > return (long)acl & 1; > > } > > > > -#define IOP_FASTPERM 0x0001 > > -#define IOP_LOOKUP 0x0002 > > -#define IOP_NOFOLLOW 0x0004 > > -#define IOP_XATTR 0x0008 > > +#define IOP_FASTPERM 0x0001 > > +#define IOP_LOOKUP 0x0002 > > +#define IOP_NOFOLLOW 0x0004 > > +#define IOP_XATTR 0x0008 > > #define IOP_DEFAULT_READLINK 0x0010 > > -#define IOP_MGTIME 0x0020 > > -#define IOP_CACHED_LINK 0x0040 > > +#define IOP_MGTIME 0x0020 > > +#define IOP_CACHED_LINK 0x0040 > > +#define IOP_FASTPERM_MAY_EXEC 0x0080 > > > > /* > > * Inode state bits. Protected by inode->i_lock > > -- > > 2.48.1 > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-10 9:46 ` Mateusz Guzik @ 2025-11-10 10:13 ` Jan Kara 2025-11-10 12:42 ` Mateusz Guzik 0 siblings, 1 reply; 14+ messages in thread From: Jan Kara @ 2025-11-10 10:13 UTC (permalink / raw) To: Mateusz Guzik Cc: Jan Kara, brauner, viro, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Mon 10-11-25 10:46:38, Mateusz Guzik wrote: > On Mon, Nov 10, 2025 at 10:32 AM Jan Kara <jack@suse.cz> wrote: > > > > On Fri 07-11-25 15:21:47, Mateusz Guzik wrote: > > > The generic inode_permission() routine does work which is known to be of > > > no significance for lookup. There are checks for MAY_WRITE, while the > > > requested permission is MAY_EXEC. Additionally devcgroup_inode_permission() > > > is called to check for devices, but it is an invariant the inode is a > > > directory. > > > > > > Absent a ->permission func, execution lands in generic_permission() > > > which checks upfront if the requested permission is granted for > > > everyone. > > > > > > We can elide the branches which are guaranteed to be false and cut > > > straight to the check if everyone happens to be allowed MAY_EXEC on the > > > inode (which holds true most of the time). > > > > > > Moreover, filesystems which provide their own ->permission routine can > > > take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC > > > flag on their inodes, which they can legitimately do if their MAY_EXEC > > > handling matches generic_permission(). > > > > > > As a simple benchmark, as part of compilation gcc issues access(2) on > > > numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o > > > > > > Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s): > > > before: 3797556 > > > after: 3987789 (+5%) > > > > > > Note: this depends on the not-yet-landed ext4 patch to mark inodes with > > > cache_no_acl() > > > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > > > The gain is nice. I'm just wondering where exactly is it coming from? I > > don't see that we'd be saving some memory load or significant amount of > > work. So is it really coming from the more compact code and saved several > > unlikely branches and function calls? > > That's several branches and 2 function calls per path component on the > way to the terminal inode. In the path at hand, that's 10 function > calls elided. OK, the path lookup is really light so I guess 10 function calls are visible enough. I guess this is hot enough path that the microoptimization is worth the code duplication. So feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-10 10:13 ` Jan Kara @ 2025-11-10 12:42 ` Mateusz Guzik 0 siblings, 0 replies; 14+ messages in thread From: Mateusz Guzik @ 2025-11-10 12:42 UTC (permalink / raw) To: Jan Kara Cc: brauner, viro, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4270 bytes --] On Mon, Nov 10, 2025 at 11:13 AM Jan Kara <jack@suse.cz> wrote: > OK, the path lookup is really light I would not go that far ;) The current code has function calls which can be either inlined or elided. More importantly it is a massive branch-fest, notably with repeated LOOKUP_RCU checks. Based on my work on the same stuff $elsewhere, most of the time the entry in the cache is there and is a directory you can traverse through and which is not mounted on. While there is a bunch of likely/unlikely usage to help out, the code is not structured in a way which allows for easy use of it. Instead some of the branches are repeated or have to be present to begin with. Ideally lookup could roll forward over a pathname without function calls as long as fast path conditions hold. You would still need to pay to check permissions and that this is a non-mounted directory for every path component, but some of this can be combined. Per the above, the repeated LOOKUP_RCU checks would be whacked. Checking if this is a directory which got mounted on *OR* is it a symlink could be one branch and so on. On path parsing side, userspace could have passed something fucky like foo/////bar and this of course needs to be handled but it does not require the current ugliness to do so. This does happen with real programs (typically two slashes in a row), but is also constitutes a small minority of paths. The current code makes sure to skip the spurious slashes before looking up the name. My code $elsewhere instead notes it is an invariant that a name containing a slash cannot appear in the cache so it just goes forward with the lookup. If an entry is found, the name could not have started with / and the check is elided (common case). Should the entry be missing then indeed we check if slashes need to get rolled over. And so on. I think I can incrementally reduce a bunch of overhead, but it will always be leaving some perf on the table unless restructured. As for some profiling of the state, I booted up a kernel with all of my patches (including an extra to elide security_inode_permission) + sheaves and perf top'ed over a testcase which consists of series of access(2) calls lifted from strace on gcc and the linker. To the tune of 205 paths, some of them repeated and several deranged -- for example: access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/Scrt1.o", R_OK); The file is attached for interested. The profile: 20.43% [kernel] [k] __d_lookup_rcu 10.66% [kernel] [k] entry_SYSCALL_64 9.50% [kernel] [k] link_path_walk 6.98% libc.so.6 [.] __GI___access 6.04% [kernel] [k] strncpy_from_user 4.81% [kernel] [k] step_into 3.36% [kernel] [k] kmem_cache_alloc_noprof 2.80% [kernel] [k] kmem_cache_free 2.77% [kernel] [k] walk_component 2.18% [kernel] [k] lookup_fast 1.83% [kernel] [k] set_root 1.83% [kernel] [k] do_syscall_64 1.65% [kernel] [k] getname_flags.part.0 1.57% [kernel] [k] entry_SYSCALL_64_safe_stack 1.52% [kernel] [k] nd_jump_root 1.48% [kernel] [k] filename_lookup 1.34% [kernel] [k] path_init 1.33% [kernel] [k] do_faccessat 1.23% [kernel] [k] __legitimize_mnt 1.23% [kernel] [k] lockref_get_not_dead 0.96% [kernel] [k] path_lookupat 0.92% [kernel] [k] lockref_put_return 0.86% [kernel] [k] its_return_thunk 0.83% [kernel] [k] entry_SYSCALL_64_after_hwframe 0.80% [kernel] [k] map_id_range_down 0.68% [kernel] [k] user_path_at [-- Attachment #2: access_compile.c --] [-- Type: text/x-csrc, Size: 12408 bytes --] #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> char *testcase_description = "access(2) calls invoked by gcc and the linker"; void testcase(unsigned long long *iterations, unsigned long nr) { while (1) { access("/etc/ld.so.preload", R_OK); access("/bin/sh", X_OK); access("/etc/ld.so.preload", R_OK); access("/etc/ld.so.preload", R_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/lto-wrapper", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/cc1", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/collect2", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/liblto_plugin.so", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtbeginS.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/collect2", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/bin/ld", X_OK); access("/usr/bin/nm", X_OK); access("/usr/bin/strip", X_OK); access("/usr/bin/cc", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/etc/ld.so.preload", R_OK); access("/bin/sh", X_OK); access("/etc/ld.so.preload", R_OK); access("/etc/ld.so.preload", R_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/cc1", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/bin/cc", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/cc1", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/bin/cc", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/lto-wrapper", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/cc1", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/collect2", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/liblto_plugin.so", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtbeginS.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/collect2", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/bin/ld", X_OK); access("/usr/bin/nm", X_OK); access("/usr/bin/strip", X_OK); access("/usr/bin/cc", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/bin/cc", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/local/sbin/cc", X_OK); access("/usr/local/bin/cc", X_OK); access("/usr/sbin/cc", X_OK); access("/usr/bin/cc", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/specs", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/lto-wrapper", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/cc1", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/collect2", X_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/liblto_plugin.so", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/Scrt1.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/crti.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtbeginS.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/lib/../lib/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/12/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu/crtn.o", R_OK); access("/usr/lib/gcc/x86_64-linux-gnu/12/collect2", X_OK); access("/etc/ld.so.preload", R_OK); access("/usr/bin/ld", X_OK); access("/usr/bin/nm", X_OK); access("/usr/bin/strip", X_OK); access("/usr/bin/cc", X_OK); access("/tmp", R_OK|W_OK|X_OK); access("/etc/ld.so.preload", R_OK); (*iterations)++; } } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-07 14:21 ` [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC Mateusz Guzik 2025-11-10 9:32 ` Jan Kara @ 2025-11-11 9:41 ` Christian Brauner 2025-11-11 10:51 ` Mateusz Guzik 1 sibling, 1 reply; 14+ messages in thread From: Christian Brauner @ 2025-11-11 9:41 UTC (permalink / raw) To: Mateusz Guzik Cc: viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Fri, Nov 07, 2025 at 03:21:47PM +0100, Mateusz Guzik wrote: > The generic inode_permission() routine does work which is known to be of > no significance for lookup. There are checks for MAY_WRITE, while the > requested permission is MAY_EXEC. Additionally devcgroup_inode_permission() > is called to check for devices, but it is an invariant the inode is a > directory. > > Absent a ->permission func, execution lands in generic_permission() > which checks upfront if the requested permission is granted for > everyone. > > We can elide the branches which are guaranteed to be false and cut > straight to the check if everyone happens to be allowed MAY_EXEC on the > inode (which holds true most of the time). > > Moreover, filesystems which provide their own ->permission routine can > take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC > flag on their inodes, which they can legitimately do if their MAY_EXEC > handling matches generic_permission(). > > As a simple benchmark, as part of compilation gcc issues access(2) on > numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o > > Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s): > before: 3797556 > after: 3987789 (+5%) > > Note: this depends on the not-yet-landed ext4 patch to mark inodes with > cache_no_acl() > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > --- > fs/namei.c | 43 +++++++++++++++++++++++++++++++++++++++++-- > include/linux/fs.h | 13 +++++++------ > 2 files changed, 48 insertions(+), 8 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index a9f9d0453425..6b2a5a5478e7 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -540,6 +540,9 @@ static inline int do_inode_permission(struct mnt_idmap *idmap, > * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > * > * Separate out file-system wide checks from inode-specific permission checks. > + * > + * Note: lookup_inode_permission_may_exec() does not call here. If you add > + * MAY_EXEC checks, adjust it. > */ > static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > { > @@ -602,6 +605,42 @@ int inode_permission(struct mnt_idmap *idmap, > } > EXPORT_SYMBOL(inode_permission); > > +/** > + * lookup_inode_permission_may_exec - Check traversal right for given inode > + * > + * This is a special case routine for may_lookup() making assumptions specific > + * to path traversal. Use inode_permission() if you are doing something else. > + * > + * Work is shaved off compared to inode_permission() as follows: > + * - we know for a fact there is no MAY_WRITE to worry about > + * - it is an invariant the inode is a directory > + * > + * Since majority of real-world traversal happens on inodes which grant it for > + * everyone, we check it upfront and only resort to more expensive work if it > + * fails. > + * > + * Filesystems which have their own ->permission hook and consequently miss out > + * on IOP_FASTPERM can still get the optimization if they set IOP_FASTPERM_MAY_EXEC > + * on their directory inodes. > + */ > +static __always_inline int lookup_inode_permission_may_exec(struct mnt_idmap *idmap, > + struct inode *inode, int mask) > +{ > + /* Lookup already checked this to return -ENOTDIR */ > + VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode), inode); > + VFS_BUG_ON((mask & ~MAY_NOT_BLOCK) != 0); > + > + mask |= MAY_EXEC; > + > + if (unlikely(!(inode->i_opflags & (IOP_FASTPERM | IOP_FASTPERM_MAY_EXEC)))) > + return inode_permission(idmap, inode, mask); > + > + if (unlikely(((inode->i_mode & 0111) != 0111) || !no_acl_inode(inode))) Can you send a follow-up where 0111 is a constant with some descriptive name, please? Can be local to the file. I hate these raw-coded permission masks with a passion. > + return inode_permission(idmap, inode, mask); > + > + return security_inode_permission(inode, mask); > +} > + > /** > * path_get - get a reference to a path > * @path: path to get the reference to > @@ -1855,7 +1894,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, > int err, mask; > > mask = nd->flags & LOOKUP_RCU ? MAY_NOT_BLOCK : 0; > - err = inode_permission(idmap, nd->inode, mask | MAY_EXEC); > + err = lookup_inode_permission_may_exec(idmap, nd->inode, mask); > if (likely(!err)) > return 0; > > @@ -1870,7 +1909,7 @@ static inline int may_lookup(struct mnt_idmap *idmap, > if (err != -ECHILD) // hard error > return err; > > - return inode_permission(idmap, nd->inode, MAY_EXEC); > + return lookup_inode_permission_may_exec(idmap, nd->inode, 0); > } > > static int reserve_stack(struct nameidata *nd, struct path *link) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 03e450dd5211..7d5de647ac7b 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -647,13 +647,14 @@ is_uncached_acl(struct posix_acl *acl) > return (long)acl & 1; > } > > -#define IOP_FASTPERM 0x0001 > -#define IOP_LOOKUP 0x0002 > -#define IOP_NOFOLLOW 0x0004 > -#define IOP_XATTR 0x0008 > +#define IOP_FASTPERM 0x0001 > +#define IOP_LOOKUP 0x0002 > +#define IOP_NOFOLLOW 0x0004 > +#define IOP_XATTR 0x0008 > #define IOP_DEFAULT_READLINK 0x0010 > -#define IOP_MGTIME 0x0020 > -#define IOP_CACHED_LINK 0x0040 > +#define IOP_MGTIME 0x0020 > +#define IOP_CACHED_LINK 0x0040 > +#define IOP_FASTPERM_MAY_EXEC 0x0080 > > /* > * Inode state bits. Protected by inode->i_lock > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-11 9:41 ` Christian Brauner @ 2025-11-11 10:51 ` Mateusz Guzik 2025-11-11 11:47 ` Mateusz Guzik 0 siblings, 1 reply; 14+ messages in thread From: Mateusz Guzik @ 2025-11-11 10:51 UTC (permalink / raw) To: Christian Brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Tue, Nov 11, 2025 at 10:41 AM Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Nov 07, 2025 at 03:21:47PM +0100, Mateusz Guzik wrote: > > + if (unlikely(((inode->i_mode & 0111) != 0111) || !no_acl_inode(inode))) > > Can you send a follow-up where 0111 is a constant with some descriptive > name, please? Can be local to the file. I hate these raw-coded > permission masks with a passion. > #define UNIX_PERM_ALL_X 0111? I have no opinion about hardcoding this vs using a macro, but don't have a good name for that one either. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC 2025-11-11 10:51 ` Mateusz Guzik @ 2025-11-11 11:47 ` Mateusz Guzik 0 siblings, 0 replies; 14+ messages in thread From: Mateusz Guzik @ 2025-11-11 11:47 UTC (permalink / raw) To: Christian Brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Tue, Nov 11, 2025 at 11:51 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Tue, Nov 11, 2025 at 10:41 AM Christian Brauner <brauner@kernel.org> wrote: > > > > On Fri, Nov 07, 2025 at 03:21:47PM +0100, Mateusz Guzik wrote: > > > + if (unlikely(((inode->i_mode & 0111) != 0111) || !no_acl_inode(inode))) > > > > Can you send a follow-up where 0111 is a constant with some descriptive > > name, please? Can be local to the file. I hate these raw-coded > > permission masks with a passion. > > > > #define UNIX_PERM_ALL_X 0111? > > I have no opinion about hardcoding this vs using a macro, but don't > have a good name for that one either. Apart from usage added by me here there is: fs/coredump.c: if ((READ_ONCE(file_inode(vma->vm_file)->i_mode) & 0111) != 0) fs/namei.c: * - multiplying by 0111 spreads them out to all of ugo fs/namei.c: if (!((mask & 7) * 0111 & ~mode)) { That's ignoring other spots which definitely want 0111 spelled out in per-fs code. I would argue the other 2 in namei.c want this spelled out numerically as well: │* - 'mask&7' is the requested permission bit set │* - multiplying by 0111 spreads them out to all of ugo │* - '& ~mode' looks for missing inode permission bits │* - the '!' is for "no missing permissions" [snip] if (!((mask & 7) * 0111 & ~mode)) { But then it may make sense to keep this numerical in the new code as well so that anyone looking at lookup_inode_permission_may_exec() and inode_permission()->generic_permission()->acl_permission_check() can see it's the same thing. I figured maybe a comment would do the trick above the 0111 usage, but the commentary added at the top of the func imo covers it: * Since majority of real-world traversal happens on inodes which grant it for * everyone, we check it upfront and only resort to more expensive work if it * fails. All that said, now that I look at it, I think the code is best left off with spelled out 0111 in place so I wont be submitting a patch to change that. Given that hiding it behind some name or adding a comment is a trivial edit, I don't think it's much of a burden for you to do it should you chose to make such a change anyway. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3 2/3] btrfs: utilize IOP_FASTPERM_MAY_EXEC 2025-11-07 14:21 [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Mateusz Guzik 2025-11-07 14:21 ` [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC Mateusz Guzik @ 2025-11-07 14:21 ` Mateusz Guzik 2025-11-10 15:40 ` David Sterba 2025-11-07 14:21 ` [PATCH v3 3/3] fs: retire now stale MAY_WRITE predicts in inode_permission() Mateusz Guzik 2025-11-11 9:42 ` [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Christian Brauner 3 siblings, 1 reply; 14+ messages in thread From: Mateusz Guzik @ 2025-11-07 14:21 UTC (permalink / raw) To: brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs, Mateusz Guzik Root filesystem was ext4, btrfs was mounted on /testfs. Then issuing access(2) in a loop on /testfs/repos/linux/include/linux/fs.h on Sapphire Rapids (ops/s): before: 3447976 after: 3620879 (+5%) Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- fs/btrfs/inode.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 42da39c1e5b5..1a560f7298bf 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5852,6 +5852,8 @@ struct btrfs_inode *btrfs_iget(u64 ino, struct btrfs_root *root) if (ret) return ERR_PTR(ret); + if (S_ISDIR(inode->vfs_inode.i_mode)) + inode->vfs_inode.i_opflags |= IOP_FASTPERM_MAY_EXEC; unlock_new_inode(&inode->vfs_inode); return inode; } @@ -6803,8 +6805,11 @@ static int btrfs_create_common(struct inode *dir, struct dentry *dentry, } ret = btrfs_create_new_inode(trans, &new_inode_args); - if (!ret) + if (!ret) { + if (S_ISDIR(inode->i_mode)) + inode->i_opflags |= IOP_FASTPERM_MAY_EXEC; d_instantiate_new(dentry, inode); + } btrfs_end_transaction(trans); btrfs_btree_balance_dirty(fs_info); @@ -9163,6 +9168,11 @@ int btrfs_prealloc_file_range_trans(struct inode *inode, min_size, actual_len, alloc_hint, trans); } +/* + * NOTE: in case you are adding MAY_EXEC check for directories: + * we are marking them with IOP_FASTPERM_MAY_EXEC, allowing path lookup to + * elide calls here. + */ static int btrfs_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { -- 2.48.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v3 2/3] btrfs: utilize IOP_FASTPERM_MAY_EXEC 2025-11-07 14:21 ` [PATCH v3 2/3] btrfs: utilize IOP_FASTPERM_MAY_EXEC Mateusz Guzik @ 2025-11-10 15:40 ` David Sterba 0 siblings, 0 replies; 14+ messages in thread From: David Sterba @ 2025-11-10 15:40 UTC (permalink / raw) To: Mateusz Guzik Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Fri, Nov 07, 2025 at 03:21:48PM +0100, Mateusz Guzik wrote: > Root filesystem was ext4, btrfs was mounted on /testfs. > > Then issuing access(2) in a loop on /testfs/repos/linux/include/linux/fs.h > on Sapphire Rapids (ops/s): > > before: 3447976 > after: 3620879 (+5%) > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Thanks. Acked-by: David Sterba <dsterba@suse.com> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3 3/3] fs: retire now stale MAY_WRITE predicts in inode_permission() 2025-11-07 14:21 [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Mateusz Guzik 2025-11-07 14:21 ` [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC Mateusz Guzik 2025-11-07 14:21 ` [PATCH v3 2/3] btrfs: utilize IOP_FASTPERM_MAY_EXEC Mateusz Guzik @ 2025-11-07 14:21 ` Mateusz Guzik 2025-11-10 10:15 ` Jan Kara 2025-11-11 9:42 ` [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Christian Brauner 3 siblings, 1 reply; 14+ messages in thread From: Mateusz Guzik @ 2025-11-07 14:21 UTC (permalink / raw) To: brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs, Mateusz Guzik The primary non-MAY_WRITE consumer now uses lookup_inode_permission_may_exec(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- fs/namei.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 6b2a5a5478e7..2a112b2c0951 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -546,7 +546,7 @@ static inline int do_inode_permission(struct mnt_idmap *idmap, */ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) { - if (unlikely(mask & MAY_WRITE)) { + if (mask & MAY_WRITE) { umode_t mode = inode->i_mode; /* Nobody gets write access to a read-only fs. */ @@ -577,7 +577,7 @@ int inode_permission(struct mnt_idmap *idmap, if (unlikely(retval)) return retval; - if (unlikely(mask & MAY_WRITE)) { + if (mask & MAY_WRITE) { /* * Nobody gets write access to an immutable file. */ -- 2.48.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v3 3/3] fs: retire now stale MAY_WRITE predicts in inode_permission() 2025-11-07 14:21 ` [PATCH v3 3/3] fs: retire now stale MAY_WRITE predicts in inode_permission() Mateusz Guzik @ 2025-11-10 10:15 ` Jan Kara 0 siblings, 0 replies; 14+ messages in thread From: Jan Kara @ 2025-11-10 10:15 UTC (permalink / raw) To: Mateusz Guzik Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Fri 07-11-25 15:21:49, Mateusz Guzik wrote: > The primary non-MAY_WRITE consumer now uses lookup_inode_permission_may_exec(). > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Makes sense. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/namei.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 6b2a5a5478e7..2a112b2c0951 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -546,7 +546,7 @@ static inline int do_inode_permission(struct mnt_idmap *idmap, > */ > static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > { > - if (unlikely(mask & MAY_WRITE)) { > + if (mask & MAY_WRITE) { > umode_t mode = inode->i_mode; > > /* Nobody gets write access to a read-only fs. */ > @@ -577,7 +577,7 @@ int inode_permission(struct mnt_idmap *idmap, > if (unlikely(retval)) > return retval; > > - if (unlikely(mask & MAY_WRITE)) { > + if (mask & MAY_WRITE) { > /* > * Nobody gets write access to an immutable file. > */ > -- > 2.48.1 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup 2025-11-07 14:21 [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Mateusz Guzik ` (2 preceding siblings ...) 2025-11-07 14:21 ` [PATCH v3 3/3] fs: retire now stale MAY_WRITE predicts in inode_permission() Mateusz Guzik @ 2025-11-11 9:42 ` Christian Brauner 3 siblings, 0 replies; 14+ messages in thread From: Christian Brauner @ 2025-11-11 9:42 UTC (permalink / raw) To: Mateusz Guzik Cc: Christian Brauner, viro, jack, linux-kernel, linux-fsdevel, linux-ext4, tytso, torvalds, josef, linux-btrfs On Fri, 07 Nov 2025 15:21:46 +0100, Mateusz Guzik wrote: > Commit message in patch 1 says it all. > > In short, MAY_WRITE checks are elided. > > This obsoletes the idea of pre-computing if perm checks are necessary as > that turned out to be too hairy. The new code has 2 more branches per > path component compared to that idea, but the perf difference for > typical paths (< 6 components) was basically within noise. To be > revisited if someone(tm) removes other slowdowns. > > [...] Applied to the vfs-6.19.misc branch of the vfs/vfs.git tree. Patches in the vfs-6.19.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs-6.19.misc [1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC https://git.kernel.org/vfs/vfs/c/5ecf656231cc [2/3] btrfs: utilize IOP_FASTPERM_MAY_EXEC https://git.kernel.org/vfs/vfs/c/d0231059c7f2 [3/3] fs: retire now stale MAY_WRITE predicts in inode_permission() https://git.kernel.org/vfs/vfs/c/e3059792dec1 ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-11-11 11:47 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-07 14:21 [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Mateusz Guzik 2025-11-07 14:21 ` [PATCH v3 1/3] fs: speed up path lookup with cheaper handling of MAY_EXEC Mateusz Guzik 2025-11-10 9:32 ` Jan Kara 2025-11-10 9:46 ` Mateusz Guzik 2025-11-10 10:13 ` Jan Kara 2025-11-10 12:42 ` Mateusz Guzik 2025-11-11 9:41 ` Christian Brauner 2025-11-11 10:51 ` Mateusz Guzik 2025-11-11 11:47 ` Mateusz Guzik 2025-11-07 14:21 ` [PATCH v3 2/3] btrfs: utilize IOP_FASTPERM_MAY_EXEC Mateusz Guzik 2025-11-10 15:40 ` David Sterba 2025-11-07 14:21 ` [PATCH v3 3/3] fs: retire now stale MAY_WRITE predicts in inode_permission() Mateusz Guzik 2025-11-10 10:15 ` Jan Kara 2025-11-11 9:42 ` [PATCH v3 0/3] cheaper MAY_EXEC handling for path lookup Christian Brauner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).