* [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
@ 2024-08-06 14:32 Jeff Layton
2024-08-06 15:25 ` Mateusz Guzik
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Jeff Layton @ 2024-08-06 14:32 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
Cc: Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel,
Jeff Layton
Today, when opening a file we'll typically do a fast lookup, but if
O_CREAT is set, the kernel always takes the exclusive inode lock. I
assume this was done with the expectation that O_CREAT means that we
always expect to do the create, but that's often not the case. Many
programs set O_CREAT even in scenarios where the file already exists.
This patch rearranges the pathwalk-for-open code to also attempt a
fast_lookup in certain O_CREAT cases. If a positive dentry is found, the
inode_lock can be avoided altogether, and if auditing isn't enabled, it
can stay in rcuwalk mode for the last step_into.
One notable exception that is hopefully temporary: if we're doing an
rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the
dentry in that case is more expensive than taking the i_rwsem for now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Here's a revised patch that does a fast_lookup in the O_CREAT codepath
too. The main difference here is that if a positive dentry is found and
audit_dummy_context is true, then we keep the walk lazy for the last
component, which avoids having to take any locks on the parent (just
like with non-O_CREAT opens).
The testcase below runs in about 18s on v6.10 (on an 80 CPU machine).
With this patch, it runs in about 1s:
#define _GNU_SOURCE 1
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/wait.h>
#define PROCS 70
#define LOOPS 500000
static int openloop(int tnum)
{
char *file;
int i, ret;
ret = asprintf(&file, "./testfile%d", tnum);
if (ret < 0) {
printf("asprintf failed for proc %d", tnum);
return 1;
}
for (i = 0; i < LOOPS; ++i) {
int fd = open(file, O_RDWR|O_CREAT, 0644);
if (fd < 0) {
perror("open");
return 1;
}
close(fd);
}
unlink(file);
free(file);
return 0;
}
int main(int argc, char **argv) {
pid_t kids[PROCS];
int i, ret = 0;
for (i = 0; i < PROCS; ++i) {
kids[i] = fork();
if (kids[i] > 0)
return openloop(i);
if (kids[i] < 0)
perror("fork");
}
for (i = 0; i < PROCS; ++i) {
int ret2;
if (kids[i] > 0) {
wait(&ret2);
if (ret2 != 0)
ret = ret2;
}
}
return ret;
}
---
Changes in v2:
- drop the lockref patch since Mateusz is working on a better approach
- add trailing_slashes helper function
- add a lookup_fast_for_open helper function
- make lookup_fast_for_open skip the lookup if auditing is enabled
- if we find a positive dentry and auditing is disabled, don't unlazy
- Link to v1: https://lore.kernel.org/r/20240802-openfast-v1-0-a1cff2a33063@kernel.org
---
fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 53 insertions(+), 9 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 1e05a0f3f04d..2d716fb114c9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
return ERR_PTR(error);
}
+static inline bool trailing_slashes(struct nameidata *nd)
+{
+ return (bool)nd->last.name[nd->last.len];
+}
+
+static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
+{
+ struct dentry *dentry;
+
+ if (open_flag & O_CREAT) {
+ /* Don't bother on an O_EXCL create */
+ if (open_flag & O_EXCL)
+ return NULL;
+
+ /*
+ * FIXME: If auditing is enabled, then we'll have to unlazy to
+ * use the dentry. For now, don't do this, since it shifts
+ * contention from parent's i_rwsem to its d_lockref spinlock.
+ * Reconsider this once dentry refcounting handles heavy
+ * contention better.
+ */
+ if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
+ return NULL;
+ }
+
+ if (trailing_slashes(nd))
+ nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
+
+ dentry = lookup_fast(nd);
+
+ if (open_flag & O_CREAT) {
+ /* Discard negative dentries. Need inode_lock to do the create */
+ if (dentry && !dentry->d_inode) {
+ if (!(nd->flags & LOOKUP_RCU))
+ dput(dentry);
+ dentry = NULL;
+ }
+ }
+ return dentry;
+}
+
static const char *open_last_lookups(struct nameidata *nd,
struct file *file, const struct open_flags *op)
{
@@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd,
return handle_dots(nd, nd->last_type);
}
+ /* We _can_ be in RCU mode here */
+ dentry = lookup_fast_for_open(nd, open_flag);
+ if (IS_ERR(dentry))
+ return ERR_CAST(dentry);
+
if (!(open_flag & O_CREAT)) {
- if (nd->last.name[nd->last.len])
- nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
- /* we _can_ be in RCU mode here */
- dentry = lookup_fast(nd);
- if (IS_ERR(dentry))
- return ERR_CAST(dentry);
if (likely(dentry))
goto finish_lookup;
if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
return ERR_PTR(-ECHILD);
} else {
- /* create side of things */
if (nd->flags & LOOKUP_RCU) {
+ /* can stay in rcuwalk if not auditing */
+ if (dentry && audit_dummy_context())
+ goto check_slashes;
if (!try_to_unlazy(nd))
return ERR_PTR(-ECHILD);
}
audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
- /* trailing slashes? */
- if (unlikely(nd->last.name[nd->last.len]))
+check_slashes:
+ if (trailing_slashes(nd))
return ERR_PTR(-EISDIR);
+ if (dentry)
+ goto finish_lookup;
}
if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
---
base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
change-id: 20240723-openfast-ac49a7b6ade2
Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 14:32 [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
@ 2024-08-06 15:25 ` Mateusz Guzik
2024-08-06 16:17 ` Jeff Layton
2024-08-06 19:11 ` Andi Kleen
2024-08-06 19:51 ` Jeff Layton
2024-08-07 14:26 ` Christian Brauner
2 siblings, 2 replies; 33+ messages in thread
From: Mateusz Guzik @ 2024-08-06 15:25 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
Josef Bacik, linux-fsdevel, linux-kernel
On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> Today, when opening a file we'll typically do a fast lookup, but if
> O_CREAT is set, the kernel always takes the exclusive inode lock. I
> assume this was done with the expectation that O_CREAT means that we
> always expect to do the create, but that's often not the case. Many
> programs set O_CREAT even in scenarios where the file already exists.
>
> This patch rearranges the pathwalk-for-open code to also attempt a
> fast_lookup in certain O_CREAT cases. If a positive dentry is found, the
> inode_lock can be avoided altogether, and if auditing isn't enabled, it
> can stay in rcuwalk mode for the last step_into.
>
> One notable exception that is hopefully temporary: if we're doing an
> rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the
> dentry in that case is more expensive than taking the i_rwsem for now.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> Here's a revised patch that does a fast_lookup in the O_CREAT codepath
> too. The main difference here is that if a positive dentry is found and
> audit_dummy_context is true, then we keep the walk lazy for the last
> component, which avoids having to take any locks on the parent (just
> like with non-O_CREAT opens).
>
> The testcase below runs in about 18s on v6.10 (on an 80 CPU machine).
> With this patch, it runs in about 1s:
>
I don't have an opinion on the patch.
If your kernel does not use apparmor and the patch manages to dodge
refing the parent, then indeed this should be fully deserialized just
like non-creat opens.
Instead of the hand-rolled benchmark may I interest you in using
will-it-scale instead? Notably it reports the achieved rate once per
second, so you can check if there is funky business going on between
reruns, gives the cpu the time to kick off turbo boost if applicable
etc.
I would bench with that myself, but I temporarily don't have handy
access to bigger hw. Even so, the below is completely optional and
perhaps more of a suggestion for the future :)
I hacked up the test case based on tests/open1.c.
git clone https://github.com/antonblanchard/will-it-scale.git
For example plop into tests/opencreate1.c && gmake &&
./opencreate1_processes -t 70:
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
char *testcase_description = "Separate file open/close + O_CREAT";
#define template "/tmp/willitscale.XXXXXX"
static char (*tmpfiles)[sizeof(template)];
static unsigned long local_nr_tasks;
void testcase_prepare(unsigned long nr_tasks)
{
int i;
tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template)
* nr_tasks);
assert(tmpfiles);
for (i = 0; i < nr_tasks; i++) {
strcpy(tmpfiles[i], template);
char *tmpfile = tmpfiles[i];
int fd = mkstemp(tmpfile);
assert(fd >= 0);
close(fd);
}
local_nr_tasks = nr_tasks;
}
void testcase(unsigned long long *iterations, unsigned long nr)
{
char *tmpfile = tmpfiles[nr];
while (1) {
int fd = open(tmpfile, O_RDWR | O_CREAT, 0600);
assert(fd >= 0);
close(fd);
(*iterations)++;
}
}
void testcase_cleanup(void)
{
int i;
for (i = 0; i < local_nr_tasks; i++) {
unlink(tmpfiles[i]);
}
free(tmpfiles);
}
> ---
> fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 53 insertions(+), 9 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1e05a0f3f04d..2d716fb114c9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> return ERR_PTR(error);
> }
>
> +static inline bool trailing_slashes(struct nameidata *nd)
> +{
> + return (bool)nd->last.name[nd->last.len];
> +}
> +
> +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> +{
> + struct dentry *dentry;
> +
> + if (open_flag & O_CREAT) {
> + /* Don't bother on an O_EXCL create */
> + if (open_flag & O_EXCL)
> + return NULL;
> +
> + /*
> + * FIXME: If auditing is enabled, then we'll have to unlazy to
> + * use the dentry. For now, don't do this, since it shifts
> + * contention from parent's i_rwsem to its d_lockref spinlock.
> + * Reconsider this once dentry refcounting handles heavy
> + * contention better.
> + */
> + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> + return NULL;
> + }
> +
> + if (trailing_slashes(nd))
> + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> +
> + dentry = lookup_fast(nd);
> +
> + if (open_flag & O_CREAT) {
> + /* Discard negative dentries. Need inode_lock to do the create */
> + if (dentry && !dentry->d_inode) {
> + if (!(nd->flags & LOOKUP_RCU))
> + dput(dentry);
> + dentry = NULL;
> + }
> + }
> + return dentry;
> +}
> +
> static const char *open_last_lookups(struct nameidata *nd,
> struct file *file, const struct open_flags *op)
> {
> @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd,
> return handle_dots(nd, nd->last_type);
> }
>
> + /* We _can_ be in RCU mode here */
> + dentry = lookup_fast_for_open(nd, open_flag);
> + if (IS_ERR(dentry))
> + return ERR_CAST(dentry);
> +
> if (!(open_flag & O_CREAT)) {
> - if (nd->last.name[nd->last.len])
> - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> - /* we _can_ be in RCU mode here */
> - dentry = lookup_fast(nd);
> - if (IS_ERR(dentry))
> - return ERR_CAST(dentry);
> if (likely(dentry))
> goto finish_lookup;
>
> if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
> return ERR_PTR(-ECHILD);
> } else {
> - /* create side of things */
> if (nd->flags & LOOKUP_RCU) {
> + /* can stay in rcuwalk if not auditing */
> + if (dentry && audit_dummy_context())
> + goto check_slashes;
> if (!try_to_unlazy(nd))
> return ERR_PTR(-ECHILD);
> }
> audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> - /* trailing slashes? */
> - if (unlikely(nd->last.name[nd->last.len]))
> +check_slashes:
> + if (trailing_slashes(nd))
> return ERR_PTR(-EISDIR);
> + if (dentry)
> + goto finish_lookup;
> }
>
> if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
>
> ---
> base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
> change-id: 20240723-openfast-ac49a7b6ade2
>
> Best regards,
> --
> Jeff Layton <jlayton@kernel.org>
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 15:25 ` Mateusz Guzik
@ 2024-08-06 16:17 ` Jeff Layton
2024-08-06 16:42 ` Mateusz Guzik
2024-08-06 19:11 ` Andi Kleen
1 sibling, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-06 16:17 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
Josef Bacik, linux-fsdevel, linux-kernel
On Tue, 2024-08-06 at 17:25 +0200, Mateusz Guzik wrote:
> On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org>
> wrote:
> >
> > Today, when opening a file we'll typically do a fast lookup, but if
> > O_CREAT is set, the kernel always takes the exclusive inode lock. I
> > assume this was done with the expectation that O_CREAT means that
> > we
> > always expect to do the create, but that's often not the case. Many
> > programs set O_CREAT even in scenarios where the file already
> > exists.
> >
> > This patch rearranges the pathwalk-for-open code to also attempt a
> > fast_lookup in certain O_CREAT cases. If a positive dentry is
> > found, the
> > inode_lock can be avoided altogether, and if auditing isn't
> > enabled, it
> > can stay in rcuwalk mode for the last step_into.
> >
> > One notable exception that is hopefully temporary: if we're doing
> > an
> > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing
> > the
> > dentry in that case is more expensive than taking the i_rwsem for
> > now.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > Here's a revised patch that does a fast_lookup in the O_CREAT
> > codepath
> > too. The main difference here is that if a positive dentry is found
> > and
> > audit_dummy_context is true, then we keep the walk lazy for the
> > last
> > component, which avoids having to take any locks on the parent
> > (just
> > like with non-O_CREAT opens).
> >
> > The testcase below runs in about 18s on v6.10 (on an 80 CPU
> > machine).
> > With this patch, it runs in about 1s:
> >
>
> I don't have an opinion on the patch.
>
> If your kernel does not use apparmor and the patch manages to dodge
> refing the parent, then indeed this should be fully deserialized just
> like non-creat opens.
>
Yep. Pity that auditing will slow things down, but them's the breaks...
> Instead of the hand-rolled benchmark may I interest you in using
> will-it-scale instead? Notably it reports the achieved rate once per
> second, so you can check if there is funky business going on between
> reruns, gives the cpu the time to kick off turbo boost if applicable
> etc.
>
> I would bench with that myself, but I temporarily don't have handy
> access to bigger hw. Even so, the below is completely optional and
> perhaps more of a suggestion for the future :)
>
> I hacked up the test case based on tests/open1.c.
>
> git clone https://github.com/antonblanchard/will-it-scale.git
>
> For example plop into tests/opencreate1.c && gmake &&
> ./opencreate1_processes -t 70:
>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <assert.h>
> #include <string.h>
>
> char *testcase_description = "Separate file open/close + O_CREAT";
>
> #define template "/tmp/willitscale.XXXXXX"
> static char (*tmpfiles)[sizeof(template)];
> static unsigned long local_nr_tasks;
>
> void testcase_prepare(unsigned long nr_tasks)
> {
> int i;
> tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template)
> * nr_tasks);
> assert(tmpfiles);
>
> for (i = 0; i < nr_tasks; i++) {
> strcpy(tmpfiles[i], template);
> char *tmpfile = tmpfiles[i];
> int fd = mkstemp(tmpfile);
>
> assert(fd >= 0);
> close(fd);
> }
>
> local_nr_tasks = nr_tasks;
> }
>
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
> char *tmpfile = tmpfiles[nr];
>
> while (1) {
> int fd = open(tmpfile, O_RDWR | O_CREAT, 0600);
> assert(fd >= 0);
> close(fd);
>
> (*iterations)++;
> }
> }
>
> void testcase_cleanup(void)
> {
> int i;
> for (i = 0; i < local_nr_tasks; i++) {
> unlink(tmpfiles[i]);
> }
> free(tmpfiles);
> }
>
>
Good suggestion and thanks for the testcase. With v6.10 kernel, I'm
seeing numbers like this at -t 70:
min:4873 max:11510 total:418915
min:4884 max:10598 total:408848
min:3704 max:12371 total:467658
min:2842 max:11792 total:418239
min:2966 max:11511 total:414144
min:4756 max:11381 total:413137
min:4557 max:10789 total:404628
min:4780 max:11125 total:413349
min:4757 max:11156 total:405963
...with the patched kernel, things are significantly faster:
min:265865 max:508909 total:21464553
min:263252 max:500084 total:21242190
min:263989 max:504929 total:21396968
min:263343 max:505852 total:21346829
min:263023 max:507303 total:21410217
min:263420 max:506593 total:21426307
min:259556 max:494529 total:20927169
min:264451 max:508967 total:21433676
min:263486 max:509460 total:21399874
min:263906 max:507400 total:21393015
I can get some fancier plots if anyone is interested, but the benefit
of this patch seems pretty clear.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 16:17 ` Jeff Layton
@ 2024-08-06 16:42 ` Mateusz Guzik
0 siblings, 0 replies; 33+ messages in thread
From: Mateusz Guzik @ 2024-08-06 16:42 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
Josef Bacik, linux-fsdevel, linux-kernel
On Tue, Aug 6, 2024 at 6:17 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2024-08-06 at 17:25 +0200, Mateusz Guzik wrote:
> > On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org>
> > wrote:
> > >
> > > Today, when opening a file we'll typically do a fast lookup, but if
> > > O_CREAT is set, the kernel always takes the exclusive inode lock. I
> > > assume this was done with the expectation that O_CREAT means that
> > > we
> > > always expect to do the create, but that's often not the case. Many
> > > programs set O_CREAT even in scenarios where the file already
> > > exists.
> > >
> > > This patch rearranges the pathwalk-for-open code to also attempt a
> > > fast_lookup in certain O_CREAT cases. If a positive dentry is
> > > found, the
> > > inode_lock can be avoided altogether, and if auditing isn't
> > > enabled, it
> > > can stay in rcuwalk mode for the last step_into.
> > >
> > > One notable exception that is hopefully temporary: if we're doing
> > > an
> > > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing
> > > the
> > > dentry in that case is more expensive than taking the i_rwsem for
> > > now.
> > >
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > Here's a revised patch that does a fast_lookup in the O_CREAT
> > > codepath
> > > too. The main difference here is that if a positive dentry is found
> > > and
> > > audit_dummy_context is true, then we keep the walk lazy for the
> > > last
> > > component, which avoids having to take any locks on the parent
> > > (just
> > > like with non-O_CREAT opens).
> > >
> > > The testcase below runs in about 18s on v6.10 (on an 80 CPU
> > > machine).
> > > With this patch, it runs in about 1s:
> > >
> >
> > I don't have an opinion on the patch.
> >
> > If your kernel does not use apparmor and the patch manages to dodge
> > refing the parent, then indeed this should be fully deserialized just
> > like non-creat opens.
> >
>
> Yep. Pity that auditing will slow things down, but them's the breaks...
>
> > Instead of the hand-rolled benchmark may I interest you in using
> > will-it-scale instead? Notably it reports the achieved rate once per
> > second, so you can check if there is funky business going on between
> > reruns, gives the cpu the time to kick off turbo boost if applicable
> > etc.
> >
> > I would bench with that myself, but I temporarily don't have handy
> > access to bigger hw. Even so, the below is completely optional and
> > perhaps more of a suggestion for the future :)
> >
> > I hacked up the test case based on tests/open1.c.
> >
> > git clone https://github.com/antonblanchard/will-it-scale.git
> >
> > For example plop into tests/opencreate1.c && gmake &&
> > ./opencreate1_processes -t 70:
> >
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <assert.h>
> > #include <string.h>
> >
> > char *testcase_description = "Separate file open/close + O_CREAT";
> >
> > #define template "/tmp/willitscale.XXXXXX"
> > static char (*tmpfiles)[sizeof(template)];
> > static unsigned long local_nr_tasks;
> >
> > void testcase_prepare(unsigned long nr_tasks)
> > {
> > int i;
> > tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template)
> > * nr_tasks);
> > assert(tmpfiles);
> >
> > for (i = 0; i < nr_tasks; i++) {
> > strcpy(tmpfiles[i], template);
> > char *tmpfile = tmpfiles[i];
> > int fd = mkstemp(tmpfile);
> >
> > assert(fd >= 0);
> > close(fd);
> > }
> >
> > local_nr_tasks = nr_tasks;
> > }
> >
> > void testcase(unsigned long long *iterations, unsigned long nr)
> > {
> > char *tmpfile = tmpfiles[nr];
> >
> > while (1) {
> > int fd = open(tmpfile, O_RDWR | O_CREAT, 0600);
> > assert(fd >= 0);
> > close(fd);
> >
> > (*iterations)++;
> > }
> > }
> >
> > void testcase_cleanup(void)
> > {
> > int i;
> > for (i = 0; i < local_nr_tasks; i++) {
> > unlink(tmpfiles[i]);
> > }
> > free(tmpfiles);
> > }
> >
> >
>
> Good suggestion and thanks for the testcase. With v6.10 kernel, I'm
> seeing numbers like this at -t 70:
>
> min:4873 max:11510 total:418915
> min:4884 max:10598 total:408848
> min:3704 max:12371 total:467658
> min:2842 max:11792 total:418239
> min:2966 max:11511 total:414144
> min:4756 max:11381 total:413137
> min:4557 max:10789 total:404628
> min:4780 max:11125 total:413349
> min:4757 max:11156 total:405963
>
> ...with the patched kernel, things are significantly faster:
>
> min:265865 max:508909 total:21464553
> min:263252 max:500084 total:21242190
> min:263989 max:504929 total:21396968
> min:263343 max:505852 total:21346829
> min:263023 max:507303 total:21410217
> min:263420 max:506593 total:21426307
> min:259556 max:494529 total:20927169
> min:264451 max:508967 total:21433676
> min:263486 max:509460 total:21399874
> min:263906 max:507400 total:21393015
>
> I can get some fancier plots if anyone is interested, but the benefit
> of this patch seems pretty clear.
In the commit message I would replace that handrolled bench thing with +/-:
<quote>
70-way open(.., O_RDWR | O_CREAT) calls on already existing files, one
per-worker:
before: 467658
after: 21464553 (+4589%)
</quote>
I guess it would make sense to check if unlink1_processes is doing
fine if it's not too much hassle.
overall pretty decent win :P
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 15:25 ` Mateusz Guzik
2024-08-06 16:17 ` Jeff Layton
@ 2024-08-06 19:11 ` Andi Kleen
2024-08-06 19:22 ` Mateusz Guzik
2024-08-06 19:26 ` Jeff Layton
1 sibling, 2 replies; 33+ messages in thread
From: Andi Kleen @ 2024-08-06 19:11 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel
Mateusz Guzik <mjguzik@gmail.com> writes:
>
> I would bench with that myself, but I temporarily don't have handy
> access to bigger hw. Even so, the below is completely optional and
> perhaps more of a suggestion for the future :)
>
> I hacked up the test case based on tests/open1.c.
Don't you need two test cases? One where the file exists and one
where it doesn't. Because the "doesn't exist" will likely be slower
than before because it will do the lookups twice,
and it will likely even slow single threaded.
I assume the penalty will also depend on the number of entries
in the path.
That all seem to be an important considerations in judging the benefits
of the patch.
-Andi
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 19:11 ` Andi Kleen
@ 2024-08-06 19:22 ` Mateusz Guzik
2024-08-06 20:42 ` Jeff Layton
2024-08-06 19:26 ` Jeff Layton
1 sibling, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2024-08-06 19:22 UTC (permalink / raw)
To: Andi Kleen
Cc: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel
On Tue, Aug 6, 2024 at 9:11 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
> >
> > I would bench with that myself, but I temporarily don't have handy
> > access to bigger hw. Even so, the below is completely optional and
> > perhaps more of a suggestion for the future :)
> >
> > I hacked up the test case based on tests/open1.c.
>
> Don't you need two test cases? One where the file exists and one
> where it doesn't. Because the "doesn't exist" will likely be slower
> than before because it will do the lookups twice,
> and it will likely even slow single threaded.
>
> I assume the penalty will also depend on the number of entries
> in the path.
>
> That all seem to be an important considerations in judging the benefits
> of the patch.
>
This is why I suggested separately running "unlink1" which is
guaranteed to create a file every time -- all iterations will fail the
proposed fast path.
Unless you meant a mixed variant where only some of the threads create
files. Perhaps worthwhile to add, not hard to do (one can switch the
mode based on passed worker number).
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 19:11 ` Andi Kleen
2024-08-06 19:22 ` Mateusz Guzik
@ 2024-08-06 19:26 ` Jeff Layton
2024-08-06 20:03 ` Mateusz Guzik
1 sibling, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-06 19:26 UTC (permalink / raw)
To: Andi Kleen, Mateusz Guzik
Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
Josef Bacik, linux-fsdevel, linux-kernel
On Tue, 2024-08-06 at 12:11 -0700, Andi Kleen wrote:
> Mateusz Guzik <mjguzik@gmail.com> writes:
> >
> > I would bench with that myself, but I temporarily don't have handy
> > access to bigger hw. Even so, the below is completely optional and
> > perhaps more of a suggestion for the future :)
> >
> > I hacked up the test case based on tests/open1.c.
>
> Don't you need two test cases? One where the file exists and one
> where it doesn't. Because the "doesn't exist" will likely be slower
> than before because it will do the lookups twice,
> and it will likely even slow single threaded.
>
> I assume the penalty will also depend on the number of entries
> in the path.
>
> That all seem to be an important considerations in judging the benefits
> of the patch.
>
Definitely.
FWIW, I did test a single threaded (bespoke) test case that did a bunch
of O_CREAT opens, closes and then unlinks. I didn't measure any
discernable difference with this patch. My conclusion from that was
that the extra lockless lookup should be cheap.
That said, this could show a difference if you have rather long hash
chains that need to be walked completely, and you have to actually do
the create every time. In practice though, time spent under the
inode_lock and doing the create tends to dominate in that case, so I
*think* this should still be worthwhile.
I'll plan to add a test like that to will_it_scale unless Mateusz beats
me to it. A long soak in linux-next is probably also justified with
this patch.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 14:32 [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-06 15:25 ` Mateusz Guzik
@ 2024-08-06 19:51 ` Jeff Layton
2024-08-14 2:18 ` Al Viro
2024-08-07 14:26 ` Christian Brauner
2 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-06 19:51 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
Cc: Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel
On Tue, 2024-08-06 at 10:32 -0400, Jeff Layton wrote:
> Today, when opening a file we'll typically do a fast lookup, but if
> O_CREAT is set, the kernel always takes the exclusive inode lock. I
> assume this was done with the expectation that O_CREAT means that we
> always expect to do the create, but that's often not the case. Many
> programs set O_CREAT even in scenarios where the file already exists.
>
> This patch rearranges the pathwalk-for-open code to also attempt a
> fast_lookup in certain O_CREAT cases. If a positive dentry is found, the
> inode_lock can be avoided altogether, and if auditing isn't enabled, it
> can stay in rcuwalk mode for the last step_into.
>
> One notable exception that is hopefully temporary: if we're doing an
> rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the
> dentry in that case is more expensive than taking the i_rwsem for now.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> Here's a revised patch that does a fast_lookup in the O_CREAT codepath
> too. The main difference here is that if a positive dentry is found and
> audit_dummy_context is true, then we keep the walk lazy for the last
> component, which avoids having to take any locks on the parent (just
> like with non-O_CREAT opens).
>
> The testcase below runs in about 18s on v6.10 (on an 80 CPU machine).
> With this patch, it runs in about 1s:
>
> #define _GNU_SOURCE 1
> #include <stdio.h>
> #include <unistd.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <sys/wait.h>
>
> #define PROCS 70
> #define LOOPS 500000
>
> static int openloop(int tnum)
> {
> char *file;
> int i, ret;
>
> ret = asprintf(&file, "./testfile%d", tnum);
> if (ret < 0) {
> printf("asprintf failed for proc %d", tnum);
> return 1;
> }
>
> for (i = 0; i < LOOPS; ++i) {
> int fd = open(file, O_RDWR|O_CREAT, 0644);
>
> if (fd < 0) {
> perror("open");
> return 1;
> }
> close(fd);
> }
> unlink(file);
> free(file);
> return 0;
> }
>
> int main(int argc, char **argv) {
> pid_t kids[PROCS];
> int i, ret = 0;
>
> for (i = 0; i < PROCS; ++i) {
> kids[i] = fork();
> if (kids[i] > 0)
> return openloop(i);
> if (kids[i] < 0)
> perror("fork");
> }
>
> for (i = 0; i < PROCS; ++i) {
> int ret2;
>
> if (kids[i] > 0) {
> wait(&ret2);
> if (ret2 != 0)
> ret = ret2;
> }
> }
> return ret;
> }
> ---
> Changes in v2:
> - drop the lockref patch since Mateusz is working on a better approach
> - add trailing_slashes helper function
> - add a lookup_fast_for_open helper function
> - make lookup_fast_for_open skip the lookup if auditing is enabled
> - if we find a positive dentry and auditing is disabled, don't unlazy
> - Link to v1: https://lore.kernel.org/r/20240802-openfast-v1-0-a1cff2a33063@kernel.org
> ---
> fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 53 insertions(+), 9 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1e05a0f3f04d..2d716fb114c9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> return ERR_PTR(error);
> }
>
> +static inline bool trailing_slashes(struct nameidata *nd)
> +{
> + return (bool)nd->last.name[nd->last.len];
> +}
> +
> +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> +{
> + struct dentry *dentry;
> +
> + if (open_flag & O_CREAT) {
> + /* Don't bother on an O_EXCL create */
> + if (open_flag & O_EXCL)
> + return NULL;
> +
> + /*
> + * FIXME: If auditing is enabled, then we'll have to unlazy to
> + * use the dentry. For now, don't do this, since it shifts
> + * contention from parent's i_rwsem to its d_lockref spinlock.
> + * Reconsider this once dentry refcounting handles heavy
> + * contention better.
> + */
> + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> + return NULL;
> + }
> +
> + if (trailing_slashes(nd))
> + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> +
> + dentry = lookup_fast(nd);
Self-NAK on this patch. We have to test for IS_ERR on the returned
dentry here. I'll send a v3 along after I've retested it.
> +
> + if (open_flag & O_CREAT) {
> + /* Discard negative dentries. Need inode_lock to do the create */
> + if (dentry && !dentry->d_inode) {
> + if (!(nd->flags & LOOKUP_RCU))
> + dput(dentry);
> + dentry = NULL;
> + }
> + }
> + return dentry;
> +}
> +
> static const char *open_last_lookups(struct nameidata *nd,
> struct file *file, const struct open_flags *op)
> {
> @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd,
> return handle_dots(nd, nd->last_type);
> }
>
> + /* We _can_ be in RCU mode here */
> + dentry = lookup_fast_for_open(nd, open_flag);
> + if (IS_ERR(dentry))
> + return ERR_CAST(dentry);
> +
> if (!(open_flag & O_CREAT)) {
> - if (nd->last.name[nd->last.len])
> - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> - /* we _can_ be in RCU mode here */
> - dentry = lookup_fast(nd);
> - if (IS_ERR(dentry))
> - return ERR_CAST(dentry);
> if (likely(dentry))
> goto finish_lookup;
>
> if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
> return ERR_PTR(-ECHILD);
> } else {
> - /* create side of things */
> if (nd->flags & LOOKUP_RCU) {
> + /* can stay in rcuwalk if not auditing */
> + if (dentry && audit_dummy_context())
> + goto check_slashes;
> if (!try_to_unlazy(nd))
> return ERR_PTR(-ECHILD);
> }
> audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> - /* trailing slashes? */
> - if (unlikely(nd->last.name[nd->last.len]))
> +check_slashes:
> + if (trailing_slashes(nd))
> return ERR_PTR(-EISDIR);
> + if (dentry)
> + goto finish_lookup;
> }
>
> if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
>
> ---
> base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
> change-id: 20240723-openfast-ac49a7b6ade2
>
> Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 19:26 ` Jeff Layton
@ 2024-08-06 20:03 ` Mateusz Guzik
2024-08-06 20:47 ` Andi Kleen
0 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2024-08-06 20:03 UTC (permalink / raw)
To: Jeff Layton
Cc: Andi Kleen, Alexander Viro, Christian Brauner, Jan Kara,
Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel
On Tue, Aug 6, 2024 at 9:26 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2024-08-06 at 12:11 -0700, Andi Kleen wrote:
> > Mateusz Guzik <mjguzik@gmail.com> writes:
> > >
> > > I would bench with that myself, but I temporarily don't have handy
> > > access to bigger hw. Even so, the below is completely optional and
> > > perhaps more of a suggestion for the future :)
> > >
> > > I hacked up the test case based on tests/open1.c.
> >
> > Don't you need two test cases? One where the file exists and one
> > where it doesn't. Because the "doesn't exist" will likely be slower
> > than before because it will do the lookups twice,
> > and it will likely even slow single threaded.
> >
> > I assume the penalty will also depend on the number of entries
> > in the path.
> >
> > That all seem to be an important considerations in judging the benefits
> > of the patch.
> >
>
> Definitely.
>
> FWIW, I did test a single threaded (bespoke) test case that did a bunch
> of O_CREAT opens, closes and then unlinks. I didn't measure any
> discernable difference with this patch. My conclusion from that was
> that the extra lockless lookup should be cheap.
>
> That said, this could show a difference if you have rather long hash
> chains that need to be walked completely, and you have to actually do
> the create every time. In practice though, time spent under the
> inode_lock and doing the create tends to dominate in that case, so I
> *think* this should still be worthwhile.
>
> I'll plan to add a test like that to will_it_scale unless Mateusz beats
> me to it. A long soak in linux-next is probably also justified with
> this patch.
Well if we are really going there I have to point out a couple of
three things concerning single-threaded performance from entering the
kernel to getting back with a file descriptor.
The headline is that there is a lot of single-threaded perf left on
the table and modulo a significant hit in terms of cycles it is going
to be hard to measure a meaningful result.
Before I get to the vfs layer, there is a significant loss in the
memory allocator because of memcg -- it takes several irq off/on trips
for every alloc (needed to grab struct file *). I have a plan what to
do with it (handle stuff with local cmpxchg (note no lock prefix)),
which I'm trying to get around to. Apart from that you may note the
allocator fast path performs a 16-byte cmpxchg, which is again dog
slow and executes twice (once for the file obj, another time for the
namei buffer). Someone(tm) should patch it up and I have some vague
ideas, but 0 idea when I can take a serious stab.
As for vfs, just from atomics perspective:
- opening a results in a spurious ref/unref cycle on the dentry, i
posted a patch [1] to sort it out (maybe Al will write his own
instead). single-threaded it gives about +5% to open/close rate
- there is an avoidable smp_mb in do_dentry_open, posted a patch [2]
and it got acked. i did not bother benching it
- someone sneaked in atomic refcount management to "struct filename"
to appease io_uring vs audit. this results in another atomic added to
every single path lookup. I have an idea what to do there
- opening for write (which you do with O_CREAT) calls
mnt_get_write_access which contains another smp_mb. this probably can
be sorted out the same way percpu semaphores got taken care of
- __legitimize_mnt has yet another smp_mb to synchro against unmount.
this *probably* can be sorted out with synchronize_rcu, but some care
is needed to not induce massive delays on unmount
- credential management is also done with atomics (so that's get + put
for the open/close cycle), $elsewhere I implemented a scheme where the
local count is cached in the thread itself and only rolled up on
credential change. this whacks the problem and is something i'm
planning on proposing at some point
and more
That is to say I completely agree this does add extra work, but the
kernel is not in shape where true single-threaded impact can be
sensibly measured. I do find it prudent to validate nothing happened
to somehow floor scalability of the case which does need to create
stuff though.
[1] https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20240806172846.886570-1-mjguzik@gmail.com/
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 19:22 ` Mateusz Guzik
@ 2024-08-06 20:42 ` Jeff Layton
0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2024-08-06 20:42 UTC (permalink / raw)
To: Mateusz Guzik, Andi Kleen
Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
Josef Bacik, linux-fsdevel, linux-kernel
On Tue, 2024-08-06 at 21:22 +0200, Mateusz Guzik wrote:
> On Tue, Aug 6, 2024 at 9:11 PM Andi Kleen <ak@linux.intel.com> wrote:
> >
> > Mateusz Guzik <mjguzik@gmail.com> writes:
> > >
> > > I would bench with that myself, but I temporarily don't have
> > > handy
> > > access to bigger hw. Even so, the below is completely optional
> > > and
> > > perhaps more of a suggestion for the future :)
> > >
> > > I hacked up the test case based on tests/open1.c.
> >
> > Don't you need two test cases? One where the file exists and one
> > where it doesn't. Because the "doesn't exist" will likely be slower
> > than before because it will do the lookups twice,
> > and it will likely even slow single threaded.
> >
> > I assume the penalty will also depend on the number of entries
> > in the path.
> >
> > That all seem to be an important considerations in judging the
> > benefits
> > of the patch.
> >
>
> This is why I suggested separately running "unlink1" which is
> guaranteed to create a file every time -- all iterations will fail
> the
> proposed fast path.
>
> Unless you meant a mixed variant where only some of the threads
> create
> files. Perhaps worthwhile to add, not hard to do (one can switch the
> mode based on passed worker number).
>
Well...
# ./unlink1_processes -t 70 -s 100
average:
v6.10: 114455
v6.10 + patch: 149513
I suspect what's happening here is that this patch relieves contention
for the inode_lock and that allows the unlinks to proceed faster.
Running it with a single process though:
average:
v6.10: 200106
v6.10 + patch: 199188
So, ~.4% degradation there? That doesn't seem too bad given the gain in
the threaded test.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 20:03 ` Mateusz Guzik
@ 2024-08-06 20:47 ` Andi Kleen
2024-08-15 15:07 ` Mateusz Guzik
0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2024-08-06 20:47 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel
> Before I get to the vfs layer, there is a significant loss in the
> memory allocator because of memcg -- it takes several irq off/on trips
> for every alloc (needed to grab struct file *). I have a plan what to
> do with it (handle stuff with local cmpxchg (note no lock prefix)),
> which I'm trying to get around to. Apart from that you may note the
> allocator fast path performs a 16-byte cmpxchg, which is again dog
> slow and executes twice (once for the file obj, another time for the
> namei buffer). Someone(tm) should patch it up and I have some vague
> ideas, but 0 idea when I can take a serious stab.
I just LBR sampled it on my skylake and it doesn't look
particularly slow. You see the whole massive block including CMPXCHG16
gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely
a missing cache line.
kmem_cache_free:
ffffffff9944ce20 nop %edi, %edx
ffffffff9944ce24 nopl %eax, (%rax,%rax,1)
ffffffff9944ce29 pushq %rbp
ffffffff9944ce2a mov %rdi, %rdx
ffffffff9944ce2d mov %rsp, %rbp
ffffffff9944ce30 pushq %r15
ffffffff9944ce32 pushq %r14
ffffffff9944ce34 pushq %r13
ffffffff9944ce36 pushq %r12
ffffffff9944ce38 mov $0x80000000, %r12d
ffffffff9944ce3e pushq %rbx
ffffffff9944ce3f mov %rsi, %rbx
ffffffff9944ce42 and $0xfffffffffffffff0, %rsp
ffffffff9944ce46 sub $0x10, %rsp
ffffffff9944ce4a movq %gs:0x28, %rax
ffffffff9944ce53 movq %rax, 0x8(%rsp)
ffffffff9944ce58 xor %eax, %eax
ffffffff9944ce5a add %rsi, %r12
ffffffff9944ce5d jb 0xffffffff9944d1ea
ffffffff9944ce63 mov $0xffffffff80000000, %rax
ffffffff9944ce6a xor %r13d, %r13d
ffffffff9944ce6d subq 0x17b068c(%rip), %rax
ffffffff9944ce74 add %r12, %rax
ffffffff9944ce77 shr $0xc, %rax
ffffffff9944ce7b shl $0x6, %rax
ffffffff9944ce7f addq 0x17b066a(%rip), %rax
ffffffff9944ce86 movq 0x8(%rax), %rcx
ffffffff9944ce8a test $0x1, %cl
ffffffff9944ce8d jnz 0xffffffff9944d15c
ffffffff9944ce93 nopl %eax, (%rax,%rax,1)
ffffffff9944ce98 movq (%rax), %rcx
ffffffff9944ce9b and $0x8, %ch
ffffffff9944ce9e jz 0xffffffff9944cfea
ffffffff9944cea4 test %rax, %rax
ffffffff9944cea7 jz 0xffffffff9944cfea
ffffffff9944cead movq 0x8(%rax), %r14
ffffffff9944ceb1 test %r14, %r14
ffffffff9944ceb4 jz 0xffffffff9944cfac
ffffffff9944ceba cmp %r14, %rdx
ffffffff9944cebd jnz 0xffffffff9944d165
ffffffff9944cec3 test %r14, %r14
ffffffff9944cec6 jz 0xffffffff9944cfac
ffffffff9944cecc movq 0x8(%rbp), %r15
ffffffff9944ced0 nopl %eax, (%rax,%rax,1)
ffffffff9944ced5 movq 0x1fe5134(%rip), %rax
ffffffff9944cedc test %r13, %r13
ffffffff9944cedf jnz 0xffffffff9944ceef
ffffffff9944cee1 mov $0xffffffff80000000, %rax
ffffffff9944cee8 subq 0x17b0611(%rip), %rax
ffffffff9944ceef add %rax, %r12
ffffffff9944cef2 shr $0xc, %r12
ffffffff9944cef6 shl $0x6, %r12
ffffffff9944cefa addq 0x17b05ef(%rip), %r12
ffffffff9944cf01 movq 0x8(%r12), %rax
ffffffff9944cf06 mov %r12, %r13
ffffffff9944cf09 test $0x1, %al
ffffffff9944cf0b jnz 0xffffffff9944d1b1
ffffffff9944cf11 nopl %eax, (%rax,%rax,1)
ffffffff9944cf16 movq (%r13), %rax
ffffffff9944cf1a movq %rbx, (%rsp)
ffffffff9944cf1e test $0x8, %ah
ffffffff9944cf21 mov $0x0, %eax
ffffffff9944cf26 cmovz %rax, %r13
ffffffff9944cf2a data16 nop
ffffffff9944cf2c movq 0x38(%r13), %r8
ffffffff9944cf30 cmp $0x3, %r8
ffffffff9944cf34 jnbe 0xffffffff9944d1ca
ffffffff9944cf3a nopl %eax, (%rax,%rax,1)
ffffffff9944cf3f movq 0x23d6f72(%rip), %rax
ffffffff9944cf46 mov %rbx, %rdx
ffffffff9944cf49 sub %rax, %rdx
ffffffff9944cf4c cmp $0x1fffff, %rdx
ffffffff9944cf53 jbe 0xffffffff9944d03a
ffffffff9944cf59 movq (%r14), %rax
ffffffff9944cf5c addq %gs:0x66bccab4(%rip), %rax
ffffffff9944cf64 movq 0x8(%rax), %rdx
ffffffff9944cf68 cmpq %r13, 0x10(%rax)
ffffffff9944cf6c jnz 0xffffffff9944d192
ffffffff9944cf72 movl 0x28(%r14), %ecx
ffffffff9944cf76 movq (%rax), %rax
ffffffff9944cf79 add %rbx, %rcx
ffffffff9944cf7c cmp %rbx, %rax
ffffffff9944cf7f jz 0xffffffff9944d1ba
ffffffff9944cf85 movq 0xb8(%r14), %rsi
ffffffff9944cf8c mov %rcx, %rdi
ffffffff9944cf8f bswap %rdi
ffffffff9944cf92 xor %rax, %rsi
ffffffff9944cf95 xor %rdi, %rsi
ffffffff9944cf98 movq %rsi, (%rcx)
ffffffff9944cf9b leaq 0x2000(%rdx), %rcx
ffffffff9944cfa2 movq (%r14), %rsi
ffffffff9944cfa5 cmpxchg16bx %gs:(%rsi)
ffffffff9944cfaa jnz 0xffffffff9944cf59
ffffffff9944cfac movq 0x8(%rsp), %rax
ffffffff9944cfb1 subq %gs:0x28, %rax
ffffffff9944cfba jnz 0xffffffff9944d1fc
ffffffff9944cfc0 leaq -0x28(%rbp), %rsp
ffffffff9944cfc4 popq %rbx
ffffffff9944cfc5 popq %r12
ffffffff9944cfc7 popq %r13
ffffffff9944cfc9 popq %r14
ffffffff9944cfcb popq %r15
ffffffff9944cfcd popq %rbp
ffffffff9944cfce retq # PRED 38 cycles [126] 2.74 IPC <-------------
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 14:32 [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-06 15:25 ` Mateusz Guzik
2024-08-06 19:51 ` Jeff Layton
@ 2024-08-07 14:26 ` Christian Brauner
2024-08-07 14:36 ` Jeff Layton
2 siblings, 1 reply; 33+ messages in thread
From: Christian Brauner @ 2024-08-07 14:26 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
> +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> +{
> + struct dentry *dentry;
> +
> + if (open_flag & O_CREAT) {
> + /* Don't bother on an O_EXCL create */
> + if (open_flag & O_EXCL)
> + return NULL;
> +
> + /*
> + * FIXME: If auditing is enabled, then we'll have to unlazy to
> + * use the dentry. For now, don't do this, since it shifts
> + * contention from parent's i_rwsem to its d_lockref spinlock.
> + * Reconsider this once dentry refcounting handles heavy
> + * contention better.
> + */
> + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> + return NULL;
Hm, the audit_inode() on the parent is done independent of whether the
file was actually created or not. But the audit_inode() on the file
itself is only done when it was actually created. Imho, there's no need
to do audit_inode() on the parent when we immediately find that file
already existed. If we accept that then this makes the change a lot
simpler.
The inconsistency would partially remain though. When the file doesn't
exist audit_inode() on the parent is called but by the time we've
grabbed the inode lock someone else might already have created the file
and then again we wouldn't audit_inode() on the file but we would have
on the parent.
I think that's fine. But if that's bothersome the more aggressive thing
to do would be to pull that audit_inode() on the parent further down
after we created the file. Imho, that should be fine?...
See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
for a completely untested draft of what I mean.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-07 14:26 ` Christian Brauner
@ 2024-08-07 14:36 ` Jeff Layton
2024-08-08 10:36 ` Christian Brauner
0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-07 14:36 UTC (permalink / raw)
To: Christian Brauner
Cc: Alexander Viro, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > +{
> > + struct dentry *dentry;
> > +
> > + if (open_flag & O_CREAT) {
> > + /* Don't bother on an O_EXCL create */
> > + if (open_flag & O_EXCL)
> > + return NULL;
> > +
> > + /*
> > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > + * use the dentry. For now, don't do this, since it shifts
> > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > + * Reconsider this once dentry refcounting handles heavy
> > + * contention better.
> > + */
> > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > + return NULL;
>
> Hm, the audit_inode() on the parent is done independent of whether the
> file was actually created or not. But the audit_inode() on the file
> itself is only done when it was actually created. Imho, there's no need
> to do audit_inode() on the parent when we immediately find that file
> already existed. If we accept that then this makes the change a lot
> simpler.
>
> The inconsistency would partially remain though. When the file doesn't
> exist audit_inode() on the parent is called but by the time we've
> grabbed the inode lock someone else might already have created the file
> and then again we wouldn't audit_inode() on the file but we would have
> on the parent.
>
> I think that's fine. But if that's bothersome the more aggressive thing
> to do would be to pull that audit_inode() on the parent further down
> after we created the file. Imho, that should be fine?...
>
> See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> for a completely untested draft of what I mean.
Yeah, that's a lot simpler. That said, my experience when I've worked
with audit in the past is that people who are using it are _very_
sensitive to changes of when records get emitted or not. I don't like
this, because I think the rules here are ad-hoc and somewhat arbitrary,
but keeping everything working exactly the same has been my MO whenever
I have to work in there.
If a certain access pattern suddenly generates a different set of
records (or some are missing, as would be in this case), we might get
bug reports about this. I'm ok with simplifying this code in the way
you suggest, but we may want to do it in a patch on top of mine, to
make it simple to revert later if that becomes necessary.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-07 14:36 ` Jeff Layton
@ 2024-08-08 10:36 ` Christian Brauner
2024-08-08 10:54 ` Jeff Layton
2024-08-08 17:11 ` Jan Kara
0 siblings, 2 replies; 33+ messages in thread
From: Christian Brauner @ 2024-08-08 10:36 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 3282 bytes --]
On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > +{
> > > + struct dentry *dentry;
> > > +
> > > + if (open_flag & O_CREAT) {
> > > + /* Don't bother on an O_EXCL create */
> > > + if (open_flag & O_EXCL)
> > > + return NULL;
> > > +
> > > + /*
> > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > + * use the dentry. For now, don't do this, since it shifts
> > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > + * Reconsider this once dentry refcounting handles heavy
> > > + * contention better.
> > > + */
> > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > + return NULL;
> >
> > Hm, the audit_inode() on the parent is done independent of whether the
> > file was actually created or not. But the audit_inode() on the file
> > itself is only done when it was actually created. Imho, there's no need
> > to do audit_inode() on the parent when we immediately find that file
> > already existed. If we accept that then this makes the change a lot
> > simpler.
> >
> > The inconsistency would partially remain though. When the file doesn't
> > exist audit_inode() on the parent is called but by the time we've
> > grabbed the inode lock someone else might already have created the file
> > and then again we wouldn't audit_inode() on the file but we would have
> > on the parent.
> >
> > I think that's fine. But if that's bothersome the more aggressive thing
> > to do would be to pull that audit_inode() on the parent further down
> > after we created the file. Imho, that should be fine?...
> >
> > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > for a completely untested draft of what I mean.
>
> Yeah, that's a lot simpler. That said, my experience when I've worked
> with audit in the past is that people who are using it are _very_
> sensitive to changes of when records get emitted or not. I don't like
> this, because I think the rules here are ad-hoc and somewhat arbitrary,
> but keeping everything working exactly the same has been my MO whenever
> I have to work in there.
>
> If a certain access pattern suddenly generates a different set of
> records (or some are missing, as would be in this case), we might get
> bug reports about this. I'm ok with simplifying this code in the way
> you suggest, but we may want to do it in a patch on top of mine, to
> make it simple to revert later if that becomes necessary.
Fwiw, even with the rearranged checks in v3 of the patch audit records
will be dropped because we may find a positive dentry but the path may
have trailing slashes. At that point we just return without audit
whereas before we always would've done that audit.
Honestly, we should move that audit event as right now it's just really
weird and see if that works. Otherwise the change is somewhat horrible
complicating the already convoluted logic even more.
So I'm appending the patches that I have on top of your patch in
vfs.misc. Can you (other as well ofc) take a look and tell me whether
that's not breaking anything completely other than later audit events?
[-- Attachment #2: 0001-fs-move-audit-parent-inode.patch --]
[-- Type: text/x-diff, Size: 1856 bytes --]
From c3dd9f0e320a631bdb12ecd298cedcc43358b80f Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 8 Aug 2024 10:16:42 +0200
Subject: [PATCH 1/4] fs: move audit parent inode
During O_CREAT we unconditionally audit the parent inode. This makes it
difficult to support a fastpath for O_CREAT when the file already exists
because we have to drop out of RCU lookup needlessly.
We worked around this by checking whether audit was actually active but
that's also suboptimal. Instead, move the audit of the parent inode down
into lookup_open() at a point where it's mostly certain that the file
needs to be created.
This also reduced the inconsistency that currently exists: while audit
on the parent is done independent of whether or no the file already
existed an audit on the file is only performed if it has been created.
By moving the audit down a bit we emit the audit a little later but it
will allow us to simplify the fastpath for O_CREAT significantly.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/namei.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3e34f4d97d83..745415fcda57 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3535,6 +3535,9 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
return dentry;
}
+ if (open_flag & O_CREAT)
+ audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
+
/*
* Checking write permission is tricky, bacuse we don't know if we are
* going to actually need it: O_CREAT opens should work as long as the
@@ -3691,7 +3694,6 @@ static const char *open_last_lookups(struct nameidata *nd,
if (!unlazied)
return ERR_PTR(-ECHILD);
}
- audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
if (trailing_slashes(nd)) {
dput(dentry);
return ERR_PTR(-EISDIR);
--
2.43.0
[-- Attachment #3: 0002-fs-pull-up-trailing-slashes-check-for-O_CREAT.patch --]
[-- Type: text/x-diff, Size: 1491 bytes --]
From ac4db275670c1311fecb00145f585127c3a7e648 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 8 Aug 2024 11:12:50 +0200
Subject: [PATCH 2/4] fs: pull up trailing slashes check for O_CREAT
Perform the check for trailing slashes right in the fastpath check and
don't bother with any additional work.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/namei.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 745415fcda57..08eb9a53beb7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3618,6 +3618,9 @@ static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
struct dentry *dentry;
if (open_flag & O_CREAT) {
+ if (trailing_slashes(nd))
+ return ERR_PTR(-EISDIR);
+
/* Don't bother on an O_EXCL create */
if (open_flag & O_EXCL)
return NULL;
@@ -3684,20 +3687,13 @@ static const char *open_last_lookups(struct nameidata *nd,
bool unlazied;
/* can stay in rcuwalk if not auditing */
- if (dentry && audit_dummy_context()) {
- if (trailing_slashes(nd))
- return ERR_PTR(-EISDIR);
+ if (dentry && audit_dummy_context())
goto finish_lookup;
- }
unlazied = dentry ? try_to_unlazy_next(nd, dentry) :
try_to_unlazy(nd);
if (!unlazied)
return ERR_PTR(-ECHILD);
}
- if (trailing_slashes(nd)) {
- dput(dentry);
- return ERR_PTR(-EISDIR);
- }
if (dentry)
goto finish_lookup;
}
--
2.43.0
[-- Attachment #4: 0003-fs-remove-audit-dummy-context-check.patch --]
[-- Type: text/x-diff, Size: 1503 bytes --]
From 155a11570398943dcb623a2390ac27f9eae3d8b3 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 8 Aug 2024 11:15:41 +0200
Subject: [PATCH 3/4] fs: remove audit dummy context check
Now that we audit later during lookup_open() we can remove the audit
dummy context check. This simplifies things a lot.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/namei.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 08eb9a53beb7..96c79dec7184 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3624,16 +3624,6 @@ static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
/* Don't bother on an O_EXCL create */
if (open_flag & O_EXCL)
return NULL;
-
- /*
- * FIXME: If auditing is enabled, then we'll have to unlazy to
- * use the dentry. For now, don't do this, since it shifts
- * contention from parent's i_rwsem to its d_lockref spinlock.
- * Reconsider this once dentry refcounting handles heavy
- * contention better.
- */
- if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
- return NULL;
}
if (trailing_slashes(nd))
@@ -3687,7 +3677,7 @@ static const char *open_last_lookups(struct nameidata *nd,
bool unlazied;
/* can stay in rcuwalk if not auditing */
- if (dentry && audit_dummy_context())
+ if (dentry)
goto finish_lookup;
unlazied = dentry ? try_to_unlazy_next(nd, dentry) :
try_to_unlazy(nd);
--
2.43.0
[-- Attachment #5: 0004-fs-rearrange-general-fastpath-check-now-that-O_CREAT.patch --]
[-- Type: text/x-diff, Size: 1427 bytes --]
From a2e0e55a57f88e08745cbd7bcdf8de8692306284 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 8 Aug 2024 11:16:42 +0200
Subject: [PATCH 4/4] fs: rearrange general fastpath check now that O_CREAT
uses it
If we find a positive dentry we can now simply try and open it. All
prelimiary checks are already done with or without O_CREAT.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/namei.c | 17 ++++-------------
1 file changed, 4 insertions(+), 13 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 96c79dec7184..2699601bf8e9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3666,26 +3666,17 @@ static const char *open_last_lookups(struct nameidata *nd,
if (IS_ERR(dentry))
return ERR_CAST(dentry);
- if (!(open_flag & O_CREAT)) {
- if (likely(dentry))
- goto finish_lookup;
+ if (likely(dentry))
+ goto finish_lookup;
+ if (!(open_flag & O_CREAT)) {
if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
return ERR_PTR(-ECHILD);
} else {
if (nd->flags & LOOKUP_RCU) {
- bool unlazied;
-
- /* can stay in rcuwalk if not auditing */
- if (dentry)
- goto finish_lookup;
- unlazied = dentry ? try_to_unlazy_next(nd, dentry) :
- try_to_unlazy(nd);
- if (!unlazied)
+ if (!try_to_unlazy(nd))
return ERR_PTR(-ECHILD);
}
- if (dentry)
- goto finish_lookup;
}
if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
--
2.43.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-08 10:36 ` Christian Brauner
@ 2024-08-08 10:54 ` Jeff Layton
2024-08-08 11:18 ` Christian Brauner
2024-08-08 17:11 ` Jan Kara
1 sibling, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-08 10:54 UTC (permalink / raw)
To: Christian Brauner
Cc: Alexander Viro, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
On Thu, 2024-08-08 at 12:36 +0200, Christian Brauner wrote:
> On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > +{
> > > > + struct dentry *dentry;
> > > > +
> > > > + if (open_flag & O_CREAT) {
> > > > + /* Don't bother on an O_EXCL create */
> > > > + if (open_flag & O_EXCL)
> > > > + return NULL;
> > > > +
> > > > + /*
> > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > + * use the dentry. For now, don't do this, since it shifts
> > > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > + * Reconsider this once dentry refcounting handles heavy
> > > > + * contention better.
> > > > + */
> > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > + return NULL;
> > >
> > > Hm, the audit_inode() on the parent is done independent of whether the
> > > file was actually created or not. But the audit_inode() on the file
> > > itself is only done when it was actually created. Imho, there's no need
> > > to do audit_inode() on the parent when we immediately find that file
> > > already existed. If we accept that then this makes the change a lot
> > > simpler.
> > >
> > > The inconsistency would partially remain though. When the file doesn't
> > > exist audit_inode() on the parent is called but by the time we've
> > > grabbed the inode lock someone else might already have created the file
> > > and then again we wouldn't audit_inode() on the file but we would have
> > > on the parent.
> > >
> > > I think that's fine. But if that's bothersome the more aggressive thing
> > > to do would be to pull that audit_inode() on the parent further down
> > > after we created the file. Imho, that should be fine?...
> > >
> > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > for a completely untested draft of what I mean.
> >
> > Yeah, that's a lot simpler. That said, my experience when I've worked
> > with audit in the past is that people who are using it are _very_
> > sensitive to changes of when records get emitted or not. I don't like
> > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > but keeping everything working exactly the same has been my MO whenever
> > I have to work in there.
> >
> > If a certain access pattern suddenly generates a different set of
> > records (or some are missing, as would be in this case), we might get
> > bug reports about this. I'm ok with simplifying this code in the way
> > you suggest, but we may want to do it in a patch on top of mine, to
> > make it simple to revert later if that becomes necessary.
>
> Fwiw, even with the rearranged checks in v3 of the patch audit records
> will be dropped because we may find a positive dentry but the path may
> have trailing slashes. At that point we just return without audit
> whereas before we always would've done that audit.
>
I don't think so. v3 has it drop out of rcuwalk and do the audit_inode
call in the case of trailing slashes. I took great pains here to ensure
that if we emitted a record before that we still do it after. Do let me
know if I missed a case though.
> Honestly, we should move that audit event as right now it's just really
> weird and see if that works. Otherwise the change is somewhat horrible
> complicating the already convoluted logic even more.
>
> So I'm appending the patches that I have on top of your patch in
> vfs.misc. Can you (other as well ofc) take a look and tell me whether
> that's not breaking anything completely other than later audit events?
That all looks fine to me. I'm not a heavy user of audit, but the
change you've made seems sane to me. Hopefully no one will notice.
I'll plan to do some testing with it later today.
Thanks!
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-08 10:54 ` Jeff Layton
@ 2024-08-08 11:18 ` Christian Brauner
0 siblings, 0 replies; 33+ messages in thread
From: Christian Brauner @ 2024-08-08 11:18 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
> I don't think so. v3 has it drop out of rcuwalk and do the audit_inode
> call in the case of trailing slashes. I took great pains here to ensure
> that if we emitted a record before that we still do it after. Do let me
> know if I missed a case though.
Ah, I missed that this was predicated on audit not being enabled. Sorry
about the noise...
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-08 10:36 ` Christian Brauner
2024-08-08 10:54 ` Jeff Layton
@ 2024-08-08 17:11 ` Jan Kara
2024-08-08 21:12 ` Paul Moore
1 sibling, 1 reply; 33+ messages in thread
From: Jan Kara @ 2024-08-08 17:11 UTC (permalink / raw)
To: Christian Brauner
Cc: Jeff Layton, Alexander Viro, Jan Kara, Andrew Morton,
Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel,
Paul Moore, audit
On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > +{
> > > > + struct dentry *dentry;
> > > > +
> > > > + if (open_flag & O_CREAT) {
> > > > + /* Don't bother on an O_EXCL create */
> > > > + if (open_flag & O_EXCL)
> > > > + return NULL;
> > > > +
> > > > + /*
> > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > + * use the dentry. For now, don't do this, since it shifts
> > > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > + * Reconsider this once dentry refcounting handles heavy
> > > > + * contention better.
> > > > + */
> > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > + return NULL;
> > >
> > > Hm, the audit_inode() on the parent is done independent of whether the
> > > file was actually created or not. But the audit_inode() on the file
> > > itself is only done when it was actually created. Imho, there's no need
> > > to do audit_inode() on the parent when we immediately find that file
> > > already existed. If we accept that then this makes the change a lot
> > > simpler.
> > >
> > > The inconsistency would partially remain though. When the file doesn't
> > > exist audit_inode() on the parent is called but by the time we've
> > > grabbed the inode lock someone else might already have created the file
> > > and then again we wouldn't audit_inode() on the file but we would have
> > > on the parent.
> > >
> > > I think that's fine. But if that's bothersome the more aggressive thing
> > > to do would be to pull that audit_inode() on the parent further down
> > > after we created the file. Imho, that should be fine?...
> > >
> > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > for a completely untested draft of what I mean.
> >
> > Yeah, that's a lot simpler. That said, my experience when I've worked
> > with audit in the past is that people who are using it are _very_
> > sensitive to changes of when records get emitted or not. I don't like
> > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > but keeping everything working exactly the same has been my MO whenever
> > I have to work in there.
> >
> > If a certain access pattern suddenly generates a different set of
> > records (or some are missing, as would be in this case), we might get
> > bug reports about this. I'm ok with simplifying this code in the way
> > you suggest, but we may want to do it in a patch on top of mine, to
> > make it simple to revert later if that becomes necessary.
>
> Fwiw, even with the rearranged checks in v3 of the patch audit records
> will be dropped because we may find a positive dentry but the path may
> have trailing slashes. At that point we just return without audit
> whereas before we always would've done that audit.
>
> Honestly, we should move that audit event as right now it's just really
> weird and see if that works. Otherwise the change is somewhat horrible
> complicating the already convoluted logic even more.
>
> So I'm appending the patches that I have on top of your patch in
> vfs.misc. Can you (other as well ofc) take a look and tell me whether
> that's not breaking anything completely other than later audit events?
The changes look good as far as I'm concerned but let me CC audit guys if
they have some thoughts regarding the change in generating audit event for
the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
event for the parent when we are failing open due to trailing slashes in
the pathname? Essentially we are speaking about moving:
audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
from open_last_lookups() into lookup_open().
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-08 17:11 ` Jan Kara
@ 2024-08-08 21:12 ` Paul Moore
2024-08-08 23:43 ` Jeff Layton
0 siblings, 1 reply; 33+ messages in thread
From: Paul Moore @ 2024-08-08 21:12 UTC (permalink / raw)
To: Jan Kara
Cc: Christian Brauner, Jeff Layton, Alexander Viro, Andrew Morton,
Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel, audit
On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > +{
> > > > > + struct dentry *dentry;
> > > > > +
> > > > > + if (open_flag & O_CREAT) {
> > > > > + /* Don't bother on an O_EXCL create */
> > > > > + if (open_flag & O_EXCL)
> > > > > + return NULL;
> > > > > +
> > > > > + /*
> > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > + * use the dentry. For now, don't do this, since it shifts
> > > > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > + * Reconsider this once dentry refcounting handles heavy
> > > > > + * contention better.
> > > > > + */
> > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > + return NULL;
> > > >
> > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > file was actually created or not. But the audit_inode() on the file
> > > > itself is only done when it was actually created. Imho, there's no need
> > > > to do audit_inode() on the parent when we immediately find that file
> > > > already existed. If we accept that then this makes the change a lot
> > > > simpler.
> > > >
> > > > The inconsistency would partially remain though. When the file doesn't
> > > > exist audit_inode() on the parent is called but by the time we've
> > > > grabbed the inode lock someone else might already have created the file
> > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > on the parent.
> > > >
> > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > to do would be to pull that audit_inode() on the parent further down
> > > > after we created the file. Imho, that should be fine?...
> > > >
> > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > for a completely untested draft of what I mean.
> > >
> > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > with audit in the past is that people who are using it are _very_
> > > sensitive to changes of when records get emitted or not. I don't like
> > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > but keeping everything working exactly the same has been my MO whenever
> > > I have to work in there.
> > >
> > > If a certain access pattern suddenly generates a different set of
> > > records (or some are missing, as would be in this case), we might get
> > > bug reports about this. I'm ok with simplifying this code in the way
> > > you suggest, but we may want to do it in a patch on top of mine, to
> > > make it simple to revert later if that becomes necessary.
> >
> > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > will be dropped because we may find a positive dentry but the path may
> > have trailing slashes. At that point we just return without audit
> > whereas before we always would've done that audit.
> >
> > Honestly, we should move that audit event as right now it's just really
> > weird and see if that works. Otherwise the change is somewhat horrible
> > complicating the already convoluted logic even more.
> >
> > So I'm appending the patches that I have on top of your patch in
> > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > that's not breaking anything completely other than later audit events?
>
> The changes look good as far as I'm concerned but let me CC audit guys if
> they have some thoughts regarding the change in generating audit event for
> the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> event for the parent when we are failing open due to trailing slashes in
> the pathname? Essentially we are speaking about moving:
>
> audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
>
> from open_last_lookups() into lookup_open().
Thanks for adding the audit mailing list to the CC, Jan. I would ask
for others to do the same when discussing changes that could impact
audit (similar requests for the LSM framework, SELinux, etc.).
The inode/path logging in audit is ... something. I have a
longstanding todo item to go revisit the audit inode logging, both to
fix some known bugs, and see what we can improve (I'm guessing quite a
bit). Unfortunately, there is always something else which is burning
a little bit hotter and I haven't been able to get to it yet.
The general idea with audit is that you want to record the information
both on success and failure. It's easy to understand the success
case, as it is a record of what actually happened on the system, but
you also want to record the failure case as it can provide some
insight on what a process/user is attempting to do, and that can be
very important for certain classes of users. I haven't dug into the
patches in Christian's tree, but in general I think Jeff's guidance
about not changing what is recorded in the audit log is probably good
advice (there will surely be exceptions to that, but it's still good
guidance).
--
paul-moore.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-08 21:12 ` Paul Moore
@ 2024-08-08 23:43 ` Jeff Layton
2024-08-09 0:28 ` Paul Moore
0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-08 23:43 UTC (permalink / raw)
To: Paul Moore, Jan Kara
Cc: Christian Brauner, Alexander Viro, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel, audit
On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > +{
> > > > > > + struct dentry *dentry;
> > > > > > +
> > > > > > + if (open_flag & O_CREAT) {
> > > > > > + /* Don't bother on an O_EXCL create */
> > > > > > + if (open_flag & O_EXCL)
> > > > > > + return NULL;
> > > > > > +
> > > > > > + /*
> > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > + * use the dentry. For now, don't do this, since it shifts
> > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > + * Reconsider this once dentry refcounting handles heavy
> > > > > > + * contention better.
> > > > > > + */
> > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > + return NULL;
> > > > >
> > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > file was actually created or not. But the audit_inode() on the file
> > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > already existed. If we accept that then this makes the change a lot
> > > > > simpler.
> > > > >
> > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > grabbed the inode lock someone else might already have created the file
> > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > on the parent.
> > > > >
> > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > after we created the file. Imho, that should be fine?...
> > > > >
> > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > for a completely untested draft of what I mean.
> > > >
> > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > with audit in the past is that people who are using it are _very_
> > > > sensitive to changes of when records get emitted or not. I don't like
> > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > but keeping everything working exactly the same has been my MO whenever
> > > > I have to work in there.
> > > >
> > > > If a certain access pattern suddenly generates a different set of
> > > > records (or some are missing, as would be in this case), we might get
> > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > make it simple to revert later if that becomes necessary.
> > >
> > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > will be dropped because we may find a positive dentry but the path may
> > > have trailing slashes. At that point we just return without audit
> > > whereas before we always would've done that audit.
> > >
> > > Honestly, we should move that audit event as right now it's just really
> > > weird and see if that works. Otherwise the change is somewhat horrible
> > > complicating the already convoluted logic even more.
> > >
> > > So I'm appending the patches that I have on top of your patch in
> > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > that's not breaking anything completely other than later audit events?
> >
> > The changes look good as far as I'm concerned but let me CC audit guys if
> > they have some thoughts regarding the change in generating audit event for
> > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > event for the parent when we are failing open due to trailing slashes in
> > the pathname? Essentially we are speaking about moving:
> >
> > audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> >
> > from open_last_lookups() into lookup_open().
>
> Thanks for adding the audit mailing list to the CC, Jan. I would ask
> for others to do the same when discussing changes that could impact
> audit (similar requests for the LSM framework, SELinux, etc.).
>
> The inode/path logging in audit is ... something. I have a
> longstanding todo item to go revisit the audit inode logging, both to
> fix some known bugs, and see what we can improve (I'm guessing quite a
> bit). Unfortunately, there is always something else which is burning
> a little bit hotter and I haven't been able to get to it yet.
>
It is "something" alright. The audit logging just happens at strange
and inconvenient times vs. what else we're trying to do wrt pathwalking
and such. In particular here, the fact __audit_inode can block is what
really sucks.
Since we're discussing it...
ISTM that the inode/path logging here is something like a tracepoint.
In particular, we're looking to record a specific set of information at
specific points in the code. One of the big differences between them
however is that tracepoints don't block. The catch is that we can't
just drop messages if we run out of audit logging space, so that would
have to be handled reasonably.
I wonder if we could leverage the tracepoint infrastructure to help us
record the necessary info somehow? Copy the records into a specific
ring buffer, and then copy them out to the audit infrastructure in
task_work?
I don't have any concrete ideas here, but the path/inode audit code has
been a burden for a while now and it'd be good to think about how we
could do this better.
> The general idea with audit is that you want to record the information
> both on success and failure. It's easy to understand the success
> case, as it is a record of what actually happened on the system, but
> you also want to record the failure case as it can provide some
> insight on what a process/user is attempting to do, and that can be
> very important for certain classes of users. I haven't dug into the
> patches in Christian's tree, but in general I think Jeff's guidance
> about not changing what is recorded in the audit log is probably good
> advice (there will surely be exceptions to that, but it's still good
> guidance).
>
In this particular case, the question is:
Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
file, just because O_CREAT was set? We don't emit such a record when
opening without O_CREAT set.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-08 23:43 ` Jeff Layton
@ 2024-08-09 0:28 ` Paul Moore
2024-08-09 0:33 ` Jeff Layton
0 siblings, 1 reply; 33+ messages in thread
From: Paul Moore @ 2024-08-09 0:28 UTC (permalink / raw)
To: Jeff Layton
Cc: Jan Kara, Christian Brauner, Alexander Viro, Andrew Morton,
Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel, audit
On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > > +{
> > > > > > > + struct dentry *dentry;
> > > > > > > +
> > > > > > > + if (open_flag & O_CREAT) {
> > > > > > > + /* Don't bother on an O_EXCL create */
> > > > > > > + if (open_flag & O_EXCL)
> > > > > > > + return NULL;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > > + * use the dentry. For now, don't do this, since it shifts
> > > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > > + * Reconsider this once dentry refcounting handles heavy
> > > > > > > + * contention better.
> > > > > > > + */
> > > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > > + return NULL;
> > > > > >
> > > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > > file was actually created or not. But the audit_inode() on the file
> > > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > > already existed. If we accept that then this makes the change a lot
> > > > > > simpler.
> > > > > >
> > > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > > grabbed the inode lock someone else might already have created the file
> > > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > > on the parent.
> > > > > >
> > > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > > after we created the file. Imho, that should be fine?...
> > > > > >
> > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > for a completely untested draft of what I mean.
> > > > >
> > > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > > with audit in the past is that people who are using it are _very_
> > > > > sensitive to changes of when records get emitted or not. I don't like
> > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > > but keeping everything working exactly the same has been my MO whenever
> > > > > I have to work in there.
> > > > >
> > > > > If a certain access pattern suddenly generates a different set of
> > > > > records (or some are missing, as would be in this case), we might get
> > > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > > make it simple to revert later if that becomes necessary.
> > > >
> > > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > > will be dropped because we may find a positive dentry but the path may
> > > > have trailing slashes. At that point we just return without audit
> > > > whereas before we always would've done that audit.
> > > >
> > > > Honestly, we should move that audit event as right now it's just really
> > > > weird and see if that works. Otherwise the change is somewhat horrible
> > > > complicating the already convoluted logic even more.
> > > >
> > > > So I'm appending the patches that I have on top of your patch in
> > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > > that's not breaking anything completely other than later audit events?
> > >
> > > The changes look good as far as I'm concerned but let me CC audit guys if
> > > they have some thoughts regarding the change in generating audit event for
> > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > > event for the parent when we are failing open due to trailing slashes in
> > > the pathname? Essentially we are speaking about moving:
> > >
> > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > >
> > > from open_last_lookups() into lookup_open().
> >
> > Thanks for adding the audit mailing list to the CC, Jan. I would ask
> > for others to do the same when discussing changes that could impact
> > audit (similar requests for the LSM framework, SELinux, etc.).
> >
> > The inode/path logging in audit is ... something. I have a
> > longstanding todo item to go revisit the audit inode logging, both to
> > fix some known bugs, and see what we can improve (I'm guessing quite a
> > bit). Unfortunately, there is always something else which is burning
> > a little bit hotter and I haven't been able to get to it yet.
> >
>
> It is "something" alright. The audit logging just happens at strange
> and inconvenient times vs. what else we're trying to do wrt pathwalking
> and such. In particular here, the fact __audit_inode can block is what
> really sucks.
>
> Since we're discussing it...
>
> ISTM that the inode/path logging here is something like a tracepoint.
> In particular, we're looking to record a specific set of information at
> specific points in the code. One of the big differences between them
> however is that tracepoints don't block. The catch is that we can't
> just drop messages if we run out of audit logging space, so that would
> have to be handled reasonably.
Yes, the buffer allocation is the tricky bit. Audit does preallocate
some structs for tracking names which ideally should handle the vast
majority of the cases, but yes, we need something to handle all of the
corner cases too without having to resort to audit_panic().
> I wonder if we could leverage the tracepoint infrastructure to help us
> record the necessary info somehow? Copy the records into a specific
> ring buffer, and then copy them out to the audit infrastructure in
> task_work?
I believe using task_work will cause a number of challenges for the
audit subsystem as we try to bring everything together into a single
audit event. We've had a lot of problems with io_uring doing similar
things, some of which are still unresolved.
> I don't have any concrete ideas here, but the path/inode audit code has
> been a burden for a while now and it'd be good to think about how we
> could do this better.
I've got some grand ideas on how to cut down on a lot of our
allocations and string generation in the critical path, not just with
the inodes, but with audit records in general. Sadly I just haven't
had the time to get to any of it.
> > The general idea with audit is that you want to record the information
> > both on success and failure. It's easy to understand the success
> > case, as it is a record of what actually happened on the system, but
> > you also want to record the failure case as it can provide some
> > insight on what a process/user is attempting to do, and that can be
> > very important for certain classes of users. I haven't dug into the
> > patches in Christian's tree, but in general I think Jeff's guidance
> > about not changing what is recorded in the audit log is probably good
> > advice (there will surely be exceptions to that, but it's still good
> > guidance).
> >
>
> In this particular case, the question is:
>
> Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
> file, just because O_CREAT was set? We don't emit such a record when
> opening without O_CREAT set.
I'm not as current on the third-party security requirements as I used
to be, but I do know that oftentimes when a file is created the parent
directory is an important bit of information to have in the audit log.
--
paul-moore.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-09 0:28 ` Paul Moore
@ 2024-08-09 0:33 ` Jeff Layton
2024-08-09 1:22 ` Paul Moore
0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-09 0:33 UTC (permalink / raw)
To: Paul Moore
Cc: Jan Kara, Christian Brauner, Alexander Viro, Andrew Morton,
Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel, audit
On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote:
> On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > > > +{
> > > > > > > > + struct dentry *dentry;
> > > > > > > > +
> > > > > > > > + if (open_flag & O_CREAT) {
> > > > > > > > + /* Don't bother on an O_EXCL create */
> > > > > > > > + if (open_flag & O_EXCL)
> > > > > > > > + return NULL;
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > > > + * use the dentry. For now, don't do this, since it shifts
> > > > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > > > + * Reconsider this once dentry refcounting handles heavy
> > > > > > > > + * contention better.
> > > > > > > > + */
> > > > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > > > + return NULL;
> > > > > > >
> > > > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > > > file was actually created or not. But the audit_inode() on the file
> > > > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > > > already existed. If we accept that then this makes the change a lot
> > > > > > > simpler.
> > > > > > >
> > > > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > > > grabbed the inode lock someone else might already have created the file
> > > > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > > > on the parent.
> > > > > > >
> > > > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > > > after we created the file. Imho, that should be fine?...
> > > > > > >
> > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > > for a completely untested draft of what I mean.
> > > > > >
> > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > > > with audit in the past is that people who are using it are _very_
> > > > > > sensitive to changes of when records get emitted or not. I don't like
> > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > > > but keeping everything working exactly the same has been my MO whenever
> > > > > > I have to work in there.
> > > > > >
> > > > > > If a certain access pattern suddenly generates a different set of
> > > > > > records (or some are missing, as would be in this case), we might get
> > > > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > > > make it simple to revert later if that becomes necessary.
> > > > >
> > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > > > will be dropped because we may find a positive dentry but the path may
> > > > > have trailing slashes. At that point we just return without audit
> > > > > whereas before we always would've done that audit.
> > > > >
> > > > > Honestly, we should move that audit event as right now it's just really
> > > > > weird and see if that works. Otherwise the change is somewhat horrible
> > > > > complicating the already convoluted logic even more.
> > > > >
> > > > > So I'm appending the patches that I have on top of your patch in
> > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > > > that's not breaking anything completely other than later audit events?
> > > >
> > > > The changes look good as far as I'm concerned but let me CC audit guys if
> > > > they have some thoughts regarding the change in generating audit event for
> > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > > > event for the parent when we are failing open due to trailing slashes in
> > > > the pathname? Essentially we are speaking about moving:
> > > >
> > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > > >
> > > > from open_last_lookups() into lookup_open().
> > >
> > > Thanks for adding the audit mailing list to the CC, Jan. I would ask
> > > for others to do the same when discussing changes that could impact
> > > audit (similar requests for the LSM framework, SELinux, etc.).
> > >
> > > The inode/path logging in audit is ... something. I have a
> > > longstanding todo item to go revisit the audit inode logging, both to
> > > fix some known bugs, and see what we can improve (I'm guessing quite a
> > > bit). Unfortunately, there is always something else which is burning
> > > a little bit hotter and I haven't been able to get to it yet.
> > >
> >
> > It is "something" alright. The audit logging just happens at strange
> > and inconvenient times vs. what else we're trying to do wrt pathwalking
> > and such. In particular here, the fact __audit_inode can block is what
> > really sucks.
> >
> > Since we're discussing it...
> >
> > ISTM that the inode/path logging here is something like a tracepoint.
> > In particular, we're looking to record a specific set of information at
> > specific points in the code. One of the big differences between them
> > however is that tracepoints don't block. The catch is that we can't
> > just drop messages if we run out of audit logging space, so that would
> > have to be handled reasonably.
>
> Yes, the buffer allocation is the tricky bit. Audit does preallocate
> some structs for tracking names which ideally should handle the vast
> majority of the cases, but yes, we need something to handle all of the
> corner cases too without having to resort to audit_panic().
>
> > I wonder if we could leverage the tracepoint infrastructure to help us
> > record the necessary info somehow? Copy the records into a specific
> > ring buffer, and then copy them out to the audit infrastructure in
> > task_work?
>
> I believe using task_work will cause a number of challenges for the
> audit subsystem as we try to bring everything together into a single
> audit event. We've had a lot of problems with io_uring doing similar
> things, some of which are still unresolved.
>
> > I don't have any concrete ideas here, but the path/inode audit code has
> > been a burden for a while now and it'd be good to think about how we
> > could do this better.
>
> I've got some grand ideas on how to cut down on a lot of our
> allocations and string generation in the critical path, not just with
> the inodes, but with audit records in general. Sadly I just haven't
> had the time to get to any of it.
>
> > > The general idea with audit is that you want to record the information
> > > both on success and failure. It's easy to understand the success
> > > case, as it is a record of what actually happened on the system, but
> > > you also want to record the failure case as it can provide some
> > > insight on what a process/user is attempting to do, and that can be
> > > very important for certain classes of users. I haven't dug into the
> > > patches in Christian's tree, but in general I think Jeff's guidance
> > > about not changing what is recorded in the audit log is probably good
> > > advice (there will surely be exceptions to that, but it's still good
> > > guidance).
> > >
> >
> > In this particular case, the question is:
> >
> > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
> > file, just because O_CREAT was set? We don't emit such a record when
> > opening without O_CREAT set.
>
> I'm not as current on the third-party security requirements as I used
> to be, but I do know that oftentimes when a file is created the parent
> directory is an important bit of information to have in the audit log.
>
Right. We'd still have that here since we have to unlazy to actually
create the file.
The question here is about the case where O_CREAT is set, but the file
already exists. Nothing is being created in that case, so do we need to
emit an audit record for the parent?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-09 0:33 ` Jeff Layton
@ 2024-08-09 1:22 ` Paul Moore
2024-08-09 14:21 ` Jeff Layton
0 siblings, 1 reply; 33+ messages in thread
From: Paul Moore @ 2024-08-09 1:22 UTC (permalink / raw)
To: Jeff Layton
Cc: Jan Kara, Christian Brauner, Alexander Viro, Andrew Morton,
Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel, audit
On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote:
> > On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > > > > +{
> > > > > > > > > + struct dentry *dentry;
> > > > > > > > > +
> > > > > > > > > + if (open_flag & O_CREAT) {
> > > > > > > > > + /* Don't bother on an O_EXCL create */
> > > > > > > > > + if (open_flag & O_EXCL)
> > > > > > > > > + return NULL;
> > > > > > > > > +
> > > > > > > > > + /*
> > > > > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > > > > + * use the dentry. For now, don't do this, since it shifts
> > > > > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > > > > + * Reconsider this once dentry refcounting handles heavy
> > > > > > > > > + * contention better.
> > > > > > > > > + */
> > > > > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > > > > + return NULL;
> > > > > > > >
> > > > > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > > > > file was actually created or not. But the audit_inode() on the file
> > > > > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > > > > already existed. If we accept that then this makes the change a lot
> > > > > > > > simpler.
> > > > > > > >
> > > > > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > > > > grabbed the inode lock someone else might already have created the file
> > > > > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > > > > on the parent.
> > > > > > > >
> > > > > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > > > > after we created the file. Imho, that should be fine?...
> > > > > > > >
> > > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > > > for a completely untested draft of what I mean.
> > > > > > >
> > > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > > > > with audit in the past is that people who are using it are _very_
> > > > > > > sensitive to changes of when records get emitted or not. I don't like
> > > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > > > > but keeping everything working exactly the same has been my MO whenever
> > > > > > > I have to work in there.
> > > > > > >
> > > > > > > If a certain access pattern suddenly generates a different set of
> > > > > > > records (or some are missing, as would be in this case), we might get
> > > > > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > > > > make it simple to revert later if that becomes necessary.
> > > > > >
> > > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > > > > will be dropped because we may find a positive dentry but the path may
> > > > > > have trailing slashes. At that point we just return without audit
> > > > > > whereas before we always would've done that audit.
> > > > > >
> > > > > > Honestly, we should move that audit event as right now it's just really
> > > > > > weird and see if that works. Otherwise the change is somewhat horrible
> > > > > > complicating the already convoluted logic even more.
> > > > > >
> > > > > > So I'm appending the patches that I have on top of your patch in
> > > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > > > > that's not breaking anything completely other than later audit events?
> > > > >
> > > > > The changes look good as far as I'm concerned but let me CC audit guys if
> > > > > they have some thoughts regarding the change in generating audit event for
> > > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > > > > event for the parent when we are failing open due to trailing slashes in
> > > > > the pathname? Essentially we are speaking about moving:
> > > > >
> > > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > > > >
> > > > > from open_last_lookups() into lookup_open().
> > > >
> > > > Thanks for adding the audit mailing list to the CC, Jan. I would ask
> > > > for others to do the same when discussing changes that could impact
> > > > audit (similar requests for the LSM framework, SELinux, etc.).
> > > >
> > > > The inode/path logging in audit is ... something. I have a
> > > > longstanding todo item to go revisit the audit inode logging, both to
> > > > fix some known bugs, and see what we can improve (I'm guessing quite a
> > > > bit). Unfortunately, there is always something else which is burning
> > > > a little bit hotter and I haven't been able to get to it yet.
> > > >
> > >
> > > It is "something" alright. The audit logging just happens at strange
> > > and inconvenient times vs. what else we're trying to do wrt pathwalking
> > > and such. In particular here, the fact __audit_inode can block is what
> > > really sucks.
> > >
> > > Since we're discussing it...
> > >
> > > ISTM that the inode/path logging here is something like a tracepoint.
> > > In particular, we're looking to record a specific set of information at
> > > specific points in the code. One of the big differences between them
> > > however is that tracepoints don't block. The catch is that we can't
> > > just drop messages if we run out of audit logging space, so that would
> > > have to be handled reasonably.
> >
> > Yes, the buffer allocation is the tricky bit. Audit does preallocate
> > some structs for tracking names which ideally should handle the vast
> > majority of the cases, but yes, we need something to handle all of the
> > corner cases too without having to resort to audit_panic().
> >
> > > I wonder if we could leverage the tracepoint infrastructure to help us
> > > record the necessary info somehow? Copy the records into a specific
> > > ring buffer, and then copy them out to the audit infrastructure in
> > > task_work?
> >
> > I believe using task_work will cause a number of challenges for the
> > audit subsystem as we try to bring everything together into a single
> > audit event. We've had a lot of problems with io_uring doing similar
> > things, some of which are still unresolved.
> >
> > > I don't have any concrete ideas here, but the path/inode audit code has
> > > been a burden for a while now and it'd be good to think about how we
> > > could do this better.
> >
> > I've got some grand ideas on how to cut down on a lot of our
> > allocations and string generation in the critical path, not just with
> > the inodes, but with audit records in general. Sadly I just haven't
> > had the time to get to any of it.
> >
> > > > The general idea with audit is that you want to record the information
> > > > both on success and failure. It's easy to understand the success
> > > > case, as it is a record of what actually happened on the system, but
> > > > you also want to record the failure case as it can provide some
> > > > insight on what a process/user is attempting to do, and that can be
> > > > very important for certain classes of users. I haven't dug into the
> > > > patches in Christian's tree, but in general I think Jeff's guidance
> > > > about not changing what is recorded in the audit log is probably good
> > > > advice (there will surely be exceptions to that, but it's still good
> > > > guidance).
> > > >
> > >
> > > In this particular case, the question is:
> > >
> > > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
> > > file, just because O_CREAT was set? We don't emit such a record when
> > > opening without O_CREAT set.
> >
> > I'm not as current on the third-party security requirements as I used
> > to be, but I do know that oftentimes when a file is created the parent
> > directory is an important bit of information to have in the audit log.
> >
>
> Right. We'd still have that here since we have to unlazy to actually
> create the file.
>
> The question here is about the case where O_CREAT is set, but the file
> already exists. Nothing is being created in that case, so do we need to
> emit an audit record for the parent?
As long as the full path information is present in the existing file's
audit record it should be okay.
--
paul-moore.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-09 1:22 ` Paul Moore
@ 2024-08-09 14:21 ` Jeff Layton
2024-08-11 21:52 ` Paul Moore
0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2024-08-09 14:21 UTC (permalink / raw)
To: Paul Moore
Cc: Jan Kara, Christian Brauner, Alexander Viro, Andrew Morton,
Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel, audit
On Thu, 2024-08-08 at 21:22 -0400, Paul Moore wrote:
> On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org>
> wrote:
> > On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote:
> > > On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org>
> > > wrote:
> > > > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > > > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton
> > > > > > > wrote:
> > > > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner
> > > > > > > > wrote:
> > > > > > > > > > +static struct dentry *lookup_fast_for_open(struct
> > > > > > > > > > nameidata *nd, int open_flag)
> > > > > > > > > > +{
> > > > > > > > > > + struct dentry *dentry;
> > > > > > > > > > +
> > > > > > > > > > + if (open_flag & O_CREAT) {
> > > > > > > > > > + /* Don't bother on an O_EXCL create
> > > > > > > > > > */
> > > > > > > > > > + if (open_flag & O_EXCL)
> > > > > > > > > > + return NULL;
> > > > > > > > > > +
> > > > > > > > > > + /*
> > > > > > > > > > + * FIXME: If auditing is enabled,
> > > > > > > > > > then we'll have to unlazy to
> > > > > > > > > > + * use the dentry. For now, don't
> > > > > > > > > > do this, since it shifts
> > > > > > > > > > + * contention from parent's i_rwsem
> > > > > > > > > > to its d_lockref spinlock.
> > > > > > > > > > + * Reconsider this once dentry
> > > > > > > > > > refcounting handles heavy
> > > > > > > > > > + * contention better.
> > > > > > > > > > + */
> > > > > > > > > > + if ((nd->flags & LOOKUP_RCU) &&
> > > > > > > > > > !audit_dummy_context())
> > > > > > > > > > + return NULL;
> > > > > > > > >
> > > > > > > > > Hm, the audit_inode() on the parent is done
> > > > > > > > > independent of whether the
> > > > > > > > > file was actually created or not. But the
> > > > > > > > > audit_inode() on the file
> > > > > > > > > itself is only done when it was actually created.
> > > > > > > > > Imho, there's no need
> > > > > > > > > to do audit_inode() on the parent when we immediately
> > > > > > > > > find that file
> > > > > > > > > already existed. If we accept that then this makes
> > > > > > > > > the change a lot
> > > > > > > > > simpler.
> > > > > > > > >
> > > > > > > > > The inconsistency would partially remain though. When
> > > > > > > > > the file doesn't
> > > > > > > > > exist audit_inode() on the parent is called but by
> > > > > > > > > the time we've
> > > > > > > > > grabbed the inode lock someone else might already
> > > > > > > > > have created the file
> > > > > > > > > and then again we wouldn't audit_inode() on the file
> > > > > > > > > but we would have
> > > > > > > > > on the parent.
> > > > > > > > >
> > > > > > > > > I think that's fine. But if that's bothersome the
> > > > > > > > > more aggressive thing
> > > > > > > > > to do would be to pull that audit_inode() on the
> > > > > > > > > parent further down
> > > > > > > > > after we created the file. Imho, that should be
> > > > > > > > > fine?...
> > > > > > > > >
> > > > > > > > > See
> > > > > > > > > https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > > > > for a completely untested draft of what I mean.
> > > > > > > >
> > > > > > > > Yeah, that's a lot simpler. That said, my experience
> > > > > > > > when I've worked
> > > > > > > > with audit in the past is that people who are using it
> > > > > > > > are _very_
> > > > > > > > sensitive to changes of when records get emitted or
> > > > > > > > not. I don't like
> > > > > > > > this, because I think the rules here are ad-hoc and
> > > > > > > > somewhat arbitrary,
> > > > > > > > but keeping everything working exactly the same has
> > > > > > > > been my MO whenever
> > > > > > > > I have to work in there.
> > > > > > > >
> > > > > > > > If a certain access pattern suddenly generates a
> > > > > > > > different set of
> > > > > > > > records (or some are missing, as would be in this
> > > > > > > > case), we might get
> > > > > > > > bug reports about this. I'm ok with simplifying this
> > > > > > > > code in the way
> > > > > > > > you suggest, but we may want to do it in a patch on top
> > > > > > > > of mine, to
> > > > > > > > make it simple to revert later if that becomes
> > > > > > > > necessary.
> > > > > > >
> > > > > > > Fwiw, even with the rearranged checks in v3 of the patch
> > > > > > > audit records
> > > > > > > will be dropped because we may find a positive dentry but
> > > > > > > the path may
> > > > > > > have trailing slashes. At that point we just return
> > > > > > > without audit
> > > > > > > whereas before we always would've done that audit.
> > > > > > >
> > > > > > > Honestly, we should move that audit event as right now
> > > > > > > it's just really
> > > > > > > weird and see if that works. Otherwise the change is
> > > > > > > somewhat horrible
> > > > > > > complicating the already convoluted logic even more.
> > > > > > >
> > > > > > > So I'm appending the patches that I have on top of your
> > > > > > > patch in
> > > > > > > vfs.misc. Can you (other as well ofc) take a look and
> > > > > > > tell me whether
> > > > > > > that's not breaking anything completely other than later
> > > > > > > audit events?
> > > > > >
> > > > > > The changes look good as far as I'm concerned but let me CC
> > > > > > audit guys if
> > > > > > they have some thoughts regarding the change in generating
> > > > > > audit event for
> > > > > > the parent. Paul, does it matter if open(O_CREAT) doesn't
> > > > > > generate audit
> > > > > > event for the parent when we are failing open due to
> > > > > > trailing slashes in
> > > > > > the pathname? Essentially we are speaking about moving:
> > > > > >
> > > > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > > > > >
> > > > > > from open_last_lookups() into lookup_open().
> > > > >
> > > > > Thanks for adding the audit mailing list to the CC, Jan. I
> > > > > would ask
> > > > > for others to do the same when discussing changes that could
> > > > > impact
> > > > > audit (similar requests for the LSM framework, SELinux,
> > > > > etc.).
> > > > >
> > > > > The inode/path logging in audit is ... something. I have a
> > > > > longstanding todo item to go revisit the audit inode logging,
> > > > > both to
> > > > > fix some known bugs, and see what we can improve (I'm
> > > > > guessing quite a
> > > > > bit). Unfortunately, there is always something else which is
> > > > > burning
> > > > > a little bit hotter and I haven't been able to get to it yet.
> > > > >
> > > >
> > > > It is "something" alright. The audit logging just happens at
> > > > strange
> > > > and inconvenient times vs. what else we're trying to do wrt
> > > > pathwalking
> > > > and such. In particular here, the fact __audit_inode can block
> > > > is what
> > > > really sucks.
> > > >
> > > > Since we're discussing it...
> > > >
> > > > ISTM that the inode/path logging here is something like a
> > > > tracepoint.
> > > > In particular, we're looking to record a specific set of
> > > > information at
> > > > specific points in the code. One of the big differences between
> > > > them
> > > > however is that tracepoints don't block. The catch is that we
> > > > can't
> > > > just drop messages if we run out of audit logging space, so
> > > > that would
> > > > have to be handled reasonably.
> > >
> > > Yes, the buffer allocation is the tricky bit. Audit does
> > > preallocate
> > > some structs for tracking names which ideally should handle the
> > > vast
> > > majority of the cases, but yes, we need something to handle all
> > > of the
> > > corner cases too without having to resort to audit_panic().
> > >
> > > > I wonder if we could leverage the tracepoint infrastructure to
> > > > help us
> > > > record the necessary info somehow? Copy the records into a
> > > > specific
> > > > ring buffer, and then copy them out to the audit infrastructure
> > > > in
> > > > task_work?
> > >
> > > I believe using task_work will cause a number of challenges for
> > > the
> > > audit subsystem as we try to bring everything together into a
> > > single
> > > audit event. We've had a lot of problems with io_uring doing
> > > similar
> > > things, some of which are still unresolved.
> > >
> > > > I don't have any concrete ideas here, but the path/inode audit
> > > > code has
> > > > been a burden for a while now and it'd be good to think about
> > > > how we
> > > > could do this better.
> > >
> > > I've got some grand ideas on how to cut down on a lot of our
> > > allocations and string generation in the critical path, not just
> > > with
> > > the inodes, but with audit records in general. Sadly I just
> > > haven't
> > > had the time to get to any of it.
> > >
> > > > > The general idea with audit is that you want to record the
> > > > > information
> > > > > both on success and failure. It's easy to understand the
> > > > > success
> > > > > case, as it is a record of what actually happened on the
> > > > > system, but
> > > > > you also want to record the failure case as it can provide
> > > > > some
> > > > > insight on what a process/user is attempting to do, and that
> > > > > can be
> > > > > very important for certain classes of users. I haven't dug
> > > > > into the
> > > > > patches in Christian's tree, but in general I think Jeff's
> > > > > guidance
> > > > > about not changing what is recorded in the audit log is
> > > > > probably good
> > > > > advice (there will surely be exceptions to that, but it's
> > > > > still good
> > > > > guidance).
> > > > >
> > > >
> > > > In this particular case, the question is:
> > > >
> > > > Do we need to emit a AUDIT_INODE_PARENT record when opening an
> > > > existing
> > > > file, just because O_CREAT was set? We don't emit such a record
> > > > when
> > > > opening without O_CREAT set.
> > >
> > > I'm not as current on the third-party security requirements as I
> > > used
> > > to be, but I do know that oftentimes when a file is created the
> > > parent
> > > directory is an important bit of information to have in the audit
> > > log.
> > >
> >
> > Right. We'd still have that here since we have to unlazy to
> > actually
> > create the file.
> >
> > The question here is about the case where O_CREAT is set, but the
> > file
> > already exists. Nothing is being created in that case, so do we
> > need to
> > emit an audit record for the parent?
>
> As long as the full path information is present in the existing
> file's
> audit record it should be okay.
>
O_CREAT is ignored when the dentry already exists, so doing the same
thing that we do when O_CREAT isn't set seems reasonable.
We do call this in do_open, which would apply in this case:
if (!(file->f_mode & FMODE_CREATED))
audit_inode(nd->name, nd->path.dentry, 0);
That should have the necessary path info. If that's the case, then I
think Christian's cleanup series on top of mine should be OK. I think
that the only thing that would be missing is the AUDIT_INODE_PARENT
record for the directory in the case where the dentry already exists,
which should be superfluous.
ISTR that Red Hat has a pretty extensive testsuite for audit. We might
want to get them to run their tests on Christian's changes to be sure
there are no surprises, if they are amenable.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-09 14:21 ` Jeff Layton
@ 2024-08-11 21:52 ` Paul Moore
0 siblings, 0 replies; 33+ messages in thread
From: Paul Moore @ 2024-08-11 21:52 UTC (permalink / raw)
To: Jeff Layton
Cc: Jan Kara, Christian Brauner, Alexander Viro, Andrew Morton,
Mateusz Guzik, Josef Bacik, linux-fsdevel, linux-kernel, audit
On Fri, Aug 9, 2024 at 10:21 AM Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2024-08-08 at 21:22 -0400, Paul Moore wrote:
> > On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org>
> > wrote:
...
> > > The question here is about the case where O_CREAT is set, but the
> > > file
> > > already exists. Nothing is being created in that case, so do we
> > > need to
> > > emit an audit record for the parent?
> >
> > As long as the full path information is present in the existing
> > file's
> > audit record it should be okay.
>
> O_CREAT is ignored when the dentry already exists, so doing the same
> thing that we do when O_CREAT isn't set seems reasonable.
>
> We do call this in do_open, which would apply in this case:
>
> if (!(file->f_mode & FMODE_CREATED))
> audit_inode(nd->name, nd->path.dentry, 0);
>
> That should have the necessary path info. If that's the case, then I
> think Christian's cleanup series on top of mine should be OK. I think
> that the only thing that would be missing is the AUDIT_INODE_PARENT
> record for the directory in the case where the dentry already exists,
> which should be superfluous.
>
> ISTR that Red Hat has a pretty extensive testsuite for audit. We might
> want to get them to run their tests on Christian's changes to be sure
> there are no surprises, if they are amenable.
I believe you are thinking of the audit-test project, which started as
a community effort to develop a test suite suitable for the popular
Common Criteria protection profiles of the time (of which auditing was
an important requirement). Unfortunately, after a couple rounds of
certifications I couldn't get the various companies involved to
continue to maintain the public test suite anymore, so everyone went
their own way with private forks. I have no idea if the community
project still works, but someone at IBM/RH should have a recent~ish
version that either runs on a modern system or is close to it. FWIW,
the dead/dormant community project can be found at the link below
(yes, that is a sf.net link):
https://sourceforge.net/projects/audit-test
It's been a while since I've been at RH, so I'm not sure if my test/QA
contacts are still there, but if you don't have a contact at IBM/RH
Jeff let me know and I can try to reach out.
--
paul-moore.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 19:51 ` Jeff Layton
@ 2024-08-14 2:18 ` Al Viro
2024-08-14 2:40 ` Al Viro
0 siblings, 1 reply; 33+ messages in thread
From: Al Viro @ 2024-08-14 2:18 UTC (permalink / raw)
To: Jeff Layton
Cc: Christian Brauner, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
On Tue, Aug 06, 2024 at 03:51:35PM -0400, Jeff Layton wrote:
> > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > +{
> > + struct dentry *dentry;
> > +
> > + if (open_flag & O_CREAT) {
> > + /* Don't bother on an O_EXCL create */
> > + if (open_flag & O_EXCL)
> > + return NULL;
> > +
> > + /*
> > + * FIXME: If auditing is enabled, then we'll have to unlazy to
> > + * use the dentry. For now, don't do this, since it shifts
> > + * contention from parent's i_rwsem to its d_lockref spinlock.
> > + * Reconsider this once dentry refcounting handles heavy
> > + * contention better.
> > + */
> > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > + return NULL;
> > + }
> > +
> > + if (trailing_slashes(nd))
> > + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> > +
> > + dentry = lookup_fast(nd);
>
> Self-NAK on this patch. We have to test for IS_ERR on the returned
> dentry here. I'll send a v3 along after I've retested it.
That's not the only problem; your "is it negative" test is inherently
racy in RCU mode. IOW, what is positive at the time you get here can
bloody well go negative immediately afterwards. Hit that with
O_CREAT and you've got a bogus ENOENT...
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-14 2:18 ` Al Viro
@ 2024-08-14 2:40 ` Al Viro
2024-08-14 11:48 ` Jeff Layton
0 siblings, 1 reply; 33+ messages in thread
From: Al Viro @ 2024-08-14 2:40 UTC (permalink / raw)
To: Jeff Layton
Cc: Christian Brauner, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:
> That's not the only problem; your "is it negative" test is inherently
> racy in RCU mode. IOW, what is positive at the time you get here can
> bloody well go negative immediately afterwards. Hit that with
> O_CREAT and you've got a bogus ENOENT...
Hmm... OTOH, in that case you end up in step_into(), which will do the
right thing...
How well does that series survive NFS client regression tests?
That's where I'd expect potentially subtle shite, what with short-circuited
->d_revalidate() on the final pathwalk step in open()...
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-14 2:40 ` Al Viro
@ 2024-08-14 11:48 ` Jeff Layton
2024-08-14 12:40 ` Christian Brauner
2024-08-14 15:42 ` Al Viro
0 siblings, 2 replies; 33+ messages in thread
From: Jeff Layton @ 2024-08-14 11:48 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote:
> On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:
>
> > That's not the only problem; your "is it negative" test is inherently
> > racy in RCU mode. IOW, what is positive at the time you get here can
> > bloody well go negative immediately afterwards. Hit that with
> > O_CREAT and you've got a bogus ENOENT...
>
> Hmm... OTOH, in that case you end up in step_into(), which will do the
> right thing...
>
> How well does that series survive NFS client regression tests?
> That's where I'd expect potentially subtle shite, what with short-circuited
> ->d_revalidate() on the final pathwalk step in open()...
Christian took in my v3 patch which is a bit different from this one.
It seems to be doing fine in testing with NFS and otherwise.
I don't think we short-circuit the d_revalidate though, do we? That
version calls lookup_fast on the last component which should
d_revalidate the last dentry before returning it.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-14 11:48 ` Jeff Layton
@ 2024-08-14 12:40 ` Christian Brauner
2024-08-14 15:44 ` Al Viro
2024-08-14 15:42 ` Al Viro
1 sibling, 1 reply; 33+ messages in thread
From: Christian Brauner @ 2024-08-14 12:40 UTC (permalink / raw)
To: Jeff Layton
Cc: Al Viro, Jan Kara, Andrew Morton, Mateusz Guzik, Josef Bacik,
linux-fsdevel, linux-kernel
> Christian took in my v3 patch which is a bit different from this one.
> It seems to be doing fine in testing with NFS and otherwise.
Every branch gets tested with nfs fstests (in addition to the usual
suspects):
Failures: generic/732
Failed 1 of 600 tests
And that just fails because it's missing your 4fd042e0465c
("generic/732: don't run it on NFS")
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-14 11:48 ` Jeff Layton
2024-08-14 12:40 ` Christian Brauner
@ 2024-08-14 15:42 ` Al Viro
2024-08-14 16:46 ` Jeff Layton
1 sibling, 1 reply; 33+ messages in thread
From: Al Viro @ 2024-08-14 15:42 UTC (permalink / raw)
To: Jeff Layton
Cc: Christian Brauner, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
On Wed, Aug 14, 2024 at 07:48:17AM -0400, Jeff Layton wrote:
> On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote:
> > On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:
> >
> > > That's not the only problem; your "is it negative" test is inherently
> > > racy in RCU mode. IOW, what is positive at the time you get here can
> > > bloody well go negative immediately afterwards. Hit that with
> > > O_CREAT and you've got a bogus ENOENT...
> >
> > Hmm... OTOH, in that case you end up in step_into(), which will do the
> > right thing...
> >
> > How well does that series survive NFS client regression tests?
> > That's where I'd expect potentially subtle shite, what with short-circuited
> > ->d_revalidate() on the final pathwalk step in open()...
>
> Christian took in my v3 patch which is a bit different from this one.
> It seems to be doing fine in testing with NFS and otherwise.
>
> I don't think we short-circuit the d_revalidate though, do we? That
> version calls lookup_fast on the last component which should
> d_revalidate the last dentry before returning it.
It's not about a skipped call of ->d_revalidate(); it's about the NFS
(especially NFS4) dances inside ->d_revalidate(), where it tries to
cut down on roundtrips where possible. The interplay with ->atomic_open()
and ->open() is subtle and I'm not sure that we do not depend upon the
details of ->i_rwsem locking by fs/namei.c in there - proof of correctness
used to be rather convoluted there, especially wrt the unhashing and
rehashing aliases.
I'm not saying that your changes break things in there, but that's one
area where I would look for trouble. NFS has fairly extensive regression
tests, and it would be a good idea to beat that patchset with those.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-14 12:40 ` Christian Brauner
@ 2024-08-14 15:44 ` Al Viro
2024-08-16 8:34 ` Christian Brauner
0 siblings, 1 reply; 33+ messages in thread
From: Al Viro @ 2024-08-14 15:44 UTC (permalink / raw)
To: Christian Brauner
Cc: Jeff Layton, Jan Kara, Andrew Morton, Mateusz Guzik, Josef Bacik,
linux-fsdevel, linux-kernel
On Wed, Aug 14, 2024 at 02:40:23PM +0200, Christian Brauner wrote:
> > Christian took in my v3 patch which is a bit different from this one.
> > It seems to be doing fine in testing with NFS and otherwise.
>
> Every branch gets tested with nfs fstests (in addition to the usual
> suspects):
>
> Failures: generic/732
> Failed 1 of 600 tests
>
> And that just fails because it's missing your 4fd042e0465c
> ("generic/732: don't run it on NFS")
connectathon would be more interesting...
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-14 15:42 ` Al Viro
@ 2024-08-14 16:46 ` Jeff Layton
0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2024-08-14 16:46 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, Jan Kara, Andrew Morton, Mateusz Guzik,
Josef Bacik, linux-fsdevel, linux-kernel
On Wed, 2024-08-14 at 16:42 +0100, Al Viro wrote:
> On Wed, Aug 14, 2024 at 07:48:17AM -0400, Jeff Layton wrote:
> > On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote:
> > > On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:
> > >
> > > > That's not the only problem; your "is it negative" test is
> > > > inherently
> > > > racy in RCU mode. IOW, what is positive at the time you get
> > > > here can
> > > > bloody well go negative immediately afterwards. Hit that with
> > > > O_CREAT and you've got a bogus ENOENT...
> > >
> > > Hmm... OTOH, in that case you end up in step_into(), which will
> > > do the
> > > right thing...
> > >
> > > How well does that series survive NFS client regression
> > > tests?
> > > That's where I'd expect potentially subtle shite, what with
> > > short-circuited
> > > ->d_revalidate() on the final pathwalk step in open()...
> >
> > Christian took in my v3 patch which is a bit different from this
> > one.
> > It seems to be doing fine in testing with NFS and otherwise.
> >
> > I don't think we short-circuit the d_revalidate though, do we? That
> > version calls lookup_fast on the last component which should
> > d_revalidate the last dentry before returning it.
>
> It's not about a skipped call of ->d_revalidate(); it's about the NFS
> (especially NFS4) dances inside ->d_revalidate(), where it tries to
> cut down on roundtrips where possible. The interplay with -
> >atomic_open()
> and ->open() is subtle and I'm not sure that we do not depend upon
> the
> details of ->i_rwsem locking by fs/namei.c in there - proof of
> correctness
> used to be rather convoluted there, especially wrt the unhashing and
> rehashing aliases.
>
> I'm not saying that your changes break things in there, but that's
> one
> area where I would look for trouble. NFS has fairly extensive
> regression
> tests, and it would be a good idea to beat that patchset with those.
I've already run a bunch of NFS tests on it and it seems to be OK so
far, but I'll keep testing it. My take:
Opening an extant file with O_CREAT set should behave the same as with
O_CREAT not set.
I did crawl through NFS's d_revalidate functions. There are a couple of
places that check for O_CREAT, but they didn't seem to depend on the
i_rwsem or any particular locking.
Please do let me know if you see anything I missed though.
Thanks,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-06 20:47 ` Andi Kleen
@ 2024-08-15 15:07 ` Mateusz Guzik
0 siblings, 0 replies; 33+ messages in thread
From: Mateusz Guzik @ 2024-08-15 15:07 UTC (permalink / raw)
To: Andi Kleen
Cc: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel
On Tue, Aug 6, 2024 at 10:47 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Before I get to the vfs layer, there is a significant loss in the
> > memory allocator because of memcg -- it takes several irq off/on trips
> > for every alloc (needed to grab struct file *). I have a plan what to
> > do with it (handle stuff with local cmpxchg (note no lock prefix)),
> > which I'm trying to get around to. Apart from that you may note the
> > allocator fast path performs a 16-byte cmpxchg, which is again dog
> > slow and executes twice (once for the file obj, another time for the
> > namei buffer). Someone(tm) should patch it up and I have some vague
> > ideas, but 0 idea when I can take a serious stab.
>
> I just LBR sampled it on my skylake and it doesn't look
> particularly slow. You see the whole massive block including CMPXCHG16
> gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely
> a missing cache line.
>
> kmem_cache_free:
> ffffffff9944ce20 nop %edi, %edx
> ffffffff9944ce24 nopl %eax, (%rax,%rax,1)
> ffffffff9944ce29 pushq %rbp
> ffffffff9944ce2a mov %rdi, %rdx
> ffffffff9944ce2d mov %rsp, %rbp
> ffffffff9944ce30 pushq %r15
> ffffffff9944ce32 pushq %r14
> ffffffff9944ce34 pushq %r13
> ffffffff9944ce36 pushq %r12
> ffffffff9944ce38 mov $0x80000000, %r12d
> ffffffff9944ce3e pushq %rbx
> ffffffff9944ce3f mov %rsi, %rbx
> ffffffff9944ce42 and $0xfffffffffffffff0, %rsp
> ffffffff9944ce46 sub $0x10, %rsp
> ffffffff9944ce4a movq %gs:0x28, %rax
> ffffffff9944ce53 movq %rax, 0x8(%rsp)
> ffffffff9944ce58 xor %eax, %eax
> ffffffff9944ce5a add %rsi, %r12
> ffffffff9944ce5d jb 0xffffffff9944d1ea
> ffffffff9944ce63 mov $0xffffffff80000000, %rax
> ffffffff9944ce6a xor %r13d, %r13d
> ffffffff9944ce6d subq 0x17b068c(%rip), %rax
> ffffffff9944ce74 add %r12, %rax
> ffffffff9944ce77 shr $0xc, %rax
> ffffffff9944ce7b shl $0x6, %rax
> ffffffff9944ce7f addq 0x17b066a(%rip), %rax
> ffffffff9944ce86 movq 0x8(%rax), %rcx
> ffffffff9944ce8a test $0x1, %cl
> ffffffff9944ce8d jnz 0xffffffff9944d15c
> ffffffff9944ce93 nopl %eax, (%rax,%rax,1)
> ffffffff9944ce98 movq (%rax), %rcx
> ffffffff9944ce9b and $0x8, %ch
> ffffffff9944ce9e jz 0xffffffff9944cfea
> ffffffff9944cea4 test %rax, %rax
> ffffffff9944cea7 jz 0xffffffff9944cfea
> ffffffff9944cead movq 0x8(%rax), %r14
> ffffffff9944ceb1 test %r14, %r14
> ffffffff9944ceb4 jz 0xffffffff9944cfac
> ffffffff9944ceba cmp %r14, %rdx
> ffffffff9944cebd jnz 0xffffffff9944d165
> ffffffff9944cec3 test %r14, %r14
> ffffffff9944cec6 jz 0xffffffff9944cfac
> ffffffff9944cecc movq 0x8(%rbp), %r15
> ffffffff9944ced0 nopl %eax, (%rax,%rax,1)
> ffffffff9944ced5 movq 0x1fe5134(%rip), %rax
> ffffffff9944cedc test %r13, %r13
> ffffffff9944cedf jnz 0xffffffff9944ceef
> ffffffff9944cee1 mov $0xffffffff80000000, %rax
> ffffffff9944cee8 subq 0x17b0611(%rip), %rax
> ffffffff9944ceef add %rax, %r12
> ffffffff9944cef2 shr $0xc, %r12
> ffffffff9944cef6 shl $0x6, %r12
> ffffffff9944cefa addq 0x17b05ef(%rip), %r12
> ffffffff9944cf01 movq 0x8(%r12), %rax
> ffffffff9944cf06 mov %r12, %r13
> ffffffff9944cf09 test $0x1, %al
> ffffffff9944cf0b jnz 0xffffffff9944d1b1
> ffffffff9944cf11 nopl %eax, (%rax,%rax,1)
> ffffffff9944cf16 movq (%r13), %rax
> ffffffff9944cf1a movq %rbx, (%rsp)
> ffffffff9944cf1e test $0x8, %ah
> ffffffff9944cf21 mov $0x0, %eax
> ffffffff9944cf26 cmovz %rax, %r13
> ffffffff9944cf2a data16 nop
> ffffffff9944cf2c movq 0x38(%r13), %r8
> ffffffff9944cf30 cmp $0x3, %r8
> ffffffff9944cf34 jnbe 0xffffffff9944d1ca
> ffffffff9944cf3a nopl %eax, (%rax,%rax,1)
> ffffffff9944cf3f movq 0x23d6f72(%rip), %rax
> ffffffff9944cf46 mov %rbx, %rdx
> ffffffff9944cf49 sub %rax, %rdx
> ffffffff9944cf4c cmp $0x1fffff, %rdx
> ffffffff9944cf53 jbe 0xffffffff9944d03a
> ffffffff9944cf59 movq (%r14), %rax
> ffffffff9944cf5c addq %gs:0x66bccab4(%rip), %rax
> ffffffff9944cf64 movq 0x8(%rax), %rdx
> ffffffff9944cf68 cmpq %r13, 0x10(%rax)
> ffffffff9944cf6c jnz 0xffffffff9944d192
> ffffffff9944cf72 movl 0x28(%r14), %ecx
> ffffffff9944cf76 movq (%rax), %rax
> ffffffff9944cf79 add %rbx, %rcx
> ffffffff9944cf7c cmp %rbx, %rax
> ffffffff9944cf7f jz 0xffffffff9944d1ba
> ffffffff9944cf85 movq 0xb8(%r14), %rsi
> ffffffff9944cf8c mov %rcx, %rdi
> ffffffff9944cf8f bswap %rdi
> ffffffff9944cf92 xor %rax, %rsi
> ffffffff9944cf95 xor %rdi, %rsi
> ffffffff9944cf98 movq %rsi, (%rcx)
> ffffffff9944cf9b leaq 0x2000(%rdx), %rcx
> ffffffff9944cfa2 movq (%r14), %rsi
> ffffffff9944cfa5 cmpxchg16bx %gs:(%rsi)
> ffffffff9944cfaa jnz 0xffffffff9944cf59
> ffffffff9944cfac movq 0x8(%rsp), %rax
> ffffffff9944cfb1 subq %gs:0x28, %rax
> ffffffff9944cfba jnz 0xffffffff9944d1fc
> ffffffff9944cfc0 leaq -0x28(%rbp), %rsp
> ffffffff9944cfc4 popq %rbx
> ffffffff9944cfc5 popq %r12
> ffffffff9944cfc7 popq %r13
> ffffffff9944cfc9 popq %r14
> ffffffff9944cfcb popq %r15
> ffffffff9944cfcd popq %rbp
> ffffffff9944cfce retq # PRED 38 cycles [126] 2.74 IPC <-------------
Sorry for late reply, my test box was temporarily unavailable and then
I forgot about this e-mail :)
I don't have a good scientific test(tm) and I don't think coming up
with one is warranted at the moment.
But to illustrate, I slapped together a test case for will-it-scale
where I either cmpxchg8 or 16 in a loop. No lock prefix on these.
On Sapphire Rapids I see well over twice the throughput for the 8-byte variant:
# ./cmpxchg8_processes
warmup
min:481465497 max:481465497 total:481465497
min:464439645 max:464439645 total:464439645
min:461884735 max:461884735 total:461884735
min:460850043 max:460850043 total:460850043
min:461066452 max:461066452 total:461066452
min:463984473 max:463984473 total:463984473
measurement
min:461317703 max:461317703 total:461317703
min:458608942 max:458608942 total:458608942
min:460846336 max:460846336 total:460846336
[snip]
# ./cmpxchg16b_processes
warmup
min:205207128 max:205207128 total:205207128
min:205010535 max:205010535 total:205010535
min:204877781 max:204877781 total:204877781
min:204163814 max:204163814 total:204163814
min:204392000 max:204392000 total:204392000
min:204094222 max:204094222 total:204094222
measurement
min:204243282 max:204243282 total:204243282
min:204136589 max:204136589 total:204136589
min:203504119 max:203504119 total:203504119
So I would say trying it out in a real alloc is worth looking at.
Of course the 16-byte variant is not used just for kicks, so going to
8 bytes is more involved than just replacing the instruction.
The current code follows the standard idea on how to deal with the ABA
problem -- apart from replacing a pointer you validate this is what
you thought by checking the counter in the same instruction.
I note that in the kernel we can do better, but I don't have have all
kinks worked out yet. The core idea builds on the fact that we can
cheaply detect a pending alloc on the same cpu and should a
conflicting free be executing from an interrupt, it can instead add
the returning buffer to a different list and the aba problem
disappears. Should the alloc fast path fail to find a free buffer, it
can disable interrupts an take a look at the fallback list.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
2024-08-14 15:44 ` Al Viro
@ 2024-08-16 8:34 ` Christian Brauner
0 siblings, 0 replies; 33+ messages in thread
From: Christian Brauner @ 2024-08-16 8:34 UTC (permalink / raw)
To: Al Viro
Cc: Jeff Layton, Jan Kara, Andrew Morton, Mateusz Guzik, Josef Bacik,
linux-fsdevel, linux-kernel
On Wed, Aug 14, 2024 at 04:44:19PM GMT, Al Viro wrote:
> On Wed, Aug 14, 2024 at 02:40:23PM +0200, Christian Brauner wrote:
> > > Christian took in my v3 patch which is a bit different from this one.
> > > It seems to be doing fine in testing with NFS and otherwise.
> >
> > Every branch gets tested with nfs fstests (in addition to the usual
> > suspects):
> >
> > Failures: generic/732
> > Failed 1 of 600 tests
> >
> > And that just fails because it's missing your 4fd042e0465c
> > ("generic/732: don't run it on NFS")
>
> connectathon would be more interesting...
Fwiw, I wasted almost a day to get more testing done here. But that
connectathon thing at [1] just failed to compile on anything reasonably
new and with errors that indicate that certain apis it uses have been
deprecated for a very long time.
In any case, I went out of my way and found that LTP has stresstests for
NFS for creation, open, mkdir, unlink etc and I added them to testing
now.
[1]: https://github.com/dkruchinin/cthon-nfs-tests
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2024-08-16 8:34 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-06 14:32 [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-06 15:25 ` Mateusz Guzik
2024-08-06 16:17 ` Jeff Layton
2024-08-06 16:42 ` Mateusz Guzik
2024-08-06 19:11 ` Andi Kleen
2024-08-06 19:22 ` Mateusz Guzik
2024-08-06 20:42 ` Jeff Layton
2024-08-06 19:26 ` Jeff Layton
2024-08-06 20:03 ` Mateusz Guzik
2024-08-06 20:47 ` Andi Kleen
2024-08-15 15:07 ` Mateusz Guzik
2024-08-06 19:51 ` Jeff Layton
2024-08-14 2:18 ` Al Viro
2024-08-14 2:40 ` Al Viro
2024-08-14 11:48 ` Jeff Layton
2024-08-14 12:40 ` Christian Brauner
2024-08-14 15:44 ` Al Viro
2024-08-16 8:34 ` Christian Brauner
2024-08-14 15:42 ` Al Viro
2024-08-14 16:46 ` Jeff Layton
2024-08-07 14:26 ` Christian Brauner
2024-08-07 14:36 ` Jeff Layton
2024-08-08 10:36 ` Christian Brauner
2024-08-08 10:54 ` Jeff Layton
2024-08-08 11:18 ` Christian Brauner
2024-08-08 17:11 ` Jan Kara
2024-08-08 21:12 ` Paul Moore
2024-08-08 23:43 ` Jeff Layton
2024-08-09 0:28 ` Paul Moore
2024-08-09 0:33 ` Jeff Layton
2024-08-09 1:22 ` Paul Moore
2024-08-09 14:21 ` Jeff Layton
2024-08-11 21:52 ` Paul Moore
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).