[PATCH 0/2] audit: Further reduce syscall latency

public inbox for audit@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] audit: Further reduce syscall latency
@ 2023-12-12 10:28 Håkon Bugge
  2023-12-12 10:28 ` [PATCH 1/2] audit: Vary struct audit_entry alignment Håkon Bugge
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Håkon Bugge @ 2023-12-12 10:28 UTC (permalink / raw)
  To: Paul Moore, Eric Paris, audit; +Cc: Ankur Arora

On an Intel Cascade Lake, booted with "audit=0" and "mitigations=off",
syscall latency in nanoseconds over 10 reboots and 5 runs is:

getpid() latency:
Boot parameters           kernel            min    avg     max      pstdev
                                           (ns)   (ns)    (ns)        (ns)

audit=0, mitigations=off  v6.6-rc4           55    55       58    0.797245
audit=1, mitigations=off  v6.6-rc4          205   210      227    6.402000
audit=1, mitigations=off  v6.6-rc4+[1]      203   203      209    0.954149
audit=1, mitigations=off  v6.6-rc4+[1]+[2]  173   173      178    0.884534

So, audit contributes significantly to the cost of a system call.

This series, hopefully applicable to audit/next (if accepted), reduces
the syscall latency by a decent 21% on an Intel Cascade Lake system.

The above numbers are derived using the same methodology as mentioned
in the commit messages.

The first commit, "audit: Vary struct audit_entry alignment", fixes a
huge L1D miss ratio and greatly reduces the variability on the three
metrics, nanoseconds per syscall, L1D misses per syscall, and
Instructions per Cycle (ipc). It does not greatly reduce on the
syscall latency, only a decent 3.5% reduction. But, it serves as a
pre-requisite for the second commit, "audit: Apply codegen
optimizations".

Please review.

Håkon Bugge (2):
[1] audit: Vary struct audit_entry alignment
[2] audit: Apply codegen optimizations

 kernel/auditfilter.c | 14 +++++++++++---
 kernel/auditsc.c     |  2 ++
 2 files changed, 13 insertions(+), 3 deletions(-)

--
2.39.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] audit: Vary struct audit_entry alignment
  2023-12-12 10:28 [PATCH 0/2] audit: Further reduce syscall latency Håkon Bugge
@ 2023-12-12 10:28 ` Håkon Bugge
  2023-12-13 23:54   ` Paul Moore
  2023-12-12 10:28 ` [PATCH 2/2] audit: Apply codegen optimizations Håkon Bugge
  2023-12-12 10:28 ` [PATCH 2/2] audit: Apply special optimizations Håkon Bugge
  2 siblings, 1 reply; 15+ messages in thread
From: Håkon Bugge @ 2023-12-12 10:28 UTC (permalink / raw)
  To: Paul Moore, Eric Paris, audit; +Cc: Ankur Arora

We allocate struct audit_entry using kzalloc() which aligns the
structure at its natural boundary and so uses the kmalloc-512
SLAB.

That means that the lower order 9 bits are equal for these allocations.
Which on architectures with limited L1D cache-sets width would cause
conflict misses.

With STIG compliant audit rules, up-to 26 L1D misses per syscall have
been observed. This, of course, varies from boot to boot.

Fix this by using a kmem_cache aligned at cacheline width, which
greatly increases the entropy in the lower order VA bits. With this
fix, we observe a maximum of 0.8 misses per syscall using the same set
of audit rules.

Testing: run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a
simple getpid() loop, running 100 million rounds. This test is run 10
times, with a reboot in between. After each reboot, we wait until
the last minute load is less than 1.0.

We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
amplify the changes in the audit system.

Base vs. base + this commit gives:

ns per call:
  min avg max   pstdev
- 205 210 227 6.402000
+ 203 203 209 0.954149

L1d misses per syscall:
    min    avg    max   pstdev
- 3.147 12.017 26.045 6.568284
+ 0.012  0.103  0.817 0.238352

ipc:
    min    avg    max   pstdev
- 2.090  2.259  2.300 0.060075
+ 2.320  2.329  2.330 0.003000

The above metrics are all improved, and the L1D misses per syscall has
a significant reduction.

Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 kernel/auditfilter.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 8317a37dea0bb..b7597a63a3950 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -80,6 +80,7 @@ static void audit_free_lsm_field(struct audit_field *f)
 	}
 }

+static struct kmem_cache *entry_cache;
 static inline void audit_free_rule(struct audit_entry *e)
 {
 	int i;
@@ -93,7 +94,7 @@ static inline void audit_free_rule(struct audit_entry *e)
 			audit_free_lsm_field(&erule->fields[i]);
 	kfree(erule->fields);
 	kfree(erule->filterkey);
-	kfree(e);
+	kmem_cache_free(entry_cache, e);
 }

 void audit_free_rule_rcu(struct rcu_head *head)
@@ -108,13 +109,20 @@ static inline struct audit_entry *audit_init_entry(u32 field_count)
 	struct audit_entry *entry;
 	struct audit_field *fields;

-	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry_cache) {
+		entry_cache = kmem_cache_create("audit_entry", sizeof(*entry), 0,
+						SLAB_HWCACHE_ALIGN, NULL);
+		if (!entry_cache)
+			return NULL;
+	}
+
+	entry = kmem_cache_zalloc(entry_cache, GFP_KERNEL);
 	if (unlikely(!entry))
 		return NULL;

 	fields = kcalloc(field_count, sizeof(*fields), GFP_KERNEL);
 	if (unlikely(!fields)) {
-		kfree(entry);
+		kmem_cache_free(entry_cache, entry);
 		return NULL;
 	}
 	entry->rule.fields = fields;
-- 
2.39.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] audit: Apply codegen optimizations
  2023-12-12 10:28 [PATCH 0/2] audit: Further reduce syscall latency Håkon Bugge
  2023-12-12 10:28 ` [PATCH 1/2] audit: Vary struct audit_entry alignment Håkon Bugge
@ 2023-12-12 10:28 ` Håkon Bugge
  2023-12-13 23:45   ` Paul Moore
  2023-12-12 10:28 ` [PATCH 2/2] audit: Apply special optimizations Håkon Bugge
  2 siblings, 1 reply; 15+ messages in thread
From: Håkon Bugge @ 2023-12-12 10:28 UTC (permalink / raw)
  To: Paul Moore, Eric Paris, audit; +Cc: Ankur Arora

For the most time-consuming function, when running a syscall benchmark
with STIG compliant audit rules:

  Overhead  Command       Shared Object      Symbol
 .........  ............  .................  ........................

    27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op

we apply codegen optimizations, which speeds up the syscall
performance by around 17% on an Intel Cascade Lake system.

We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
application that measures average syscall latency from getpid()
running 100 million rounds.

Between each perf run, we reboot the system and waits until the last
minute load is less than 1.0.

We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
amplify the changes in the audit system.

Let the base kernel be v6.6-rc4 with booted with "audit=1" and
"mitigations=off" and with the commit "audit: Vary struct audit_entry
alignment" on an Intel Cascade Lake system. The following three
metrics are reported, nanoseconds per syscall, L1D misses per syscall,
and finally Intructions Per Cycle, ipc.

Base vs. base + this commit gives:

ns per call:
  min avg max   pstdev
- 203 203 209 0.954149
+ 173 173 178 0.884534

L1d misses per syscall:
     min    avg    max   pstdev
-  0.012  0.103  0.817 0.238352
+  0.010  0.209  1.235 0.399416

ipc:
    min    avg    max   pstdev
- 2.320  2.329  2.330 0.003000
+ 2.430  2.436  2.440 0.004899

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 kernel/auditsc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 6f0d6fb6523fa..84d0dfe75a4ac 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
  * parameter can be NULL, but all others must be specified.
  * Returns 1/true if the filter finds a match, 0/false if none are found.
  */
+#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")
 static int __audit_filter_op(struct task_struct *tsk,
 			   struct audit_context *ctx,
 			   struct list_head *list,
@@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk,
 	}
 	return 0;
 }
+#pragma GCC reset_options

 /**
  * audit_filter_uring - apply filters to an io_uring operation
-- 
2.39.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] audit: Apply special optimizations
  2023-12-12 10:28 [PATCH 0/2] audit: Further reduce syscall latency Håkon Bugge
  2023-12-12 10:28 ` [PATCH 1/2] audit: Vary struct audit_entry alignment Håkon Bugge
  2023-12-12 10:28 ` [PATCH 2/2] audit: Apply codegen optimizations Håkon Bugge
@ 2023-12-12 10:28 ` Håkon Bugge
  2023-12-12 10:39   ` Haakon Bugge
  2 siblings, 1 reply; 15+ messages in thread
From: Håkon Bugge @ 2023-12-12 10:28 UTC (permalink / raw)
  To: Paul Moore, Eric Paris, audit; +Cc: Ankur Arora

For the most time-consuming function, when running a syscall benchmark
with STIG compliant audit rules:

  Overhead  Command       Shared Object      Symbol
 .........  ............  .................  ........................

    27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op

we apply special optimizations, which speeds up the syscall
performance by around 17% on an Intel Cascade Lake system.

We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
application that measures average syscall latency from getpid()
running 100 million rounds.

Between each perf run, we reboot the system and waits until the last
minute load is less than 1.0.

We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
amplify the changes in the audit system.

Let the base kernel be v6.6-rc4 with booted with "audit=1" and
"mitigations=off" and with the commit "audit: Vary struct audit_entry
alignment" on an Intel Cascade Lake system. The following three
metrics are reported, nanoseconds per syscall, L1D misses per syscall,
and finally Intructions Per Cycle, ipc.

Base vs. base + this commit gives:

ns per call:
  min avg max   pstdev
- 203 203 209 0.954149
+ 173 173 178 0.884534

L1d misses per syscall:
     min    avg    max   pstdev
-  0.012  0.103  0.817 0.238352
+  0.010  0.209  1.235 0.399416

ipc:
    min    avg    max   pstdev
- 2.320  2.329  2.330 0.003000
+ 2.430  2.436  2.440 0.004899

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 kernel/auditsc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 6f0d6fb6523fa..84d0dfe75a4ac 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
  * parameter can be NULL, but all others must be specified.
  * Returns 1/true if the filter finds a match, 0/false if none are found.
  */
+#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")
 static int __audit_filter_op(struct task_struct *tsk,
 			   struct audit_context *ctx,
 			   struct list_head *list,
@@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk,
 	}
 	return 0;
 }
+#pragma GCC reset_options

 /**
  * audit_filter_uring - apply filters to an io_uring operation
-- 
2.39.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] audit: Apply special optimizations
  2023-12-12 10:28 ` [PATCH 2/2] audit: Apply special optimizations Håkon Bugge
@ 2023-12-12 10:39   ` Haakon Bugge
  0 siblings, 0 replies; 15+ messages in thread
From: Haakon Bugge @ 2023-12-12 10:39 UTC (permalink / raw)
  To: Haakon Bugge; +Cc: Paul Moore, Eric Paris, audit@vger.kernel.org, Ankur Arora



> On 12 Dec 2023, at 11:28, Håkon Bugge <Haakon.Bugge@oracle.com> wrote:
> 
> For the most time-consuming function, when running a syscall benchmark
> with STIG compliant audit rules:
> 
>  Overhead  Command       Shared Object      Symbol
> .........  ............  .................  ........................
> 
>    27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op
> 
> we apply special optimizations, which speeds up the syscall
> performance by around 17% on an Intel Cascade Lake system.
> 
> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
> application that measures average syscall latency from getpid()
> running 100 million rounds.
> 
> Between each perf run, we reboot the system and waits until the last
> minute load is less than 1.0.
> 
> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
> amplify the changes in the audit system.
> 
> Let the base kernel be v6.6-rc4 with booted with "audit=1" and
> "mitigations=off" and with the commit "audit: Vary struct audit_entry
> alignment" on an Intel Cascade Lake system. The following three
> metrics are reported, nanoseconds per syscall, L1D misses per syscall,
> and finally Intructions Per Cycle, ipc.
> 
> Base vs. base + this commit gives:
> 
> ns per call:
>  min avg max   pstdev
> - 203 203 209 0.954149
> + 173 173 178 0.884534
> 
> L1d misses per syscall:
>     min    avg    max   pstdev
> -  0.012  0.103  0.817 0.238352
> +  0.010  0.209  1.235 0.399416
> 
> ipc:
>    min    avg    max   pstdev
> - 2.320  2.329  2.330 0.003000
> + 2.430  2.436  2.440 0.004899
> 
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>

Please disregard this message.


Thxs, Håkon

> ---
> kernel/auditsc.c | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 6f0d6fb6523fa..84d0dfe75a4ac 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
>  * parameter can be NULL, but all others must be specified.
>  * Returns 1/true if the filter finds a match, 0/false if none are found.
>  */
> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")
> static int __audit_filter_op(struct task_struct *tsk,
>   struct audit_context *ctx,
>   struct list_head *list,
> @@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk,
> }
> return 0;
> }
> +#pragma GCC reset_options
> 
> /**
>  * audit_filter_uring - apply filters to an io_uring operation
> -- 
> 2.39.3
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] audit: Apply codegen optimizations
  2023-12-12 10:28 ` [PATCH 2/2] audit: Apply codegen optimizations Håkon Bugge
@ 2023-12-13 23:45   ` Paul Moore
  2023-12-16 16:28     ` Haakon Bugge
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Moore @ 2023-12-13 23:45 UTC (permalink / raw)
  To: Håkon Bugge; +Cc: Eric Paris, audit, Ankur Arora

On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
>
> For the most time-consuming function, when running a syscall benchmark
> with STIG compliant audit rules:
>
>   Overhead  Command       Shared Object      Symbol
>  .........  ............  .................  ........................
>
>     27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op
>
> we apply codegen optimizations, which speeds up the syscall
> performance by around 17% on an Intel Cascade Lake system.
>
> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
> application that measures average syscall latency from getpid()
> running 100 million rounds.
>
> Between each perf run, we reboot the system and waits until the last
> minute load is less than 1.0.
>
> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
> amplify the changes in the audit system.
>
> Let the base kernel be v6.6-rc4 with booted with "audit=1" and
> "mitigations=off" and with the commit "audit: Vary struct audit_entry
> alignment" on an Intel Cascade Lake system. The following three
> metrics are reported, nanoseconds per syscall, L1D misses per syscall,
> and finally Intructions Per Cycle, ipc.
>
> Base vs. base + this commit gives:
>
> ns per call:
>   min avg max   pstdev
> - 203 203 209 0.954149
> + 173 173 178 0.884534
>
> L1d misses per syscall:
>      min    avg    max   pstdev
> -  0.012  0.103  0.817 0.238352
> +  0.010  0.209  1.235 0.399416
>
> ipc:
>     min    avg    max   pstdev
> - 2.320  2.329  2.330 0.003000
> + 2.430  2.436  2.440 0.004899
>
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
>  kernel/auditsc.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 6f0d6fb6523fa..84d0dfe75a4ac 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
>   * parameter can be NULL, but all others must be specified.
>   * Returns 1/true if the filter finds a match, 0/false if none are found.
>   */
> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")

The kernel doesn't really make use of #pragma optimization statements
like this, at least not in any of the core areas, and I'm not
interested in being the first to do so.  I appreciate the time and
effort that you have spent profiling the audit subsystem, but this
isn't a patch I can accept at this point in time, I'm sorry.

>  static int __audit_filter_op(struct task_struct *tsk,
>                            struct audit_context *ctx,
>                            struct list_head *list,
> @@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk,
>         }
>         return 0;
>  }
> +#pragma GCC reset_options
>
>  /**
>   * audit_filter_uring - apply filters to an io_uring operation
> --
> 2.39.3

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] audit: Vary struct audit_entry alignment
  2023-12-12 10:28 ` [PATCH 1/2] audit: Vary struct audit_entry alignment Håkon Bugge
@ 2023-12-13 23:54   ` Paul Moore
  2023-12-16 16:25     ` Haakon Bugge
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Moore @ 2023-12-13 23:54 UTC (permalink / raw)
  To: Håkon Bugge; +Cc: Eric Paris, audit, Ankur Arora

On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
>
> We allocate struct audit_entry using kzalloc() which aligns the
> structure at its natural boundary and so uses the kmalloc-512
> SLAB.
>
> That means that the lower order 9 bits are equal for these allocations.
> Which on architectures with limited L1D cache-sets width would cause
> conflict misses.
>
> With STIG compliant audit rules, up-to 26 L1D misses per syscall have
> been observed. This, of course, varies from boot to boot.
>
> Fix this by using a kmem_cache aligned at cacheline width, which
> greatly increases the entropy in the lower order VA bits. With this
> fix, we observe a maximum of 0.8 misses per syscall using the same set
> of audit rules.
>
> Testing: run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a
> simple getpid() loop, running 100 million rounds. This test is run 10
> times, with a reboot in between. After each reboot, we wait until
> the last minute load is less than 1.0.
>
> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
> amplify the changes in the audit system.
>
> Base vs. base + this commit gives:
>
> ns per call:
>   min avg max   pstdev
> - 205 210 227 6.402000
> + 203 203 209 0.954149
>
> L1d misses per syscall:
>     min    avg    max   pstdev
> - 3.147 12.017 26.045 6.568284
> + 0.012  0.103  0.817 0.238352
>
> ipc:
>     min    avg    max   pstdev
> - 2.090  2.259  2.300 0.060075
> + 2.320  2.329  2.330 0.003000
>
> The above metrics are all improved, and the L1D misses per syscall has
> a significant reduction.
>
> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
>  kernel/auditfilter.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> index 8317a37dea0bb..b7597a63a3950 100644
> --- a/kernel/auditfilter.c
> +++ b/kernel/auditfilter.c
> @@ -80,6 +80,7 @@ static void audit_free_lsm_field(struct audit_field *f)
>         }
>  }
>
> +static struct kmem_cache *entry_cache;
>  static inline void audit_free_rule(struct audit_entry *e)
>  {
>         int i;
> @@ -93,7 +94,7 @@ static inline void audit_free_rule(struct audit_entry *e)
>                         audit_free_lsm_field(&erule->fields[i]);
>         kfree(erule->fields);
>         kfree(erule->filterkey);
> -       kfree(e);
> +       kmem_cache_free(entry_cache, e);
>  }
>
>  void audit_free_rule_rcu(struct rcu_head *head)
> @@ -108,13 +109,20 @@ static inline struct audit_entry *audit_init_entry(u32 field_count)
>         struct audit_entry *entry;
>         struct audit_field *fields;
>
> -       entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> +       if (!entry_cache) {
> +               entry_cache = kmem_cache_create("audit_entry", sizeof(*entry), 0,
> +                                               SLAB_HWCACHE_ALIGN, NULL);
> +               if (!entry_cache)
> +                       return NULL;

Two things:

1. If we are going to create a kmem_cache pool we shouldn't create it
here, it should be in its own audit_filter_init() function which is
called from audit_init().

2. I'm not sure it makes a lot of sense to create a kmem_cache pool
for audit filter entries, especially given the modest performance
gains.  Is there not some way to request cacheline alignment with
kmalloc() or similar?

> +       }
> +
> +       entry = kmem_cache_zalloc(entry_cache, GFP_KERNEL);
>         if (unlikely(!entry))
>                 return NULL;
>
>         fields = kcalloc(field_count, sizeof(*fields), GFP_KERNEL);
>         if (unlikely(!fields)) {
> -               kfree(entry);
> +               kmem_cache_free(entry_cache, entry);
>                 return NULL;
>         }
>         entry->rule.fields = fields;
> --
> 2.39.3

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] audit: Vary struct audit_entry alignment
  2023-12-13 23:54   ` Paul Moore
@ 2023-12-16 16:25     ` Haakon Bugge
  2023-12-18 22:07       ` Paul Moore
  0 siblings, 1 reply; 15+ messages in thread
From: Haakon Bugge @ 2023-12-16 16:25 UTC (permalink / raw)
  To: Paul Moore; +Cc: Eric Paris, audit@vger.kernel.org, Ankur Arora



> On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
> 
> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
>> 
>> We allocate struct audit_entry using kzalloc() which aligns the
>> structure at its natural boundary and so uses the kmalloc-512
>> SLAB.
>> 
>> That means that the lower order 9 bits are equal for these allocations.
>> Which on architectures with limited L1D cache-sets width would cause
>> conflict misses.
>> 
>> With STIG compliant audit rules, up-to 26 L1D misses per syscall have
>> been observed. This, of course, varies from boot to boot.
>> 
>> Fix this by using a kmem_cache aligned at cacheline width, which
>> greatly increases the entropy in the lower order VA bits. With this
>> fix, we observe a maximum of 0.8 misses per syscall using the same set
>> of audit rules.
>> 
>> Testing: run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a
>> simple getpid() loop, running 100 million rounds. This test is run 10
>> times, with a reboot in between. After each reboot, we wait until
>> the last minute load is less than 1.0.
>> 
>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
>> amplify the changes in the audit system.
>> 
>> Base vs. base + this commit gives:
>> 
>> ns per call:
>>  min avg max   pstdev
>> - 205 210 227 6.402000
>> + 203 203 209 0.954149
>> 
>> L1d misses per syscall:
>>    min    avg    max   pstdev
>> - 3.147 12.017 26.045 6.568284
>> + 0.012  0.103  0.817 0.238352
>> 
>> ipc:
>>    min    avg    max   pstdev
>> - 2.090  2.259  2.300 0.060075
>> + 2.320  2.329  2.330 0.003000
>> 
>> The above metrics are all improved, and the L1D misses per syscall has
>> a significant reduction.
>> 
>> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>> ---
>> kernel/auditfilter.c | 14 +++++++++++---
>> 1 file changed, 11 insertions(+), 3 deletions(-)
>> 
>> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
>> index 8317a37dea0bb..b7597a63a3950 100644
>> --- a/kernel/auditfilter.c
>> +++ b/kernel/auditfilter.c
>> @@ -80,6 +80,7 @@ static void audit_free_lsm_field(struct audit_field *f)
>>        }
>> }
>> 
>> +static struct kmem_cache *entry_cache;
>> static inline void audit_free_rule(struct audit_entry *e)
>> {
>>        int i;
>> @@ -93,7 +94,7 @@ static inline void audit_free_rule(struct audit_entry *e)
>>                        audit_free_lsm_field(&erule->fields[i]);
>>        kfree(erule->fields);
>>        kfree(erule->filterkey);
>> -       kfree(e);
>> +       kmem_cache_free(entry_cache, e);
>> }
>> 
>> void audit_free_rule_rcu(struct rcu_head *head)
>> @@ -108,13 +109,20 @@ static inline struct audit_entry *audit_init_entry(u32 field_count)
>>        struct audit_entry *entry;
>>        struct audit_field *fields;
>> 
>> -       entry = kzalloc(sizeof(*entry), GFP_KERNEL);
>> +       if (!entry_cache) {
>> +               entry_cache = kmem_cache_create("audit_entry", sizeof(*entry), 0,
>> +                                               SLAB_HWCACHE_ALIGN, NULL);
>> +               if (!entry_cache)
>> +                       return NULL;
> 
> Two things:
> 
> 1. If we are going to create a kmem_cache pool we shouldn't create it
> here, it should be in its own audit_filter_init() function which is
> called from audit_init().

Understood. Will fix.

> 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
> for audit filter entries, especially given the modest performance
> gains.  Is there not some way to request cacheline alignment with
> kmalloc() or similar?

The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.

Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.


Thxs, Håkon


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] audit: Apply codegen optimizations
  2023-12-13 23:45   ` Paul Moore
@ 2023-12-16 16:28     ` Haakon Bugge
  2023-12-18 22:09       ` Paul Moore
  0 siblings, 1 reply; 15+ messages in thread
From: Haakon Bugge @ 2023-12-16 16:28 UTC (permalink / raw)
  To: Paul Moore; +Cc: Eric Paris, audit@vger.kernel.org, Ankur Arora



> On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote:
> 
> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
>> 
>> For the most time-consuming function, when running a syscall benchmark
>> with STIG compliant audit rules:
>> 
>>  Overhead  Command       Shared Object      Symbol
>> .........  ............  .................  ........................
>> 
>>    27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op
>> 
>> we apply codegen optimizations, which speeds up the syscall
>> performance by around 17% on an Intel Cascade Lake system.
>> 
>> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
>> application that measures average syscall latency from getpid()
>> running 100 million rounds.
>> 
>> Between each perf run, we reboot the system and waits until the last
>> minute load is less than 1.0.
>> 
>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
>> amplify the changes in the audit system.
>> 
>> Let the base kernel be v6.6-rc4 with booted with "audit=1" and
>> "mitigations=off" and with the commit "audit: Vary struct audit_entry
>> alignment" on an Intel Cascade Lake system. The following three
>> metrics are reported, nanoseconds per syscall, L1D misses per syscall,
>> and finally Intructions Per Cycle, ipc.
>> 
>> Base vs. base + this commit gives:
>> 
>> ns per call:
>>  min avg max   pstdev
>> - 203 203 209 0.954149
>> + 173 173 178 0.884534
>> 
>> L1d misses per syscall:
>>     min    avg    max   pstdev
>> -  0.012  0.103  0.817 0.238352
>> +  0.010  0.209  1.235 0.399416
>> 
>> ipc:
>>    min    avg    max   pstdev
>> - 2.320  2.329  2.330 0.003000
>> + 2.430  2.436  2.440 0.004899
>> 
>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>> ---
>> kernel/auditsc.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 6f0d6fb6523fa..84d0dfe75a4ac 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
>>  * parameter can be NULL, but all others must be specified.
>>  * Returns 1/true if the filter finds a match, 0/false if none are found.
>>  */
>> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")
> 
> The kernel doesn't really make use of #pragma optimization statements
> like this, at least not in any of the core areas, and I'm not
> interested in being the first to do so.  I appreciate the time and
> effort that you have spent profiling the audit subsystem, but this
> isn't a patch I can accept at this point in time, I'm sorry.

Fair enough. Will a function attribute aka:

__attribute__((optimize("foo=bar")))

be acceptable for you?


Thxs, Håkon

> 
>> static int __audit_filter_op(struct task_struct *tsk,
>>                           struct audit_context *ctx,
>>                           struct list_head *list,
>> @@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk,
>>        }
>>        return 0;
>> }
>> +#pragma GCC reset_options
>> 
>> /**
>>  * audit_filter_uring - apply filters to an io_uring operation
>> --
>> 2.39.3
> 
> -- 
> paul-moore.com
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] audit: Vary struct audit_entry alignment
  2023-12-16 16:25     ` Haakon Bugge
@ 2023-12-18 22:07       ` Paul Moore
  2023-12-19 21:07         ` Ankur Arora
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Moore @ 2023-12-18 22:07 UTC (permalink / raw)
  To: Haakon Bugge; +Cc: Eric Paris, audit@vger.kernel.org, Ankur Arora

On Sat, Dec 16, 2023 at 11:25 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
> > On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
> >
> > Two things:
> >
> > 1. If we are going to create a kmem_cache pool we shouldn't create it
> > here, it should be in its own audit_filter_init() function which is
> > called from audit_init().
>
> Understood. Will fix.
>
> > 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
> > for audit filter entries, especially given the modest performance
> > gains.  Is there not some way to request cacheline alignment with
> > kmalloc() or similar?
>
> The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.
>
> Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.

My hesitation of using a kmem_cache object here remains, given the
relatively limited and static filter rule configuration I would rather
use a k*malloc() based approach.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] audit: Apply codegen optimizations
  2023-12-16 16:28     ` Haakon Bugge
@ 2023-12-18 22:09       ` Paul Moore
  2023-12-21 19:05         ` Haakon Bugge
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Moore @ 2023-12-18 22:09 UTC (permalink / raw)
  To: Haakon Bugge; +Cc: Eric Paris, audit@vger.kernel.org, Ankur Arora

On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
> > On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote:
> > On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
> >>
> >> For the most time-consuming function, when running a syscall benchmark
> >> with STIG compliant audit rules:
> >>
> >>  Overhead  Command       Shared Object      Symbol
> >> .........  ............  .................  ........................
> >>
> >>    27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op
> >>
> >> we apply codegen optimizations, which speeds up the syscall
> >> performance by around 17% on an Intel Cascade Lake system.
> >>
> >> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
> >> application that measures average syscall latency from getpid()
> >> running 100 million rounds.
> >>
> >> Between each perf run, we reboot the system and waits until the last
> >> minute load is less than 1.0.
> >>
> >> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
> >> amplify the changes in the audit system.
> >>
> >> Let the base kernel be v6.6-rc4 with booted with "audit=1" and
> >> "mitigations=off" and with the commit "audit: Vary struct audit_entry
> >> alignment" on an Intel Cascade Lake system. The following three
> >> metrics are reported, nanoseconds per syscall, L1D misses per syscall,
> >> and finally Intructions Per Cycle, ipc.
> >>
> >> Base vs. base + this commit gives:
> >>
> >> ns per call:
> >>  min avg max   pstdev
> >> - 203 203 209 0.954149
> >> + 173 173 178 0.884534
> >>
> >> L1d misses per syscall:
> >>     min    avg    max   pstdev
> >> -  0.012  0.103  0.817 0.238352
> >> +  0.010  0.209  1.235 0.399416
> >>
> >> ipc:
> >>    min    avg    max   pstdev
> >> - 2.320  2.329  2.330 0.003000
> >> + 2.430  2.436  2.440 0.004899
> >>
> >> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> >> ---
> >> kernel/auditsc.c | 2 ++
> >> 1 file changed, 2 insertions(+)
> >>
> >> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> >> index 6f0d6fb6523fa..84d0dfe75a4ac 100644
> >> --- a/kernel/auditsc.c
> >> +++ b/kernel/auditsc.c
> >> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
> >>  * parameter can be NULL, but all others must be specified.
> >>  * Returns 1/true if the filter finds a match, 0/false if none are found.
> >>  */
> >> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")
> >
> > The kernel doesn't really make use of #pragma optimization statements
> > like this, at least not in any of the core areas, and I'm not
> > interested in being the first to do so.  I appreciate the time and
> > effort that you have spent profiling the audit subsystem, but this
> > isn't a patch I can accept at this point in time, I'm sorry.
>
> Fair enough. Will a function attribute aka:
>
> __attribute__((optimize("foo=bar")))
>
> be acceptable for you?

Unless you can show me widespread acceptance of these types of
optimizations in core kernel code, my answer is no.  I'm sorry.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] audit: Vary struct audit_entry alignment
  2023-12-18 22:07       ` Paul Moore
@ 2023-12-19 21:07         ` Ankur Arora
  2023-12-19 21:27           ` Paul Moore
  0 siblings, 1 reply; 15+ messages in thread
From: Ankur Arora @ 2023-12-19 21:07 UTC (permalink / raw)
  To: Paul Moore; +Cc: Haakon Bugge, Eric Paris, audit@vger.kernel.org, Ankur Arora


Paul Moore <paul@paul-moore.com> writes:

> On Sat, Dec 16, 2023 at 11:25 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
>> > On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
>> >
>> > Two things:
>> >
>> > 1. If we are going to create a kmem_cache pool we shouldn't create it
>> > here, it should be in its own audit_filter_init() function which is
>> > called from audit_init().
>>
>> Understood. Will fix.
>>
>> > 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
>> > for audit filter entries, especially given the modest performance
>> > gains.  Is there not some way to request cacheline alignment with
>> > kmalloc() or similar?
>>
>> The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.
>>
>> Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.
>
> My hesitation of using a kmem_cache object here remains, given the
> relatively limited and static filter rule configuration I would rather
> use a k*malloc() based approach.

AFAICT, kmalloc() etc only allows fixed alignment. From
Documentation/core-api/memory-allocation.rst:

  The address of a chunk allocated with `kmalloc` is aligned to at least
  ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
  alignment is also guaranteed to be at least the respective size.

I had sent out a patch a while ago reducing the cost of the same
alignment issue. For me the most pernicious part was the fact that
syscall latency was good or poor based on how boot time factors
affected audit allocations.

So while I do agree with your hesitation on kmem_cache not being quite
the right interface for what are static allocations, I think it might
be worth it given the cost.

Thanks
--
ankur

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] audit: Vary struct audit_entry alignment
  2023-12-19 21:07         ` Ankur Arora
@ 2023-12-19 21:27           ` Paul Moore
  0 siblings, 0 replies; 15+ messages in thread
From: Paul Moore @ 2023-12-19 21:27 UTC (permalink / raw)
  To: Ankur Arora; +Cc: Haakon Bugge, Eric Paris, audit@vger.kernel.org

On Tue, Dec 19, 2023 at 4:07 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > On Sat, Dec 16, 2023 at 11:25 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
> >> > On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
> >> >
> >> > Two things:
> >> >
> >> > 1. If we are going to create a kmem_cache pool we shouldn't create it
> >> > here, it should be in its own audit_filter_init() function which is
> >> > called from audit_init().
> >>
> >> Understood. Will fix.
> >>
> >> > 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
> >> > for audit filter entries, especially given the modest performance
> >> > gains.  Is there not some way to request cacheline alignment with
> >> > kmalloc() or similar?
> >>
> >> The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.
> >>
> >> Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.
> >
> > My hesitation of using a kmem_cache object here remains, given the
> > relatively limited and static filter rule configuration I would rather
> > use a k*malloc() based approach.
>
> AFAICT, kmalloc() etc only allows fixed alignment. From
> Documentation/core-api/memory-allocation.rst:
>
>   The address of a chunk allocated with `kmalloc` is aligned to at least
>   ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
>   alignment is also guaranteed to be at least the respective size.
>
> I had sent out a patch a while ago reducing the cost of the same
> alignment issue. For me the most pernicious part was the fact that
> syscall latency was good or poor based on how boot time factors
> affected audit allocations.
>
> So while I do agree with your hesitation on kmem_cache not being quite
> the right interface for what are static allocations, I think it might
> be worth it given the cost.

Once again, I appreciate the time and effort that you have put into
this patchset, but it is not something I want to accept at this point
in time.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] audit: Apply codegen optimizations
  2023-12-18 22:09       ` Paul Moore
@ 2023-12-21 19:05         ` Haakon Bugge
  2023-12-22  4:31           ` Paul Moore
  0 siblings, 1 reply; 15+ messages in thread
From: Haakon Bugge @ 2023-12-21 19:05 UTC (permalink / raw)
  To: Paul Moore; +Cc: Eric Paris, audit@vger.kernel.org, Ankur Arora



> On 18 Dec 2023, at 23:09, Paul Moore <paul@paul-moore.com> wrote:
> 
> On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
>>> On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote:
>>> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
>>>> 
>>>> For the most time-consuming function, when running a syscall benchmark
>>>> with STIG compliant audit rules:
>>>> 
>>>> Overhead  Command       Shared Object      Symbol
>>>> .........  ............  .................  ........................
>>>> 
>>>>   27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op
>>>> 
>>>> we apply codegen optimizations, which speeds up the syscall
>>>> performance by around 17% on an Intel Cascade Lake system.
>>>> 
>>>> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
>>>> application that measures average syscall latency from getpid()
>>>> running 100 million rounds.
>>>> 
>>>> Between each perf run, we reboot the system and waits until the last
>>>> minute load is less than 1.0.
>>>> 
>>>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
>>>> amplify the changes in the audit system.
>>>> 
>>>> Let the base kernel be v6.6-rc4 with booted with "audit=1" and
>>>> "mitigations=off" and with the commit "audit: Vary struct audit_entry
>>>> alignment" on an Intel Cascade Lake system. The following three
>>>> metrics are reported, nanoseconds per syscall, L1D misses per syscall,
>>>> and finally Intructions Per Cycle, ipc.
>>>> 
>>>> Base vs. base + this commit gives:
>>>> 
>>>> ns per call:
>>>> min avg max   pstdev
>>>> - 203 203 209 0.954149
>>>> + 173 173 178 0.884534
>>>> 
>>>> L1d misses per syscall:
>>>>    min    avg    max   pstdev
>>>> -  0.012  0.103  0.817 0.238352
>>>> +  0.010  0.209  1.235 0.399416
>>>> 
>>>> ipc:
>>>>   min    avg    max   pstdev
>>>> - 2.320  2.329  2.330 0.003000
>>>> + 2.430  2.436  2.440 0.004899
>>>> 
>>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>>>> ---
>>>> kernel/auditsc.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>> 
>>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>>>> index 6f0d6fb6523fa..84d0dfe75a4ac 100644
>>>> --- a/kernel/auditsc.c
>>>> +++ b/kernel/auditsc.c
>>>> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
>>>> * parameter can be NULL, but all others must be specified.
>>>> * Returns 1/true if the filter finds a match, 0/false if none are found.
>>>> */
>>>> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")
>>> 
>>> The kernel doesn't really make use of #pragma optimization statements
>>> like this, at least not in any of the core areas, and I'm not
>>> interested in being the first to do so.  I appreciate the time and
>>> effort that you have spent profiling the audit subsystem, but this
>>> isn't a patch I can accept at this point in time, I'm sorry.
>> 
>> Fair enough. Will a function attribute aka:
>> 
>> __attribute__((optimize("foo=bar")))
>> 
>> be acceptable for you?
> 
> Unless you can show me widespread acceptance of these types of
> optimizations in core kernel code, my answer is no.  I'm sorry.

Hi Paul,

Well, I guess we both know that I cannot show you any "widespread acceptance" of the sort. Here is what I see:

arch/arm/lib/xor-neon.c:#pragma GCC optimize "tree-vectorize"
(not what you describe as core kernel)

lib/zstd/common/compiler.h:#    define DONT_VECTORIZE __attribute__((optimize("no-tree-vectorize")))
(core, but not widespread)

Then of course, there are more examples of specific optimizations or disable thereof in Makefiles:

kernel/debug/kdb/Makefile:clean-files := gen-kdb_cmds.c
kernel/kcsan/Makefile:CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \
kernel/kcsan/Makefile: -fno-stack-protector -DDISABLE_BRANCH_PROFILING
kernel/kcsan/Makefile:CFLAGS_kcsan_test.o := $(CFLAGS_KCSAN) -fno-omit-frame-pointer
kernel/Makefile:CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) -fno-stack-protector
kernel/Makefile:clean-files := kheaders_data.tar.xz kheaders.md5
kernel/sched/Makefile:# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
kernel/sched/Makefile:CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
kernel/bpf/Makefile:cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
kernel/entry/Makefile:CFLAGS_REMOVE_common.o = -fstack-protector -fstack-protector-strong
kernel/entry/Makefile:CFLAGS_common.o += -fno-stack-protector
kernel/rcu/Makefile:KBUILD_CFLAGS += -g -fno-omit-frame-pointer

My view is:

The coding of time critical functions, that access structs, that are allocated by k{m,z}alloc and family, in a short proximity in time, is, with all respect, not a good programming practice, as depicted by the example in this patch series.

I have to accept, that you will not accept this patch series or a v2 thereof.

But, this leaves (210 - 173) = 37 nanosecond unnecessary overhead of a syscall on the floor.

It should be unnecessary to quote the number of Cascade Lake systems being deployed, how many of them that run with audit enabled and STIG component rules, to point of the significance of those nanoseconds.

I would be very appreciative if you could come up with a better and more acceptable solution, in order to shave off these nanoseconds.


Thxs, Håkon




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] audit: Apply codegen optimizations
  2023-12-21 19:05         ` Haakon Bugge
@ 2023-12-22  4:31           ` Paul Moore
  0 siblings, 0 replies; 15+ messages in thread
From: Paul Moore @ 2023-12-22  4:31 UTC (permalink / raw)
  To: Haakon Bugge; +Cc: Eric Paris, audit@vger.kernel.org, Ankur Arora

On Thu, Dec 21, 2023 at 2:05 PM Haakon Bugge <haakon.bugge@oracle.com> wrote:
> > On 18 Dec 2023, at 23:09, Paul Moore <paul@paul-moore.com> wrote:
> > On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
> >>> On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote:
> >>> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
> >>>>
> >>>> For the most time-consuming function, when running a syscall benchmark
> >>>> with STIG compliant audit rules:
> >>>>
> >>>> Overhead  Command       Shared Object      Symbol
> >>>> .........  ............  .................  ........................
> >>>>
> >>>>   27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op
> >>>>
> >>>> we apply codegen optimizations, which speeds up the syscall
> >>>> performance by around 17% on an Intel Cascade Lake system.
> >>>>
> >>>> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
> >>>> application that measures average syscall latency from getpid()
> >>>> running 100 million rounds.
> >>>>
> >>>> Between each perf run, we reboot the system and waits until the last
> >>>> minute load is less than 1.0.
> >>>>
> >>>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
> >>>> amplify the changes in the audit system.
> >>>>
> >>>> Let the base kernel be v6.6-rc4 with booted with "audit=1" and
> >>>> "mitigations=off" and with the commit "audit: Vary struct audit_entry
> >>>> alignment" on an Intel Cascade Lake system. The following three
> >>>> metrics are reported, nanoseconds per syscall, L1D misses per syscall,
> >>>> and finally Intructions Per Cycle, ipc.
> >>>>
> >>>> Base vs. base + this commit gives:
> >>>>
> >>>> ns per call:
> >>>> min avg max   pstdev
> >>>> - 203 203 209 0.954149
> >>>> + 173 173 178 0.884534
> >>>>
> >>>> L1d misses per syscall:
> >>>>    min    avg    max   pstdev
> >>>> -  0.012  0.103  0.817 0.238352
> >>>> +  0.010  0.209  1.235 0.399416
> >>>>
> >>>> ipc:
> >>>>   min    avg    max   pstdev
> >>>> - 2.320  2.329  2.330 0.003000
> >>>> + 2.430  2.436  2.440 0.004899
> >>>>
> >>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> >>>> ---
> >>>> kernel/auditsc.c | 2 ++
> >>>> 1 file changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> >>>> index 6f0d6fb6523fa..84d0dfe75a4ac 100644
> >>>> --- a/kernel/auditsc.c
> >>>> +++ b/kernel/auditsc.c
> >>>> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
> >>>> * parameter can be NULL, but all others must be specified.
> >>>> * Returns 1/true if the filter finds a match, 0/false if none are found.
> >>>> */
> >>>> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16")
> >>>
> >>> The kernel doesn't really make use of #pragma optimization statements
> >>> like this, at least not in any of the core areas, and I'm not
> >>> interested in being the first to do so.  I appreciate the time and
> >>> effort that you have spent profiling the audit subsystem, but this
> >>> isn't a patch I can accept at this point in time, I'm sorry.
> >>
> >> Fair enough. Will a function attribute aka:
> >>
> >> __attribute__((optimize("foo=bar")))
> >>
> >> be acceptable for you?
> >
> > Unless you can show me widespread acceptance of these types of
> > optimizations in core kernel code, my answer is no.  I'm sorry.
>
> Hi Paul,
>
> Well, I guess we both know that I cannot show you any "widespread acceptance" of the sort. Here is what I see:
>
> arch/arm/lib/xor-neon.c:#pragma GCC optimize "tree-vectorize"
> (not what you describe as core kernel)
>
> lib/zstd/common/compiler.h:#    define DONT_VECTORIZE __attribute__((optimize("no-tree-vectorize")))
> (core, but not widespread)
>
> Then of course, there are more examples of specific optimizations or disable thereof in Makefiles:
>
> kernel/debug/kdb/Makefile:clean-files := gen-kdb_cmds.c
> kernel/kcsan/Makefile:CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \
> kernel/kcsan/Makefile: -fno-stack-protector -DDISABLE_BRANCH_PROFILING
> kernel/kcsan/Makefile:CFLAGS_kcsan_test.o := $(CFLAGS_KCSAN) -fno-omit-frame-pointer
> kernel/Makefile:CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) -fno-stack-protector
> kernel/Makefile:clean-files := kheaders_data.tar.xz kheaders.md5
> kernel/sched/Makefile:# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
> kernel/sched/Makefile:CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
> kernel/bpf/Makefile:cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
> kernel/entry/Makefile:CFLAGS_REMOVE_common.o = -fstack-protector -fstack-protector-strong
> kernel/entry/Makefile:CFLAGS_common.o += -fno-stack-protector
> kernel/rcu/Makefile:KBUILD_CFLAGS += -g -fno-omit-frame-pointer
>
> My view is:
>
> The coding of time critical functions, that access structs, that are allocated by k{m,z}alloc and family, in a short proximity in time, is, with all respect, not a good programming practice, as depicted by the example in this patch series.
>
> I have to accept, that you will not accept this patch series or a v2 thereof.
>
> But, this leaves (210 - 173) = 37 nanosecond unnecessary overhead of a syscall on the floor.
>
> It should be unnecessary to quote the number of Cascade Lake systems being deployed, how many of them that run with audit enabled and STIG component rules, to point of the significance of those nanoseconds.
>
> I would be very appreciative if you could come up with a better and more acceptable solution, in order to shave off these nanoseconds.

I appreciate and understand your frustration, but sometimes the
subjective quality and maintainability of the code needs to take
precedence, and in my mind the tradeoff for a 37ns win is not
desirable.  Thank you again for your time and effort.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-12-22  4:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-12 10:28 [PATCH 0/2] audit: Further reduce syscall latency Håkon Bugge
2023-12-12 10:28 ` [PATCH 1/2] audit: Vary struct audit_entry alignment Håkon Bugge
2023-12-13 23:54   ` Paul Moore
2023-12-16 16:25     ` Haakon Bugge
2023-12-18 22:07       ` Paul Moore
2023-12-19 21:07         ` Ankur Arora
2023-12-19 21:27           ` Paul Moore
2023-12-12 10:28 ` [PATCH 2/2] audit: Apply codegen optimizations Håkon Bugge
2023-12-13 23:45   ` Paul Moore
2023-12-16 16:28     ` Haakon Bugge
2023-12-18 22:09       ` Paul Moore
2023-12-21 19:05         ` Haakon Bugge
2023-12-22  4:31           ` Paul Moore
2023-12-12 10:28 ` [PATCH 2/2] audit: Apply special optimizations Håkon Bugge
2023-12-12 10:39   ` Haakon Bugge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox