From: Ankur Arora <ankur.a.arora@oracle.com>
To: Steve Grubb <sgrubb@redhat.com>
Cc: konrad.wilk@oracle.com, linux-kernel@vger.kernel.org,
linux-audit@redhat.com, eparis@redhat.com,
boris.ostrovsky@oracle.com
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
Date: Thu, 06 Oct 2022 17:55:06 -0700 [thread overview]
Message-ID: <87a668bggl.fsf@oracle.com> (raw)
In-Reply-To: <8171459.NyiUUSuA9g@x2>
Steve Grubb <sgrubb@redhat.com> writes:
> Hello,
>
> Thanks for the detailed notes on this investigation. It really is a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.
Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.
Thanks
Ankur
>
> Thanks,
> -Steve
>
> On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> In and of itself the load isn't very expensive (ops dependent on the
>> ctx->major load are only used to determine the direction of control flow
>> and have short dependence chains and, in any case the related branches
>> get predicted perfectly in the fastpath) but still cache ctx->major
>> in a local for two reasons:
>>
>> * ctx->major is in the first cacheline of struct audit_context and has
>> similar alignment as audit_entry::list audit_entry. For cases
>> with a lot of audit rules, doing this reduces one source of contention
>> from a potentially busy cache-set.
>>
>> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>> does cast manipulation and error checking on ctx->major:
>>
>> audit_in_mask(const struct audit_krule *rule, unsigned long val):
>> if (val > 0xffffffff)
>> return false;
>>
>> word = AUDIT_WORD(val);
>> if (word >= AUDIT_BITMASK_SIZE)
>> return false;
>>
>> bit = AUDIT_BIT(val);
>>
>> return rule->mask[word] & bit;
>>
>> The clauses related to the rule need to be evaluated in the loop, but
>> the rest is unnecessarily re-evaluated for every loop iteration.
>> (Note, however, that most of these are cheap ALU ops and the branches
>> are perfectly predicted. However, see discussion on cycles
>> improvement below for more on why it is still worth hoisting.)
>>
>> On a Skylakex system change in getpid() latency (aggregated over
>> 12 boot cycles):
>>
>> Min Mean Median Max pstdev
>> (ns) (ns) (ns) (ns)
>>
>> - 201.30 216.14 216.22 228.46 (+- 1.45%)
>> + 196.63 207.86 206.60 230.98 (+- 3.92%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>> cycles 836.89 ( +- .80% )
>> instructions 2000.19 ( +- .03% )
>> IPC 2.39 ( +- .83% )
>> branches 430.14 ( +- .03% )
>> branch-misses 1.48 ( +- 3.37% )
>> L1-dcache-loads 471.11 ( +- .05% )
>> L1-dcache-load-misses 7.62 ( +- 46.98% )
>>
>> to:
>> cycles 805.58 ( +- 4.11% )
>> instructions 1654.11 ( +- .05% )
>> IPC 2.06 ( +- 3.39% )
>> branches 430.02 ( +- .05% )
>> branch-misses 1.55 ( +- 7.09% )
>> L1-dcache-loads 440.01 ( +- .09% )
>> L1-dcache-load-misses 9.05 ( +- 74.03% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> instructions: we reduce around 8 instructions/iteration because some of
>> the computation is now hoisted out of the loop (branch count does not
>> change because GCC, for reasons unclear, only hoists the computations
>> while keeping the basic-blocks.)
>>
>> cycles: improve by about 5% (in aggregate and looking at individual run
>> numbers.) This is likely because we now waste fewer pipeline resources
>> on unnecessary instructions which allows the control flow to
>> speculatively execute further ahead shortening the execution of the loop
>> a little. The final gating factor on the performance of this loop
>> remains the long dependence chain due to the linked-list load.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> kernel/auditsc.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
>> *tsk, {
>> struct audit_entry *e;
>> enum audit_state state;
>> + unsigned long major = ctx->major;
>>
>> if (auditd_test_task(tsk))
>> return;
>>
>> rcu_read_lock();
>> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
> list) {
>> - if (audit_in_mask(&e->rule, ctx->major) &&
>> + if (audit_in_mask(&e->rule, major) &&
>> audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> &state, false)) {
>> rcu_read_unlock();
--
ankur
--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit
WARNING: multiple messages have this Message-ID (diff)
From: Ankur Arora <ankur.a.arora@oracle.com>
To: Steve Grubb <sgrubb@redhat.com>
Cc: linux-audit@redhat.com, boris.ostrovsky@oracle.com,
konrad.wilk@oracle.com, eparis@redhat.com,
linux-kernel@vger.kernel.org,
Ankur Arora <ankur.a.arora@oracle.com>
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
Date: Thu, 06 Oct 2022 17:55:06 -0700 [thread overview]
Message-ID: <87a668bggl.fsf@oracle.com> (raw)
In-Reply-To: <8171459.NyiUUSuA9g@x2>
Steve Grubb <sgrubb@redhat.com> writes:
> Hello,
>
> Thanks for the detailed notes on this investigation. It really is a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.
Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.
Thanks
Ankur
>
> Thanks,
> -Steve
>
> On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> In and of itself the load isn't very expensive (ops dependent on the
>> ctx->major load are only used to determine the direction of control flow
>> and have short dependence chains and, in any case the related branches
>> get predicted perfectly in the fastpath) but still cache ctx->major
>> in a local for two reasons:
>>
>> * ctx->major is in the first cacheline of struct audit_context and has
>> similar alignment as audit_entry::list audit_entry. For cases
>> with a lot of audit rules, doing this reduces one source of contention
>> from a potentially busy cache-set.
>>
>> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>> does cast manipulation and error checking on ctx->major:
>>
>> audit_in_mask(const struct audit_krule *rule, unsigned long val):
>> if (val > 0xffffffff)
>> return false;
>>
>> word = AUDIT_WORD(val);
>> if (word >= AUDIT_BITMASK_SIZE)
>> return false;
>>
>> bit = AUDIT_BIT(val);
>>
>> return rule->mask[word] & bit;
>>
>> The clauses related to the rule need to be evaluated in the loop, but
>> the rest is unnecessarily re-evaluated for every loop iteration.
>> (Note, however, that most of these are cheap ALU ops and the branches
>> are perfectly predicted. However, see discussion on cycles
>> improvement below for more on why it is still worth hoisting.)
>>
>> On a Skylakex system change in getpid() latency (aggregated over
>> 12 boot cycles):
>>
>> Min Mean Median Max pstdev
>> (ns) (ns) (ns) (ns)
>>
>> - 201.30 216.14 216.22 228.46 (+- 1.45%)
>> + 196.63 207.86 206.60 230.98 (+- 3.92%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>> cycles 836.89 ( +- .80% )
>> instructions 2000.19 ( +- .03% )
>> IPC 2.39 ( +- .83% )
>> branches 430.14 ( +- .03% )
>> branch-misses 1.48 ( +- 3.37% )
>> L1-dcache-loads 471.11 ( +- .05% )
>> L1-dcache-load-misses 7.62 ( +- 46.98% )
>>
>> to:
>> cycles 805.58 ( +- 4.11% )
>> instructions 1654.11 ( +- .05% )
>> IPC 2.06 ( +- 3.39% )
>> branches 430.02 ( +- .05% )
>> branch-misses 1.55 ( +- 7.09% )
>> L1-dcache-loads 440.01 ( +- .09% )
>> L1-dcache-load-misses 9.05 ( +- 74.03% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> instructions: we reduce around 8 instructions/iteration because some of
>> the computation is now hoisted out of the loop (branch count does not
>> change because GCC, for reasons unclear, only hoists the computations
>> while keeping the basic-blocks.)
>>
>> cycles: improve by about 5% (in aggregate and looking at individual run
>> numbers.) This is likely because we now waste fewer pipeline resources
>> on unnecessary instructions which allows the control flow to
>> speculatively execute further ahead shortening the execution of the loop
>> a little. The final gating factor on the performance of this loop
>> remains the long dependence chain due to the linked-list load.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> kernel/auditsc.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
>> *tsk, {
>> struct audit_entry *e;
>> enum audit_state state;
>> + unsigned long major = ctx->major;
>>
>> if (auditd_test_task(tsk))
>> return;
>>
>> rcu_read_lock();
>> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
> list) {
>> - if (audit_in_mask(&e->rule, ctx->major) &&
>> + if (audit_in_mask(&e->rule, major) &&
>> audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> &state, false)) {
>> rcu_read_unlock();
--
ankur
next prev parent reply other threads:[~2022-10-07 0:55 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-27 22:59 [PATCH 0/3] improve audit syscall-exit latency Ankur Arora
2022-09-27 22:59 ` Ankur Arora
2022-09-27 22:59 ` [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall() Ankur Arora
2022-09-27 22:59 ` Ankur Arora
2022-09-28 22:03 ` Paul Moore
2022-09-28 22:03 ` Paul Moore
2022-09-29 20:20 ` Ankur Arora
2022-09-29 20:20 ` Ankur Arora
2022-10-17 18:23 ` Paul Moore
2022-10-17 18:23 ` Paul Moore
2022-09-30 17:45 ` Steve Grubb
2022-09-30 17:45 ` Steve Grubb
2022-09-30 18:22 ` Paul Moore
2022-09-30 18:22 ` Paul Moore
2022-10-07 0:55 ` Ankur Arora [this message]
2022-10-07 0:55 ` Ankur Arora
2022-09-27 22:59 ` [PATCH 2/3] audit: annotate branch direction for audit_in_mask() Ankur Arora
2022-09-27 22:59 ` Ankur Arora
2022-09-28 22:26 ` Paul Moore
2022-09-28 22:26 ` Paul Moore
2022-09-29 20:19 ` Ankur Arora
2022-09-29 20:19 ` Ankur Arora
2022-09-30 18:48 ` Paul Moore
2022-09-30 18:48 ` Paul Moore
2022-10-07 0:57 ` Ankur Arora
2022-10-07 0:57 ` Ankur Arora
2022-09-27 22:59 ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
2022-09-27 22:59 ` [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()} Ankur Arora
2022-09-28 22:54 ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-09-28 22:54 ` [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore
2022-09-29 20:23 ` Ankur Arora
2022-09-29 20:23 ` Ankur Arora
2022-10-07 0:49 ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
2022-10-07 0:49 ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Ankur Arora
2022-10-13 23:11 ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-10-13 23:11 ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore
2022-10-14 16:53 ` Ankur Arora
2022-10-14 16:53 ` Ankur Arora
2022-10-17 18:26 ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-10-17 18:26 ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a668bggl.fsf@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=boris.ostrovsky@oracle.com \
--cc=eparis@redhat.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-audit@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sgrubb@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.