All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ankur Arora <ankur.a.arora@oracle.com>
To: Steve Grubb <sgrubb@redhat.com>
Cc: konrad.wilk@oracle.com, linux-kernel@vger.kernel.org,
	linux-audit@redhat.com, eparis@redhat.com,
	boris.ostrovsky@oracle.com
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
Date: Thu, 06 Oct 2022 17:55:06 -0700	[thread overview]
Message-ID: <87a668bggl.fsf@oracle.com> (raw)
In-Reply-To: <8171459.NyiUUSuA9g@x2>


Steve Grubb <sgrubb@redhat.com> writes:

> Hello,
>
> Thanks for the detailed notes on this investigation. It really is  a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.

Thanks
Ankur

>
> Thanks,
> -Steve
>
> On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> In and of itself the load isn't very expensive (ops dependent on the
>> ctx->major load are only used to determine the direction of control flow
>> and have short dependence chains and, in any case the related branches
>> get predicted perfectly in the fastpath) but still cache ctx->major
>> in a local for two reasons:
>>
>> * ctx->major is in the first cacheline of struct audit_context and has
>>   similar alignment as audit_entry::list audit_entry. For cases
>>   with a lot of audit rules, doing this reduces one source of contention
>>   from a potentially busy cache-set.
>>
>> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>>   does cast manipulation and error checking on ctx->major:
>>
>>      audit_in_mask(const struct audit_krule *rule, unsigned long val):
>>              if (val > 0xffffffff)
>>                      return false;
>>
>>              word = AUDIT_WORD(val);
>>              if (word >= AUDIT_BITMASK_SIZE)
>>                      return false;
>>
>>              bit = AUDIT_BIT(val);
>>
>>              return rule->mask[word] & bit;
>>
>>   The clauses related to the rule need to be evaluated in the loop, but
>>   the rest is unnecessarily re-evaluated for every loop iteration.
>>   (Note, however, that most of these are cheap ALU ops and the branches
>>    are perfectly predicted. However, see discussion on cycles
>>    improvement below for more on why it is still worth hoisting.)
>>
>> On a Skylakex system change in getpid() latency (aggregated over
>> 12 boot cycles):
>>
>>              Min     Mean  Median     Max       pstdev
>>             (ns)     (ns)    (ns)    (ns)
>>
>>  -        201.30   216.14  216.22  228.46      (+- 1.45%)
>>  +        196.63   207.86  206.60  230.98      (+- 3.92%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               836.89  (  +-   .80% )
>>     instructions        2000.19  (  +-   .03% )
>>     IPC                    2.39  (  +-   .83% )
>>     branches             430.14  (  +-   .03% )
>>     branch-misses          1.48  (  +-  3.37% )
>>     L1-dcache-loads      471.11  (  +-   .05% )
>>     L1-dcache-load-misses  7.62  (  +- 46.98% )
>>
>>  to:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> instructions: we reduce around 8 instructions/iteration because some of
>> the computation is now hoisted out of the loop (branch count does not
>> change because GCC, for reasons unclear, only hoists the computations
>> while keeping the basic-blocks.)
>>
>> cycles: improve by about 5% (in aggregate and looking at individual run
>> numbers.) This is likely because we now waste fewer pipeline resources
>> on unnecessary instructions which allows the control flow to
>> speculatively execute further ahead shortening the execution of the loop
>> a little. The final gating factor on the performance of this loop
>> remains the long dependence chain due to the linked-list load.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
>> *tsk, {
>>  	struct audit_entry *e;
>>  	enum audit_state state;
>> +	unsigned long major = ctx->major;
>>
>>  	if (auditd_test_task(tsk))
>>  		return;
>>
>>  	rcu_read_lock();
>>  	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
> list) {
>> -		if (audit_in_mask(&e->rule, ctx->major) &&
>> +		if (audit_in_mask(&e->rule, major) &&
>>  		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>  				       &state, false)) {
>>  			rcu_read_unlock();


--
ankur

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


WARNING: multiple messages have this Message-ID (diff)
From: Ankur Arora <ankur.a.arora@oracle.com>
To: Steve Grubb <sgrubb@redhat.com>
Cc: linux-audit@redhat.com, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com, eparis@redhat.com,
	linux-kernel@vger.kernel.org,
	Ankur Arora <ankur.a.arora@oracle.com>
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
Date: Thu, 06 Oct 2022 17:55:06 -0700	[thread overview]
Message-ID: <87a668bggl.fsf@oracle.com> (raw)
In-Reply-To: <8171459.NyiUUSuA9g@x2>


Steve Grubb <sgrubb@redhat.com> writes:

> Hello,
>
> Thanks for the detailed notes on this investigation. It really is  a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.

Thanks
Ankur

>
> Thanks,
> -Steve
>
> On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> In and of itself the load isn't very expensive (ops dependent on the
>> ctx->major load are only used to determine the direction of control flow
>> and have short dependence chains and, in any case the related branches
>> get predicted perfectly in the fastpath) but still cache ctx->major
>> in a local for two reasons:
>>
>> * ctx->major is in the first cacheline of struct audit_context and has
>>   similar alignment as audit_entry::list audit_entry. For cases
>>   with a lot of audit rules, doing this reduces one source of contention
>>   from a potentially busy cache-set.
>>
>> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>>   does cast manipulation and error checking on ctx->major:
>>
>>      audit_in_mask(const struct audit_krule *rule, unsigned long val):
>>              if (val > 0xffffffff)
>>                      return false;
>>
>>              word = AUDIT_WORD(val);
>>              if (word >= AUDIT_BITMASK_SIZE)
>>                      return false;
>>
>>              bit = AUDIT_BIT(val);
>>
>>              return rule->mask[word] & bit;
>>
>>   The clauses related to the rule need to be evaluated in the loop, but
>>   the rest is unnecessarily re-evaluated for every loop iteration.
>>   (Note, however, that most of these are cheap ALU ops and the branches
>>    are perfectly predicted. However, see discussion on cycles
>>    improvement below for more on why it is still worth hoisting.)
>>
>> On a Skylakex system change in getpid() latency (aggregated over
>> 12 boot cycles):
>>
>>              Min     Mean  Median     Max       pstdev
>>             (ns)     (ns)    (ns)    (ns)
>>
>>  -        201.30   216.14  216.22  228.46      (+- 1.45%)
>>  +        196.63   207.86  206.60  230.98      (+- 3.92%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               836.89  (  +-   .80% )
>>     instructions        2000.19  (  +-   .03% )
>>     IPC                    2.39  (  +-   .83% )
>>     branches             430.14  (  +-   .03% )
>>     branch-misses          1.48  (  +-  3.37% )
>>     L1-dcache-loads      471.11  (  +-   .05% )
>>     L1-dcache-load-misses  7.62  (  +- 46.98% )
>>
>>  to:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> instructions: we reduce around 8 instructions/iteration because some of
>> the computation is now hoisted out of the loop (branch count does not
>> change because GCC, for reasons unclear, only hoists the computations
>> while keeping the basic-blocks.)
>>
>> cycles: improve by about 5% (in aggregate and looking at individual run
>> numbers.) This is likely because we now waste fewer pipeline resources
>> on unnecessary instructions which allows the control flow to
>> speculatively execute further ahead shortening the execution of the loop
>> a little. The final gating factor on the performance of this loop
>> remains the long dependence chain due to the linked-list load.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
>> *tsk, {
>>  	struct audit_entry *e;
>>  	enum audit_state state;
>> +	unsigned long major = ctx->major;
>>
>>  	if (auditd_test_task(tsk))
>>  		return;
>>
>>  	rcu_read_lock();
>>  	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
> list) {
>> -		if (audit_in_mask(&e->rule, ctx->major) &&
>> +		if (audit_in_mask(&e->rule, major) &&
>>  		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>  				       &state, false)) {
>>  			rcu_read_unlock();


--
ankur

  parent reply	other threads:[~2022-10-07  0:55 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-27 22:59 [PATCH 0/3] improve audit syscall-exit latency Ankur Arora
2022-09-27 22:59 ` Ankur Arora
2022-09-27 22:59 ` [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall() Ankur Arora
2022-09-27 22:59   ` Ankur Arora
2022-09-28 22:03   ` Paul Moore
2022-09-28 22:03     ` Paul Moore
2022-09-29 20:20     ` Ankur Arora
2022-09-29 20:20       ` Ankur Arora
2022-10-17 18:23     ` Paul Moore
2022-10-17 18:23       ` Paul Moore
2022-09-30 17:45   ` Steve Grubb
2022-09-30 17:45     ` Steve Grubb
2022-09-30 18:22     ` Paul Moore
2022-09-30 18:22       ` Paul Moore
2022-10-07  0:55     ` Ankur Arora [this message]
2022-10-07  0:55       ` Ankur Arora
2022-09-27 22:59 ` [PATCH 2/3] audit: annotate branch direction for audit_in_mask() Ankur Arora
2022-09-27 22:59   ` Ankur Arora
2022-09-28 22:26   ` Paul Moore
2022-09-28 22:26     ` Paul Moore
2022-09-29 20:19     ` Ankur Arora
2022-09-29 20:19       ` Ankur Arora
2022-09-30 18:48       ` Paul Moore
2022-09-30 18:48         ` Paul Moore
2022-10-07  0:57         ` Ankur Arora
2022-10-07  0:57           ` Ankur Arora
2022-09-27 22:59 ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
2022-09-27 22:59   ` [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()} Ankur Arora
2022-09-28 22:54   ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-09-28 22:54     ` [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore
2022-09-29 20:23     ` Ankur Arora
2022-09-29 20:23       ` Ankur Arora
2022-10-07  0:49   ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
2022-10-07  0:49     ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Ankur Arora
2022-10-13 23:11     ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-10-13 23:11       ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore
2022-10-14 16:53       ` Ankur Arora
2022-10-14 16:53         ` Ankur Arora
2022-10-17 18:26       ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-10-17 18:26         ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a668bggl.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=eparis@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-audit@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sgrubb@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.