[PATCH v5 0/1] Subject: x86/resctrl: Fix task CLOSID update race

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v5 0/1] Subject: x86/resctrl: Fix task CLOSID update race
@ 2022-12-14 11:44 Peter Newman
  2022-12-14 11:44 ` [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID " Peter Newman
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Newman @ 2022-12-14 11:44 UTC (permalink / raw)
  To: reinette.chatre, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86, Peter Newman

Hi Reinette, Fenghua,

Now that we know there is a full barrier in context switch and that
smp_mb() on every task in a group update isn't nearly as expensive as
task_call_func(), I can present you a much smaller fix.

The diffs are small enough to go down to a single patch. The changelogs
are not much simpler, but hopefully this isn't as big of a concern.

This patch is different enough that I thought it necessary to remove
James's Reviewed-By.

Updates in v5:
 - Just put an smp_mb() between CLOSID/RMID stores and task_curr() calls
 - Add a diagram detailing the race to the changelog
Updates in v4:
 - Reorder the patches so that justification for sending more IPIs can
   reference the patch fixing __rdtgroup_move_task().
 - Correct tense of wording used in changelog and comments
Updates in v3:
 - Split the handling of multi-task and single-task operations into
   separate patches, now that they're handled differently.
 - Clarify justification in the commit message, including moving some of
   it out of inline code comment.
Updates in v2:
 - Following Reinette's suggestion: use task_call_func() for single
   task, IPI broadcast for group movements.
 - Rebased to v6.1-rc4

v1: https://lore.kernel.org/lkml/20221103141641.3055981-1-peternewman@google.com/
v2: https://lore.kernel.org/lkml/20221110135346.2209839-1-peternewman@google.com/
v3: https://lore.kernel.org/lkml/20221115141953.816851-1-peternewman@google.com/
v4: https://lore.kernel.org/lkml/20221129111055.953833-1-peternewman@google.com/

Thank you for your careful reviews and feedback so far.

-Peter

Peter Newman (1):
  x86/resctrl: Fix task CLOSID/RMID update race

 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)


base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
-- 
2.39.0.rc1.256.g54fd8350bd-goog


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race
  2022-12-14 11:44 [PATCH v5 0/1] Subject: x86/resctrl: Fix task CLOSID update race Peter Newman
@ 2022-12-14 11:44 ` Peter Newman
  2022-12-15 23:51   ` Reinette Chatre
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Newman @ 2022-12-14 11:44 UTC (permalink / raw)
  To: reinette.chatre, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86, Peter Newman

When the user moves a running task to a new rdtgroup using the tasks
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.

x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.

Refer to the diagram below:

CPU 0                                   CPU 1
-----                                   -----
__rdtgroup_move_task():
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {1,1}
  t1->{closid,rmid} <- {2,2}
  if (curr == t1) // false
   IPI(t1->cpu)

A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.

In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr().  In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.

It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.

Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.

CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)

 # mkdir /sys/fs/resctrl/test
 # echo $$ > /sys/fs/resctrl/test/tasks
 # perf bench sched messaging -g 40 -l 100000

task-clock time ranges collected using:

 # perf stat rmdir /sys/fs/resctrl/test

Baseline:                     1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 	/*
 	 * Ensure the task's closid and rmid are written before determining if
 	 * the task is current that will decide if it will be interrupted.
+	 * This pairs with the full barrier between the rq->curr update and
+	 * resctrl_sched_in() during context switch.
 	 */
-	barrier();
+	smp_mb();

 	/*
 	 * By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 			WRITE_ONCE(t->closid, to->closid);
 			WRITE_ONCE(t->rmid, to->mon.rmid);

+			/*
+			 * Order the closid/rmid stores above before the loads
+			 * in task_curr(). This pairs with the full barrier
+			 * between the rq->curr update and resctrl_sched_in()
+			 * during context switch.
+			 */
+			smp_mb();
+
 			/*
 			 * If the task is on a CPU, set the CPU in the mask.
 			 * The detection is inaccurate as tasks might move or
-- 
2.39.0.rc1.256.g54fd8350bd-goog

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race
  2022-12-14 11:44 ` [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID " Peter Newman
@ 2022-12-15 23:51   ` Reinette Chatre
  2022-12-16 10:26     ` Peter Newman
  0 siblings, 1 reply; 6+ messages in thread
From: Reinette Chatre @ 2022-12-15 23:51 UTC (permalink / raw)
  To: Peter Newman, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86

Hi Peter,

On 12/14/2022 3:44 AM, Peter Newman wrote:
> When the user moves a running task to a new rdtgroup using the tasks
> file interface or by deleting its rdtgroup, the resulting change in
> CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
> task(s) CPUs.
> 
> x86 allows reordering loads with prior stores, so if the task starts
> running between a task_curr() check that the CPU hoisted before the
> stores in the CLOSID/RMID update then it can start running with the old
> CLOSID/RMID until it is switched again because __rdtgroup_move_task()
> failed to determine that it needs to be interrupted to obtain the new
> CLOSID/RMID.
> 
> Refer to the diagram below:
> 
> CPU 0                                   CPU 1
> -----                                   -----
> __rdtgroup_move_task():
>   curr <- t1->cpu->rq->curr
>                                         __schedule():
>                                           rq->curr <- t1
>                                         resctrl_sched_in():
>                                           t1->{closid,rmid} -> {1,1}
>   t1->{closid,rmid} <- {2,2}
>   if (curr == t1) // false
>    IPI(t1->cpu)
> 
> A similar race impacts rdt_move_group_tasks(), which updates tasks in a
> deleted rdtgroup.
> 
> In both cases, use smp_mb() to order the task_struct::{closid,rmid}
> stores before the loads in task_curr().  In particular, in the
> rdt_move_group_tasks() case, simply execute an smp_mb() on every
> iteration with a matching task.
> 
> It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
> this would require two passes and a means of remembering which
> task_structs were updated in the first loop. However, benchmarking
> results below showed too little performance impact in the simple
> approach to justify implementing the two-pass approach.
> 
> Times below were collected using `perf stat` to measure the time to
> remove a group containing a 1600-task, parallel workload.
> 
> CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
> 
>  # mkdir /sys/fs/resctrl/test
>  # echo $$ > /sys/fs/resctrl/test/tasks
>  # perf bench sched messaging -g 40 -l 100000
> 
> task-clock time ranges collected using:
> 
>  # perf stat rmdir /sys/fs/resctrl/test
> 
> Baseline:                     1.54 - 1.60 ms
> smp_mb() every matching task: 1.57 - 1.67 ms
> 

For a fix a Fixes: tag is expected. It looks like the following
may be relevant:
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")

> Signed-off-by: Peter Newman <peternewman@google.com>

Also, please do let the stable team know about this via:
Cc: stable@vger.kernel.org

> ---

There is no need to submit with a cover letter, but please do keep the history with this patch
by including it here below the "---".

>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index e5a48f05e787..5993da21d822 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>  	/*
>  	 * Ensure the task's closid and rmid are written before determining if
>  	 * the task is current that will decide if it will be interrupted.
> +	 * This pairs with the full barrier between the rq->curr update and
> +	 * resctrl_sched_in() during context switch.
>  	 */
> -	barrier();
> +	smp_mb();
>  
>  	/*
>  	 * By now, the task's closid and rmid are set. If the task is current
> @@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
>  			WRITE_ONCE(t->closid, to->closid);
>  			WRITE_ONCE(t->rmid, to->mon.rmid);
>  
> +			/*
> +			 * Order the closid/rmid stores above before the loads
> +			 * in task_curr(). This pairs with the full barrier
> +			 * between the rq->curr update and resctrl_sched_in()
> +			 * during context switch.
> +			 */
> +			smp_mb();
> +
>  			/*
>  			 * If the task is on a CPU, set the CPU in the mask.
>  			 * The detection is inaccurate as tasks might move or


Thank you very much for sticking with this and always paying attention
to the details along the way.

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race
  2022-12-15 23:51   ` Reinette Chatre
@ 2022-12-16 10:26     ` Peter Newman
  2022-12-16 19:36       ` Reinette Chatre
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Newman @ 2022-12-16 10:26 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Reinette,

On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> For a fix a Fixes: tag is expected. It looks like the following
> may be relevant:
> Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
> Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")

Thanks for preparing these lines. I'll include them.

>
> > Signed-off-by: Peter Newman <peternewman@google.com>
>
> Also, please do let the stable team know about this via:
> Cc: stable@vger.kernel.org

I wasn't sure if this fix met the criteria for backporting to stable,
because I found it by code inspection, so it doesn't meet the "bothers
people" criterion.

However I can make a case that it's exploitable:

"In a memory bandwidth-metered compute host, malicious jobs could
exploit this race to remain in a previous CLOSID or RMID in order to
dodge a class-of-service downgrade imposed by an admin or steal
bandwidth."


>
> Thank you very much for sticking with this and always paying attention
> to the details along the way.
>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Thank you, Reinette! This has been a learning experience for me.

-Peter

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race
  2022-12-16 10:26     ` Peter Newman
@ 2022-12-16 19:36       ` Reinette Chatre
  2022-12-19 10:22         ` Peter Newman
  0 siblings, 1 reply; 6+ messages in thread
From: Reinette Chatre @ 2022-12-16 19:36 UTC (permalink / raw)
  To: Peter Newman
  Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Peter,

On 12/16/2022 2:26 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> For a fix a Fixes: tag is expected. It looks like the following
>> may be relevant:
>> Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
>> Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
> 
> Thanks for preparing these lines. I'll include them.
> 
>>
>>> Signed-off-by: Peter Newman <peternewman@google.com>
>>
>> Also, please do let the stable team know about this via:
>> Cc: stable@vger.kernel.org
> 
> I wasn't sure if this fix met the criteria for backporting to stable,
> because I found it by code inspection, so it doesn't meet the "bothers
> people" criterion.

That is fair. Encountering the issue does not have an obvious error, the
consequence is that there could be intervals during which tasks may not
get resources/measurements they are entitled to. I do think that this will
be hard to test in order to demonstrate the impact.

My understanding was that this was encountered in your environment where
actions are taken at large scale. If this remains theoretical then no need
to include the stable team. With the Fixes tags they can decide if it is
something they would like to carry.

> 
> However I can make a case that it's exploitable:
> 
> "In a memory bandwidth-metered compute host, malicious jobs could
> exploit this race to remain in a previous CLOSID or RMID in order to
> dodge a class-of-service downgrade imposed by an admin or steal
> bandwidth."
> 

I am not comfortable with such high level speculation. For this
exploit to work the malicious jobs needs to control scheduler decisions
as well as time the exploit with the admin's decision to move the target task.

Reinette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race
  2022-12-16 19:36       ` Reinette Chatre
@ 2022-12-19 10:22         ` Peter Newman
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Newman @ 2022-12-19 10:22 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Reinette,

On Fri, Dec 16, 2022 at 8:36 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 12/16/2022 2:26 AM, Peter Newman wrote:
> > However I can make a case that it's exploitable:
> >
> > "In a memory bandwidth-metered compute host, malicious jobs could
> > exploit this race to remain in a previous CLOSID or RMID in order to
> > dodge a class-of-service downgrade imposed by an admin or steal
> > bandwidth."
> >
>
> I am not comfortable with such high level speculation. For this
> exploit to work the malicious jobs needs to control scheduler decisions
> as well as time the exploit with the admin's decision to move the target task.

I imagined if the malicious job maintained a large pool of threads in
short sleep-loops, after it sees a drop in bandwidth, it can cue the
threads to measure their memory bandwidth to see if any got past the
CLOSID change.

I don't know whether having fast, unmetered bandwidth until the next
context switch is enough of a payoff to bother with this, though. Our
workloads have too many context switches for this to be worth very much,
so I'm fine with letting others decide how important this fix is to
them.

-Peter

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-12-19 10:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-14 11:44 [PATCH v5 0/1] Subject: x86/resctrl: Fix task CLOSID update race Peter Newman
2022-12-14 11:44 ` [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID " Peter Newman
2022-12-15 23:51   ` Reinette Chatre
2022-12-16 10:26     ` Peter Newman
2022-12-16 19:36       ` Reinette Chatre
2022-12-19 10:22         ` Peter Newman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox