* [PATCH v5 0/1] Subject: x86/resctrl: Fix task CLOSID update race @ 2022-12-14 11:44 Peter Newman 2022-12-14 11:44 ` [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID " Peter Newman 0 siblings, 1 reply; 6+ messages in thread From: Peter Newman @ 2022-12-14 11:44 UTC (permalink / raw) To: reinette.chatre, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86, Peter Newman Hi Reinette, Fenghua, Now that we know there is a full barrier in context switch and that smp_mb() on every task in a group update isn't nearly as expensive as task_call_func(), I can present you a much smaller fix. The diffs are small enough to go down to a single patch. The changelogs are not much simpler, but hopefully this isn't as big of a concern. This patch is different enough that I thought it necessary to remove James's Reviewed-By. Updates in v5: - Just put an smp_mb() between CLOSID/RMID stores and task_curr() calls - Add a diagram detailing the race to the changelog Updates in v4: - Reorder the patches so that justification for sending more IPIs can reference the patch fixing __rdtgroup_move_task(). - Correct tense of wording used in changelog and comments Updates in v3: - Split the handling of multi-task and single-task operations into separate patches, now that they're handled differently. - Clarify justification in the commit message, including moving some of it out of inline code comment. Updates in v2: - Following Reinette's suggestion: use task_call_func() for single task, IPI broadcast for group movements. - Rebased to v6.1-rc4 v1: https://lore.kernel.org/lkml/20221103141641.3055981-1-peternewman@google.com/ v2: https://lore.kernel.org/lkml/20221110135346.2209839-1-peternewman@google.com/ v3: https://lore.kernel.org/lkml/20221115141953.816851-1-peternewman@google.com/ v4: https://lore.kernel.org/lkml/20221129111055.953833-1-peternewman@google.com/ Thank you for your careful reviews and feedback so far. -Peter Peter Newman (1): x86/resctrl: Fix task CLOSID/RMID update race arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476 -- 2.39.0.rc1.256.g54fd8350bd-goog ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race 2022-12-14 11:44 [PATCH v5 0/1] Subject: x86/resctrl: Fix task CLOSID update race Peter Newman @ 2022-12-14 11:44 ` Peter Newman 2022-12-15 23:51 ` Reinette Chatre 0 siblings, 1 reply; 6+ messages in thread From: Peter Newman @ 2022-12-14 11:44 UTC (permalink / raw) To: reinette.chatre, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86, Peter Newman When the user moves a running task to a new rdtgroup using the tasks file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task(s) CPUs. x86 allows reordering loads with prior stores, so if the task starts running between a task_curr() check that the CPU hoisted before the stores in the CLOSID/RMID update then it can start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID. Refer to the diagram below: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) A similar race impacts rdt_move_group_tasks(), which updates tasks in a deleted rdtgroup. In both cases, use smp_mb() to order the task_struct::{closid,rmid} stores before the loads in task_curr(). In particular, in the rdt_move_group_tasks() case, simply execute an smp_mb() on every iteration with a matching task. It is possible to use a single smp_mb() in rdt_move_group_tasks(), but this would require two passes and a means of remembering which task_structs were updated in the first loop. However, benchmarking results below showed too little performance impact in the simple approach to justify implementing the two-pass approach. Times below were collected using `perf stat` to measure the time to remove a group containing a 1600-task, parallel workload. CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) # mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000 task-clock time ranges collected using: # perf stat rmdir /sys/fs/resctrl/test Baseline: 1.54 - 1.60 ms smp_mb() every matching task: 1.57 - 1.67 ms Signed-off-by: Peter Newman <peternewman@google.com> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index e5a48f05e787..5993da21d822 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk, /* * Ensure the task's closid and rmid are written before determining if * the task is current that will decide if it will be interrupted. + * This pairs with the full barrier between the rq->curr update and + * resctrl_sched_in() during context switch. */ - barrier(); + smp_mb(); /* * By now, the task's closid and rmid are set. If the task is current @@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, WRITE_ONCE(t->closid, to->closid); WRITE_ONCE(t->rmid, to->mon.rmid); + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and resctrl_sched_in() + * during context switch. + */ + smp_mb(); + /* * If the task is on a CPU, set the CPU in the mask. * The detection is inaccurate as tasks might move or -- 2.39.0.rc1.256.g54fd8350bd-goog ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race 2022-12-14 11:44 ` [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID " Peter Newman @ 2022-12-15 23:51 ` Reinette Chatre 2022-12-16 10:26 ` Peter Newman 0 siblings, 1 reply; 6+ messages in thread From: Reinette Chatre @ 2022-12-15 23:51 UTC (permalink / raw) To: Peter Newman, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Peter, On 12/14/2022 3:44 AM, Peter Newman wrote: > When the user moves a running task to a new rdtgroup using the tasks > file interface or by deleting its rdtgroup, the resulting change in > CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the > task(s) CPUs. > > x86 allows reordering loads with prior stores, so if the task starts > running between a task_curr() check that the CPU hoisted before the > stores in the CLOSID/RMID update then it can start running with the old > CLOSID/RMID until it is switched again because __rdtgroup_move_task() > failed to determine that it needs to be interrupted to obtain the new > CLOSID/RMID. > > Refer to the diagram below: > > CPU 0 CPU 1 > ----- ----- > __rdtgroup_move_task(): > curr <- t1->cpu->rq->curr > __schedule(): > rq->curr <- t1 > resctrl_sched_in(): > t1->{closid,rmid} -> {1,1} > t1->{closid,rmid} <- {2,2} > if (curr == t1) // false > IPI(t1->cpu) > > A similar race impacts rdt_move_group_tasks(), which updates tasks in a > deleted rdtgroup. > > In both cases, use smp_mb() to order the task_struct::{closid,rmid} > stores before the loads in task_curr(). In particular, in the > rdt_move_group_tasks() case, simply execute an smp_mb() on every > iteration with a matching task. > > It is possible to use a single smp_mb() in rdt_move_group_tasks(), but > this would require two passes and a means of remembering which > task_structs were updated in the first loop. However, benchmarking > results below showed too little performance impact in the simple > approach to justify implementing the two-pass approach. > > Times below were collected using `perf stat` to measure the time to > remove a group containing a 1600-task, parallel workload. > > CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) > > # mkdir /sys/fs/resctrl/test > # echo $$ > /sys/fs/resctrl/test/tasks > # perf bench sched messaging -g 40 -l 100000 > > task-clock time ranges collected using: > > # perf stat rmdir /sys/fs/resctrl/test > > Baseline: 1.54 - 1.60 ms > smp_mb() every matching task: 1.57 - 1.67 ms > For a fix a Fixes: tag is expected. It looks like the following may be relevant: Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") > Signed-off-by: Peter Newman <peternewman@google.com> Also, please do let the stable team know about this via: Cc: stable@vger.kernel.org > --- There is no need to submit with a cover letter, but please do keep the history with this patch by including it here below the "---". > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > index e5a48f05e787..5993da21d822 100644 > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > @@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk, > /* > * Ensure the task's closid and rmid are written before determining if > * the task is current that will decide if it will be interrupted. > + * This pairs with the full barrier between the rq->curr update and > + * resctrl_sched_in() during context switch. > */ > - barrier(); > + smp_mb(); > > /* > * By now, the task's closid and rmid are set. If the task is current > @@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, > WRITE_ONCE(t->closid, to->closid); > WRITE_ONCE(t->rmid, to->mon.rmid); > > + /* > + * Order the closid/rmid stores above before the loads > + * in task_curr(). This pairs with the full barrier > + * between the rq->curr update and resctrl_sched_in() > + * during context switch. > + */ > + smp_mb(); > + > /* > * If the task is on a CPU, set the CPU in the mask. > * The detection is inaccurate as tasks might move or Thank you very much for sticking with this and always paying attention to the details along the way. Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reinette ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race 2022-12-15 23:51 ` Reinette Chatre @ 2022-12-16 10:26 ` Peter Newman 2022-12-16 19:36 ` Reinette Chatre 0 siblings, 1 reply; 6+ messages in thread From: Peter Newman @ 2022-12-16 10:26 UTC (permalink / raw) To: Reinette Chatre Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Reinette, On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre <reinette.chatre@intel.com> wrote: > > For a fix a Fixes: tag is expected. It looks like the following > may be relevant: > Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") > Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") Thanks for preparing these lines. I'll include them. > > > Signed-off-by: Peter Newman <peternewman@google.com> > > Also, please do let the stable team know about this via: > Cc: stable@vger.kernel.org I wasn't sure if this fix met the criteria for backporting to stable, because I found it by code inspection, so it doesn't meet the "bothers people" criterion. However I can make a case that it's exploitable: "In a memory bandwidth-metered compute host, malicious jobs could exploit this race to remain in a previous CLOSID or RMID in order to dodge a class-of-service downgrade imposed by an admin or steal bandwidth." > > Thank you very much for sticking with this and always paying attention > to the details along the way. > > Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Thank you, Reinette! This has been a learning experience for me. -Peter ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race 2022-12-16 10:26 ` Peter Newman @ 2022-12-16 19:36 ` Reinette Chatre 2022-12-19 10:22 ` Peter Newman 0 siblings, 1 reply; 6+ messages in thread From: Reinette Chatre @ 2022-12-16 19:36 UTC (permalink / raw) To: Peter Newman Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Peter, On 12/16/2022 2:26 AM, Peter Newman wrote: > Hi Reinette, > > On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> For a fix a Fixes: tag is expected. It looks like the following >> may be relevant: >> Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") >> Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") > > Thanks for preparing these lines. I'll include them. > >> >>> Signed-off-by: Peter Newman <peternewman@google.com> >> >> Also, please do let the stable team know about this via: >> Cc: stable@vger.kernel.org > > I wasn't sure if this fix met the criteria for backporting to stable, > because I found it by code inspection, so it doesn't meet the "bothers > people" criterion. That is fair. Encountering the issue does not have an obvious error, the consequence is that there could be intervals during which tasks may not get resources/measurements they are entitled to. I do think that this will be hard to test in order to demonstrate the impact. My understanding was that this was encountered in your environment where actions are taken at large scale. If this remains theoretical then no need to include the stable team. With the Fixes tags they can decide if it is something they would like to carry. > > However I can make a case that it's exploitable: > > "In a memory bandwidth-metered compute host, malicious jobs could > exploit this race to remain in a previous CLOSID or RMID in order to > dodge a class-of-service downgrade imposed by an admin or steal > bandwidth." > I am not comfortable with such high level speculation. For this exploit to work the malicious jobs needs to control scheduler decisions as well as time the exploit with the admin's decision to move the target task. Reinette ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race 2022-12-16 19:36 ` Reinette Chatre @ 2022-12-19 10:22 ` Peter Newman 0 siblings, 0 replies; 6+ messages in thread From: Peter Newman @ 2022-12-19 10:22 UTC (permalink / raw) To: Reinette Chatre Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Reinette, On Fri, Dec 16, 2022 at 8:36 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/16/2022 2:26 AM, Peter Newman wrote: > > However I can make a case that it's exploitable: > > > > "In a memory bandwidth-metered compute host, malicious jobs could > > exploit this race to remain in a previous CLOSID or RMID in order to > > dodge a class-of-service downgrade imposed by an admin or steal > > bandwidth." > > > > I am not comfortable with such high level speculation. For this > exploit to work the malicious jobs needs to control scheduler decisions > as well as time the exploit with the admin's decision to move the target task. I imagined if the malicious job maintained a large pool of threads in short sleep-loops, after it sees a drop in bandwidth, it can cue the threads to measure their memory bandwidth to see if any got past the CLOSID change. I don't know whether having fast, unmetered bandwidth until the next context switch is enough of a payoff to bother with this, though. Our workloads have too many context switches for this to be worth very much, so I'm fine with letting others decide how important this fix is to them. -Peter ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-12-19 10:22 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-12-14 11:44 [PATCH v5 0/1] Subject: x86/resctrl: Fix task CLOSID update race Peter Newman 2022-12-14 11:44 ` [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID " Peter Newman 2022-12-15 23:51 ` Reinette Chatre 2022-12-16 10:26 ` Peter Newman 2022-12-16 19:36 ` Reinette Chatre 2022-12-19 10:22 ` Peter Newman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox