* [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race @ 2022-11-29 11:10 Peter Newman 2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman 2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman 0 siblings, 2 replies; 13+ messages in thread From: Peter Newman @ 2022-11-29 11:10 UTC (permalink / raw) To: reinette.chatre, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86, Peter Newman Hi Reinette, Fenghua, I've fixed the wording in changelogs and code comments throughout and clarified the explanations as Reinette had requested. The patch series addresses the IPI race we discussed in the container move RFD thread[1]. The changelog in the patches should also provide a good description. Updates in v4: - Reorder the patches so that justification for sending more IPIs can reference the patch fixing __rdtgroup_move_task(). - Correct tense of wording used in changelog and comments Updates in v3: - Split the handling of multi-task and single-task operations into separate patches, now that they're handled differently. - Clarify justification in the commit message, including moving some of it out of inline code comment. Updates in v2: - Following Reinette's suggestion: use task_call_func() for single task, IPI broadcast for group movements. - Rebased to v6.1-rc4 v1: https://lore.kernel.org/lkml/20221103141641.3055981-1-peternewman@google.com/ v2: https://lore.kernel.org/lkml/20221110135346.2209839-1-peternewman@google.com/ v3: https://lore.kernel.org/lkml/20221115141953.816851-1-peternewman@google.com/ Thanks! -Peter [1] https://lore.kernel.org/all/CALPaoCg2-9ARbK+MEgdvdcjJtSy_2H6YeRkLrT97zgy8Aro3Vg@mail.gmail.com/ Peter Newman (2): x86/resctrl: Update task closid/rmid with task_call_func() x86/resctrl: IPI all online CPUs for group updates arch/x86/kernel/cpu/resctrl/rdtgroup.c | 130 ++++++++++++------------- 1 file changed, 60 insertions(+), 70 deletions(-) base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8 -- 2.38.1.584.g0f3c55d4c2-goog ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-11-29 11:10 [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race Peter Newman @ 2022-11-29 11:10 ` Peter Newman 2022-12-06 18:56 ` Reinette Chatre 2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman 1 sibling, 1 reply; 13+ messages in thread From: Peter Newman @ 2022-11-29 11:10 UTC (permalink / raw) To: reinette.chatre, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86, Peter Newman When the user moves a running task to a new rdtgroup using the tasks file interface, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task's CPU. It is possible for a task to wake up or migrate while it is being moved to a new group. If __rdtgroup_move_task() fails to observe that a task has begun running or misses that it migrated to a new CPU, the task will continue to use the old CLOSID or RMID until it switches in again. __rdtgroup_move_task() assumes that if the task migrates off of its CPU before it can IPI the task, then the task has already observed the updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can delay stores until after loads, the following incorrect scenarios are possible: 1. __rdtgroup_move_task() stores the new closid and rmid in the task structure after it loads task_curr() and task_cpu(). 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context switch stores new task_curr() and task_cpu() values. Use task_call_func() in __rdtgroup_move_task() to serialize updates to the closid and rmid fields in the task_struct with context switch. Signed-off-by: Peter Newman <peternewman@google.com> Reviewed-by: James Morse <james.morse@arm.com> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++---------- 1 file changed, 47 insertions(+), 31 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index e5a48f05e787..59b7ffcd53bb 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -528,6 +528,31 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp) kfree(rdtgrp); } +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg) +{ + struct rdtgroup *rdtgrp = arg; + + /* + * Although task_call_func() serializes the writes below with the paired + * reads in resctrl_sched_in(), resctrl_sched_in() still needs + * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here + * to conform. + */ + if (rdtgrp->type == RDTCTRL_GROUP) { + WRITE_ONCE(t->closid, rdtgrp->closid); + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); + } else if (rdtgrp->type == RDTMON_GROUP) { + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); + } + + /* + * If the task is current on a CPU, the PQR_ASSOC MSR needs to be + * updated to make the resource group go into effect. If the task is not + * current, the MSR will be updated when the task is scheduled in. + */ + return task_curr(t); +} + static void _update_task_closid_rmid(void *task) { /* @@ -538,10 +563,24 @@ static void _update_task_closid_rmid(void *task) resctrl_sched_in(); } -static void update_task_closid_rmid(struct task_struct *t) +static void update_task_closid_rmid(struct task_struct *t, + struct rdtgroup *rdtgrp) { - if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); + /* + * Serialize the closid and rmid update with context switch. If + * task_call_func() indicates that the task was running during + * update_locked_task_closid_rmid(), then interrupt it. + */ + if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) && + IS_ENABLED(CONFIG_SMP)) + /* + * If the task has migrated away from the CPU indicated by + * task_cpu() below, then it has already switched in on the + * new CPU using the updated closid and rmid and the call below + * is unnecessary, but harmless. + */ + smp_call_function_single(task_cpu(t), + _update_task_closid_rmid, t, 1); else _update_task_closid_rmid(t); } @@ -557,39 +596,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk, return 0; /* - * Set the task's closid/rmid before the PQR_ASSOC MSR can be - * updated by them. - * - * For ctrl_mon groups, move both closid and rmid. * For monitor groups, can move the tasks only from * their parent CTRL group. */ - - if (rdtgrp->type == RDTCTRL_GROUP) { - WRITE_ONCE(tsk->closid, rdtgrp->closid); - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); - } else if (rdtgrp->type == RDTMON_GROUP) { - if (rdtgrp->mon.parent->closid == tsk->closid) { - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); - } else { - rdt_last_cmd_puts("Can't move task to different control group\n"); - return -EINVAL; - } + if (rdtgrp->type == RDTMON_GROUP && + rdtgrp->mon.parent->closid != tsk->closid) { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; } - /* - * Ensure the task's closid and rmid are written before determining if - * the task is current that will decide if it will be interrupted. - */ - barrier(); - - /* - * By now, the task's closid and rmid are set. If the task is current - * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource - * group go into effect. If the task is not current, the MSR will be - * updated when the task is scheduled in. - */ - update_task_closid_rmid(tsk); + update_task_closid_rmid(tsk, rdtgrp); return 0; } -- 2.38.1.584.g0f3c55d4c2-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman @ 2022-12-06 18:56 ` Reinette Chatre 2022-12-07 10:58 ` Peter Newman 0 siblings, 1 reply; 13+ messages in thread From: Reinette Chatre @ 2022-12-06 18:56 UTC (permalink / raw) To: Peter Newman, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Peter, On 11/29/2022 3:10 AM, Peter Newman wrote: > When the user moves a running task to a new rdtgroup using the tasks > file interface, the resulting change in CLOSID/RMID must be immediately > propagated to the PQR_ASSOC MSR on the task's CPU. > > It is possible for a task to wake up or migrate while it is being moved > to a new group. If __rdtgroup_move_task() fails to observe that a task > has begun running or misses that it migrated to a new CPU, the task will > continue to use the old CLOSID or RMID until it switches in again. > > __rdtgroup_move_task() assumes that if the task migrates off of its CPU > before it can IPI the task, then the task has already observed the > updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can > delay stores until after loads, the following incorrect scenarios are > possible: > > 1. __rdtgroup_move_task() stores the new closid and rmid in > the task structure after it loads task_curr() and task_cpu(). Stating how this scenario encounters the problem would help so perhaps something like (please feel free to change): "If the task starts running between a reordered task_curr() check and the CLOSID/RMID update then it will start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID." > 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context > switch stores new task_curr() and task_cpu() values. This scenario is not clear to me. Could you please provide more detail about it? I was trying to follow the context_switch() flow and resctrl_sched_in() is one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). From what I can tell rq->curr, as used by task_curr() is set before even context_switch() is called ... and since the next task is picked from the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to a runqueue) it seems to me that the value used by task_cpu() would also be set early (before context_switch() is called). It is thus not clear to me how the above reordering could occur so an example would help a lot. > > Use task_call_func() in __rdtgroup_move_task() to serialize updates to > the closid and rmid fields in the task_struct with context switch. Is there a reason why there is a switch between the all caps CLOSID/RMID at the beginning to the no caps here? > Signed-off-by: Peter Newman <peternewman@google.com> > Reviewed-by: James Morse <james.morse@arm.com> > --- > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++---------- > 1 file changed, 47 insertions(+), 31 deletions(-) > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > index e5a48f05e787..59b7ffcd53bb 100644 > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > @@ -528,6 +528,31 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp) > kfree(rdtgrp); > } > > +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg) > +{ > + struct rdtgroup *rdtgrp = arg; > + > + /* > + * Although task_call_func() serializes the writes below with the paired > + * reads in resctrl_sched_in(), resctrl_sched_in() still needs > + * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here > + * to conform. > + */ > + if (rdtgrp->type == RDTCTRL_GROUP) { > + WRITE_ONCE(t->closid, rdtgrp->closid); > + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); > + } else if (rdtgrp->type == RDTMON_GROUP) { > + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); > + } > + > + /* > + * If the task is current on a CPU, the PQR_ASSOC MSR needs to be > + * updated to make the resource group go into effect. If the task is not > + * current, the MSR will be updated when the task is scheduled in. > + */ > + return task_curr(t); > +} > + > static void _update_task_closid_rmid(void *task) > { > /* > @@ -538,10 +563,24 @@ static void _update_task_closid_rmid(void *task) > resctrl_sched_in(); > } > > -static void update_task_closid_rmid(struct task_struct *t) > +static void update_task_closid_rmid(struct task_struct *t, > + struct rdtgroup *rdtgrp) > { > - if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) > - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); > + /* > + * Serialize the closid and rmid update with context switch. If > + * task_call_func() indicates that the task was running during > + * update_locked_task_closid_rmid(), then interrupt it. > + */ > + if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) && > + IS_ENABLED(CONFIG_SMP)) > + /* > + * If the task has migrated away from the CPU indicated by > + * task_cpu() below, then it has already switched in on the > + * new CPU using the updated closid and rmid and the call below > + * is unnecessary, but harmless. > + */ > + smp_call_function_single(task_cpu(t), > + _update_task_closid_rmid, t, 1); > else > _update_task_closid_rmid(t); > } > @@ -557,39 +596,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk, > return 0; > > /* > - * Set the task's closid/rmid before the PQR_ASSOC MSR can be > - * updated by them. > - * > - * For ctrl_mon groups, move both closid and rmid. > * For monitor groups, can move the tasks only from > * their parent CTRL group. > */ > - > - if (rdtgrp->type == RDTCTRL_GROUP) { > - WRITE_ONCE(tsk->closid, rdtgrp->closid); > - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); > - } else if (rdtgrp->type == RDTMON_GROUP) { > - if (rdtgrp->mon.parent->closid == tsk->closid) { > - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); > - } else { > - rdt_last_cmd_puts("Can't move task to different control group\n"); > - return -EINVAL; > - } > + if (rdtgrp->type == RDTMON_GROUP && > + rdtgrp->mon.parent->closid != tsk->closid) { > + rdt_last_cmd_puts("Can't move task to different control group\n"); > + return -EINVAL; > } > > - /* > - * Ensure the task's closid and rmid are written before determining if > - * the task is current that will decide if it will be interrupted. > - */ > - barrier(); > - > - /* > - * By now, the task's closid and rmid are set. If the task is current > - * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource > - * group go into effect. If the task is not current, the MSR will be > - * updated when the task is scheduled in. > - */ > - update_task_closid_rmid(tsk); > + update_task_closid_rmid(tsk, rdtgrp); > > return 0; > } The change looks good to me. Reinette ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-12-06 18:56 ` Reinette Chatre @ 2022-12-07 10:58 ` Peter Newman 2022-12-07 18:38 ` Reinette Chatre 0 siblings, 1 reply; 13+ messages in thread From: Peter Newman @ 2022-12-07 10:58 UTC (permalink / raw) To: Reinette Chatre Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Reinette, On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 11/29/2022 3:10 AM, Peter Newman wrote: > > When the user moves a running task to a new rdtgroup using the tasks > > file interface, the resulting change in CLOSID/RMID must be immediately > > propagated to the PQR_ASSOC MSR on the task's CPU. > > > > It is possible for a task to wake up or migrate while it is being moved > > to a new group. If __rdtgroup_move_task() fails to observe that a task > > has begun running or misses that it migrated to a new CPU, the task will > > continue to use the old CLOSID or RMID until it switches in again. > > > > __rdtgroup_move_task() assumes that if the task migrates off of its CPU > > before it can IPI the task, then the task has already observed the > > updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can > > delay stores until after loads, the following incorrect scenarios are > > possible: > > > > 1. __rdtgroup_move_task() stores the new closid and rmid in > > the task structure after it loads task_curr() and task_cpu(). > > Stating how this scenario encounters the problem would help > so perhaps something like (please feel free to change): > "If the task starts running between a reordered task_curr() check and > the CLOSID/RMID update then it will start running with the old CLOSID/RMID > until it is switched again because __rdtgroup_move_task() failed to determine > that it needs to be interrupted to obtain the new CLOSID/RMID." That is largely what I was trying to state in paragraph 2 above, though at a higher level. I hoped the paragraph following it would do enough to connect the high-level description with the low-level problem scenarios. > > > 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context > > switch stores new task_curr() and task_cpu() values. > > This scenario is not clear to me. Could you please provide more detail about it? > I was trying to follow the context_switch() flow and resctrl_sched_in() is > one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). > From what I can tell rq->curr, as used by task_curr() is set before > even context_switch() is called ... and since the next task is picked from > the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to > a runqueue) it seems to me that the value used by task_cpu() would also > be set early (before context_switch() is called). It is thus not clear to > me how the above reordering could occur so an example would help a lot. Perhaps in both scenarios I didn't make it clear that reordering in the CPU can cause the incorrect behavior rather than the program order. In this explanation, items 1. and 2. are supposed to be completing the sentence ending with a ':' at the end of paragraph 3, so I thought that would keep focus on the CPU. I had assumed that the ordering requirements were well-understood, since they're stated in existing code comments a few times, and that making a case for how the expected ordering could be violated would be enough, but I'm happy to draw up a side-by-side example. > > > > > Use task_call_func() in __rdtgroup_move_task() to serialize updates to > > the closid and rmid fields in the task_struct with context switch. > > Is there a reason why there is a switch between the all caps CLOSID/RMID > at the beginning to the no caps here? It's because I referred to the task_struct fields explicitly here. -Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-12-07 10:58 ` Peter Newman @ 2022-12-07 18:38 ` Reinette Chatre 2022-12-08 22:30 ` Peter Newman 0 siblings, 1 reply; 13+ messages in thread From: Reinette Chatre @ 2022-12-07 18:38 UTC (permalink / raw) To: Peter Newman Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Peter, On 12/7/2022 2:58 AM, Peter Newman wrote: > Hi Reinette, > > On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> On 11/29/2022 3:10 AM, Peter Newman wrote: >>> When the user moves a running task to a new rdtgroup using the tasks >>> file interface, the resulting change in CLOSID/RMID must be immediately >>> propagated to the PQR_ASSOC MSR on the task's CPU. >>> >>> It is possible for a task to wake up or migrate while it is being moved >>> to a new group. If __rdtgroup_move_task() fails to observe that a task >>> has begun running or misses that it migrated to a new CPU, the task will >>> continue to use the old CLOSID or RMID until it switches in again. >>> >>> __rdtgroup_move_task() assumes that if the task migrates off of its CPU >>> before it can IPI the task, then the task has already observed the >>> updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can >>> delay stores until after loads, the following incorrect scenarios are >>> possible: >>> >>> 1. __rdtgroup_move_task() stores the new closid and rmid in >>> the task structure after it loads task_curr() and task_cpu(). >> >> Stating how this scenario encounters the problem would help >> so perhaps something like (please feel free to change): >> "If the task starts running between a reordered task_curr() check and >> the CLOSID/RMID update then it will start running with the old CLOSID/RMID >> until it is switched again because __rdtgroup_move_task() failed to determine >> that it needs to be interrupted to obtain the new CLOSID/RMID." > > That is largely what I was trying to state in paragraph 2 above, though > at a higher level. I hoped the paragraph following it would do enough to > connect the high-level description with the low-level problem scenarios. There is no need to require the reader to connect various snippets to create a problematic scenario themselves. The changelog should make the problem obvious. I understand that it is what you wanted to say, that is why I moved existing snippets to form a coherent problem scenario. It is ok if you do not like the way I wrote it, it was only an example on how it can be done. >>> 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context >>> switch stores new task_curr() and task_cpu() values. >> >> This scenario is not clear to me. Could you please provide more detail about it? >> I was trying to follow the context_switch() flow and resctrl_sched_in() is >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). >> From what I can tell rq->curr, as used by task_curr() is set before >> even context_switch() is called ... and since the next task is picked from >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to >> a runqueue) it seems to me that the value used by task_cpu() would also >> be set early (before context_switch() is called). It is thus not clear to >> me how the above reordering could occur so an example would help a lot. > > Perhaps in both scenarios I didn't make it clear that reordering in the > CPU can cause the incorrect behavior rather than the program order. In > this explanation, items 1. and 2. are supposed to be completing the > sentence ending with a ':' at the end of paragraph 3, so I thought that > would keep focus on the CPU. You did make it clear that the cause is reordering in the CPU. I am just not able to see where the reordering is occurring in your scenario (2). > I had assumed that the ordering requirements were well-understood, since > they're stated in existing code comments a few times, and that making a > case for how the expected ordering could be violated would be enough, > but I'm happy to draw up a side-by-side example. Please do. Could you start by highlighting which resctrl_sched_in() you are referring to? I am trying to dissect (2) with the given information: Through "the calling context switch" the scenario is written to create understanding that it refers to: context_switch()->switch_to()->resctrl_sched_in() - so the calling context switch is the first in the above call path ... where does it (context_switch()) store the new task_curr() and task_cpu() values and how does that reorder with resctrl_sched_in() further down in call path? >>> Use task_call_func() in __rdtgroup_move_task() to serialize updates to >>> the closid and rmid fields in the task_struct with context switch. >> >> Is there a reason why there is a switch between the all caps CLOSID/RMID >> at the beginning to the no caps here? > > It's because I referred to the task_struct fields explicitly here. You can use task_struct::closid and task_struct::rmid to make this clear. Reinette ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-12-07 18:38 ` Reinette Chatre @ 2022-12-08 22:30 ` Peter Newman 2022-12-09 23:54 ` Reinette Chatre 0 siblings, 1 reply; 13+ messages in thread From: Peter Newman @ 2022-12-08 22:30 UTC (permalink / raw) To: reinette.chatre Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, peternewman, tglx, x86 Hi Reinette, On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/7/2022 2:58 AM, Peter Newman wrote: > >>> 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context > >>> switch stores new task_curr() and task_cpu() values. > >> > >> This scenario is not clear to me. Could you please provide more detail about it? > >> I was trying to follow the context_switch() flow and resctrl_sched_in() is > >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). > >> From what I can tell rq->curr, as used by task_curr() is set before > >> even context_switch() is called ... and since the next task is picked from > >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to > >> a runqueue) it seems to me that the value used by task_cpu() would also > >> be set early (before context_switch() is called). It is thus not clear to > >> me how the above reordering could occur so an example would help a lot. > > > > Perhaps in both scenarios I didn't make it clear that reordering in the > > CPU can cause the incorrect behavior rather than the program order. In > > this explanation, items 1. and 2. are supposed to be completing the > > sentence ending with a ':' at the end of paragraph 3, so I thought that > > would keep focus on the CPU. > > You did make it clear that the cause is reordering in the CPU. I am just > not able to see where the reordering is occurring in your scenario (2). It will all come down to whether it can get from updating rq->curr to reading task_struct::{closid,rmid} without encountering a full barrier. I'll go into the details below. > Please do. Could you start by highlighting which resctrl_sched_in() > you are referring to? I am trying to dissect (2) with the given information: > Through "the calling context switch" the scenario is written to create > understanding that it refers to: > context_switch()->switch_to()->resctrl_sched_in() - so the calling context > switch is the first in the above call path ... where does it (context_switch()) > store the new task_curr() and task_cpu() values and how does that reorder with > resctrl_sched_in() further down in call path? Yes, the rq->curr update is actually in __schedule(). I was probably still thinking it was in prepare_task_switch() (called from context_switch()) because of older kernels where __rdtgroup_move_task() is still reading task_struct::on_cpu. There is an interesting code comment under the rq->curr update site in __schedule(): /* * RCU users of rcu_dereference(rq->curr) may not see * changes to task_struct made by pick_next_task(). */ RCU_INIT_POINTER(rq->curr, next); /* * The membarrier system call requires each architecture * to have a full memory barrier after updating * rq->curr, before returning to user-space. * * Here are the schemes providing that barrier on the * various architectures: * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. * - finish_lock_switch() for weakly-ordered * architectures where spin_unlock is a full barrier, * - switch_to() for arm64 (weakly-ordered, spin_unlock * is a RELEASE barrier), */ Based on this, I believe (2) doesn't happen on x86 because switch_mm() provides the required ordering. On arm64, it won't happen as long as it calls its resctrl_sched_in() after the dsb(ish) in __switch_to(), which seems to be the case: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/arch/arm64/kernel/process.c?h=mpam/snapshot/v6.0-rc1#n561 Based on this, I'll just sketch out the first scenario below and drop (2) from the changelog. This also implies that the group update cases can use a single smp_mb() to provide all the necessary ordering, because there's a full barrier on context switch for it to pair with, so I don't need to broadcast IPI anymore. I don't know whether task_call_func() is faster than an smp_mb(). I'll take some measurements to see. The presumed behavior is __rdtgroup_move_task() not seeing t1 running yet implies that it observes the updated values: CPU 0 CPU 1 ----- ----- (t1->{closid,rmid} -> {1,1}) (rq->curr -> t0) __rdtgroup_move_task(): t1->{closid,rmid} <- {2,2} curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {2,2} if (curr == t1) // false IPI(t1->cpu) In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1 is current: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) Please let me know if these diagrams clarify things. -Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-12-08 22:30 ` Peter Newman @ 2022-12-09 23:54 ` Reinette Chatre 2022-12-12 17:36 ` Peter Newman 0 siblings, 1 reply; 13+ messages in thread From: Reinette Chatre @ 2022-12-09 23:54 UTC (permalink / raw) To: Peter Newman Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Peter, On 12/8/2022 2:30 PM, Peter Newman wrote: > On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote: >> On 12/7/2022 2:58 AM, Peter Newman wrote: >>>>> 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context >>>>> switch stores new task_curr() and task_cpu() values. ... > > Based on this, I'll just sketch out the first scenario below and drop > (2) from the changelog. This also implies that the group update cases ok, thank you for doing that analysis. > can use a single smp_mb() to provide all the necessary ordering, because > there's a full barrier on context switch for it to pair with, so I don't > need to broadcast IPI anymore. I don't know whether task_call_func() is This is not clear to me because rdt_move_group_tasks() seems to have the same code as shown below as vulnerable to re-ordering. Only difference is that it uses the "//false" checks to set a bit in the cpumask for a later IPI instead of an immediate IPI. > faster than an smp_mb(). I'll take some measurements to see. > > The presumed behavior is __rdtgroup_move_task() not seeing t1 running > yet implies that it observes the updated values: > > CPU 0 CPU 1 > ----- ----- > (t1->{closid,rmid} -> {1,1}) (rq->curr -> t0) > > __rdtgroup_move_task(): > t1->{closid,rmid} <- {2,2} > curr <- t1->cpu->rq->curr > __schedule(): > rq->curr <- t1 > resctrl_sched_in(): > t1->{closid,rmid} -> {2,2} > if (curr == t1) // false > IPI(t1->cpu) I understand that the test is false when it may be expected to be true, but there does not seem to be a problem because of that. t1 was scheduled in with the correct CLOSID/RMID and its CPU did not get an unnecessary IPI. > In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1 > is current: > > CPU 0 CPU 1 > ----- ----- > __rdtgroup_move_task(): > curr <- t1->cpu->rq->curr > __schedule(): > rq->curr <- t1 > resctrl_sched_in(): > t1->{closid,rmid} -> {1,1} > t1->{closid,rmid} <- {2,2} > if (curr == t1) // false > IPI(t1->cpu) Yes, this I understand to be the problematic scenario. > Please let me know if these diagrams clarify things. They do, thank you very much. Reinette ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-12-09 23:54 ` Reinette Chatre @ 2022-12-12 17:36 ` Peter Newman 2022-12-13 18:33 ` Reinette Chatre 0 siblings, 1 reply; 13+ messages in thread From: Peter Newman @ 2022-12-12 17:36 UTC (permalink / raw) To: reinette.chatre Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, peternewman, tglx, x86 Hi Reinette, On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/8/2022 2:30 PM, Peter Newman wrote: > > Based on this, I'll just sketch out the first scenario below and drop > > (2) from the changelog. This also implies that the group update cases > > ok, thank you for doing that analysis. > > > can use a single smp_mb() to provide all the necessary ordering, because > > there's a full barrier on context switch for it to pair with, so I don't > > need to broadcast IPI anymore. I don't know whether task_call_func() is > > This is not clear to me because rdt_move_group_tasks() seems to have the > same code as shown below as vulnerable to re-ordering. Only difference > is that it uses the "//false" checks to set a bit in the cpumask for a > later IPI instead of an immediate IPI. An smp_mb() between writing the new task_struct::{closid,rmid} and calling task_curr() would prevent the reordering I described, but I was worried about the cost of executing a full barrier for every matching task. I tried something like this: for_each_process_thread(p, t) { if (!from || is_closid_match(t, from) || is_rmid_match(t, from)) { WRITE_ONCE(t->closid, to->closid); WRITE_ONCE(t->rmid, to->mon.rmid); /* group moves are serialized by rdt */ t->resctrl_dirty = true; } } if (IS_ENABLED(CONFIG_SMP) && mask) { /* Order t->{closid,rmid} stores before loads in task_curr() */ smp_mb(); for_each_process_thread(p, t) { if (t->resctrl_dirty) { if (task_curr(t)) cpumask_set_cpu(task_cpu(t), mask); t->resctrl_dirty = false; } } } I repeated the `perf bench sched messaging -g 40 -l100000 ` benchmark from before[1] to compare this with the baseline, and found that it only increased the time to delete the benchmark's group from 1.65ms to 1.66ms, so it's an alternative to what I last posted. I could do something similar in the single-task move, but I don't think it makes much of a performance difference in that case. It's only a win for the group move because the synchronization cost doesn't grow with the group size. [1] https://lore.kernel.org/lkml/20221129111055.953833-3-peternewman@google.com/ > > > faster than an smp_mb(). I'll take some measurements to see. > > > > The presumed behavior is __rdtgroup_move_task() not seeing t1 running > > yet implies that it observes the updated values: > > > > CPU 0 CPU 1 > > ----- ----- > > (t1->{closid,rmid} -> {1,1}) (rq->curr -> t0) > > > > __rdtgroup_move_task(): > > t1->{closid,rmid} <- {2,2} > > curr <- t1->cpu->rq->curr > > __schedule(): > > rq->curr <- t1 > > resctrl_sched_in(): > > t1->{closid,rmid} -> {2,2} > > if (curr == t1) // false > > IPI(t1->cpu) > > I understand that the test is false when it may be expected to be true, but > there does not seem to be a problem because of that. t1 was scheduled in with > the correct CLOSID/RMID and its CPU did not get an unnecessary IPI. Yes, this one was reminding the reader of the correct behavior. I can just leave it out. -Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-12-12 17:36 ` Peter Newman @ 2022-12-13 18:33 ` Reinette Chatre 2022-12-14 10:05 ` Peter Newman 0 siblings, 1 reply; 13+ messages in thread From: Reinette Chatre @ 2022-12-13 18:33 UTC (permalink / raw) To: Peter Newman Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Peter, On 12/12/2022 9:36 AM, Peter Newman wrote: > On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote: >> On 12/8/2022 2:30 PM, Peter Newman wrote: >>> Based on this, I'll just sketch out the first scenario below and drop >>> (2) from the changelog. This also implies that the group update cases >> >> ok, thank you for doing that analysis. >> >>> can use a single smp_mb() to provide all the necessary ordering, because >>> there's a full barrier on context switch for it to pair with, so I don't >>> need to broadcast IPI anymore. I don't know whether task_call_func() is >> >> This is not clear to me because rdt_move_group_tasks() seems to have the >> same code as shown below as vulnerable to re-ordering. Only difference >> is that it uses the "//false" checks to set a bit in the cpumask for a >> later IPI instead of an immediate IPI. > > An smp_mb() between writing the new task_struct::{closid,rmid} and > calling task_curr() would prevent the reordering I described, but I > was worried about the cost of executing a full barrier for every > matching task. So for moving a single task the solution may just be to change the current barrier() to smp_mb()? > > I tried something like this: > > for_each_process_thread(p, t) { > if (!from || is_closid_match(t, from) || > is_rmid_match(t, from)) { > WRITE_ONCE(t->closid, to->closid); > WRITE_ONCE(t->rmid, to->mon.rmid); > /* group moves are serialized by rdt */ > t->resctrl_dirty = true; > } > } > if (IS_ENABLED(CONFIG_SMP) && mask) { > /* Order t->{closid,rmid} stores before loads in task_curr() */ > smp_mb(); > for_each_process_thread(p, t) { > if (t->resctrl_dirty) { > if (task_curr(t)) > cpumask_set_cpu(task_cpu(t), mask); > t->resctrl_dirty = false; > } > } > } > struct task_struct would not welcome a new member dedicated to resctrl's rare use for convenience. Another option may be to use a flag within the variables themselves but that seems like significant complication (flag need to be dealt with during scheduling) for which the benefit is not clear to me. I would prefer that we start with the simplest solution first (I see that as IPI to all CPUs). The changelog needs clear description of the problem needing to be solved and the solution chosen, noting the tradeoffs with other possible solutions. You can submit that, as an RFC if the "IPI to all CPUs" remains a concern, after which we can bring that submission to the attention of the experts who would have needed information then to point us in the right direction. Reinette ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() 2022-12-13 18:33 ` Reinette Chatre @ 2022-12-14 10:05 ` Peter Newman 0 siblings, 0 replies; 13+ messages in thread From: Peter Newman @ 2022-12-14 10:05 UTC (permalink / raw) To: Reinette Chatre Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Reinette, On Tue, Dec 13, 2022 at 7:34 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/12/2022 9:36 AM, Peter Newman wrote: > > On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote: > >> On 12/8/2022 2:30 PM, Peter Newman wrote: > >>> Based on this, I'll just sketch out the first scenario below and drop > >>> (2) from the changelog. This also implies that the group update cases > >> > >> ok, thank you for doing that analysis. > >> > >>> can use a single smp_mb() to provide all the necessary ordering, because > >>> there's a full barrier on context switch for it to pair with, so I don't > >>> need to broadcast IPI anymore. I don't know whether task_call_func() is > >> > >> This is not clear to me because rdt_move_group_tasks() seems to have the > >> same code as shown below as vulnerable to re-ordering. Only difference > >> is that it uses the "//false" checks to set a bit in the cpumask for a > >> later IPI instead of an immediate IPI. > > > > An smp_mb() between writing the new task_struct::{closid,rmid} and > > calling task_curr() would prevent the reordering I described, but I > > was worried about the cost of executing a full barrier for every > > matching task. > > So for moving a single task the solution may just be to change > the current barrier() to smp_mb()? Yes, that's a simpler change, so I'll just do that instead. > > > > > I tried something like this: > > > > for_each_process_thread(p, t) { > > if (!from || is_closid_match(t, from) || > > is_rmid_match(t, from)) { > > WRITE_ONCE(t->closid, to->closid); > > WRITE_ONCE(t->rmid, to->mon.rmid); > > /* group moves are serialized by rdt */ > > t->resctrl_dirty = true; > > } > > } > > if (IS_ENABLED(CONFIG_SMP) && mask) { > > /* Order t->{closid,rmid} stores before loads in task_curr() */ > > smp_mb(); > > for_each_process_thread(p, t) { > > if (t->resctrl_dirty) { > > if (task_curr(t)) > > cpumask_set_cpu(task_cpu(t), mask); > > t->resctrl_dirty = false; > > } > > } > > } > > > > struct task_struct would not welcome a new member dedicated to resctrl's > rare use for convenience. Another option may be to use a flag within > the variables themselves but that seems like significant complication > (flag need to be dealt with during scheduling) for which the benefit > is not clear to me. I would prefer that we start with the simplest > solution first (I see that as IPI to all CPUs). The changelog needs clear > description of the problem needing to be solved and the solution chosen, noting > the tradeoffs with other possible solutions. You can submit that, as an RFC > if the "IPI to all CPUs" remains a concern, after which we can bring that > submission to the attention of the experts who would have needed information then > to point us in the right direction. To be complete, I did the benchmark again with the simple addition of an smp_mb() on every iteration with a matching CLOSID/RMID and found that it didn't result in a substantial performance impact. (1.57ms -> 1.65ms). This isn't as significant as the 2x slowdown I saw when using task_call_func(), so maybe task_call_func() is just really expensive. That's more reason to just upgrade the barrier in the single-task move case. While I agree with your points on the IPI broadcast, it seems like a discussion I would prefer to just avoid given these measurements. -Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates 2022-11-29 11:10 [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race Peter Newman 2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman @ 2022-11-29 11:10 ` Peter Newman 2022-12-06 18:57 ` Reinette Chatre 1 sibling, 1 reply; 13+ messages in thread From: Peter Newman @ 2022-11-29 11:10 UTC (permalink / raw) To: reinette.chatre, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86, Peter Newman Removing a CTRL_MON or MON group directory moves all tasks to the parent group. The rmdir implementation therefore interrupts any running tasks which were in the deleted group to update their CLOSID/RMID to those of the parent. The rmdir operation iterates over all tasks in the deleted group while read-locking the tasklist_lock to ensure that no newly-created child tasks remain in the deleted group. Calling task_call_func() to perform the updates on every task in the deleted group, similar to the recent fix in __rdtgroup_move_task(), would result in a much longer tasklist_lock critical section. To avoid this, stop attempting to construct a precise mask of CPUs hosting the moved tasks in rdt_move_group_tasks(). Its callers instead perform the PQR_ASSOC MSR update on all online CPUs to ensure all affected tasks are notified. To measure the impact of the rdt_move_group_tasks() implementation options, the following command was run in an rdtgroup to produce a 1600-task workload: # mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000 Results collected using: # perf stat rmdir /sys/fs/resctrl/test CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) Calling task_call_func() on all tasks in the deleted group increased task-clock time from 1.54 to 2.35 ms, while the IPI broadcast reduced the time to 1.31 ms. Restructuring resctrl groups is assumed to be a rare act of system-level reconfiguration by the user, so the impact of additional IPIs resulting from this change to a CPU-isolated workload is not a concern. Signed-off-by: Peter Newman <peternewman@google.com> Reviewed-by: James Morse <james.morse@arm.com> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++------------------- 1 file changed, 13 insertions(+), 39 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 59b7ffcd53bb..4a3c0b315484 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2401,12 +2401,10 @@ static int reset_all_ctrls(struct rdt_resource *r) * Move tasks from one to the other group. If @from is NULL, then all tasks * in the systems are moved unconditionally (used for teardown). * - * If @mask is not NULL the cpus on which moved tasks are running are set - * in that mask so the update smp function call is restricted to affected - * cpus. + * Following this operation, the caller should update PQR_ASSOC MSR and per-CPU + * storage on all online CPUs. */ -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, - struct cpumask *mask) +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to) { struct task_struct *p, *t; @@ -2416,16 +2414,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, is_rmid_match(t, from)) { WRITE_ONCE(t->closid, to->closid); WRITE_ONCE(t->rmid, to->mon.rmid); - - /* - * If the task is on a CPU, set the CPU in the mask. - * The detection is inaccurate as tasks might move or - * schedule before the smp function call takes place. - * In such a case the function call is pointless, but - * there is no other side effect. - */ - if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) - cpumask_set_cpu(task_cpu(t), mask); } } read_unlock(&tasklist_lock); @@ -2456,7 +2444,7 @@ static void rmdir_all_sub(void) struct rdtgroup *rdtgrp, *tmp; /* Move all tasks to the default resource group */ - rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); + rdt_move_group_tasks(NULL, &rdtgroup_default); list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { /* Free any child rmids */ @@ -3115,23 +3103,19 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, return -EPERM; } -static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp) { struct rdtgroup *prdtgrp = rdtgrp->mon.parent; int cpu; /* Give any tasks back to the parent group */ - rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); + rdt_move_group_tasks(rdtgrp, prdtgrp); /* Update per cpu rmid of the moved CPUs first */ for_each_cpu(cpu, &rdtgrp->cpu_mask) per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid; - /* - * Update the MSR on moved CPUs and CPUs which have moved - * task running on them. - */ - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); - update_closid_rmid(tmpmask, NULL); + + update_closid_rmid(cpu_online_mask, NULL); rdtgrp->flags = RDT_DELETED; free_rmid(rdtgrp->mon.rmid); @@ -3156,12 +3140,12 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) return 0; } -static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp) { int cpu; /* Give any tasks back to the default group */ - rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); + rdt_move_group_tasks(rdtgrp, &rdtgroup_default); /* Give any CPUs back to the default group */ cpumask_or(&rdtgroup_default.cpu_mask, @@ -3173,12 +3157,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid; } - /* - * Update the MSR on moved CPUs and CPUs which have moved - * task running on them. - */ - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); - update_closid_rmid(tmpmask, NULL); + update_closid_rmid(cpu_online_mask, NULL); closid_free(rdtgrp->closid); free_rmid(rdtgrp->mon.rmid); @@ -3197,12 +3176,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn) { struct kernfs_node *parent_kn = kn->parent; struct rdtgroup *rdtgrp; - cpumask_var_t tmpmask; int ret = 0; - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) - return -ENOMEM; - rdtgrp = rdtgroup_kn_lock_live(kn); if (!rdtgrp) { ret = -EPERM; @@ -3222,18 +3197,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn) rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { ret = rdtgroup_ctrl_remove(rdtgrp); } else { - ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); + ret = rdtgroup_rmdir_ctrl(rdtgrp); } } else if (rdtgrp->type == RDTMON_GROUP && is_mon_groups(parent_kn, kn->name)) { - ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); + ret = rdtgroup_rmdir_mon(rdtgrp); } else { ret = -EPERM; } out: rdtgroup_kn_unlock(kn); - free_cpumask_var(tmpmask); return ret; } -- 2.38.1.584.g0f3c55d4c2-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates 2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman @ 2022-12-06 18:57 ` Reinette Chatre 2022-12-07 11:04 ` Peter Newman 0 siblings, 1 reply; 13+ messages in thread From: Reinette Chatre @ 2022-12-06 18:57 UTC (permalink / raw) To: Peter Newman, fenghua.yu Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Peter, On 11/29/2022 3:10 AM, Peter Newman wrote: > Removing a CTRL_MON or MON group directory moves all tasks to the parent > group. The rmdir implementation therefore interrupts any running > tasks which were in the deleted group to update their CLOSID/RMID to > those of the parent. > > The rmdir operation iterates over all tasks in the deleted group while > read-locking the tasklist_lock to ensure that no newly-created child > tasks remain in the deleted group. The above describes the current behavior. This is great context. What follows in the changelog is a description of different fixes. This is unexpected because there is no description of a problem with the current behavior. Could you please describe the problem with the current implementation? Next you could state the two possible solutions and then I think the reader would be ready to parse what is written below. > Calling task_call_func() to perform > the updates on every task in the deleted group, similar to the recent > fix in __rdtgroup_move_task(), would result in a much longer > tasklist_lock critical section. I so still think it would help to state that this additional locking does not help to provide precise CPU mask. Especially since the next paragraph may be interpreted that a precise CPU mask is lost by giving up the additional locking. > To avoid this, stop attempting to construct a precise mask of CPUs > hosting the moved tasks in rdt_move_group_tasks(). Its callers instead > perform the PQR_ASSOC MSR update on all online CPUs to ensure all > affected tasks are notified. > > To measure the impact of the rdt_move_group_tasks() implementation > options, the following command was run in an rdtgroup to produce a > 1600-task workload: > > # mkdir /sys/fs/resctrl/test > # echo $$ > /sys/fs/resctrl/test/tasks > # perf bench sched messaging -g 40 -l 100000 > > Results collected using: > > # perf stat rmdir /sys/fs/resctrl/test > > CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) > > Calling task_call_func() on all tasks in the deleted group increased > task-clock time from 1.54 to 2.35 ms, while the IPI broadcast reduced > the time to 1.31 ms. Thank you very much for doing this testing. > > Restructuring resctrl groups is assumed to be a rare act of system-level > reconfiguration by the user, so the impact of additional IPIs resulting > from this change to a CPU-isolated workload is not a concern. > > Signed-off-by: Peter Newman <peternewman@google.com> > Reviewed-by: James Morse <james.morse@arm.com> > --- > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++------------------- > 1 file changed, 13 insertions(+), 39 deletions(-) > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > index 59b7ffcd53bb..4a3c0b315484 100644 > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > @@ -2401,12 +2401,10 @@ static int reset_all_ctrls(struct rdt_resource *r) > * Move tasks from one to the other group. If @from is NULL, then all tasks > * in the systems are moved unconditionally (used for teardown). > * > - * If @mask is not NULL the cpus on which moved tasks are running are set > - * in that mask so the update smp function call is restricted to affected > - * cpus. > + * Following this operation, the caller should update PQR_ASSOC MSR and per-CPU > + * storage on all online CPUs. > */ > -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, > - struct cpumask *mask) > +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to) > { > struct task_struct *p, *t; > > @@ -2416,16 +2414,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, > is_rmid_match(t, from)) { > WRITE_ONCE(t->closid, to->closid); > WRITE_ONCE(t->rmid, to->mon.rmid); > - > - /* > - * If the task is on a CPU, set the CPU in the mask. > - * The detection is inaccurate as tasks might move or > - * schedule before the smp function call takes place. > - * In such a case the function call is pointless, but > - * there is no other side effect. > - */ > - if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) > - cpumask_set_cpu(task_cpu(t), mask); > } > } > read_unlock(&tasklist_lock); > @@ -2456,7 +2444,7 @@ static void rmdir_all_sub(void) > struct rdtgroup *rdtgrp, *tmp; > > /* Move all tasks to the default resource group */ > - rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); > + rdt_move_group_tasks(NULL, &rdtgroup_default); > > list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { > /* Free any child rmids */ > @@ -3115,23 +3103,19 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, > return -EPERM; > } > > -static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) > +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp) > { > struct rdtgroup *prdtgrp = rdtgrp->mon.parent; > int cpu; > > /* Give any tasks back to the parent group */ > - rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); > + rdt_move_group_tasks(rdtgrp, prdtgrp); > > /* Update per cpu rmid of the moved CPUs first */ > for_each_cpu(cpu, &rdtgrp->cpu_mask) > per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid; > - /* > - * Update the MSR on moved CPUs and CPUs which have moved > - * task running on them. > - */ > - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); > - update_closid_rmid(tmpmask, NULL); > + > + update_closid_rmid(cpu_online_mask, NULL); > > rdtgrp->flags = RDT_DELETED; > free_rmid(rdtgrp->mon.rmid); > @@ -3156,12 +3140,12 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) > return 0; > } > > -static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) > +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp) > { > int cpu; > > /* Give any tasks back to the default group */ > - rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); > + rdt_move_group_tasks(rdtgrp, &rdtgroup_default); > > /* Give any CPUs back to the default group */ > cpumask_or(&rdtgroup_default.cpu_mask, > @@ -3173,12 +3157,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) > per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid; > } > > - /* > - * Update the MSR on moved CPUs and CPUs which have moved > - * task running on them. > - */ > - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); > - update_closid_rmid(tmpmask, NULL); > + update_closid_rmid(cpu_online_mask, NULL); > > closid_free(rdtgrp->closid); > free_rmid(rdtgrp->mon.rmid); > @@ -3197,12 +3176,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn) > { > struct kernfs_node *parent_kn = kn->parent; > struct rdtgroup *rdtgrp; > - cpumask_var_t tmpmask; > int ret = 0; > > - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) > - return -ENOMEM; > - > rdtgrp = rdtgroup_kn_lock_live(kn); > if (!rdtgrp) { > ret = -EPERM; > @@ -3222,18 +3197,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn) > rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { > ret = rdtgroup_ctrl_remove(rdtgrp); > } else { > - ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); > + ret = rdtgroup_rmdir_ctrl(rdtgrp); > } > } else if (rdtgrp->type == RDTMON_GROUP && > is_mon_groups(parent_kn, kn->name)) { > - ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); > + ret = rdtgroup_rmdir_mon(rdtgrp); > } else { > ret = -EPERM; > } > > out: > rdtgroup_kn_unlock(kn); > - free_cpumask_var(tmpmask); > return ret; > } > The change looks good to me. Reinette ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates 2022-12-06 18:57 ` Reinette Chatre @ 2022-12-07 11:04 ` Peter Newman 0 siblings, 0 replies; 13+ messages in thread From: Peter Newman @ 2022-12-07 11:04 UTC (permalink / raw) To: Reinette Chatre Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh, kpsingh, linux-kernel, mingo, tglx, x86 Hi Reinette, On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 11/29/2022 3:10 AM, Peter Newman wrote: > > Removing a CTRL_MON or MON group directory moves all tasks to the parent > > group. The rmdir implementation therefore interrupts any running > > tasks which were in the deleted group to update their CLOSID/RMID to > > those of the parent. > > > > The rmdir operation iterates over all tasks in the deleted group while > > read-locking the tasklist_lock to ensure that no newly-created child > > tasks remain in the deleted group. > > The above describes the current behavior. This is great context. What > follows in the changelog is a description of different fixes. This is > unexpected because there is no description of a problem with the current > behavior. > > Could you please describe the problem with the current implementation? Next > you could state the two possible solutions and then I think the reader would > be ready to parse what is written below. Ok > > Calling task_call_func() to perform > > the updates on every task in the deleted group, similar to the recent > > fix in __rdtgroup_move_task(), would result in a much longer > > tasklist_lock critical section. > > > I so still think it would help to state that this additional locking > does not help to provide precise CPU mask. Especially since > the next paragraph may be interpreted that a precise CPU mask > is lost by giving up the additional locking. Yes, that's a very good point, and I'm afraid I've already made you reiterate it once before. I will make sure to work it into the next revision. -Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2022-12-14 10:06 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-11-29 11:10 [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race Peter Newman 2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman 2022-12-06 18:56 ` Reinette Chatre 2022-12-07 10:58 ` Peter Newman 2022-12-07 18:38 ` Reinette Chatre 2022-12-08 22:30 ` Peter Newman 2022-12-09 23:54 ` Reinette Chatre 2022-12-12 17:36 ` Peter Newman 2022-12-13 18:33 ` Reinette Chatre 2022-12-14 10:05 ` Peter Newman 2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman 2022-12-06 18:57 ` Reinette Chatre 2022-12-07 11:04 ` Peter Newman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox