Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >>> Jan Kiszka wrote: >>>> Gilles Chanteperdrix wrote: >>>>> Jan Kiszka wrote: >>>>>> Gilles Chanteperdrix wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Hi again, >>>>>>>> >>>>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu >>>>>>>> switches again: >>>>>>>> >>>>>>>> There were 3 race windows between setting active_mm of the current task >>>>>>>> and actually switching it (that's a noarch issue), there were several >>>>>>>> calls into switch_mm without proper hard interrupt protection (on archs >>>>>>>> that have no preemptible switch_mm, like x86) and there was a race in >>>>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread - >>>>>>>> while I was looking at an old tree where smp_invalidate_interrupt took >>>>>>>> care of this). >>>>>>>> >>>>>>>> The patch is thought as a basis for further discussions about >>>>>>>> >>>>>>>> o how to solve all the issues for all archs technically (ideally >>>>>>>> without the need to patch noarch parts in an arch-specific way...) >>>>>>>> >>>>>>>> o if anyone thinks there could be more spots like these (I've checked >>>>>>>> the code only for x86 so far) >>>>>>>> diff --git a/fs/aio.c b/fs/aio.c >>>>>>>> index 76da125..0286f0f 100644 >>>>>>>> --- a/fs/aio.c >>>>>>>> +++ b/fs/aio.c >>>>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>>>>>> { >>>>>>>> struct mm_struct *active_mm; >>>>>>>> struct task_struct *tsk = current; >>>>>>>> + unsigned long flags; >>>>>>>> >>>>>>>> task_lock(tsk); >>>>>>>> active_mm = tsk->active_mm; >>>>>>>> atomic_inc(&mm->mm_count); >>>>>>>> tsk->mm = mm; >>>>>>>> + local_irq_save_hw_cond(flags); >>>>>>>> tsk->active_mm = mm; >>>>>>>> - switch_mm(active_mm, mm, tsk); >>>>>>>> + __switch_mm(active_mm, mm, tsk); >>>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>>> task_unlock(tsk); >>>>>>>> >>>>>>>> mmdrop(active_mm); >>>>>>>> diff --git a/fs/exec.c b/fs/exec.c >>>>>>>> index 3b36c69..06591ac 100644 >>>>>>>> --- a/fs/exec.c >>>>>>>> +++ b/fs/exec.c >>>>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >>>>>>>> { >>>>>>>> struct task_struct *tsk; >>>>>>>> struct mm_struct * old_mm, *active_mm; >>>>>>>> + unsigned long flags; >>>>>>>> >>>>>>>> /* Notify parent that we're no longer interested in the old VM */ >>>>>>>> tsk = current; >>>>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >>>>>>>> task_lock(tsk); >>>>>>>> active_mm = tsk->active_mm; >>>>>>>> tsk->mm = mm; >>>>>>>> + local_irq_save_hw_cond(flags); >>>>>>>> tsk->active_mm = mm; >>>>>>>> activate_mm(active_mm, mm); >>>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>>> task_unlock(tsk); >>>>>>>> arch_pick_mmap_layout(mm); >>>>>>>> if (old_mm) { >>>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>>>>>> index 01a836b..cf3b68a 100644 >>>>>>>> --- a/kernel/fork.c >>>>>>>> +++ b/kernel/fork.c >>>>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >>>>>>>> } >>>>>>>> >>>>>>>> if (new_mm) { >>>>>>>> + unsigned long flags; >>>>>>>> mm = current->mm; >>>>>>>> active_mm = current->active_mm; >>>>>>>> current->mm = new_mm; >>>>>>>> + local_irq_save_hw_cond(flags); >>>>>>>> current->active_mm = new_mm; >>>>>>>> activate_mm(active_mm, new_mm); >>>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>>> new_mm = mm; >>>>>>>> } >>>>>>> Ok. These are patches of the noarch code which are dependent on x86 >>>>>>> implementation. >>>>>>> >>>>>>> I think we need something like >>>>>>> >>>>>>> mm_change_enter(flags); >>>>>>> mm_change_leave(flags); >>>>>>> >>>>>>> which would resolve to ipipe_active_mm = NULL on architectures with >>>>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on >>>>>>> x86? >>>>>> + we have to change switch_mm in the lockless case, and maybe also its >>>>>> caller (if I look at the arm code), to use ipipe_active_mm in order to >>>>>> decide if to switch or not - see my explanation of the possible race in >>>>>> aio.c. >>>>> This is already the way it is currently working. the "prev" passed to >>>>> switch_mm is always ipipe_active_mm. >>>>> >>>> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it. >>>> >>>> OK, but this needs to be thought through in details again, specifically >>>> also the enter-root scenarios. I have no safe feeling about it yet, >>>> partly because switch_mm fiddles with ipipe_active_mm as well, but more >>>> generally because we used to have such critical races in this area for a >>>> "fairly" long time. >>> When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit >>> is set, in which case the switch_mm preempted by Xenomai is restarted. >>> >> Right. But now we are also dealing with preemptions outside that loop. >> And we are dealing with archs that have atomic switch_mm, thus no such >> loops. >> >> The goal should be to find a solution against the races outside >> switch_mm that is applicable to all archs. But the question is still >> unanswered to me if we can simply reuse existing ipipe_active_mm without >> disturbing its current use in non-atomic switch_mm. >> >> Moreover, part of the problems on x86 was the non-atomic update of >> cpu_vm_mask vs. actually switching the hardware. How do you deal with >> cpu_vm_mask on ARM? Why can you call switch_mm without IRQ protection? >> My impression is that only the (costly) cpu_switch_mm should be allowed >> to be preemptible to have a safe state around it. I'm also asking as we >> would have to address this on x86 & co. once switch_mm could be called >> with undefined prev_mm. > > when ipipe_active_mm is set to NULL, switch_mm does not touch the > current mm cpu_vm_mask, since it does not know what the current mm is. > However, we have a problem when switching back to root: we do not clear > the previous mm cpu_vm_mask since we do not call switch_mm in > xnarch_switch_to, we should fix this. So, I think we can extend the zone > where ipipe_active_mm is set to NULL, to protect this area from the > effects of preemption by Xenomai. > > So, IMO, what we need is something like ipipe_mm_switch_protect which > sets ipipe_active_mm to NULL on ARM, and results in local_irq_save_hw on > x86. That sounds like a good step forward. I will split up my patch into a non-arch part that instruments the code and an x86 part that installs the required protection for that arch. Jan