Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
Date: Fri, 15 Nov 2013 12:17:36 +0100	[thread overview]
Message-ID: <20131115121736.72170c36@mschwide> (raw)
In-Reply-To: <20131115121000.69219fa4@mschwide>

On Fri, 15 Nov 2013 12:10:00 +0100
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Fri, 15 Nov 2013 10:44:37 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On Thu, Nov 14, 2013 at 04:33:59PM +0000, Martin Schwidefsky wrote:
> > > On Thu, 14 Nov 2013 13:22:23 +0000
> > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > 
> > > > On Thu, Nov 14, 2013 at 08:10:07AM +0000, Martin Schwidefsky wrote:
> > > > > On Wed, 13 Nov 2013 16:16:35 +0000
> > > > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > 
> > > > > > On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > > > > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > > > > > > index 5d1f950..e91afeb 100644
> > > > > > > --- a/arch/s390/include/asm/mmu_context.h
> > > > > > > +++ b/arch/s390/include/asm/mmu_context.h
> > > > > > > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> > > > > > >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > > > > >                              struct task_struct *tsk)
> > > > > > >  {
> > > > > > > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > > > > > > -       update_mm(next, tsk);
> > > > > > > +       int cpu = smp_processor_id();
> > > > > > > +
> > > > > > > +       if (prev == next)
> > > > > > > +               return;
> > > > > > > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > > > > > > +               /* Delay update_mm until all TLB flushes are done. */
> > > > > > > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > > > > > > +       } else {
> > > > > > > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > > > > > > +               update_mm(next, tsk);
> > > > > > > +               if (next->context.flush_mm)
> > > > > > > +                       /* Flush pending TLBs */
> > > > > > > +                       __tlb_flush_mm(next);
> > > > > > > +       }
> > > > > > >         atomic_dec(&prev->context.attach_count);
> > > > > > >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > > > > > > -       atomic_inc(&next->context.attach_count);
> > > > > > > -       /* Check for TLBs not flushed yet */
> > > > > > > -       __tlb_flush_mm_lazy(next);
> > > > > > > +}
> > > > > > > +
> > > > > > > +#define finish_switch_mm finish_switch_mm
> > > > > > > +static inline void finish_switch_mm(struct mm_struct *mm,
> > > > > > > +                                   struct task_struct *tsk)
> > > > > > > +{
> > > > > > > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > > > > > > +               return;
> > > > > > > +
> > > > > > > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > > > > > > +               cpu_relax();
> > > > > > > +
> > > > > > > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > > > > > > +       update_mm(mm, tsk);
> > > > > > > +       if (mm->context.flush_mm)
> > > > > > > +               __tlb_flush_mm(mm);
> > > > > > >  }
> > > > > > 
> > > > > > Some care is needed here with preemption (we had this on arm and I
> > > > > > think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> > > > > > on a thread but you get preempted just before finish_switch_mm(). The
> > > > > > new thread has the same mm as the preempted on and switch_mm() exits
> > > > > > early without setting another flag. So finish_switch_mm() wouldn't do
> > > > > > anything but you still switched to the new mm. The fix is to make the
> > > > > > flag per mm rather than thread (see commit bdae73cd374e).
> > > > > 
> > > > > Interesting. For s390 I need to make sure that each task attaching an
> > > > > mm waits for the completion of concurrent TLB flush operations. If the
> > > > > scheduler does not switch the mm I don't care, the mm is still attached.
> > > > 
> > > > I assume the actual hardware mm switch happens via update_mm(). If you
> > > > have a context_switch() to a thread which requires an update_mm() but you
> > > > defer this until finish_switch_mm(), you may be preempted before the
> > > > hardware update. If the new context_switch() schedules a thread with the
> > > > same mm as the preempted one, you no longer call update_mm(). So the new
> > > > thread actually uses an old hardware mm.
> > >  
> > > If the code gets preempted between switch_mm() and finish_switch_mm()
> > > the worst that can happen is that finish_switch_mm() is called twice.
> > 
> > Yes, it's called twice, but you only set the TIF_TLB_WAIT the first
> > time. Here's the scenario:
> > 
> > 1. thread-A running with mm-A
> > 2. context_switch() to thread-B1 causing a switch_mm(mm-B)
> > 3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
> >    update_mm(mm-B). Hardware still using mm-A
> > 4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
> > 5. interrupt and preemption before finish_mm_switch(mm-B)
> > 6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
> >    that thread-B1 and thread-B2 have the same mm-B)
> > 7. switch_mm() as in this patch exits early because prev == next
> > 8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
> >    for thread-B2, therefore no call to update_mm(mm-B)
> > 
> > So after point 8, you get thread-B2 running (and possibly returning to
> > user space) with mm-A. Do you see a problem here?
> 
> Oh, now I get it. Thanks for the patience, this is indeed a problem.
> And I concur, a per-mm flag is the 'obvious' solution.

Having said that and looking at the code I find this to be not as obvious
any more. If you have multiple cpus using a per-mm flag can get you into
trouble:

1. cpu #1 calls switch_mm and finds that irqs are disabled.
   mm->context.switch_pending is set
2. cpu #2 calls switch_mm for the same mm and finds that irqs are disabled.
   mm->context.switch_pending is set again
3. cpu #1 reaches finish_arch_post_lock_switch and finds switch_pending == 1
4. cpu #1 zeroes mm->switch_pending and calls cpu_switch_mm
5. cpu #2 reaches finish_arch_post_lock_switch and finds switch_pending == 0
6. cpu #2 continues with the old mm

This is a race, no?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

next prev parent reply	other threads:[~2013-11-15 11:17 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-13  8:16 [PATCH 0/2] sched: finish_switch_mm hook Martin Schwidefsky
2013-11-13  8:16 ` [PATCH 1/2] sched/mm: add finish_switch_mm function Martin Schwidefsky
2013-11-13 11:41   ` Peter Zijlstra
2013-11-13 11:49     ` Martin Schwidefsky
2013-11-13 12:19     ` Catalin Marinas
2013-11-13 16:05       ` Martin Schwidefsky
2013-11-13 17:03         ` Catalin Marinas
2013-11-14  8:00           ` Martin Schwidefsky
2013-11-13  8:16 ` [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries Martin Schwidefsky
2013-11-13 16:16   ` Catalin Marinas
2013-11-14  8:10     ` Martin Schwidefsky
2013-11-14 13:22       ` Catalin Marinas
2013-11-14 16:33         ` Martin Schwidefsky
2013-11-15 10:44           ` Catalin Marinas
2013-11-15 11:10             ` Martin Schwidefsky
2013-11-15 11:17               ` Martin Schwidefsky [this message]
2013-11-15 11:57                 ` Catalin Marinas
2013-11-15 13:29                   ` Martin Schwidefsky
2013-11-15 13:46                     ` Catalin Marinas
2013-11-18  8:11                       ` Martin Schwidefsky
2013-11-15  9:13       ` Martin Schwidefsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131115121736.72170c36@mschwide \
    --to=schwidefsky@de.ibm.com \
    --cc=catalin.marinas@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.