From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched/mm: add finish_switch_mm function
Date: Thu, 14 Nov 2013 09:00:10 +0100 [thread overview]
Message-ID: <20131114090010.2dc05610@mschwide> (raw)
In-Reply-To: <20131113170358.GG18837@arm.com>
On Wed, 13 Nov 2013 17:03:58 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Nov 13, 2013 at 04:05:56PM +0000, Martin Schwidefsky wrote:
> > On Wed, 13 Nov 2013 12:19:09 +0000
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Wed, Nov 13, 2013 at 11:41:43AM +0000, Peter Zijlstra wrote:
> > > > On Wed, Nov 13, 2013 at 09:16:13AM +0100, Martin Schwidefsky wrote:
> > > > > fire_sched_in_preempt_notifiers(current);
> > > > > if (mm)
> > > > > @@ -4140,8 +4141,10 @@ void idle_task_exit(void)
> > > > >
> > > > > BUG_ON(cpu_online(smp_processor_id()));
> > > > >
> > > > > - if (mm != &init_mm)
> > > > > + if (mm != &init_mm) {
> > > > > switch_mm(mm, &init_mm, current);
> > > > > + finish_switch_mm(&init_mm, current);
> > > > > + }
> > > > > mmdrop(mm);
> > > > > }
> > >
> > > Here finish_switch_mm() is called in the same context with switch_mm().
> > > What we have on ARM via switch_mm() is to check for irqs_disabled() and
> > > if yes, defer the actual switching via a flag until the
> > > finish_arch_post_lock_switch() hook. But on ARM we only cared about the
> > > interrupts being enabled.
> >
> > The guarantee s390 needs is that the rq-lock is not taken. What I have
> > seen with the wait loop in switch_mm is a dead lock because one CPU #0
> > was looping in switch_mm to wait for the TLB flush of another CPU #1.
> > CPU #1 got an interrupt that tried to wake-up a task which happened to
> > be on the run-queue of CPU #0.
>
> I'm not familiar with the s390 code, so how's the waiting done? Is it
> part of an on_each_cpu() call (that's what I got from smp_ptlb_all)?
Ok, the long explanation: the requirement of the architecture is that a
PTE need to be modified in a coordinated fashion if there is the possibility
that another CPU is using the mm at the same time. Coordinated means that
the modification has to be done with specific instructions, IPTE and IDTE.
These instructions make sure that other CPUs can complete instructions
that use the PTE before the PTE is made invalid and the associated TLB
is flushed.
Problem is that these instructions are slow, basically they do on_each_cpu
under the hood. Therefore the code tries to batch PTE operations. For each
PTE the code decides if it can do a batched/lazy PTE update or if the
expensive way with IPTE is required. Now if the decision to use a batched
PTE update has just been made with the PTE still valid this can race with
another CPU attaching the mm with switch_mm(). Until the first CPU is done
with PTE modification the second CPU may not use the mm yet.
The coordination between CPU is done with the mm->context.attach_count.
The upper 16 bit have a flush-counter, the lower 16 bit an attach counter.
Each attacher of an mm increases the lower 16 bit, this is used by the
flusher to decide between lazy and direct flushing. Each flusher increases
the upper 16 bit while it is in the critical phase of the update. The
attacher uses this to loop until the flush counter is zero again. And
this wait loop may not hold any locks, otherwise it can dead-lock.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
next prev parent reply other threads:[~2013-11-14 8:00 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-13 8:16 [PATCH 0/2] sched: finish_switch_mm hook Martin Schwidefsky
2013-11-13 8:16 ` [PATCH 1/2] sched/mm: add finish_switch_mm function Martin Schwidefsky
2013-11-13 11:41 ` Peter Zijlstra
2013-11-13 11:49 ` Martin Schwidefsky
2013-11-13 12:19 ` Catalin Marinas
2013-11-13 16:05 ` Martin Schwidefsky
2013-11-13 17:03 ` Catalin Marinas
2013-11-14 8:00 ` Martin Schwidefsky [this message]
2013-11-13 8:16 ` [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries Martin Schwidefsky
2013-11-13 16:16 ` Catalin Marinas
2013-11-14 8:10 ` Martin Schwidefsky
2013-11-14 13:22 ` Catalin Marinas
2013-11-14 16:33 ` Martin Schwidefsky
2013-11-15 10:44 ` Catalin Marinas
2013-11-15 11:10 ` Martin Schwidefsky
2013-11-15 11:17 ` Martin Schwidefsky
2013-11-15 11:57 ` Catalin Marinas
2013-11-15 13:29 ` Martin Schwidefsky
2013-11-15 13:46 ` Catalin Marinas
2013-11-18 8:11 ` Martin Schwidefsky
2013-11-15 9:13 ` Martin Schwidefsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131114090010.2dc05610@mschwide \
--to=schwidefsky@de.ibm.com \
--cc=catalin.marinas@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).