Re: ia64 mmu_gather question

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* Re: ia64 mmu_gather question
@ 2007-08-02  3:40 Benjamin Herrenschmidt
  2007-08-02  5:38 ` Luck, Tony
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-02  3:40 UTC (permalink / raw)
  To: linux-ia64

CC'ing linux-ia64@vger.kernel.org. For those who haven't followed, this
is a discussion I'm having about the rework of mmu_gather I'm in, to
making sure I do the right thing as far as ia64 is concerned.

> I suspect I just wanted to use something stable, so I wouldn't have to
> check tlb->nr.  It looks like tlb->nr can be 0 or ~0 when nothing gets
> shot down, so I probably just didn't want to check for those
> conditions.  Maybe it'd be better to just check tlb->start_addr and if
> it's something other than ~0UL, go ahead and use tlb->start_addr and
> tlb->end_addr.
> 
> While looking over the code, it looks to me like tlb->need_flush never
> gets initialized (cleared to 0).  On the other hand, tlb->start_addr
> gets initialized with ~0UL both in tlb_gather_mmu() and
> ia64_tlb_flush_mmu().  Am I missing something?

tlb->nr is initialised in tlb_gather_mmu(). The ~0 trick is the "fast"
mode used when non-SMP.

Thing is, the start/end arguments passed to tlb_finish_mmu() tend to
cover more than really necessary. And you end up with your own
start_addr/end_addr tracking pretty much useless, since you re-do a
whole range flush at the end.

However, I think that you do need to do that because of a subtle issue
with the virtual mapping of the page tables themselves.

Basically, the generic freeing code first flushes all VMA's and thus all
PTEs, and then does a second pass with the page table pages themselves.

The problem is that if you get a racy access, you can end up with
somebody walking down a PGD/PUD/PMD all the way to a PTE page before
hitting the empty PTE and taking a fault. That means that even after the
pages have been unmapped (where the TLB flush will have flushes the TLB
entries for the virtual page tables), something can still bring in the
TLB translations for the page tables themselves, at least until the
PGD/PUD/PMD entries themselves have been cleared, which happens later.

I think the proper way to handle that is in fact to separate the
tracking of invalidated PTEs (flushes of user translations) from the
tracking of freed page tables (flushes of page table pages). The later
can be done by keeping a separate pair of start/end and by adding the
virtual address argument to tlb_free_pte(), which I intend to do.

So unless you see something wrong with that approach, I'll produce and
will try to test a patch doing just that.

In addition, however, I do have a question. I haven't tracked every bit
of MM code in ia64 land yet and I'm wondering how are the page table
translations faulted in ? via a SW miss handler ? Or some HW handler ?
Is there some locking ?

Because there's another possible issue that I can see: You aren't using
delayed batching nor anything like that to actually free the page
tables. __pte_free_tlb() & friends just directly do the freeing. Which
means those page can be given away to somebody else pretty much right
away.

However, what if that races with some other thread walking those page
tables ? I may very well be missing a bit here, but since we have 
that specific race on ppc64, I was wondering if you had it too...

Typically something like that: Thread A is unmapping/destroying page
tables and thread B is trying to access the area that is being freed at
the same time. The PTE have already been cleared and the TLB flushed,
but the page tables themselves haven't yet. We have A now in
free_pgtables(), freeing a PTE page.

	A					B
	gets into free_pte_range()		reads PMD entry			
	write 0 to the PMD entry
	frees PTE page
						reads PTE from PTE page

As you can see above, thread B can try to read a PTE from a freed PTE page
which may in fact already be filled with unrelated PTEs. I think this
doesn't happen on x86 due to some HW tricks with the tablewalk.

On ppc64 we fix that by deferring page table freeing using RCU. That
involves an allocation, so if that allocation fails, we fallback to
sending an IPI to all CPUs to make sure they're all out of whatever
TLB or hash miss handler may have been caching the old entry before
we free the page.

So I wonder if you have that problem too. If you do, an option is to
use the batch to also free the page table pages, but that's a bit of
a problem with the quicklists, or to use something aking to ppc64 use
of RCU which I may generalize.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
@ 2007-08-02  5:38 ` Luck, Tony
  2007-08-02  7:09 ` Benjamin Herrenschmidt
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Luck, Tony @ 2007-08-02  5:38 UTC (permalink / raw)
  To: linux-ia64

Answers to select questions (that I can do off the top of my head).

> In addition, however, I do have a question. I haven't tracked every bit
> of MM code in ia64 land yet and I'm wondering how are the page table
> translations faulted in ? via a SW miss handler ? Or some HW handler ?

The page tables translations are inserted in s/w (by the VHPT miss handler
in arch/ia64/kernel/ivt.S).  Essentially the first TLB miss in a PMD range
will end up here and will insert both the page mapping that we actually
want, plus the mapping for the page table (so that a subsequent TLB miss on
this address, or another address in the PMD range) can be serviced by the
h/w VHPT walker (for as long as the page table mapping survives in the
TLB).

> Is there some locking ?
No locking.  But we do have race detection.  After we chase the PGD>PUD>PMD>PTE
pointers we insert the TLB entry.  Then we retrace the pointer chain and
make sure that the pte we find is still the same.  If it isn't, then we
purge the entry we just inserted and go for a full page fault.

Time to tell bed-time stories to my daughter.  More tomorrow (if someone
else doesn't fill in the rest of the answers before I get back to this).

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
  2007-08-02  5:38 ` Luck, Tony
@ 2007-08-02  7:09 ` Benjamin Herrenschmidt
  2007-08-02  7:57 ` Benjamin Herrenschmidt
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-02  7:09 UTC (permalink / raw)
  To: linux-ia64

On Wed, 2007-08-01 at 22:38 -0700, Luck, Tony wrote:
> No locking.  But we do have race detection.  After we chase the
> PGD>PUD>PMD>PTE
> pointers we insert the TLB entry.  Then we retrace the pointer chain
> and
> make sure that the pte we find is still the same.  If it isn't, then
> we
> purge the entry we just inserted and go for a full page fault.
> 
> Time to tell bed-time stories to my daughter.  More tomorrow (if
> someone
> else doesn't fill in the rest of the answers before I get back to
> this).

Ok, that's what I think I understood from the asm. However, what
prevents the very unlikely race where you insert a stale pgtable
mapping entry, and before you backtrack and remove it, another
CPU accesses the stale PTE ?

I'm tempted, while at working on the mmu_gather, to add a generic
mechanism for putting the page table quicklist pages in there too.
Though that would only help for archs that 1) have the problem
(typically are SW loaded in a way or another) and 2) use an IPI for SMP
flush (not ppc64 for example).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
  2007-08-02  5:38 ` Luck, Tony
  2007-08-02  7:09 ` Benjamin Herrenschmidt
@ 2007-08-02  7:57 ` Benjamin Herrenschmidt
  2007-08-02 17:16 ` Luck, Tony
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-02  7:57 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-08-02 at 17:10 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2007-08-01 at 22:38 -0700, Luck, Tony wrote:
> > No locking.  But we do have race detection.  After we chase the
> > PGD>PUD>PMD>PTE
> > pointers we insert the TLB entry.  Then we retrace the pointer chain
> > and
> > make sure that the pte we find is still the same.  If it isn't, then
> > we
> > purge the entry we just inserted and go for a full page fault.
> > 
> > Time to tell bed-time stories to my daughter.  More tomorrow (if
> > someone
> > else doesn't fill in the rest of the answers before I get back to
> > this).
> 
> Ok, that's what I think I understood from the asm. However, what
> prevents the very unlikely race where you insert a stale pgtable
> mapping entry, and before you backtrack and remove it, another
> CPU accesses the stale PTE ?

I was thinking about a case where the TLB is shared (SMT) between linux
logical CPUs (threads) but ia64 is not SMT right ? Thus the TLB is
split ,and the "other" CPU can't see the stale translation... should be
allright then.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
                   ` (2 preceding siblings ...)
  2007-08-02  7:57 ` Benjamin Herrenschmidt
@ 2007-08-02 17:16 ` Luck, Tony
  2007-08-02 21:46 ` Benjamin Herrenschmidt
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Luck, Tony @ 2007-08-02 17:16 UTC (permalink / raw)
  To: linux-ia64

> I was thinking about a case where the TLB is shared (SMT) between linux
> logical CPUs (threads) but ia64 is not SMT right ? Thus the TLB is
> split ,and the "other" CPU can't see the stale translation... should be
> allright then.

Montecito is SMT, and the threads do share the TLB resources in that
there are a fixed number of TLB TC slots that are dynamically shared
between threads. But entries in the TLB have their virtual addresses tagged
with a thread identifier, so an entry inserted by one thread cannot
be used by another thread.

See sections 2.4.3.1.1 and 2.4.3.1.2 in "Dual-Core Update to the Intel®
Itanium® 2 Processor Reference Manual"
  http://download.intel.com/design/Itanium2/manuals/30806501.pdf

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
                   ` (3 preceding siblings ...)
  2007-08-02 17:16 ` Luck, Tony
@ 2007-08-02 21:46 ` Benjamin Herrenschmidt
  2007-08-02 21:56 ` Luck, Tony
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-02 21:46 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-08-02 at 10:16 -0700, Luck, Tony wrote:
> 
> Montecito is SMT, and the threads do share the TLB resources in that
> there are a fixed number of TLB TC slots that are dynamically shared
> between threads. But entries in the TLB have their virtual addresses
> tagged
> with a thread identifier, so an entry inserted by one thread cannot
> be used by another thread.

Allright... that's a bit of a waste of TLB space though :-)

So I suspect at this stage that the race isn't affecting you. However,
it looks to me that you put the burden on the fairly hot TLB miss path
rather than on the much less hot invalidation path itself...

Cheers,
BenH


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
                   ` (4 preceding siblings ...)
  2007-08-02 21:46 ` Benjamin Herrenschmidt
@ 2007-08-02 21:56 ` Luck, Tony
  2007-08-02 22:00 ` Benjamin Herrenschmidt
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Luck, Tony @ 2007-08-02 21:56 UTC (permalink / raw)
  To: linux-ia64

> So I suspect at this stage that the race isn't affecting you. However,
> it looks to me that you put the burden on the fairly hot TLB miss path
> rather than on the much less hot invalidation path itself...

Suggestions on how to do move the burden gratefully received.  I don't
think that it is all that bad though.  The re-read of the PGD>PUD>PMD>PTE
should all hit in the L1-D cache, which has single cycle latency.

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
                   ` (5 preceding siblings ...)
  2007-08-02 21:56 ` Luck, Tony
@ 2007-08-02 22:00 ` Benjamin Herrenschmidt
  2007-08-02 22:07 ` David Mosberger-Tang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-02 22:00 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-08-02 at 14:56 -0700, Luck, Tony wrote:
> > So I suspect at this stage that the race isn't affecting you. However,
> > it looks to me that you put the burden on the fairly hot TLB miss path
> > rather than on the much less hot invalidation path itself...
> 
> Suggestions on how to do move the burden gratefully received.  I don't
> think that it is all that bad though.  The re-read of the PGD>PUD>PMD>PTE
> should all hit in the L1-D cache, which has single cycle latency.

Allright. I'll look into it. Other archs have a similar issues and don't
currently fix it. The easy fix for archs that have IPIs for TLB flushes
is to batch the freeing of page tables pages. That's made a bit harder
by the quicklist but I may just end up adding support for those to the
mmu_gather.

Ben.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
                   ` (6 preceding siblings ...)
  2007-08-02 22:00 ` Benjamin Herrenschmidt
@ 2007-08-02 22:07 ` David Mosberger-Tang
  2007-08-02 22:32 ` Benjamin Herrenschmidt
  2007-08-03  3:22 ` Tony Luck
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger-Tang @ 2007-08-02 22:07 UTC (permalink / raw)
  To: linux-ia64

Do keep in mind that there is a hardware walker, too! ;-)

  --david

On 8/2/07, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> On Thu, 2007-08-02 at 14:56 -0700, Luck, Tony wrote:
> > > So I suspect at this stage that the race isn't affecting you. However,
> > > it looks to me that you put the burden on the fairly hot TLB miss path
> > > rather than on the much less hot invalidation path itself...
> >
> > Suggestions on how to do move the burden gratefully received.  I don't
> > think that it is all that bad though.  The re-read of the PGD>PUD>PMD>PTE
> > should all hit in the L1-D cache, which has single cycle latency.
>
> Allright. I'll look into it. Other archs have a similar issues and don't
> currently fix it. The easy fix for archs that have IPIs for TLB flushes
> is to batch the freeing of page tables pages. That's made a bit harder
> by the quicklist but I may just end up adding support for those to the
> mmu_gather.
>
> Ben.
>
>
>


-- 
Mosberger Consulting LLC, http://www.mosberger-consulting.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
                   ` (7 preceding siblings ...)
  2007-08-02 22:07 ` David Mosberger-Tang
@ 2007-08-02 22:32 ` Benjamin Herrenschmidt
  2007-08-03  3:22 ` Tony Luck
  9 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-02 22:32 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-08-02 at 16:07 -0600, David Mosberger-Tang wrote:
> Do keep in mind that there is a hardware walker, too! ;-)

Yup, I know. Your TLB miss rate gets lower as you fault in the page
table pages, though it's still higher than a full tree walker. How big
is the TLB btw ?

Ben.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ia64 mmu_gather question
  2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
                   ` (8 preceding siblings ...)
  2007-08-02 22:32 ` Benjamin Herrenschmidt
@ 2007-08-03  3:22 ` Tony Luck
  9 siblings, 0 replies; 11+ messages in thread
From: Tony Luck @ 2007-08-03  3:22 UTC (permalink / raw)
  To: linux-ia64

> Yup, I know. Your TLB miss rate gets lower as you fault in the page
> table pages, though it's still higher than a full tree walker. How big
> is the TLB btw ?

# of TLB entries is model specific ... most Itanium cpus so far have been around
128 entries give or take a few.

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-08-03  3:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-02  3:40 ia64 mmu_gather question Benjamin Herrenschmidt
2007-08-02  5:38 ` Luck, Tony
2007-08-02  7:09 ` Benjamin Herrenschmidt
2007-08-02  7:57 ` Benjamin Herrenschmidt
2007-08-02 17:16 ` Luck, Tony
2007-08-02 21:46 ` Benjamin Herrenschmidt
2007-08-02 21:56 ` Luck, Tony
2007-08-02 22:00 ` Benjamin Herrenschmidt
2007-08-02 22:07 ` David Mosberger-Tang
2007-08-02 22:32 ` Benjamin Herrenschmidt
2007-08-03  3:22 ` Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox