From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin Herrenschmidt Date: Thu, 02 Aug 2007 03:40:55 +0000 Subject: Re: ia64 mmu_gather question Message-Id: <1186026055.5495.585.camel@localhost.localdomain> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org CC'ing linux-ia64@vger.kernel.org. For those who haven't followed, this is a discussion I'm having about the rework of mmu_gather I'm in, to making sure I do the right thing as far as ia64 is concerned. > I suspect I just wanted to use something stable, so I wouldn't have to > check tlb->nr. It looks like tlb->nr can be 0 or ~0 when nothing gets > shot down, so I probably just didn't want to check for those > conditions. Maybe it'd be better to just check tlb->start_addr and if > it's something other than ~0UL, go ahead and use tlb->start_addr and > tlb->end_addr. > > While looking over the code, it looks to me like tlb->need_flush never > gets initialized (cleared to 0). On the other hand, tlb->start_addr > gets initialized with ~0UL both in tlb_gather_mmu() and > ia64_tlb_flush_mmu(). Am I missing something? tlb->nr is initialised in tlb_gather_mmu(). The ~0 trick is the "fast" mode used when non-SMP. Thing is, the start/end arguments passed to tlb_finish_mmu() tend to cover more than really necessary. And you end up with your own start_addr/end_addr tracking pretty much useless, since you re-do a whole range flush at the end. However, I think that you do need to do that because of a subtle issue with the virtual mapping of the page tables themselves. Basically, the generic freeing code first flushes all VMA's and thus all PTEs, and then does a second pass with the page table pages themselves. The problem is that if you get a racy access, you can end up with somebody walking down a PGD/PUD/PMD all the way to a PTE page before hitting the empty PTE and taking a fault. That means that even after the pages have been unmapped (where the TLB flush will have flushes the TLB entries for the virtual page tables), something can still bring in the TLB translations for the page tables themselves, at least until the PGD/PUD/PMD entries themselves have been cleared, which happens later. I think the proper way to handle that is in fact to separate the tracking of invalidated PTEs (flushes of user translations) from the tracking of freed page tables (flushes of page table pages). The later can be done by keeping a separate pair of start/end and by adding the virtual address argument to tlb_free_pte(), which I intend to do. So unless you see something wrong with that approach, I'll produce and will try to test a patch doing just that. In addition, however, I do have a question. I haven't tracked every bit of MM code in ia64 land yet and I'm wondering how are the page table translations faulted in ? via a SW miss handler ? Or some HW handler ? Is there some locking ? Because there's another possible issue that I can see: You aren't using delayed batching nor anything like that to actually free the page tables. __pte_free_tlb() & friends just directly do the freeing. Which means those page can be given away to somebody else pretty much right away. However, what if that races with some other thread walking those page tables ? I may very well be missing a bit here, but since we have that specific race on ppc64, I was wondering if you had it too... Typically something like that: Thread A is unmapping/destroying page tables and thread B is trying to access the area that is being freed at the same time. The PTE have already been cleared and the TLB flushed, but the page tables themselves haven't yet. We have A now in free_pgtables(), freeing a PTE page. A B gets into free_pte_range() reads PMD entry write 0 to the PMD entry frees PTE page reads PTE from PTE page As you can see above, thread B can try to read a PTE from a freed PTE page which may in fact already be filled with unrelated PTEs. I think this doesn't happen on x86 due to some HW tricks with the tablewalk. On ppc64 we fix that by deferring page table freeing using RCU. That involves an allocation, so if that allocation fails, we fallback to sending an IPI to all CPUs to make sure they're all out of whatever TLB or hash miss handler may have been caching the old entry before we free the page. So I wonder if you have that problem too. If you do, an option is to use the batch to also free the page table pages, but that's a bit of a problem with the quicklists, or to use something aking to ppc64 use of RCU which I may generalize. Cheers, Ben.