* [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5
@ 2005-10-18 21:15 Seth, Rohit
2005-10-18 21:34 ` Andrew Morton
0 siblings, 1 reply; 21+ messages in thread
From: Seth, Rohit @ 2005-10-18 21:15 UTC (permalink / raw)
To: torvalds; +Cc: linux-kernel
Linus,
[PATCH]: Handle spurious page fault for hugetlb region
The hugetlb pages are currently pre-faulted. At the time of mmap of
hugepages, we populate the new PTEs. It is possible that HW has already cached
some of the unused PTEs internally. These stale entries never get a chance to
be purged in existing control flow.
This patch extends the check in page fault code for hugepages. Check if
a faulted address falls with in size for the hugetlb file backing it. We
return VM_FAULT_MINOR for these cases (assuming that the arch specific
page-faulting code purges the stale entry for the archs that need it).
Thanks,
-rohit
Signed-off-by: Rohit Seth <rohit.seth@intel.com>
--- linux-2.6.14-rc4-git5-x86/include/linux/hugetlb.h 2005-10-18 13:14:24.879947360 -0700
+++ b/include/linux/hugetlb.h 2005-10-18 13:13:55.711381656 -0700
@@ -155,11 +155,24 @@
{
file->f_op = &hugetlbfs_file_operations;
}
+
+static inline int valid_hugetlb_file_off(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ struct inode *inode = vma->vm_file->f_dentry->d_inode;
+ loff_t file_off = address - vma->vm_start;
+
+ file_off += (vma->vm_pgoff << PAGE_SHIFT);
+
+ return (file_off < inode->i_size);
+}
+
#else /* !CONFIG_HUGETLBFS */
#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define valid_hugetlb_file_off(vma, address) 0
#endif /* !CONFIG_HUGETLBFS */
--- linux-2.6.14-rc4-git5-x86/mm/memory.c 2005-10-18 12:24:14.153647328 -0700
+++ b/mm/memory.c 2005-10-18 10:39:41.980162632 -0700
@@ -2045,8 +2045,18 @@
inc_page_state(pgfault);
- if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ if (valid_hugetlb_file_off(vma, address))
+ /* We get here only if there was a stale(zero) TLB entry
+ * (because of HW prefetching).
+ * Low-level arch code (if needed) should have already
+ * purged the stale entry as part of this fault handling.
+ * Here we just return.
+ */
+ return VM_FAULT_MINOR;
+ else
+ return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ }
/*
* We need the page table lock to synchronize with kswapd
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-18 21:15 [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 Seth, Rohit @ 2005-10-18 21:34 ` Andrew Morton 2005-10-18 22:17 ` Rohit Seth 0 siblings, 1 reply; 21+ messages in thread From: Andrew Morton @ 2005-10-18 21:34 UTC (permalink / raw) To: Seth, Rohit; +Cc: torvalds, linux-kernel "Seth, Rohit" <rohit.seth@intel.com> wrote: > > Linus, > > [PATCH]: Handle spurious page fault for hugetlb region > > The hugetlb pages are currently pre-faulted. At the time of mmap of > hugepages, we populate the new PTEs. It is possible that HW has already cached > some of the unused PTEs internally. What's an "unused pte"? One which maps a regular-sized page at the same virtual address? How can such a thing come about, and why isn't it already a problem for regular-sized pages? From where does the hardware prefetch the pte contents? IOW: please tell us more about this hardware pte-fetcher. > These stale entries never get a chance to > be purged in existing control flow. I'd have thought that invalidating those ptes at mmap()-time would be a more consistent approach. > This patch extends the check in page fault code for hugepages. Check if > a faulted address falls with in size for the hugetlb file backing it. We > return VM_FAULT_MINOR for these cases (assuming that the arch specific > page-faulting code purges the stale entry for the archs that need it). Do you have an example of the code which does this purging? > --- linux-2.6.14-rc4-git5-x86/include/linux/hugetlb.h 2005-10-18 13:14:24.879947360 -0700 > +++ b/include/linux/hugetlb.h 2005-10-18 13:13:55.711381656 -0700 > @@ -155,11 +155,24 @@ > { > file->f_op = &hugetlbfs_file_operations; > } > + > +static inline int valid_hugetlb_file_off(struct vm_area_struct *vma, > + unsigned long address) > +{ > + struct inode *inode = vma->vm_file->f_dentry->d_inode; > + loff_t file_off = address - vma->vm_start; > + > + file_off += (vma->vm_pgoff << PAGE_SHIFT); > + > + return (file_off < inode->i_size); > +} I suppose we should use i_size_read() here. > + if (valid_hugetlb_file_off(vma, address)) > + /* We get here only if there was a stale(zero) TLB entry > + * (because of HW prefetching). > + * Low-level arch code (if needed) should have already > + * purged the stale entry as part of this fault handling. > + * Here we just return. > + */ If the low-level code has purged the stale pte then it knows what's happening. Perhaps it shouldn't call into handle_mm_fault() at all? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-18 21:34 ` Andrew Morton @ 2005-10-18 22:17 ` Rohit Seth 2005-10-19 0:25 ` Andrew Morton 0 siblings, 1 reply; 21+ messages in thread From: Rohit Seth @ 2005-10-18 22:17 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, linux-kernel On Tue, 2005-10-18 at 14:34 -0700, Andrew Morton wrote: > "Seth, Rohit" <rohit.seth@intel.com> wrote: > > > > Linus, > > > > [PATCH]: Handle spurious page fault for hugetlb region > > > > The hugetlb pages are currently pre-faulted. At the time of mmap of > > hugepages, we populate the new PTEs. It is possible that HW has already cached > > some of the unused PTEs internally. > > What's an "unused pte"? One which maps a regular-sized page at the same > virtual address? How can such a thing come about, and why isn't it already > a problem for regular-sized pages? From where does the hardware prefetch > the pte contents? > Unsused pte is where *pte == 0. Basically entries in leaf page table that does not map anything. If such a pte ends up mapping a normal page then you get a page fault and HW/handler does the right thing (in terms of purging old entries). > IOW: please tell us more about this hardware pte-fetcher. > > > These stale entries never get a chance to > > be purged in existing control flow. > > I'd have thought that invalidating those ptes at mmap()-time would be a > more consistent approach. That would be adding too many unconditional purges (which are very expensive operations) during mmap. And as we are only talking of speculative pre-fetches that are done by HW so IMO we should do this as lazily as possible (only if required). > > > This patch extends the check in page fault code for hugepages. Check if > > a faulted address falls with in size for the hugetlb file backing it. We > > return VM_FAULT_MINOR for these cases (assuming that the arch specific > > page-faulting code purges the stale entry for the archs that need it). > > Do you have an example of the code which does this purging? > Tha page_not_present vector in IA-64 for example. > > --- linux-2.6.14-rc4-git5-x86/include/linux/hugetlb.h 2005-10-18 13:14:24.879947360 -0700 > > +++ b/include/linux/hugetlb.h 2005-10-18 13:13:55.711381656 -0700 > > @@ -155,11 +155,24 @@ > > { > > file->f_op = &hugetlbfs_file_operations; > > } > > + > > +static inline int valid_hugetlb_file_off(struct vm_area_struct *vma, > > + unsigned long address) > > +{ > > + struct inode *inode = vma->vm_file->f_dentry->d_inode; > > + loff_t file_off = address - vma->vm_start; > > + > > + file_off += (vma->vm_pgoff << PAGE_SHIFT); > > + > > + return (file_off < inode->i_size); > > +} > > I suppose we should use i_size_read() here. You are right. > > > + if (valid_hugetlb_file_off(vma, address)) > > + /* We get here only if there was a stale(zero) TLB entry > > + * (because of HW prefetching). > > + * Low-level arch code (if needed) should have already > > + * purged the stale entry as part of this fault handling. > > + * Here we just return. > > + */ > > If the low-level code has purged the stale pte then it knows what's > happening. Perhaps it shouldn't call into handle_mm_fault() at all? Well, at that time the code does not know if the address belong to hugetlbfile. The archs that needs those purges in low level code need them for all (for example) page not present faults. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-18 22:17 ` Rohit Seth @ 2005-10-19 0:25 ` Andrew Morton 2005-10-19 3:25 ` Rohit Seth 0 siblings, 1 reply; 21+ messages in thread From: Andrew Morton @ 2005-10-19 0:25 UTC (permalink / raw) To: Rohit Seth; +Cc: torvalds, linux-kernel Rohit Seth <rohit.seth@intel.com> wrote: > > On Tue, 2005-10-18 at 14:34 -0700, Andrew Morton wrote: > > "Seth, Rohit" <rohit.seth@intel.com> wrote: > > > > > > Linus, > > > > > > [PATCH]: Handle spurious page fault for hugetlb region > > > > > > The hugetlb pages are currently pre-faulted. At the time of mmap of > > > hugepages, we populate the new PTEs. It is possible that HW has already cached > > > some of the unused PTEs internally. > > > > What's an "unused pte"? One which maps a regular-sized page at the same > > virtual address? How can such a thing come about, and why isn't it already > > a problem for regular-sized pages? From where does the hardware prefetch > > the pte contents? > > > > Unsused pte is where *pte == 0. Basically entries in leaf page table > that does not map anything. Oh. Then I'm still not understanding. > If such a pte ends up mapping a normal page > then you get a page fault and HW/handler does the right thing (in terms > of purging old entries). For starters, we don't actually _use_ pte's for the hugepage. We use an entry in a pmd page. Does that concept still apply for ia64? If so, are you saying that the hardware prefetching is happening at the pmd level? If not, how can it happen at the pte level when the hardware doesn't know where the pte page _is_? Has this problem been observed in testing? > > > IOW: please tell us more about this hardware pte-fetcher. > > > > > These stale entries never get a chance to > > > be purged in existing control flow. > > > > I'd have thought that invalidating those ptes at mmap()-time would be a > > more consistent approach. > > That would be adding too many unconditional purges (which are very > expensive operations) during mmap. And as we are only talking of > speculative pre-fetches that are done by HW so IMO we should do this as > lazily as possible (only if required). Is an mmap of a hugepage region very common? I guess it is, on ia32, or on 64-bit apps which are (poorly?) designed to also run on ia32. > > If the low-level code has purged the stale pte then it knows what's > > happening. Perhaps it shouldn't call into handle_mm_fault() at all? > > Well, at that time the code does not know if the address belong to > hugetlbfile. The archs that needs those purges in low level code need > them for all (for example) page not present faults. I still don't understand what's different about hugepages here. If the prefetching problem also occurs on regular pages and is handled OK for regular pages, why do hugepages need special treatment? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 0:25 ` Andrew Morton @ 2005-10-19 3:25 ` Rohit Seth 2005-10-19 4:07 ` Andrew Morton 2005-10-19 15:23 ` Hugh Dickins 0 siblings, 2 replies; 21+ messages in thread From: Rohit Seth @ 2005-10-19 3:25 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, torvalds On Tue, 2005-10-18 at 17:25 -0700, Andrew Morton wrote: > Rohit Seth <rohit.seth@intel.com> wrote: > > > > > > Unsused pte is where *pte == 0. Basically entries in leaf page table > > that does not map anything. > > Oh. Then I'm still not understanding. Following is the scenario: Kernel is going to prefault the huge page for user virtual address V. Let us say the pointer pte points to the page table entry (for x86 it is PDE/PMD) which maps the hugepage. At this point *pte == 0. CPU also has this entry cached in its TLB. Meaning, unless this entry is purged or displaced, for virtual address V CPU will generate the page fault (as the P bit is not set and assuming this fault has the highest precedence). Kernel updates the *pte so that it now maps the hugepage at virtual address V to physical address P. Later when the user process make a reference to V, because of stale TLB entry, the processor gets PAGE_FAULT. > > > If such a pte ends up mapping a normal page > > then you get a page fault and HW/handler does the right thing (in terms > > of purging old entries). > > For starters, we don't actually _use_ pte's for the hugepage. We use an > entry in a pmd page. Does that concept still apply for ia64? > With pte entry, I mean the entry that is mapping the hugepage (leaf level entry). This entry could be sitting in PDE/PMD for x86 or VHPT for IA-64. This entry is referring to what the processor uses for translation to get to huge page. > If so, are you saying that the hardware prefetching is happening at the pmd > level? If not, how can it happen at the pte level when the hardware > doesn't know where the pte page _is_? > IA-64, for example, directly maps the last level of page table. Processor directly uses this table to find the translations. So, there is no pmd here. But it is entirely possible on x86 arch as well that any level of PDE (PMD), PTE could be getting prefetched. That is why architecture requires invalidate/purge operations whenever any of this entry changes. > Has this problem been observed in testing? > Yes. On IA-64. > > > > I'd have thought that invalidating those ptes at mmap()-time would be a > > > more consistent approach. > > > > That would be adding too many unconditional purges (which are very > > expensive operations) during mmap. And as we are only talking of > > speculative pre-fetches that are done by HW so IMO we should do this as > > lazily as possible (only if required). > > Is an mmap of a hugepage region very common? I guess it is, on ia32, or on > 64-bit apps which are (poorly?) designed to also run on ia32. > > On x86 it surely is. For 64-bit there is 100G or more of mmaps... And then you will have to do this purge for all the processes that are trying to mmap hugepages. > > > If the low-level code has purged the stale pte then it knows what's > > > happening. Perhaps it shouldn't call into handle_mm_fault() at all? > > > > Well, at that time the code does not know if the address belong to > > hugetlbfile. The archs that needs those purges in low level code need > > them for all (for example) page not present faults. > > I still don't understand what's different about hugepages here. If the > prefetching problem also occurs on regular pages and is handled OK for > regular pages, why do hugepages need special treatment? The prefetching problem is handled OK for regular pages because we can handle page faults corresponding to those pages. That is currently not true for hugepages. Currently the kernel assumes that PAGE_FAULT happening against a hugetlb page is caused by truncate and returns SIGBUS. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 3:25 ` Rohit Seth @ 2005-10-19 4:07 ` Andrew Morton 2005-10-19 14:33 ` Adam Litke 2005-10-19 15:48 ` Hugh Dickins 2005-10-19 15:23 ` Hugh Dickins 1 sibling, 2 replies; 21+ messages in thread From: Andrew Morton @ 2005-10-19 4:07 UTC (permalink / raw) To: Rohit Seth; +Cc: linux-kernel, torvalds, Adam Litke Rohit Seth <rohit.seth@intel.com> wrote: > > The prefetching problem is handled OK for regular pages because we can > handle page faults corresponding to those pages. That is currently not > true for hugepages. Currently the kernel assumes that PAGE_FAULT > happening against a hugetlb page is caused by truncate and returns > SIGBUS. Doh. No fault handler. The penny finally drops. Adam, I think this patch is temporary? From: "Seth, Rohit" <rohit.seth@intel.com> We prefault hugepages at mmap() time, but hardware TLB prefetching may mean that the TLB has NULL pagetable entries in the places where the pagetable in fact has the desired virtual->physical translation. For regular pages this problem is resolved via the resulting pagefault, in the pagefault handler. But hugepages don't support pagefaults - they're supposed to be prefaulted. So we need minimal pagefault handling for these stale hugepage TLB entries. An alternative is to invalidate the relevant TLB entries at hugepage mmap()-time, but this is apparently too expensive. Note: Adam Litke <agl@us.ibm.com>'s demand-paging-for-hugepages patches are now in -mm. If/when these are merged up, this fix should probably be reverted. Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> --- include/linux/hugetlb.h | 13 +++++++++++++ mm/memory.c | 14 ++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-) diff -puN include/linux/hugetlb.h~handling-spurious-page-fault-for-hugetlb-region include/linux/hugetlb.h --- devel/include/linux/hugetlb.h~handling-spurious-page-fault-for-hugetlb-region 2005-10-18 21:04:34.000000000 -0700 +++ devel-akpm/include/linux/hugetlb.h 2005-10-18 21:04:34.000000000 -0700 @@ -155,11 +155,24 @@ static inline void set_file_hugepages(st { file->f_op = &hugetlbfs_file_operations; } + +static inline int valid_hugetlb_file_off(struct vm_area_struct *vma, + unsigned long address) +{ + struct inode *inode = vma->vm_file->f_dentry->d_inode; + loff_t file_off = address - vma->vm_start; + + file_off += (vma->vm_pgoff << PAGE_SHIFT); + + return (file_off < inode->i_size); +} + #else /* !CONFIG_HUGETLBFS */ #define is_file_hugepages(file) 0 #define set_file_hugepages(file) BUG() #define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS) +#define valid_hugetlb_file_off(vma, address) 0 #endif /* !CONFIG_HUGETLBFS */ diff -puN mm/memory.c~handling-spurious-page-fault-for-hugetlb-region mm/memory.c --- devel/mm/memory.c~handling-spurious-page-fault-for-hugetlb-region 2005-10-18 21:04:34.000000000 -0700 +++ devel-akpm/mm/memory.c 2005-10-18 21:04:34.000000000 -0700 @@ -2045,8 +2045,18 @@ int __handle_mm_fault(struct mm_struct * inc_page_state(pgfault); - if (is_vm_hugetlb_page(vma)) - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + if (unlikely(is_vm_hugetlb_page(vma))) { + if (valid_hugetlb_file_off(vma, address)) + /* We get here only if there was a stale(zero) TLB entry + * (because of HW prefetching). + * Low-level arch code (if needed) should have already + * purged the stale entry as part of this fault handling. + * Here we just return. + */ + return VM_FAULT_MINOR; + else + return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + } /* * We need the page table lock to synchronize with kswapd _ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 4:07 ` Andrew Morton @ 2005-10-19 14:33 ` Adam Litke 2005-10-19 15:48 ` Hugh Dickins 1 sibling, 0 replies; 21+ messages in thread From: Adam Litke @ 2005-10-19 14:33 UTC (permalink / raw) To: Andrew Morton; +Cc: Rohit Seth, linux-kernel, torvalds On Tue, 2005-10-18 at 21:07 -0700, Andrew Morton wrote: > Rohit Seth <rohit.seth@intel.com> wrote: > > > > The prefetching problem is handled OK for regular pages because we can > > handle page faults corresponding to those pages. That is currently not > > true for hugepages. Currently the kernel assumes that PAGE_FAULT > > happening against a hugetlb page is caused by truncate and returns > > SIGBUS. > > Doh. No fault handler. The penny finally drops. > > Adam, I think this patch is temporary? Yeah, looks like it can be dropped now since we handle this case (in hugetlb_fault) as: > size = i_size_read(mapping->host) >> HPAGE_SHIFT; > if (idx >= size) > goto backout; > > ret = VM_FAULT_MINOR; > if (!pte_none(*pte)) > goto backout; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 4:07 ` Andrew Morton 2005-10-19 14:33 ` Adam Litke @ 2005-10-19 15:48 ` Hugh Dickins 2005-10-19 19:05 ` Rohit Seth 1 sibling, 1 reply; 21+ messages in thread From: Hugh Dickins @ 2005-10-19 15:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Rohit Seth, linux-kernel, torvalds, Adam Litke On Tue, 18 Oct 2005, Andrew Morton wrote: > Rohit Seth <rohit.seth@intel.com> wrote: > > > > The prefetching problem is handled OK for regular pages because we can > > handle page faults corresponding to those pages. That is currently not > > true for hugepages. Currently the kernel assumes that PAGE_FAULT > > happening against a hugetlb page is caused by truncate and returns > > SIGBUS. Is Rohit's intended to be a late 2.6.14 fix? We seem to have done well without it for several years, and are just on the point of changing to prefaulting the hugetlb pages anyway, which will fix it up. > @@ -2045,8 +2045,18 @@ int __handle_mm_fault(struct mm_struct * > > inc_page_state(pgfault); > > - if (is_vm_hugetlb_page(vma)) > - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ > + if (unlikely(is_vm_hugetlb_page(vma))) { > + if (valid_hugetlb_file_off(vma, address)) > + /* We get here only if there was a stale(zero) TLB entry > + * (because of HW prefetching). > + * Low-level arch code (if needed) should have already > + * purged the stale entry as part of this fault handling. > + * Here we just return. > + */ > + return VM_FAULT_MINOR; > + else > + return VM_FAULT_SIGBUS; /* mapping truncation does this. */ > + } (I'm not surprised that the low-level arch code fixes this up as part of the fault handling. I am surprised that it does so only after giving the fault to the kernel. Sounds like something's gone wrong.) What happens when the hugetlb file is truncated down and back up after the mmap? Truncating down will remove a page from the mmap and flush TLB. Truncating up will let accesses to the gone page pass the valid...off test. But we've no support for hugetlb faulting in this version: so won't it get get stuck in a tight loop? If it's decided that this issue is actually an ia64 one, and does need to be fixed right now, then I'd have thought your idea of fixing it at the ia64 end better: arch/ia64/mm/fault.c already has code for discarding faults on speculative loads, and ia64 has an RGN_HPAGE set aside for hugetlb, so shouldn't it just discard speculative loads on that region? Hugh ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 15:48 ` Hugh Dickins @ 2005-10-19 19:05 ` Rohit Seth 2005-10-19 20:00 ` Hugh Dickins 0 siblings, 1 reply; 21+ messages in thread From: Rohit Seth @ 2005-10-19 19:05 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, torvalds, Adam Litke On Wed, 2005-10-19 at 16:48 +0100, Hugh Dickins wrote: > On Tue, 18 Oct 2005, Andrew Morton wrote: > > Rohit Seth <rohit.seth@intel.com> wrote: > > > > > > The prefetching problem is handled OK for regular pages because we can > > > handle page faults corresponding to those pages. That is currently not > > > true for hugepages. Currently the kernel assumes that PAGE_FAULT > > > happening against a hugetlb page is caused by truncate and returns > > > SIGBUS. > > Is Rohit's intended to be a late 2.6.14 fix? We seem to have done well > without it for several years, and are just on the point of changing to > prefaulting the hugetlb pages anyway, which will fix it up. > I understand this is (very) late in the cycle for 2.6.14 But this issue is happening on existing HW. Moreover, if demand paging of hugepages is going to make into next version release from Linus then it makes all the more sense to fill in the gap for prefaulting in this release itself. It is also quite possible that some OSVs use 2.6.14 as the base for their next releases. FWIW this patch is applicable to all the previous kernel versions as well. That done well for years logic ..... does not really hold good. There is always better HW and SW that brings up new issues in existing code that happened to be running well for years. > > > What happens when the hugetlb file is truncated down and back up after > the mmap? Truncating down will remove a page from the mmap and flush TLB. > Truncating up will let accesses to the gone page pass the valid...off test. > But we've no support for hugetlb faulting in this version: so won't it get > get stuck in a tight loop? > First of all, there is currently no support of extending the hugetlb file size using truncate in 2.6.14. (I agree that part is broken). So the above scenario can not happen. For the prefaulting hugepage case that exist today, I would assume that (if someone does extend ) using truncate to extend the hugetlb file size support, would update the PTEs of all the processes that have those hugepages mapped (just like when truncating it down currently kernel clear the ptes from all the processes PT). > If it's decided that this issue is actually an ia64 one, and does need to > be fixed right now, then I'd have thought your idea of fixing it at the > ia64 end better: arch/ia64/mm/fault.c already has code for discarding > faults on speculative loads, and ia64 has an RGN_HPAGE set aside for > hugetlb, so shouldn't it just discard speculative loads on that region? > This has nothing to do with speculative load fault on IA-64. The first time kernel finds out about if a page belongs to hugepage in the path of page_fault is in mm/memory.c Though the duplicated logic could be added in the arch specific tree. I wonder if other archs like ppc is also exposed to this behavior. -rohit ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 19:05 ` Rohit Seth @ 2005-10-19 20:00 ` Hugh Dickins 2005-10-19 20:19 ` Andrew Morton 0 siblings, 1 reply; 21+ messages in thread From: Hugh Dickins @ 2005-10-19 20:00 UTC (permalink / raw) To: Rohit Seth; +Cc: Andrew Morton, linux-kernel, torvalds, Adam Litke Many thanks for your helpful answers in other mail: much appreciated. On Wed, 19 Oct 2005, Rohit Seth wrote: > On Wed, 2005-10-19 at 16:48 +0100, Hugh Dickins wrote: > > > > What happens when the hugetlb file is truncated down and back up after > > the mmap? Truncating down will remove a page from the mmap and flush TLB. > > Truncating up will let accesses to the gone page pass the valid...off test. > > But we've no support for hugetlb faulting in this version: so won't it get > > get stuck in a tight loop? > > First of all, there is currently no support of extending the hugetlb > file size using truncate in 2.6.14. (I agree that part is broken). So > the above scenario can not happen. I was forgetting that extending ftruncate wasn't supported in 2.6.14 and earlier, yes. But I'm afraid the above scenario can still happen there: extending is done, not by ftruncate, but by (somewhere else) mmapping the larger size. So your fix may still cause a tight infinite fault loop. > For the prefaulting hugepage case that exist today, I would assume that > (if someone does extend ) using truncate to extend the hugetlb file size > support, would update the PTEs of all the processes that have those > hugepages mapped (just like when truncating it down currently kernel > clear the ptes from all the processes PT). Sorry, no, you're dreaming there. And I very much doubt we'd want to implement such a behaviour! (But yes, it's a good example of why several of us are quite unhappy with the extend-to-fill-mmap behaviour.) > > If it's decided that this issue is actually an ia64 one, and does need to > > be fixed right now, then I'd have thought your idea of fixing it at the > > ia64 end better: arch/ia64/mm/fault.c already has code for discarding > > faults on speculative loads, and ia64 has an RGN_HPAGE set aside for > > hugetlb, so shouldn't it just discard speculative loads on that region? > > This has nothing to do with speculative load fault on IA-64. As you made clear in your other mail: sorry for wasting your time with my ignorant confusions! Hugh ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 20:00 ` Hugh Dickins @ 2005-10-19 20:19 ` Andrew Morton 2005-10-19 20:28 ` Hugh Dickins 0 siblings, 1 reply; 21+ messages in thread From: Andrew Morton @ 2005-10-19 20:19 UTC (permalink / raw) To: Hugh Dickins; +Cc: rohit.seth, linux-kernel, torvalds, agl Hugh Dickins <hugh@veritas.com> wrote: > > On Wed, 19 Oct 2005, Rohit Seth wrote: > > On Wed, 2005-10-19 at 16:48 +0100, Hugh Dickins wrote: > > > > > > What happens when the hugetlb file is truncated down and back up after > > > the mmap? Truncating down will remove a page from the mmap and flush TLB. > > > Truncating up will let accesses to the gone page pass the valid...off test. > > > But we've no support for hugetlb faulting in this version: so won't it get > > > get stuck in a tight loop? > > > > First of all, there is currently no support of extending the hugetlb > > file size using truncate in 2.6.14. (I agree that part is broken). So > > the above scenario can not happen. > > I was forgetting that extending ftruncate wasn't supported in 2.6.14 and > earlier, yes. But I'm afraid the above scenario can still happen there: > extending is done, not by ftruncate, but by (somewhere else) mmapping the > larger size. So your fix may still cause a tight infinite fault loop. Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma with hugepages. So I don't think there's ever any part of an address space which ia a) inside a hugepage vma and b) doesn't have a hugepage backing it. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 20:19 ` Andrew Morton @ 2005-10-19 20:28 ` Hugh Dickins 2005-10-19 23:53 ` Rohit Seth 0 siblings, 1 reply; 21+ messages in thread From: Hugh Dickins @ 2005-10-19 20:28 UTC (permalink / raw) To: Andrew Morton; +Cc: rohit.seth, linux-kernel, torvalds, agl On Wed, 19 Oct 2005, Andrew Morton wrote: > Hugh Dickins <hugh@veritas.com> wrote: > > > > I was forgetting that extending ftruncate wasn't supported in 2.6.14 and > > earlier, yes. But I'm afraid the above scenario can still happen there: > > extending is done, not by ftruncate, but by (somewhere else) mmapping the > > larger size. So your fix may still cause a tight infinite fault loop. > > Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma > with hugepages. So I don't think there's ever any part of an address space > which ia a) inside a hugepage vma and b) doesn't have a hugepage backing > it. The new vma, sure, will be fully populated. But the old vma, in this or some other process, which was created before the hugetlbfs file was truncated down, will be left with a hole at the end. Hugh ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 20:28 ` Hugh Dickins @ 2005-10-19 23:53 ` Rohit Seth 2005-10-20 1:36 ` Rohit Seth 0 siblings, 1 reply; 21+ messages in thread From: Rohit Seth @ 2005-10-19 23:53 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, torvalds, agl On Wed, 2005-10-19 at 21:28 +0100, Hugh Dickins wrote: > On Wed, 19 Oct 2005, Andrew Morton wrote: > > Hugh Dickins <hugh@veritas.com> wrote: > > > > > > I was forgetting that extending ftruncate wasn't supported in 2.6.14 and > > > earlier, yes. But I'm afraid the above scenario can still happen there: > > > extending is done, not by ftruncate, but by (somewhere else) mmapping the > > > larger size. So your fix may still cause a tight infinite fault loop. > > > > Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma > > with hugepages. So I don't think there's ever any part of an address space > > which ia a) inside a hugepage vma and b) doesn't have a hugepage backing > > it. > > The new vma, sure, will be fully populated. But the old vma, in this > or some other process, which was created before the hugetlbfs file was > truncated down, will be left with a hole at the end. > Excellent catch. This broken truncation thing.... And I don't know what the right solution should be for this scenario at this point for 2.6.14....may be to actually look at the HUGEPTE corresponding to the hugetlb faulting address or don't allow mmaps to grow the hugetlb file bigger (except the first mmap). I understand that both of them don't sound too good... Any suggestions. -rohit ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 23:53 ` Rohit Seth @ 2005-10-20 1:36 ` Rohit Seth 2005-10-20 1:37 ` Andrew Morton 2005-10-20 6:17 ` Hugh Dickins 0 siblings, 2 replies; 21+ messages in thread From: Rohit Seth @ 2005-10-20 1:36 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, torvalds, agl On Wed, 2005-10-19 at 16:53 -0700, Rohit Seth wrote: > On Wed, 2005-10-19 at 21:28 +0100, Hugh Dickins wrote: > > On Wed, 19 Oct 2005, Andrew Morton wrote: > > > Hugh Dickins <hugh@veritas.com> wrote: > > > > > > > > I was forgetting that extending ftruncate wasn't supported in 2.6.14 and > > > > earlier, yes. But I'm afraid the above scenario can still happen there: > > > > extending is done, not by ftruncate, but by (somewhere else) mmapping the > > > > larger size. So your fix may still cause a tight infinite fault loop. > > > > > > Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma > > > with hugepages. So I don't think there's ever any part of an address space > > > which ia a) inside a hugepage vma and b) doesn't have a hugepage backing > > > it. > > > > The new vma, sure, will be fully populated. But the old vma, in this > > or some other process, which was created before the hugetlbfs file was > > truncated down, will be left with a hole at the end. > > > > Excellent catch. This broken truncation thing.... > > And I don't know what the right solution should be for this scenario at > this point for 2.6.14....may be to actually look at the HUGEPTE > corresponding to the hugetlb faulting address or don't allow mmaps to > grow the hugetlb file bigger (except the first mmap). I understand that > both of them don't sound too good... > > Any suggestions. > I would like to keep this patch. This at least takes care of bad things happening behind application's back by OS/HW. If the scenario that you mentioned happens then the application is knowingly doing unsupported things (at this point truncate is a broken operation on hugetlb). The application can be killed in this case. ...trying to avoid any heavy changes at this time for 2.6.14 -rohit ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-20 1:36 ` Rohit Seth @ 2005-10-20 1:37 ` Andrew Morton 2005-10-20 6:17 ` Hugh Dickins 1 sibling, 0 replies; 21+ messages in thread From: Andrew Morton @ 2005-10-20 1:37 UTC (permalink / raw) To: Rohit Seth; +Cc: hugh, linux-kernel, torvalds, agl Rohit Seth <rohit.seth@intel.com> wrote: > > > Excellent catch. This broken truncation thing.... > > > > And I don't know what the right solution should be for this scenario at > > this point for 2.6.14....may be to actually look at the HUGEPTE > > corresponding to the hugetlb faulting address or don't allow mmaps to > > grow the hugetlb file bigger (except the first mmap). I understand that > > both of them don't sound too good... > > > > Any suggestions. > > > > > I would like to keep this patch. This at least takes care of bad things > happening behind application's back by OS/HW. If the scenario that you > mentioned happens then the application is knowingly doing unsupported > things (at this point truncate is a broken operation on hugetlb). The > application can be killed in this case. Yes, we had an infinite-page-fault bug in the vmalloc code ages ago and the app was indeed killable. It'd be nice to confirm that this is still the case. If so, the stopgap approach seems reasonable. As long as the demand-paging patches get in, that is. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-20 1:36 ` Rohit Seth 2005-10-20 1:37 ` Andrew Morton @ 2005-10-20 6:17 ` Hugh Dickins 1 sibling, 0 replies; 21+ messages in thread From: Hugh Dickins @ 2005-10-20 6:17 UTC (permalink / raw) To: Rohit Seth; +Cc: Andrew Morton, linux-kernel, Linus Torvalds, Adam Litke On Wed, 19 Oct 2005, Rohit Seth wrote: > On Wed, 2005-10-19 at 16:53 -0700, Rohit Seth wrote: > > > > And I don't know what the right solution should be for this scenario at > > this point for 2.6.14....may be to actually look at the HUGEPTE > > corresponding to the hugetlb faulting address or don't allow mmaps to > > grow the hugetlb file bigger (except the first mmap). I understand that > > both of them don't sound too good... > > I would like to keep this patch. This at least takes care of bad things > happening behind application's back by OS/HW. If the scenario that you > mentioned happens then the application is knowingly doing unsupported > things (at this point truncate is a broken operation on hugetlb). The > application can be killed in this case. > > ...trying to avoid any heavy changes at this time for 2.6.14 I think your first suggestion, actually looking at the hugepte, is the right one. I've dived into Adam's forthcoming hugetlb fault-on-demand patch and extracted the minimum we need (but thought we might as well take all the arguments to hugetlb_fault, even those not yet used). This works fine so far as I can test it now, how about for you? [Adam, if this goes in, your patch shouldn't need to touch memory.c or hugetlb.h; and we hadn't noticed, yours was missing the hugetlb_fault definition for the #else config, giving compile warning - fixed here.] Signed-off-by: Hugh Dickins <hugh@veritas.com> --- include/linux/hugetlb.h | 3 +++ mm/hugetlb.c | 22 ++++++++++++++++++++++ mm/memory.c | 2 +- 3 files changed, 26 insertions(+), 1 deletion(-) --- 2.6.14-rc4-git7/include/linux/hugetlb.h 2005-10-11 12:07:52.000000000 +0100 +++ linux/include/linux/hugetlb.h 2005-10-20 06:27:51.000000000 +0100 @@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t); unsigned long hugetlb_total_pages(void); struct page *alloc_huge_page(void); void free_huge_page(struct page *); +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access); extern unsigned long max_huge_pages; extern const unsigned long hugetlb_zero, hugetlb_infinity; @@ -99,6 +101,7 @@ static inline unsigned long hugetlb_tota do { } while (0) #define alloc_huge_page() ({ NULL; }) #define free_huge_page(p) ({ (void)(p); BUG(); }) +#define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; }) #ifndef HPAGE_MASK #define HPAGE_MASK 0 /* Keep the compiler happy */ --- 2.6.14-rc4-git7/mm/hugetlb.c 2005-10-11 12:07:55.000000000 +0100 +++ linux/mm/hugetlb.c 2005-10-20 06:27:51.000000000 +0100 @@ -393,6 +393,28 @@ out: return ret; } +/* + * On ia64 at least, it is possible to receive a hugetlb fault from a + * stale zero entry left in the TLB from earlier hardware prefetching. + * Low-level arch code should already have flushed the stale entry as + * part of its fault handling, but we do need to accept this minor fault + * and return successfully. Whereas the "normal" case is that this is + * an access to a hugetlb page which has been truncated off since mmap. + */ +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + int ret = VM_FAULT_SIGBUS; + pte_t *pte; + + spin_lock(&mm->page_table_lock); + pte = huge_pte_offset(mm, address); + if (pte && !pte_none(*pte)) + ret = VM_FAULT_MINOR; + spin_unlock(&mm->page_table_lock); + return ret; +} + int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, int *length, int i) --- 2.6.14-rc4-git7/mm/memory.c 2005-10-11 12:07:55.000000000 +0100 +++ linux/mm/memory.c 2005-10-20 06:27:51.000000000 +0100 @@ -2046,7 +2046,7 @@ int __handle_mm_fault(struct mm_struct * inc_page_state(pgfault); if (is_vm_hugetlb_page(vma)) - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + return hugetlb_fault(mm, vma, address, write_access); /* * We need the page table lock to synchronize with kswapd ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 3:25 ` Rohit Seth 2005-10-19 4:07 ` Andrew Morton @ 2005-10-19 15:23 ` Hugh Dickins 2005-10-19 18:47 ` Rohit Seth 1 sibling, 1 reply; 21+ messages in thread From: Hugh Dickins @ 2005-10-19 15:23 UTC (permalink / raw) To: Rohit Seth; +Cc: Andrew Morton, linux-kernel, torvalds On Tue, 18 Oct 2005, Rohit Seth wrote: > On Tue, 2005-10-18 at 17:25 -0700, Andrew Morton wrote: > > Rohit Seth <rohit.seth@intel.com> wrote: > > > > > > > > > Unsused pte is where *pte == 0. Basically entries in leaf page table > > > that does not map anything. > > > > Oh. Then I'm still not understanding. I've had great trouble understanding it too. This mail was the most helpful, thanks. > Following is the scenario: > > Kernel is going to prefault the huge page for user virtual address V. > > Let us say the pointer pte points to the page table entry (for x86 it is > PDE/PMD) which maps the hugepage. > > At this point *pte == 0. CPU also has this entry cached in its TLB. I thought that the CPU never caches !present entries in the TLB? Or is that true of i386 (and x86_64), but untrue of ia64? Or do you have some new model or errata on some CPU where it's true? Or, final ghastly possibility ;), am I simply altogether wrong? > Meaning, unless this entry is purged or displaced, for virtual address V When you say "purged", is that what we elsewhere call "flushed" in relation to the TLB, or something else? > CPU will generate the page fault (as the P bit is not set and assuming > this fault has the highest precedence). > > Kernel updates the *pte so that it now maps the hugepage at virtual > address V to physical address P. > > Later when the user process make a reference to V, because of stale TLB > entry, the processor gets PAGE_FAULT. You seem to be saying that strictly, we ought to flush TLB even when we make a page present where none was before, but that the likelihood of it being needed is so low, and the overhead of TLB flush so high, and the existing code almost everywhere recovering safely from this condition, that the most effective thing to do is just fix up the hugetlb case. Is that correct? > > Has this problem been observed in testing? > > Yes. On IA-64. But not on i386 or x86_64. > > I still don't understand what's different about hugepages here. If the > > prefetching problem also occurs on regular pages and is handled OK for > > regular pages, why do hugepages need special treatment? Okay, I get that part of the puzzle. > The prefetching problem is handled OK for regular pages because we can But here you mention prefetching, and I think in another of these explanations you mention that this only happens via speculative prefetch. I thought prefetch was designed to have no such visible effects? Same series of doubts as with !present entries in the TLB; but after looking at the ia64 fault handler, that does seem to have stuff about speculative loads, so I'm guessing i386 and x86_64 prefetch does not cause faults (modulo errata), but ia64 does. > handle page faults corresponding to those pages. That is currently not > true for hugepages. Currently the kernel assumes that PAGE_FAULT > happening against a hugetlb page is caused by truncate and returns > SIGBUS. Testing against i_size is not good enough, but I'll explain that in other mail against Andrew's latest version of the patch. Once I started to understand this thread, I thought you were quite wrong to be changing hugetlb fault handling, thought I'd find several other places which would need fixing too e.g. kmap_atomic, remap_pfn_range. But no, I've found no others. Either miraculously, or by good design, all the kernel misfaults should be seamlessly handled by the lazy vmalloc path (on i386 anyway: I don't know what happens for ia64 there), and the userspace misfaults by handle_pte_fault's pte_present check. I think. Even with Nick's PageReserved changes in -mm, where he might have chosen to put a warning on VM_RESERVED higher in the fault path, it looks like it needs no change for this issue. Hugh ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 15:23 ` Hugh Dickins @ 2005-10-19 18:47 ` Rohit Seth 2005-10-19 20:53 ` Linus Torvalds 0 siblings, 1 reply; 21+ messages in thread From: Rohit Seth @ 2005-10-19 18:47 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, torvalds On Wed, 2005-10-19 at 16:23 +0100, Hugh Dickins wrote: > I thought that the CPU never caches !present entries in the TLB? > Or is that true of i386 (and x86_64), but untrue of ia64? IA-64 can prefetch any entry from VHPT (last level page table) irrespective of its value. You are right that i386 and x86_64 does not cache !present entry. Though OS is suppose to handle those faults if happen. > Or do you have some new model or errata on some CPU where it's true? No errata here. > Or, final ghastly possibility ;), am I simply altogether wrong? > You are asking the right questions here. > > Meaning, unless this entry is purged or displaced, for virtual address V > > When you say "purged", is that what we elsewhere call "flushed" > in relation to the TLB, or something else? > I should use flush to be consistent. > > CPU will generate the page fault (as the P bit is not set and assuming > > this fault has the highest precedence). > > > > Kernel updates the *pte so that it now maps the hugepage at virtual > > address V to physical address P. > > > > Later when the user process make a reference to V, because of stale TLB > > entry, the processor gets PAGE_FAULT. > > You seem to be saying that strictly, we ought to flush TLB even when we > make a page present where none was before, but that the likelihood of it > being needed is so low, and the overhead of TLB flush so high, and the > existing code almost everywhere recovering safely from this condition, > that the most effective thing to do is just fix up the hugetlb case. > Is that correct? > Yes. At least for the architectures that can cache any translation in its TLB. IA-64 is again a good example here. It flushes the entry only at the fault time so that next time around you get the updated entry (for the cases where the fault happened because of any stale TLB). > > > Has this problem been observed in testing? > > > > Yes. On IA-64. > > But not on i386 or x86_64. > No. > Same series of doubts as with !present entries in the TLB; but after > looking at the ia64 fault handler, that does seem to have stuff about > speculative loads, so I'm guessing i386 and x86_64 prefetch does not > cause faults (modulo errata), but ia64 does. > Those speculative loads (are more of advanced loads generated by compiler in anticipation that they will be helpful) on IA-64 are different from prefetches that HW does for TLBs. HW Speculative loads never generates any fault. Whereas prefetched TLB entries in i386, x86_64 or IA-64 can cause fault if they are not flushed after updates. > Once I started to understand this thread, I thought you were quite > wrong to be changing hugetlb fault handling, thought I'd find several > other places which would need fixing too e.g. kmap_atomic, remap_pfn_range. > > But no, I've found no others. Either miraculously, or by good design, > all the kernel misfaults should be seamlessly handled by the lazy vmalloc > path (on i386 anyway: I don't know what happens for ia64 there), and the > userspace misfaults by handle_pte_fault's pte_present check. I think. > Good OS design :-) Though on IA-64 there was recently a similar issue for vmalloc area that got fixed in low level arch specific code. -rohit ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 18:47 ` Rohit Seth @ 2005-10-19 20:53 ` Linus Torvalds 2005-10-19 21:59 ` Tony Luck 2005-10-20 0:05 ` Rohit Seth 0 siblings, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2005-10-19 20:53 UTC (permalink / raw) To: Rohit Seth; +Cc: Hugh Dickins, Andrew Morton, linux-kernel On Wed, 19 Oct 2005, Rohit Seth wrote: > > IA-64 can prefetch any entry from VHPT (last level page table) > irrespective of its value. You are right that i386 and x86_64 does not > cache !present entry. Though OS is suppose to handle those faults if > happen. Well.. The fact is, the VM layer is designed for systems that do not cache not-present entries in their TLB. See for example the end of do_no_page() in mm/memory.c: /* no need to invalidate: a not-present page shouldn't be cached */ update_mmu_cache(vma, address, entry); lazy_mmu_prot_update(entry); spin_unlock(&mm->page_table_lock); out: return ret; which _allows_ for hardware that caches not-present pages, but the architecture needs to catch them in the "update_mmu_cache()". IOW, the kernel is largely designed for present-only caching, and only has explicit tlb flush macros for that case. If ia64 caches non-present TLB entries, then that would seem to be a bug in the Linux ia64 port: - include/asm-ia64/pgtable.h: #define update_mmu_cache(vma, address, pte) do { } while (0) (Of course, you can and maybe do handle it differently: you can also decide to just take the TLB fault, and flush the TLB at fault time in your handler. I don't see that either on ia64, though. Although I didn't look into any of the asm code, so maybe it's hidden somewhere there). Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 20:53 ` Linus Torvalds @ 2005-10-19 21:59 ` Tony Luck 2005-10-20 0:05 ` Rohit Seth 1 sibling, 0 replies; 21+ messages in thread From: Tony Luck @ 2005-10-19 21:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rohit Seth, Hugh Dickins, Andrew Morton, linux-kernel On 10/19/05, Linus Torvalds <torvalds@osdl.org> wrote: > (Of course, you can and maybe do handle it differently: you can also > decide to just take the TLB fault, and flush the TLB at fault time in your > handler. I don't see that either on ia64, though. Although I didn't look > into any of the asm code, so maybe it's hidden somewhere there). Yes. In arch/ia64/kernel/ivt.S: ENTRY(page_not_present) DBG_FAULT(20) mov r16=cr.ifa rsm psr.dt /* * The Linux page fault handler doesn't expect non-present pages to be in * the TLB. Flush the existing entry now, so we meet that expectation. */ mov r17=PAGE_SHIFT<<2 ;; ptc.l r16,r17 -Tony ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 2005-10-19 20:53 ` Linus Torvalds 2005-10-19 21:59 ` Tony Luck @ 2005-10-20 0:05 ` Rohit Seth 1 sibling, 0 replies; 21+ messages in thread From: Rohit Seth @ 2005-10-20 0:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Hugh Dickins, Andrew Morton, linux-kernel On Wed, 2005-10-19 at 13:53 -0700, Linus Torvalds wrote: > > The fact is, the VM layer is designed for systems that do not cache > not-present entries in their TLB. See for example the end of do_no_page() > in mm/memory.c: > > /* no need to invalidate: a not-present page shouldn't be cached */ > update_mmu_cache(vma, address, entry); > lazy_mmu_prot_update(entry); > spin_unlock(&mm->page_table_lock); > out: > return ret; > > which _allows_ for hardware that caches not-present pages, but the > architecture needs to catch them in the "update_mmu_cache()". > I agree that one way would be to flush the TLB as part of update_mmu_cache. But that would result in too many extra global flushes. Instead in IA-64, we wait till fault time to do local flushes whenever needed. > If ia64 caches non-present TLB entries, then that would seem to be a bug > in the Linux ia64 port: > > - include/asm-ia64/pgtable.h: > #define update_mmu_cache(vma, address, pte) do { } while (0) > > (Of course, you can and maybe do handle it differently: you can also > decide to just take the TLB fault, and flush the TLB at fault time in your > handler. I don't see that either on ia64, though. Although I didn't look > into any of the asm code, so maybe it's hidden somewhere there). > The low level page_not_present vector (asm code) flushes the stale entry that could be sitting in TLB resulting in current page fault. But anyways now there is another scenario that Hugh has pointed out in the last mail that needs to be taken care of too... -rohit ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2005-10-20 6:18 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-10-18 21:15 [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5 Seth, Rohit 2005-10-18 21:34 ` Andrew Morton 2005-10-18 22:17 ` Rohit Seth 2005-10-19 0:25 ` Andrew Morton 2005-10-19 3:25 ` Rohit Seth 2005-10-19 4:07 ` Andrew Morton 2005-10-19 14:33 ` Adam Litke 2005-10-19 15:48 ` Hugh Dickins 2005-10-19 19:05 ` Rohit Seth 2005-10-19 20:00 ` Hugh Dickins 2005-10-19 20:19 ` Andrew Morton 2005-10-19 20:28 ` Hugh Dickins 2005-10-19 23:53 ` Rohit Seth 2005-10-20 1:36 ` Rohit Seth 2005-10-20 1:37 ` Andrew Morton 2005-10-20 6:17 ` Hugh Dickins 2005-10-19 15:23 ` Hugh Dickins 2005-10-19 18:47 ` Rohit Seth 2005-10-19 20:53 ` Linus Torvalds 2005-10-19 21:59 ` Tony Luck 2005-10-20 0:05 ` Rohit Seth
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox