[PATCH v1 0/7] Huge page support for DAX

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/7] Huge page support for DAX
@ 2014-10-08 13:25 Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 1/7] thp: vma_adjust_trans_huge(): adjust file-backed VMA too Matthew Wilcox
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

This patchset, on top of the v11 DAX patchset I posted recently, adds
support for transparent huge pages.  In-memory databases and HPC apps are
particularly fond of using huge pages for their massive data sets.

The actual DAX code here is not how I want it to be, for example it
will allocate on read-faults instead of using zero pages to fill until
we have a write fault (which is going to prove tricky without at least
some of Kirill's patches for supporting huge pages in the page cache).

I'm posting this for review now since I clearly don't understand the
Linux MM very well and I'm expecting to be told I've done all the huge
memory bits wrongly :-)

I'd like to thank Kirill for all his helpful suggestions ... I may not
have taken all of them, but this would be in a lot worse shape without
him.

The first patch is from Kirill's patchset to allow huge pages in the
page cache.  Patches 2-4 are the ones that touch the MM and I'd really
like reviewed.  Patch 5 is the DAX code that is easily critiqued, and
patches 6 & 7 are very boring, just hooking up the dax-hugepage code to
ext2 & ext4.

Kirill A. Shutemov (1):
  thp: vma_adjust_trans_huge(): adjust file-backed VMA too

Matthew Wilcox (6):
  mm: Prepare for DAX huge pages
  mm: Add vm_insert_pfn_pmd()
  mm: Add a pmd_fault handler
  dax: Add huge page fault support
  ext2: Huge page fault support
  ext4: Huge page fault support

 Documentation/filesystems/dax.txt |   7 +-
 arch/x86/include/asm/pgtable.h    |  10 +++
 fs/dax.c                          | 133 ++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c                    |   9 ++-
 fs/ext4/file.c                    |   9 ++-
 include/linux/fs.h                |   2 +
 include/linux/huge_mm.h           |  11 +---
 include/linux/mm.h                |   4 ++
 mm/huge_memory.c                  |  53 +++++++++------
 mm/memory.c                       |  63 ++++++++++++++++--
 10 files changed, 262 insertions(+), 39 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v1 1/7] thp: vma_adjust_trans_huge(): adjust file-backed VMA too
  2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
@ 2014-10-08 13:25 ` Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 2/7] mm: Prepare for DAX huge pages Matthew Wilcox
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Kirill A. Shutemov, willy

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages in page cache, we need to call
adjust file-backed VMA, which potentially can contain huge pages.

For now we call it for all VMAs.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
---
 include/linux/huge_mm.h | 11 +----------
 mm/huge_memory.c        |  2 +-
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb..c4e050d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -122,7 +122,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
 			    unsigned long *vm_flags, int advice);
-extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
@@ -138,15 +138,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 	else
 		return 0;
 }
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 long adjust_next)
-{
-	if (!vma->anon_vma || vma->vm_ops)
-		return;
-	__vma_adjust_trans_huge(vma, start, end, adjust_next);
-}
 static inline int hpage_nr_pages(struct page *page)
 {
 	if (unlikely(PageTransHuge(page)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..2a56ddd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2940,7 +2940,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 	split_huge_page_pmd_mm(mm, address, pmd);
 }
 
-void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+void vma_adjust_trans_huge(struct vm_area_struct *vma,
 			     unsigned long start,
 			     unsigned long end,
 			     long adjust_next)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 1/7] thp: vma_adjust_trans_huge(): adjust file-backed VMA too Matthew Wilcox
@ 2014-10-08 13:25 ` Matthew Wilcox
  2014-10-08 15:21   ` Kirill A. Shutemov
  2014-10-08 13:25 ` [PATCH v1 3/7] mm: Add vm_insert_pfn_pmd() Matthew Wilcox
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

DAX wants to use the 'special' bit to mark PMD entries that are not backed
by struct page, just as for PTEs.  Add pmd_special() and pmd_mkspecial
for x86 (nb: also need to be added for other architectures).  Prepare
do_huge_pmd_wp_page(), zap_huge_pmd() and __split_huge_page_pmd() to
handle pmd_special entries.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 10 +++++++++
 mm/huge_memory.c               | 51 ++++++++++++++++++++++++++----------------
 2 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aa97a07..f4f42f2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -302,6 +302,11 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
 {
@@ -504,6 +509,11 @@ static inline int pmd_none(pmd_t pmd)
 	return (unsigned long)native_pmd_val(pmd) == 0;
 }
 
+static inline int pmd_special(pmd_t pmd)
+{
+	return (pmd_flags(pmd) & _PAGE_SPECIAL) && pmd_present(pmd);
+}
+
 static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 {
 	return (unsigned long)__va(pmd_val(pmd) & PTE_PFN_MASK);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a56ddd..ad09fc1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1096,7 +1096,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	ptl = pmd_lockptr(mm, pmd);
-	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
@@ -1104,9 +1103,20 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_unlock;
 
-	page = pmd_page(orig_pmd);
-	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
-	if (page_mapcount(page) == 1) {
+	if (pmd_special(orig_pmd)) {
+		/* VM_MIXEDMAP !pfn_valid() case */
+		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) !=
+				     (VM_WRITE|VM_SHARED)) {
+			pmdp_clear_flush(vma, haddr, pmd);
+			ret = VM_FAULT_FALLBACK;
+			goto out_unlock;
+		}
+	} else {
+		VM_BUG_ON(!vma->anon_vma);
+		page = pmd_page(orig_pmd);
+		VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
+	}
+	if (!page || page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1391,7 +1401,6 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	int ret = 0;
 
 	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
-		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
 		/*
@@ -1402,13 +1411,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		if (pmd_special(orig_pmd)) {
+			spin_unlock(ptl);
+			return 1;
+		}
 		pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
 		if (is_huge_zero_pmd(orig_pmd)) {
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			spin_unlock(ptl);
 			put_huge_zero_page();
 		} else {
-			page = pmd_page(orig_pmd);
+			struct page *page = pmd_page(orig_pmd);
 			page_remove_rmap(page);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
@@ -2860,7 +2873,7 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 		pmd_t *pmd)
 {
 	spinlock_t *ptl;
-	struct page *page;
+	struct page *page = NULL;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	unsigned long mmun_start;	/* For mmu_notifiers */
@@ -2873,25 +2886,25 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 again:
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
+	if (unlikely(!pmd_trans_huge(*pmd)))
+		goto unlock;
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
+	} else if (pmd_special(*pmd)) {
+		pmdp_clear_flush(vma, haddr, pmd);
+	} else {
+		page = pmd_page(*pmd);
+		VM_BUG_ON_PAGE(!page_count(page), page);
+		get_page(page);
 	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
+ unlock:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
-	split_huge_page(page);
+	if (!page)
+		return;
 
+	split_huge_page(page);
 	put_page(page);
 
 	/*
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 3/7] mm: Add vm_insert_pfn_pmd()
  2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 1/7] thp: vma_adjust_trans_huge(): adjust file-backed VMA too Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 2/7] mm: Prepare for DAX huge pages Matthew Wilcox
@ 2014-10-08 13:25 ` Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 4/7] mm: Add a pmd_fault handler Matthew Wilcox
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Similar to vm_insert_pfn(), but for PMDs rather than PTEs.  Should this
be in m/huge_memory.c?

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a47817..d0de9fa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1960,6 +1960,8 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
+int vm_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *,
+			unsigned long pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/mm/memory.c b/mm/memory.c
index 3368785..993be2b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1648,6 +1648,54 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+			pmd_t *pmd, unsigned long pfn, pgprot_t prot)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int retval;
+	pmd_t entry;
+	spinlock_t *ptl;
+
+	ptl = pmd_lock(mm, pmd);
+	retval = -EBUSY;
+	if (!pmd_none(*pmd))
+		goto out_unlock;
+
+	/* Ok, finally just insert the thing.. */
+	entry = pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot)));
+	set_pmd_at(mm, addr, pmd, entry);
+	update_mmu_cache_pmd(vma, addr, pmd);
+
+	retval = 0;
+ out_unlock:
+	spin_unlock(ptl);
+	return retval;
+}
+
+int vm_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+					pmd_t *pmd, unsigned long pfn)
+{
+	pgprot_t pgprot = vma->vm_page_prot;
+	/*
+	 * Technically, architectures with pte_special can avoid all these
+	 * restrictions (same for remap_pfn_range).  However we would like
+	 * consistency in testing and feature parity among all, so we should
+	 * try to keep these invariants in place for everybody.
+	 */
+	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
+						(VM_PFNMAP|VM_MIXEDMAP));
+	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
+	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return -EFAULT;
+	if (track_pfn_insert(vma, &pgprot, pfn))
+		return -EINVAL;
+	return insert_pfn_pmd(vma, addr, pmd, pfn, pgprot);
+}
+EXPORT_SYMBOL(vm_insert_pfn_pmd);
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 4/7] mm: Add a pmd_fault handler
  2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
                   ` (2 preceding siblings ...)
  2014-10-08 13:25 ` [PATCH v1 3/7] mm: Add vm_insert_pfn_pmd() Matthew Wilcox
@ 2014-10-08 13:25 ` Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 5/7] dax: Add huge page fault support Matthew Wilcox
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Allow non-anonymous VMAs to provide huge pages in response to a page fault.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d0de9fa..c0b4f74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -229,6 +229,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
+						pmd_t *, unsigned int flags);
 	void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
diff --git a/mm/memory.c b/mm/memory.c
index 993be2b..ec51b0f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3238,6 +3238,16 @@ out:
 	return 0;
 }
 
+static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+	if (!vma->vm_ops)
+		return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+	if (vma->vm_ops->pmd_fault)
+		return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+	return VM_FAULT_FALLBACK;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3335,10 +3345,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
-		int ret = VM_FAULT_FALLBACK;
-		if (!vma->vm_ops)
-			ret = do_huge_pmd_anonymous_page(mm, vma, address,
-					pmd, flags);
+		int ret = create_huge_pmd(mm, vma, address, pmd, flags);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	} else {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 5/7] dax: Add huge page fault support
  2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
                   ` (3 preceding siblings ...)
  2014-10-08 13:25 ` [PATCH v1 4/7] mm: Add a pmd_fault handler Matthew Wilcox
@ 2014-10-08 13:25 ` Matthew Wilcox
  2014-10-08 20:11   ` Kirill A. Shutemov
  2014-10-08 13:25 ` [PATCH v1 6/7] ext2: Huge " Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 7/7] ext4: " Matthew Wilcox
  6 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

This is the support code for DAX-enabled filesystems to allow them to
provide huge pages in response to faults.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 Documentation/filesystems/dax.txt |   7 +-
 fs/dax.c                          | 133 ++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |   2 +
 3 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index be376d9..f958b07 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -58,9 +58,10 @@ Filesystem support consists of
 - implementing the direct_IO address space operation, and calling
   dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
 - implementing an mmap file operation for DAX files which sets the
-  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
-  for fault and page_mkwrite (which should probably call dax_fault() and
-  dax_mkwrite(), passing the appropriate get_block() callback)
+  VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
+  include handlers for fault, pmd_fault and page_mkwrite (which should
+  probably call dax_fault(), dax_pmd_fault() and dax_mkwrite(), passing the
+  appropriate get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
 - calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
diff --git a/fs/dax.c b/fs/dax.c
index 041d237..7be108b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -461,6 +461,139 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below function.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static int do_dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+			pmd_t *pmd, unsigned int flags, get_block_t get_block)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct buffer_head bh;
+	unsigned blkbits = inode->i_blkbits;
+	long length;
+	void *kaddr;
+	pgoff_t size, pgoff;
+	sector_t block, sector;
+	unsigned long pfn;
+	int major = 0;
+
+	/* Fall back to PTEs if we're going to COW */
+	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
+		return VM_FAULT_FALLBACK;
+	/* If the PMD would extend outside the VMA */
+	if ((address & PMD_MASK) < vma->vm_start)
+		return VM_FAULT_FALLBACK;
+	if (((address & PMD_MASK) + PMD_SIZE) > vma->vm_end)
+		return VM_FAULT_FALLBACK;
+
+	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+	/* If the PMD would cover blocks out of the file */
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		return VM_FAULT_FALLBACK;
+
+	memset(&bh, 0, sizeof(bh));
+	block = ((sector_t)pgoff & ~PG_PMD_COLOUR) << (PAGE_SHIFT - blkbits);
+
+	/* Start by seeing if we already have an allocated block */
+	bh.b_size = PMD_SIZE;
+	length = get_block(inode, block, &bh, 0);
+	if (length)
+		return VM_FAULT_SIGBUS;
+
+	if ((!buffer_mapped(&bh) && !buffer_unwritten(&bh)) ||
+						bh.b_size != PMD_SIZE) {
+		bh.b_size = PMD_SIZE;
+		length = get_block(inode, block, &bh, 1);
+		count_vm_event(PGMAJFAULT);
+		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+		major = VM_FAULT_MAJOR;
+		if (length)
+			return VM_FAULT_SIGBUS;
+		if (bh.b_size != PMD_SIZE)
+			return VM_FAULT_FALLBACK;
+	}
+
+	mutex_lock(&mapping->i_mmap_mutex);
+
+	/* Guard against a race with truncate */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= size)
+		goto sigbus;
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		goto fallback;
+
+	sector = bh.b_blocknr << (blkbits - 9);
+	length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size);
+	if (length < 0)
+		goto sigbus;
+	if (length < PMD_SIZE)
+		goto fallback;
+	if (pfn & PG_PMD_COLOUR)
+		goto fallback;	/* not aligned */
+
+	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
+		int i;
+		for (i = 0; i < PTRS_PER_PMD; i++)
+			clear_page(kaddr + i * PAGE_SIZE);
+	}
+
+	length = vm_insert_pfn_pmd(vma, address, pmd, pfn);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (bh.b_end_io)
+		bh.b_end_io(&bh, 1);
+
+	if (length == -ENOMEM)
+		return VM_FAULT_OOM | major;
+	/* -EBUSY is fine, somebody else faulted on the same PMD */
+	if ((length < 0) && (length != -EBUSY))
+		return VM_FAULT_SIGBUS | major;
+	return VM_FAULT_NOPAGE | major;
+
+ fallback:
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return VM_FAULT_FALLBACK | major;
+
+ sigbus:
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return VM_FAULT_SIGBUS | major;
+}
+
+/**
+ * dax_pmd_fault - handle a PMD fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * pmd_fault handler for DAX files.
+ */
+int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+			pmd_t *pmd, unsigned int flags, get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	if (flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(sb);
+		file_update_time(vma->vm_file);
+	}
+	result = do_dax_pmd_fault(vma, address, pmd, flags, get_block);
+	if (flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(sb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_pmd_fault);
+
 /**
  * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 105d0f0..3528597 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2495,6 +2495,8 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
+					unsigned int flags, get_block_t);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 #else
 static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 6/7] ext2: Huge page fault support
  2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
                   ` (4 preceding siblings ...)
  2014-10-08 13:25 ` [PATCH v1 5/7] dax: Add huge page fault support Matthew Wilcox
@ 2014-10-08 13:25 ` Matthew Wilcox
  2014-10-08 13:25 ` [PATCH v1 7/7] ext4: " Matthew Wilcox
  6 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext2/file.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 5b8cab5..4379393 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -31,6 +31,12 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return dax_fault(vma, vmf, ext2_get_block);
 }
 
+static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+						pmd_t *pmd, unsigned int flags)
+{
+	return dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
+}
+
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	return dax_mkwrite(vma, vmf, ext2_get_block);
@@ -38,6 +44,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
+	.pmd_fault	= ext2_dax_pmd_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
@@ -49,7 +56,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
 	return 0;
 }
 #else
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 7/7] ext4: Huge page fault support
  2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
                   ` (5 preceding siblings ...)
  2014-10-08 13:25 ` [PATCH v1 6/7] ext2: Huge " Matthew Wilcox
@ 2014-10-08 13:25 ` Matthew Wilcox
  6 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 13:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/file.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 9c7bde5..9e3b4d3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -198,6 +198,12 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 					/* Is this the right get_block? */
 }
 
+static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+						pmd_t *pmd, unsigned int flags)
+{
+	return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block);
+}
+
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	return dax_mkwrite(vma, vmf, ext4_get_block);
@@ -205,6 +211,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
+	.pmd_fault	= ext4_dax_pmd_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
@@ -224,7 +231,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-08 13:25 ` [PATCH v1 2/7] mm: Prepare for DAX huge pages Matthew Wilcox
@ 2014-10-08 15:21   ` Kirill A. Shutemov
  2014-10-08 15:57     ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2014-10-08 15:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> DAX wants to use the 'special' bit to mark PMD entries that are not backed
> by struct page, just as for PTEs. 

Hm. I don't see where you use PMD without special set.

> Add pmd_special() and pmd_mkspecial
> for x86 (nb: also need to be added for other architectures).  Prepare
> do_huge_pmd_wp_page(), zap_huge_pmd() and __split_huge_page_pmd() to
> handle pmd_special entries.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  arch/x86/include/asm/pgtable.h | 10 +++++++++
>  mm/huge_memory.c               | 51 ++++++++++++++++++++++++++----------------
>  2 files changed, 42 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index aa97a07..f4f42f2 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -302,6 +302,11 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
>  	return pmd_clear_flags(pmd, _PAGE_PRESENT);
>  }
>  
> +static inline pmd_t pmd_mkspecial(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_SPECIAL);
> +}
> +
>  #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
>  static inline int pte_soft_dirty(pte_t pte)
>  {
> @@ -504,6 +509,11 @@ static inline int pmd_none(pmd_t pmd)
>  	return (unsigned long)native_pmd_val(pmd) == 0;
>  }
>  
> +static inline int pmd_special(pmd_t pmd)
> +{
> +	return (pmd_flags(pmd) & _PAGE_SPECIAL) && pmd_present(pmd);
> +}
> +
>  static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>  {
>  	return (unsigned long)__va(pmd_val(pmd) & PTE_PFN_MASK);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2a56ddd..ad09fc1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1096,7 +1096,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>  
>  	ptl = pmd_lockptr(mm, pmd);
> -	VM_BUG_ON(!vma->anon_vma);
>  	haddr = address & HPAGE_PMD_MASK;
>  	if (is_huge_zero_pmd(orig_pmd))
>  		goto alloc;
> @@ -1104,9 +1103,20 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
>  		goto out_unlock;
>  
> -	page = pmd_page(orig_pmd);
> -	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> -	if (page_mapcount(page) == 1) {
> +	if (pmd_special(orig_pmd)) {
> +		/* VM_MIXEDMAP !pfn_valid() case */
> +		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) !=
> +				     (VM_WRITE|VM_SHARED)) {
> +			pmdp_clear_flush(vma, haddr, pmd);
> +			ret = VM_FAULT_FALLBACK;

No private THP pages with THP? Why?
It should be trivial: we already have a code path for !page case for zero
page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
COW.

I remeber I've mentioned that you don't think it's reasonable to allocate
2M page on COW, but that's what we do for anon memory...

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-08 15:21   ` Kirill A. Shutemov
@ 2014-10-08 15:57     ` Matthew Wilcox
  2014-10-08 19:43       ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-08 15:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm,
	Matthew Wilcox

On Wed, Oct 08, 2014 at 06:21:24PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> > 
> > DAX wants to use the 'special' bit to mark PMD entries that are not backed
> > by struct page, just as for PTEs. 
> 
> Hm. I don't see where you use PMD without special set.

Right ... I don't currently insert PMDs that point to huge pages of DRAM,
only to huge pages of PMEM.

> > @@ -1104,9 +1103,20 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> >  		goto out_unlock;
> >  
> > -	page = pmd_page(orig_pmd);
> > -	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> > -	if (page_mapcount(page) == 1) {
> > +	if (pmd_special(orig_pmd)) {
> > +		/* VM_MIXEDMAP !pfn_valid() case */
> > +		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) !=
> > +				     (VM_WRITE|VM_SHARED)) {
> > +			pmdp_clear_flush(vma, haddr, pmd);
> > +			ret = VM_FAULT_FALLBACK;
> 
> No private THP pages with THP? Why?
> It should be trivial: we already have a code path for !page case for zero
> page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
> COW.
> 
> I remeber I've mentioned that you don't think it's reasonable to allocate
> 2M page on COW, but that's what we do for anon memory...

I agree that it shouldn't be too hard, but I have no evidence that it'll
be a performance win to COW 2MB pages for MAP_PRIVATE.  I'd rather be
cautious for now and we can explore COWing 2MB chunks in a future patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-08 15:57     ` Matthew Wilcox
@ 2014-10-08 19:43       ` Kirill A. Shutemov
  2014-10-09 20:40         ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2014-10-08 19:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm

On Wed, Oct 08, 2014 at 11:57:58AM -0400, Matthew Wilcox wrote:
> On Wed, Oct 08, 2014 at 06:21:24PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> > > From: Matthew Wilcox <willy@linux.intel.com>
> > > 
> > > DAX wants to use the 'special' bit to mark PMD entries that are not backed
> > > by struct page, just as for PTEs. 
> > 
> > Hm. I don't see where you use PMD without special set.
> 
> Right ... I don't currently insert PMDs that point to huge pages of DRAM,
> only to huge pages of PMEM.

Looks like you don't need pmd_{mk,}special() then. It seems you have all
inforamtion you need -- vma -- to find out what's going on. Right?

PMD bits is not something we can assigning to a feature without a need.

> > > @@ -1104,9 +1103,20 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> > >  		goto out_unlock;
> > >  
> > > -	page = pmd_page(orig_pmd);
> > > -	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> > > -	if (page_mapcount(page) == 1) {
> > > +	if (pmd_special(orig_pmd)) {
> > > +		/* VM_MIXEDMAP !pfn_valid() case */
> > > +		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) !=
> > > +				     (VM_WRITE|VM_SHARED)) {
> > > +			pmdp_clear_flush(vma, haddr, pmd);
> > > +			ret = VM_FAULT_FALLBACK;
> > 
> > No private THP pages with THP? Why?
> > It should be trivial: we already have a code path for !page case for zero
> > page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
> > COW.
> > 
> > I remeber I've mentioned that you don't think it's reasonable to allocate
> > 2M page on COW, but that's what we do for anon memory...
> 
> I agree that it shouldn't be too hard, but I have no evidence that it'll
> be a performance win to COW 2MB pages for MAP_PRIVATE.  I'd rather be
> cautious for now and we can explore COWing 2MB chunks in a future patch.

I would rather make it other way around: use the same apporoach as for
anon memory until data shows it's doesn't make any good. Then consider
switching COW for *both* anon and file THP to fallback path.
This way we will get consistent behaviour for both types of mappings.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 5/7] dax: Add huge page fault support
  2014-10-08 13:25 ` [PATCH v1 5/7] dax: Add huge page fault support Matthew Wilcox
@ 2014-10-08 20:11   ` Kirill A. Shutemov
  2014-10-09 20:47     ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2014-10-08 20:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Wed, Oct 08, 2014 at 09:25:27AM -0400, Matthew Wilcox wrote:
> +
> +	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +	/* If the PMD would cover blocks out of the file */
> +	if ((pgoff | PG_PMD_COLOUR) >= size)
> +		return VM_FAULT_FALLBACK;

IIUC, zero pading would work too.

> +
> +	memset(&bh, 0, sizeof(bh));
> +	block = ((sector_t)pgoff & ~PG_PMD_COLOUR) << (PAGE_SHIFT - blkbits);
> +
> +	/* Start by seeing if we already have an allocated block */
> +	bh.b_size = PMD_SIZE;
> +	length = get_block(inode, block, &bh, 0);

This makes me confused. get_block() return zero on success, right?
Why the var called 'lenght'?

> +	sector = bh.b_blocknr << (blkbits - 9);
> +	length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size);
> +	if (length < 0)
> +		goto sigbus;
> +	if (length < PMD_SIZE)
> +		goto fallback;
> +	if (pfn & PG_PMD_COLOUR)
> +		goto fallback;	/* not aligned */

So, are you rely on pure luck to make get_block() allocate 2M aligned pfn?
Not really productive. You would need assistance from fs and
arch_get_unmapped_area() sides.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-08 19:43       ` Kirill A. Shutemov
@ 2014-10-09 20:40         ` Matthew Wilcox
  2014-10-13 20:36           ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-09 20:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Matthew Wilcox, linux-fsdevel, linux-kernel,
	linux-mm

On Wed, Oct 08, 2014 at 10:43:35PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 08, 2014 at 11:57:58AM -0400, Matthew Wilcox wrote:
> > On Wed, Oct 08, 2014 at 06:21:24PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> > > > From: Matthew Wilcox <willy@linux.intel.com>
> > > > 
> > > > DAX wants to use the 'special' bit to mark PMD entries that are not backed
> > > > by struct page, just as for PTEs. 
> > > 
> > > Hm. I don't see where you use PMD without special set.
> > 
> > Right ... I don't currently insert PMDs that point to huge pages of DRAM,
> > only to huge pages of PMEM.
> 
> Looks like you don't need pmd_{mk,}special() then. It seems you have all
> inforamtion you need -- vma -- to find out what's going on. Right?

That would prevent us from putting huge pages of DRAM into a VM_MIXEDMAP |
VM_HUGEPAGE vma.  Is that acceptable to the wider peanut gallery?

> > > No private THP pages with THP? Why?
> > > It should be trivial: we already have a code path for !page case for zero
> > > page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
> > > COW.
> > > 
> > > I remeber I've mentioned that you don't think it's reasonable to allocate
> > > 2M page on COW, but that's what we do for anon memory...
> > 
> > I agree that it shouldn't be too hard, but I have no evidence that it'll
> > be a performance win to COW 2MB pages for MAP_PRIVATE.  I'd rather be
> > cautious for now and we can explore COWing 2MB chunks in a future patch.
> 
> I would rather make it other way around: use the same apporoach as for
> anon memory until data shows it's doesn't make any good. Then consider
> switching COW for *both* anon and file THP to fallback path.
> This way we will get consistent behaviour for both types of mappings.

I'm not sure that we want consistent behaviour for both types of mappings.
My understanding is that they're used for different purposes, and having
different bahaviour is acceptable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 5/7] dax: Add huge page fault support
  2014-10-08 20:11   ` Kirill A. Shutemov
@ 2014-10-09 20:47     ` Matthew Wilcox
  2014-10-13  1:13       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2014-10-09 20:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm,
	Matthew Wilcox

On Wed, Oct 08, 2014 at 11:11:00PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 08, 2014 at 09:25:27AM -0400, Matthew Wilcox wrote:
> > +	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (pgoff >= size)
> > +		return VM_FAULT_SIGBUS;
> > +	/* If the PMD would cover blocks out of the file */
> > +	if ((pgoff | PG_PMD_COLOUR) >= size)
> > +		return VM_FAULT_FALLBACK;
> 
> IIUC, zero pading would work too.

The blocks after this file might be allocated to another file already.
I suppose we could ask the filesystem if it wants to allocate them to
this file.

Dave, Jan, is it acceptable to call get_block() for blocks that extend
beyond the current i_size?

> > +
> > +	memset(&bh, 0, sizeof(bh));
> > +	block = ((sector_t)pgoff & ~PG_PMD_COLOUR) << (PAGE_SHIFT - blkbits);
> > +
> > +	/* Start by seeing if we already have an allocated block */
> > +	bh.b_size = PMD_SIZE;
> > +	length = get_block(inode, block, &bh, 0);
> 
> This makes me confused. get_block() return zero on success, right?
> Why the var called 'lenght'?

Historical reasons.  I can go back and change the name of the variable.

> > +	sector = bh.b_blocknr << (blkbits - 9);
> > +	length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size);
> > +	if (length < 0)
> > +		goto sigbus;
> > +	if (length < PMD_SIZE)
> > +		goto fallback;
> > +	if (pfn & PG_PMD_COLOUR)
> > +		goto fallback;	/* not aligned */
> 
> So, are you rely on pure luck to make get_block() allocate 2M aligned pfn?
> Not really productive. You would need assistance from fs and
> arch_get_unmapped_area() sides.

Certainly ext4 and XFS will align their allocations; if you ask it for a
2MB block, it will try to allocate a 2MB block aligned on a 2MB boundary.

I started looking into the get_unampped_area (and have the code sitting
around to align specially marked files on special boundaries), but when
I mentioned it to the author of the NVM Library, he said "Oh, I'll just
pick a 1GB aligned area to request it be mapped at", so I haven't taken
it any further.

The upshot is that (confirmed with debugging code), when the tests run,
they pretty much always get a correctly aligned block.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 5/7] dax: Add huge page fault support
  2014-10-09 20:47     ` Matthew Wilcox
@ 2014-10-13  1:13       ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-10-13  1:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Matthew Wilcox, linux-fsdevel, linux-kernel,
	linux-mm

On Thu, Oct 09, 2014 at 04:47:16PM -0400, Matthew Wilcox wrote:
> On Wed, Oct 08, 2014 at 11:11:00PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 08, 2014 at 09:25:27AM -0400, Matthew Wilcox wrote:
> > > +	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +	if (pgoff >= size)
> > > +		return VM_FAULT_SIGBUS;
> > > +	/* If the PMD would cover blocks out of the file */
> > > +	if ((pgoff | PG_PMD_COLOUR) >= size)
> > > +		return VM_FAULT_FALLBACK;
> > 
> > IIUC, zero pading would work too.
> 
> The blocks after this file might be allocated to another file already.
> I suppose we could ask the filesystem if it wants to allocate them to
> this file.
> 
> Dave, Jan, is it acceptable to call get_block() for blocks that extend
> beyond the current i_size?

In what context? XFS basically does nothing for certain cases (e.g.
read mapping for direct IO) where zeroes are always going to be
returned, so essentially filesystems right now may actually just
return a "hole" for any read mapping request beyond EOF.

If "create" is set, then we'll either create or map existing blocks
beyond EOF because the we have to reserve space or allocate blocks
before the EOF gets extended when the write succeeds fully...

> > > +	if (length < PMD_SIZE)
> > > +		goto fallback;
> > > +	if (pfn & PG_PMD_COLOUR)
> > > +		goto fallback;	/* not aligned */
> > 
> > So, are you rely on pure luck to make get_block() allocate 2M aligned pfn?
> > Not really productive. You would need assistance from fs and
> > arch_get_unmapped_area() sides.
> 
> Certainly ext4 and XFS will align their allocations; if you ask it for a
> 2MB block, it will try to allocate a 2MB block aligned on a 2MB boundary.

As a sweeping generalisation, that's wrong. Empty filesystems might
behave that way, but we don't *guarantee* that this sort of
alignment will occur.

XFS has several different extent alignment strategies and
none of them will always work that way. Many of them are dependent
on mkfs parameters, and even then are used only as *guidelines*.
Further, alignment is dependent on the size of the write being done
- on some filesystem configs a 2MB write might be aligned, but on
others it won't be. More complex still is that mount options can
change alignment behaviour, as can per-file extent size hints, as
can truncation that removes post-eof blocks...

IOWs, if you want the filesystem to guarantee alignment to the
underlying hardware in this way for DAX, we're going to need to make
some modifications to the allocator alignment strategy.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-09 20:40         ` Matthew Wilcox
@ 2014-10-13 20:36           ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2014-10-13 20:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm

On Thu, Oct 09, 2014 at 04:40:26PM -0400, Matthew Wilcox wrote:
> On Wed, Oct 08, 2014 at 10:43:35PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 08, 2014 at 11:57:58AM -0400, Matthew Wilcox wrote:
> > > On Wed, Oct 08, 2014 at 06:21:24PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> > > > > From: Matthew Wilcox <willy@linux.intel.com>
> > > > > 
> > > > > DAX wants to use the 'special' bit to mark PMD entries that are not backed
> > > > > by struct page, just as for PTEs. 
> > > > 
> > > > Hm. I don't see where you use PMD without special set.
> > > 
> > > Right ... I don't currently insert PMDs that point to huge pages of DRAM,
> > > only to huge pages of PMEM.
> > 
> > Looks like you don't need pmd_{mk,}special() then. It seems you have all
> > inforamtion you need -- vma -- to find out what's going on. Right?
> 
> That would prevent us from putting huge pages of DRAM into a VM_MIXEDMAP |
> VM_HUGEPAGE vma.  Is that acceptable to the wider peanut gallery?

We didn't have huge pages on VM_MIXEDMAP | VM_HUGEPAGE before and we don't
have them there after the patchset. Nothing changed.

It probably worth adding VM_BUG_ON() in some code path to be able to catch
this situation.

> > > > No private THP pages with THP? Why?
> > > > It should be trivial: we already have a code path for !page case for zero
> > > > page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
> > > > COW.
> > > > 
> > > > I remeber I've mentioned that you don't think it's reasonable to allocate
> > > > 2M page on COW, but that's what we do for anon memory...
> > > 
> > > I agree that it shouldn't be too hard, but I have no evidence that it'll
> > > be a performance win to COW 2MB pages for MAP_PRIVATE.  I'd rather be
> > > cautious for now and we can explore COWing 2MB chunks in a future patch.
> > 
> > I would rather make it other way around: use the same apporoach as for
> > anon memory until data shows it's doesn't make any good. Then consider
> > switching COW for *both* anon and file THP to fallback path.
> > This way we will get consistent behaviour for both types of mappings.
> 
> I'm not sure that we want consistent behaviour for both types of mappings.
> My understanding is that they're used for different purposes, and having
> different bahaviour is acceptable.

This should be described in commit message along with other design
solutions (split wrt. mlock, etc) with their pros and cons.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-10-13 20:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-08 13:25 [PATCH v1 0/7] Huge page support for DAX Matthew Wilcox
2014-10-08 13:25 ` [PATCH v1 1/7] thp: vma_adjust_trans_huge(): adjust file-backed VMA too Matthew Wilcox
2014-10-08 13:25 ` [PATCH v1 2/7] mm: Prepare for DAX huge pages Matthew Wilcox
2014-10-08 15:21   ` Kirill A. Shutemov
2014-10-08 15:57     ` Matthew Wilcox
2014-10-08 19:43       ` Kirill A. Shutemov
2014-10-09 20:40         ` Matthew Wilcox
2014-10-13 20:36           ` Kirill A. Shutemov
2014-10-08 13:25 ` [PATCH v1 3/7] mm: Add vm_insert_pfn_pmd() Matthew Wilcox
2014-10-08 13:25 ` [PATCH v1 4/7] mm: Add a pmd_fault handler Matthew Wilcox
2014-10-08 13:25 ` [PATCH v1 5/7] dax: Add huge page fault support Matthew Wilcox
2014-10-08 20:11   ` Kirill A. Shutemov
2014-10-09 20:47     ` Matthew Wilcox
2014-10-13  1:13       ` Dave Chinner
2014-10-08 13:25 ` [PATCH v1 6/7] ext2: Huge " Matthew Wilcox
2014-10-08 13:25 ` [PATCH v1 7/7] ext4: " Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).