public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] do_no_pfn handler
@ 2006-04-03 11:32 Jes Sorensen
  2006-04-03 11:46 ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jes Sorensen @ 2006-04-03 11:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Hugh Dickins, Nick Piggin,
	bjorn_helgaas, cotte

Linus,

Attached is a repost of the do_no_pfn handler patch which is needed
for the MSPEC driver and Bjorn and Carsten have expressed strong
interest in using this interface for other things as well.

You mentioned earlier that you preferred an alternative approach, do
you still feel that given the additional interest from other Bjorn and
Carsten? If this is still the case, I'd love to get some guidance as
to what that should be.

I had hoped to get this in before 2.6.17 closed, but I guess I missed
that window. If we can agree on the interface it would be ok for the
-mm series for a while.

Thanks,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it. This avoids creating fake page table entries for
regions which are not backed by real memory.

Signed-off-by: Jes Sorensen <jes@sgi.com>

---
 include/linux/mm.h |    1 +
 mm/memory.c        |   51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+	long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2146,6 +2146,51 @@
 }
 
 /*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, pte_t *page_table, pmd_t *pmd,
+		     int write_access)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	long pfn;
+	int ret = VM_FAULT_MINOR;
+
+	pte_unmap(page_table);
+	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+
+	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK, &ret);
+	if (pfn == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (pfn == -EFAULT)
+		return VM_FAULT_SIGBUS;
+	if (pfn < 0)
+		return VM_FAULT_SIGBUS;
+
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	entry = pfn_pte(pfn, vma->vm_page_prot);
+	if (write_access)
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	set_pte_at(mm, address, page_table, entry);
+
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
+/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2207,9 +2252,13 @@
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (!vma->vm_ops || !vma->vm_ops->nopage)
+			if (!vma->vm_ops ||
+			    (!vma->vm_ops->nopage && !vma->vm_ops->nopfn))
 				return do_anonymous_page(mm, vma, address,
 					pte, pmd, write_access);
+			if (vma->vm_ops->nopfn)
+				return do_no_pfn(mm, vma, address,
+						 pte, pmd, write_access);
 			return do_no_page(mm, vma, address,
 					pte, pmd, write_access);
 		}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-03 11:32 [patch] do_no_pfn handler Jes Sorensen
@ 2006-04-03 11:46 ` Nick Piggin
  2006-04-03 14:49   ` Jes Sorensen
  2006-04-04 10:58 ` Jes Sorensen
  2006-04-11 14:29 ` [patch] " Jes Sorensen
  2 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2006-04-03 11:46 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Hugh Dickins,
	bjorn_helgaas, cotte

Jes Sorensen wrote:
> +static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
> +		     unsigned long address, pte_t *page_table, pmd_t *pmd,
> +		     int write_access)
> +{
> +	spinlock_t *ptl;
> +	pte_t entry;
> +	long pfn;
> +	int ret = VM_FAULT_MINOR;
> +
> +	pte_unmap(page_table);
> +	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
> +
> +	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK, &ret);
> +	if (pfn == -ENOMEM)
> +		return VM_FAULT_OOM;
> +	if (pfn == -EFAULT)
> +		return VM_FAULT_SIGBUS;
> +	if (pfn < 0)
> +		return VM_FAULT_SIGBUS;
> +
> +	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +
> +	entry = pfn_pte(pfn, vma->vm_page_prot);
> +	if (write_access)
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	set_pte_at(mm, address, page_table, entry);
> +

Should you recheck to make sure nobody else faulted this in
before it was relocked? Doesn't seem to matter in this case,
but it would be more consistent with the other fault handlers.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-03 11:46 ` Nick Piggin
@ 2006-04-03 14:49   ` Jes Sorensen
  0 siblings, 0 replies; 18+ messages in thread
From: Jes Sorensen @ 2006-04-03 14:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Hugh Dickins,
	bjorn_helgaas, cotte

Nick Piggin wrote:
> Should you recheck to make sure nobody else faulted this in
> before it was relocked? Doesn't seem to matter in this case,
> but it would be more consistent with the other fault handlers.
> 

I'm fine either way. It didn't matter to the case I need it for, but
if you think it would make more sense I am fine with that.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-03 11:32 [patch] do_no_pfn handler Jes Sorensen
  2006-04-03 11:46 ` Nick Piggin
@ 2006-04-04 10:58 ` Jes Sorensen
  2006-04-04 11:05   ` Nick Piggin
  2006-04-11 14:29 ` [patch] " Jes Sorensen
  2 siblings, 1 reply; 18+ messages in thread
From: Jes Sorensen @ 2006-04-04 10:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Hugh Dickins, Nick Piggin,
	bjorn_helgaas, cotte

Hi,

Ingo Oeser suggested reorganizing the hangle_pte_fault code in a way
that simplifies the code deciding which fault handler to call. It
makes the call to ->nopfn and ->nopage a lot clearer.

It doesn't address Nick's suggestion as whether to recheck for someone
else faulting it as I didn't see a consensus on that yet.

Updated patch attached.

Cheers,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it. This avoids creating fake page table entries for
regions which are not backed by real memory.

Signed-off-by: Jes Sorensen <jes@sgi.com>

---
 include/linux/mm.h |    1 
 mm/memory.c        |   61 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 57 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+	long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2146,6 +2146,51 @@
 }
 
 /*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, pte_t *page_table, pmd_t *pmd,
+		     int write_access)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	long pfn;
+	int ret = VM_FAULT_MINOR;
+
+	pte_unmap(page_table);
+	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+
+	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK, &ret);
+	if (pfn == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (pfn == -EFAULT)
+		return VM_FAULT_SIGBUS;
+	if (pfn < 0)
+		return VM_FAULT_SIGBUS;
+
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	entry = pfn_pte(pfn, vma->vm_page_prot);
+	if (write_access)
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	set_pte_at(mm, address, page_table, entry);
+
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
+/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2207,11 +2252,17 @@
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (!vma->vm_ops || !vma->vm_ops->nopage)
-				return do_anonymous_page(mm, vma, address,
-					pte, pmd, write_access);
-			return do_no_page(mm, vma, address,
-					pte, pmd, write_access);
+			if (vma->vm_ops) {
+				if (vma->vm_ops->nopfn)
+					return do_no_pfn(mm, vma, address, pte,
+							 pmd, write_access);
+				if (vma->vm_ops->nopage)
+					return do_no_page(mm, vma, address,
+							  pte, pmd,
+							  write_access);
+			}
+			return do_anonymous_page(mm, vma, address,
+						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
 			return do_file_page(mm, vma, address,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-04 10:58 ` Jes Sorensen
@ 2006-04-04 11:05   ` Nick Piggin
  2006-04-05  9:34     ` Jes Sorensen
  2006-04-19 14:10     ` [patch - repost] " Jes Sorensen
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2006-04-04 11:05 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Hugh Dickins,
	bjorn_helgaas, cotte

Jes Sorensen wrote:
> Hi,
> 
> Ingo Oeser suggested reorganizing the hangle_pte_fault code in a way
> that simplifies the code deciding which fault handler to call. It
> makes the call to ->nopfn and ->nopage a lot clearer.
> 

Probably doesn't make much difference, but I'd rather do the nopage
check first, as that will obviously be the most common.

> It doesn't address Nick's suggestion as whether to recheck for someone
> else faulting it as I didn't see a consensus on that yet.
> 

I first thought this might be a good idea because some archs have a
pretty heavy-weight set_pte_at (eg. powerpc, which is even heavier if
it is to replace an existing entry). This is not going to be very
common, but there have been cases where multiple threads all try to
fault in a particular page, which has caused performance problems.

Other than that, you never know what a nopfn handler will want to do,
so I think it is better to be consistent with other faults. Shouldn't
need much more than a `if (pte_none()) { /* do it */ }`.

> Updated patch attached.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-04 11:05   ` Nick Piggin
@ 2006-04-05  9:34     ` Jes Sorensen
  2006-04-19 14:10     ` [patch - repost] " Jes Sorensen
  1 sibling, 0 replies; 18+ messages in thread
From: Jes Sorensen @ 2006-04-05  9:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Hugh Dickins,
	bjorn_helgaas, cotte

>>>>> "Nick" == Nick Piggin <nickpiggin@yahoo.com.au> writes:

Nick> Jes Sorensen wrote:
>> Hi, Ingo Oeser suggested reorganizing the hangle_pte_fault code in
>> a way that simplifies the code deciding which fault handler to
>> call. It makes the call to ->nopfn and ->nopage a lot clearer.

Nick> Probably doesn't make much difference, but I'd rather do the
Nick> nopage check first, as that will obviously be the most common.

Hi Nick,

Fair 'nuff. I guess I had had a more hierarchical approach in mind,
but then a driver really shouldn't provide both.

Nick> Other than that, you never know what a nopfn handler will want
Nick> to do, so I think it is better to be consistent with other
Nick> faults. Shouldn't need much more than a `if (pte_none()) { /* do
Nick> it */ }`.

Like this?

Thanks for the input, updated patch attached.

Cheers,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it. This avoids creating fake page table entries for
regions which are not backed by real memory.

Signed-off-by: Jes Sorensen <jes@sgi.com>

---
 include/linux/mm.h |    1 
 mm/memory.c        |   63 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 59 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+	long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2146,6 +2146,53 @@
 }
 
 /*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, pte_t *page_table, pmd_t *pmd,
+		     int write_access)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	long pfn;
+	int ret = VM_FAULT_MINOR;
+
+	pte_unmap(page_table);
+	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+
+	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK, &ret);
+	if (pfn == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (pfn == -EFAULT)
+		return VM_FAULT_SIGBUS;
+	if (pfn < 0)
+		return VM_FAULT_SIGBUS;
+
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	/* Only go through if we didn't race with anybody else... */
+	if (pte_none(*page_table)) {
+		entry = pfn_pte(pfn, vma->vm_page_prot);
+		if (write_access)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+	}
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
+/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2207,11 +2254,17 @@
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (!vma->vm_ops || !vma->vm_ops->nopage)
-				return do_anonymous_page(mm, vma, address,
-					pte, pmd, write_access);
-			return do_no_page(mm, vma, address,
-					pte, pmd, write_access);
+			if (vma->vm_ops) {
+				if (vma->vm_ops->nopage)
+					return do_no_page(mm, vma, address,
+							  pte, pmd,
+							  write_access);
+				if (vma->vm_ops->nopfn)
+					return do_no_pfn(mm, vma, address, pte,
+							 pmd, write_access);
+			}
+			return do_anonymous_page(mm, vma, address,
+						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
 			return do_file_page(mm, vma, address,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch] do_no_pfn handler
  2006-04-03 11:32 [patch] do_no_pfn handler Jes Sorensen
  2006-04-03 11:46 ` Nick Piggin
  2006-04-04 10:58 ` Jes Sorensen
@ 2006-04-11 14:29 ` Jes Sorensen
  2006-04-11 15:10   ` Linus Torvalds
  2 siblings, 1 reply; 18+ messages in thread
From: Jes Sorensen @ 2006-04-11 14:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Hugh Dickins, Nick Piggin,
	bjorn_helgaas, cotte

Linus,

Attached is a repost of the do_no_pfn handler patch. This version
includes all the changes that were suggested after my previous posting
(the order of checking ->nopage before ->nopfn and Nick's suggestion
to check for others refaulting the same).  The patch is needed for the
MSPEC driver and Bjorn and Carsten have expressed strong interest in
using this interface for other things as well.

You mentioned earlier that you preferred an alternative approach, do
you still feel that given the additional interest from other Bjorn and
Carsten? If this is still the case, I'd love to get some guidance as
to what that should be.

I had hoped to get this in before 2.6.17, dunno if that is too late or
would you prefer to see it in -mm first?

Thanks,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it. This avoids creating fake page table entries for
regions which are not backed by real memory.

Signed-off-by: Jes Sorensen <jes@sgi.com>

---
 include/linux/mm.h |    1 
 mm/memory.c        |   63 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 59 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+	long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2146,6 +2146,53 @@
 }
 
 /*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, pte_t *page_table, pmd_t *pmd,
+		     int write_access)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	long pfn;
+	int ret = VM_FAULT_MINOR;
+
+	pte_unmap(page_table);
+	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+
+	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK, &ret);
+	if (pfn == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (pfn == -EFAULT)
+		return VM_FAULT_SIGBUS;
+	if (pfn < 0)
+		return VM_FAULT_SIGBUS;
+
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	/* Only go through if we didn't race with anybody else... */
+	if (pte_none(*page_table)) {
+		entry = pfn_pte(pfn, vma->vm_page_prot);
+		if (write_access)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+	}
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
+/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2207,11 +2254,17 @@
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (!vma->vm_ops || !vma->vm_ops->nopage)
-				return do_anonymous_page(mm, vma, address,
-					pte, pmd, write_access);
-			return do_no_page(mm, vma, address,
-					pte, pmd, write_access);
+			if (vma->vm_ops) {
+				if (vma->vm_ops->nopage)
+					return do_no_page(mm, vma, address,
+							  pte, pmd,
+							  write_access);
+				if (vma->vm_ops->nopfn)
+					return do_no_pfn(mm, vma, address, pte,
+							 pmd, write_access);
+			}
+			return do_anonymous_page(mm, vma, address,
+						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
 			return do_file_page(mm, vma, address,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-11 14:29 ` [patch] " Jes Sorensen
@ 2006-04-11 15:10   ` Linus Torvalds
  2006-04-11 15:26     ` Carsten Otte
  2006-04-12  9:09     ` Jes Sorensen
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2006-04-11 15:10 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Andrew Morton, linux-kernel, Hugh Dickins, Nick Piggin,
	bjorn_helgaas, cotte



On Tue, 11 Apr 2006, Jes Sorensen wrote:
> 
> You mentioned earlier that you preferred an alternative approach, do
> you still feel that given the additional interest from other Bjorn and
> Carsten? If this is still the case, I'd love to get some guidance as
> to what that should be.

I'm still pretty unhappy with this. It's pretty much designed to screw up 
the system by letting the driver do random things that make little or no 
sense from a VM standpoint.

The 

	BUG_ON(!(vma->vm_flags & VM_PFNMAP));

certainly helps, but it still leaves the window open for other problems. 

At the very least, it would also need a

	BUG_ON(is_cow_mapping(vma->vm_flags));

(or at least make it return VM_FAULT_SIGBUS). Because a COW mapping _will_ 
confuse the VM and cause it to do random bad things in vm_normal_page(). 

It also assumes that a negative pfn is ok as an error return, but maybe 
that's fine. I can't think of any architecture that uses all bits of the 
PFN (x86 with PAE can have a full 32-bit PFN, but I don't think any actual 
CPU supports 48 bits of physical addressing?). Something to think about.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-11 15:10   ` Linus Torvalds
@ 2006-04-11 15:26     ` Carsten Otte
  2006-04-11 15:35       ` Linus Torvalds
  2006-04-12  9:09     ` Jes Sorensen
  1 sibling, 1 reply; 18+ messages in thread
From: Carsten Otte @ 2006-04-11 15:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jes Sorensen, Andrew Morton, linux-kernel, Hugh Dickins,
	Nick Piggin, bjorn_helgaas

Linus Torvalds wrote:
> At the very least, it would also need a
> 
> 	BUG_ON(is_cow_mapping(vma->vm_flags));
> 
> (or at least make it return VM_FAULT_SIGBUS). Because a COW mapping _will_ 
> confuse the VM and cause it to do random bad things in vm_normal_page(). 
That leaves my use-case out for now. I will need COW for my mapping when
switching to this interface. Looks like a lot of things need rethinking
in memory.c for COW with no struct page behind.
-- 

Carsten Otte
IBM Linux technology center
ARCH=s390

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-11 15:26     ` Carsten Otte
@ 2006-04-11 15:35       ` Linus Torvalds
  2006-04-11 20:03         ` Carsten Otte
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2006-04-11 15:35 UTC (permalink / raw)
  To: carsteno
  Cc: Jes Sorensen, Andrew Morton, linux-kernel, Hugh Dickins,
	Nick Piggin, bjorn_helgaas



On Tue, 11 Apr 2006, Carsten Otte wrote:

> Linus Torvalds wrote:
> > At the very least, it would also need a
> > 
> > 	BUG_ON(is_cow_mapping(vma->vm_flags));
> > 
> > (or at least make it return VM_FAULT_SIGBUS). Because a COW mapping _will_ 
> > confuse the VM and cause it to do random bad things in vm_normal_page(). 
>
> That leaves my use-case out for now. I will need COW for my mapping when
> switching to this interface. Looks like a lot of things need rethinking
> in memory.c for COW with no struct page behind.

You _really_ cannot do COW together with "random pfn filling".

You can do COW with a pure remap_pfn_range() (ie a /dev/mem kind of 
mapping, or a frame buffer etc), but that's only because it has a very 
magic special case that is used to distinguish between cow'ed pages and 
the pages that were inserted initially.

We have no free bits in the page tables to say "this is a COW page" in 
general (on x86 we could do it, but some other architectures don't have 
any SW-usable bits). 

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-11 15:35       ` Linus Torvalds
@ 2006-04-11 20:03         ` Carsten Otte
  2006-04-11 20:30           ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Carsten Otte @ 2006-04-11 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: carsteno, Jes Sorensen, Andrew Morton, linux-kernel, Hugh Dickins,
	Nick Piggin, bjorn_helgaas

Linus Torvalds wrote:
> You _really_ cannot do COW together with "random pfn filling".
I still have'nt found a good way to do so, even after discussing with Nick and Hugh, but that's exactly where I intend to get for the xip stuff.

Today, the _only_ code that uses the struct page behind those DCSS segments is aops->nopage (as return value) and do_wp_page. Those small servers have almost no local memory (kernel, libraries, and binaries are shared), and the mem_map array is a large overhead.

> You can do COW with a pure remap_pfn_range() (ie a /dev/mem kind of 
> mapping, or a frame buffer etc), but that's only because it has a very 
> magic special case that is used to distinguish between cow'ed pages and 
> the pages that were inserted initially.
> 
> We have no free bits in the page tables to say "this is a COW page" in 
> general (on x86 we could do it, but some other architectures don't have 
> any SW-usable bits). 
That's true. One can store that information in the vma flags, and split the vma into 3 vmas once we have a write fault. Although that would work in theory, I doubt it would save lot of memory because of too many vmas, and I think we would burn precious CPU horsepower walking all those vmas.
I believe Hugh has already done an implementation for that which he does not consider nice. I have not found a feasible way to adress that issue so far, and I promise to keep from coding until I find a reasonable non-intrusive way to get there.
-- 

Carsten Otte
IBM Linux technology center
ARCH=s390

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-11 20:03         ` Carsten Otte
@ 2006-04-11 20:30           ` Linus Torvalds
  2006-04-11 20:53             ` Carsten Otte
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2006-04-11 20:30 UTC (permalink / raw)
  To: carsteno
  Cc: Jes Sorensen, Andrew Morton, linux-kernel, Hugh Dickins,
	Nick Piggin, bjorn_helgaas



On Tue, 11 Apr 2006, Carsten Otte wrote:
> 
> Today, the _only_ code that uses the struct page behind those DCSS 
> segments is aops->nopage (as return value) and do_wp_page. Those small 
> servers have almost no local memory (kernel, libraries, and binaries are 
> shared), and the mem_map array is a large overhead.

Quite frankly, I'm not in the least interested in designing for a niche 
market and for some strange niche usage. It really needs to make a lot of 
sense from a bigger picture.

I don't much like do_no_pfn either, but at least that one _can_ work 
within the rules of the bigger picture, as long as we limit it and make 
sure people can't mis-use it. 

Now, the kernel page table accessor macros are certainly generic enough 
that you could have your own "COW bits" macro, and make this all an 
architecture-specific feature (and simply not allow it on architectures 
that don't have a sw-usable COW bit)

It so happens that S390 seems to be one of the very few architectures that 
doesn't have room for that bit in its regular page table layout, and 
that's arguably a design problem for S390. But you _could_ just allocate 
extra memory for page tables, and put the COW bit there. The VM wouldn't 
care - at that point it would fit in the "larger picture" of just having 
the COW information directly in the page tables (even if the "page tables" 
would be partly just sw-defined).

But especially since it doesn't seem to have very wide use, I'd push back 
unless you can do it really cleanly. And you'd have to pay that extra page 
table cost all over, even if it's almost never used. I bet you're better 
off just having the "struct page"s.

> > We have no free bits in the page tables to say "this is a COW page" in 
> > general (on x86 we could do it, but some other architectures don't have 
> > any SW-usable bits). 
>
> That's true. One can store that information in the vma flags, and split 
> the vma into 3 vmas once we have a write fault.

Well, i's not exactly "3 vmas". It ends up being "one vma for every page" 
in the limit. Not to mention that you can't really split the vma at all at 
page fault time without screwing up locking (you'd have to take the VM 
lock for writing).

Nasty.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-11 20:30           ` Linus Torvalds
@ 2006-04-11 20:53             ` Carsten Otte
  0 siblings, 0 replies; 18+ messages in thread
From: Carsten Otte @ 2006-04-11 20:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: carsteno, Jes Sorensen, Andrew Morton, linux-kernel, Hugh Dickins,
	Nick Piggin, bjorn_helgaas

Linus Torvalds wrote:
> Now, the kernel page table accessor macros are certainly generic enough 
> that you could have your own "COW bits" macro, and make this all an 
> architecture-specific feature (and simply not allow it on architectures 
> that don't have a sw-usable COW bit)
> 
> It so happens that S390 seems to be one of the very few architectures that 
> doesn't have room for that bit in its regular page table layout, and 
> that's arguably a design problem for S390. But you _could_ just allocate 
> extra memory for page tables, and put the COW bit there. The VM wouldn't 
> care - at that point it would fit in the "larger picture" of just having 
> the COW information directly in the page tables (even if the "page tables" 
> would be partly just sw-defined).
Interresting idea. Sounds more feasible then splitting vmas, I am going to think about it. Thanks!
-- 

Carsten Otte
IBM Linux technology center
ARCH=s390

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-11 15:10   ` Linus Torvalds
  2006-04-11 15:26     ` Carsten Otte
@ 2006-04-12  9:09     ` Jes Sorensen
  2006-04-12  9:16       ` Carsten Otte
  1 sibling, 1 reply; 18+ messages in thread
From: Jes Sorensen @ 2006-04-12  9:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Hugh Dickins, Nick Piggin,
	bjorn_helgaas, cotte

>>>>> "Linus" == Linus Torvalds <torvalds@osdl.org> writes:

Linus> On Tue, 11 Apr 2006, Jes Sorensen wrote:
>>  You mentioned earlier that you preferred an alternative approach,
>> do you still feel that given the additional interest from other
>> Bjorn and Carsten? If this is still the case, I'd love to get some
>> guidance as to what that should be.

Linus> I'm still pretty unhappy with this. It's pretty much designed
Linus> to screw up the system by letting the driver do random things
Linus> that make little or no sense from a VM standpoint.

Hi Linus,

I'd love to come up with another way to do it, but this still comes
back as the cleanest to me :( I've tried to limit the damage based on
your comments.

Linus> At the very least, it would also need a

Linus> 	BUG_ON(is_cow_mapping(vma->vm_flags));

Added - I am not sure how this will affect Carsten's situation based
on the further discussion in this thread, however if something changes
at that point I assume we can modify the limitation at a later stage.

Linus> It also assumes that a negative pfn is ok as an error return,
Linus> but maybe that's fine. I can't think of any architecture that
Linus> uses all bits of the PFN (x86 with PAE can have a full 32-bit
Linus> PFN, but I don't think any actual CPU supports 48 bits of
Linus> physical addressing?). Something to think about.

Good catch! I think it goes back to when the mspec driver abused the
nopage interface. The previous version was passing `ret' to the driver
as well as looking at the pfn return value. I don't think the driver
should modify `ret' directly, so I have changed it to use two specific
pfn error values (NOPFN_OOM and NOPFN_SIGBUS), similar to how the
nopage handler works. I think the original intention was to catch all
potential errors a driver could stick in, but thinking about it again
I don't think it makes sense to try that.

Anyway, I hope you like this version of the nopfn code better! Update
version of the mspec driver to match it will follow in a minute.

Cheers,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it. This avoids creating fake page table entries for
regions which are not backed by real memory.

This version uses specific NOPFN_{SIGBUS,OOM} return values, rather
than expect all negative pfn values would be an error. It also bugs on
cow mappings as this would not work with the VM.

Signed-off-by: Jes Sorensen <jes@sgi.com>

---
 include/linux/mm.h |    7 +++++
 mm/memory.c        |   62 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 64 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+	long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
@@ -612,6 +613,12 @@
 #define NOPAGE_OOM	((struct page *) (-1))
 
 /*
+ * Error return values for the *_nopfn functions
+ */
+#define NOPFN_SIGBUS	((unsigned long) -1)
+#define NOPFN_OOM	((unsigned long) -2)
+
+/*
  * Different kinds of faults, as returned by handle_mm_fault().
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2146,6 +2146,52 @@
 }
 
 /*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, pte_t *page_table, pmd_t *pmd,
+		     int write_access)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	unsigned long pfn;
+	int ret = VM_FAULT_MINOR;
+
+	pte_unmap(page_table);
+	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+	BUG_ON(is_cow_mapping(vma->vm_flags));
+
+	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
+	if (pfn == NOPFN_OOM)
+		return VM_FAULT_OOM;
+	if (pfn == NOPFN_SIGBUS)
+		return VM_FAULT_SIGBUS;
+
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	/* Only go through if we didn't race with anybody else... */
+	if (pte_none(*page_table)) {
+		entry = pfn_pte(pfn, vma->vm_page_prot);
+		if (write_access)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+	}
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
+/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2207,11 +2253,17 @@
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (!vma->vm_ops || !vma->vm_ops->nopage)
-				return do_anonymous_page(mm, vma, address,
-					pte, pmd, write_access);
-			return do_no_page(mm, vma, address,
-					pte, pmd, write_access);
+			if (vma->vm_ops) {
+				if (vma->vm_ops->nopage)
+					return do_no_page(mm, vma, address,
+							  pte, pmd,
+							  write_access);
+				if (vma->vm_ops->nopfn)
+					return do_no_pfn(mm, vma, address, pte,
+							 pmd, write_access);
+			}
+			return do_anonymous_page(mm, vma, address,
+						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
 			return do_file_page(mm, vma, address,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] do_no_pfn handler
  2006-04-12  9:09     ` Jes Sorensen
@ 2006-04-12  9:16       ` Carsten Otte
  0 siblings, 0 replies; 18+ messages in thread
From: Carsten Otte @ 2006-04-12  9:16 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Hugh Dickins,
	Nick Piggin, bjorn_helgaas

Jes Sorensen wrote:
> Linus> At the very least, it would also need a
> Linus> 	BUG_ON(is_cow_mapping(vma->vm_flags));
> 
> Added - I am not sure how this will affect Carsten's situation based
> on the further discussion in this thread, however if something changes
> at that point I assume we can modify the limitation at a later stage.
Fine with me.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch - repost] do_no_pfn handler
  2006-04-04 11:05   ` Nick Piggin
  2006-04-05  9:34     ` Jes Sorensen
@ 2006-04-19 14:10     ` Jes Sorensen
  2006-04-21 10:41       ` Nick Piggin
  1 sibling, 1 reply; 18+ messages in thread
From: Jes Sorensen @ 2006-04-19 14:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, linux-kernel, Hugh Dickins,
	bjorn_helgaas, cotte

Hi,

This is a repost of the lobotomized do_no_pfn patch from last week.

It adds the BUG_ON for !is_cow_mapping() as well as addresses the
issue around negative pfn numbers as you pointed out.

No changes since the last posting as I haven't received any other
comments since.

Cheers,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it. This avoids creating fake page table entries for
regions which are not backed by real memory.

This version uses specific NOPFN_{SIGBUS,OOM} return values, rather
than expect all negative pfn values would be an error. It also bugs on
cow mappings as this would not work with the VM.

Signed-off-by: Jes Sorensen <jes@sgi.com>

---
 include/linux/mm.h |    7 +++++
 mm/memory.c        |   62 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 64 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+	long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
@@ -612,6 +613,12 @@
 #define NOPAGE_OOM	((struct page *) (-1))
 
 /*
+ * Error return values for the *_nopfn functions
+ */
+#define NOPFN_SIGBUS	((unsigned long) -1)
+#define NOPFN_OOM	((unsigned long) -2)
+
+/*
  * Different kinds of faults, as returned by handle_mm_fault().
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2146,6 +2146,52 @@
 }
 
 /*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, pte_t *page_table, pmd_t *pmd,
+		     int write_access)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	unsigned long pfn;
+	int ret = VM_FAULT_MINOR;
+
+	pte_unmap(page_table);
+	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+	BUG_ON(is_cow_mapping(vma->vm_flags));
+
+	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
+	if (pfn == NOPFN_OOM)
+		return VM_FAULT_OOM;
+	if (pfn == NOPFN_SIGBUS)
+		return VM_FAULT_SIGBUS;
+
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	/* Only go through if we didn't race with anybody else... */
+	if (pte_none(*page_table)) {
+		entry = pfn_pte(pfn, vma->vm_page_prot);
+		if (write_access)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+	}
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
+/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2207,11 +2253,17 @@
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (!vma->vm_ops || !vma->vm_ops->nopage)
-				return do_anonymous_page(mm, vma, address,
-					pte, pmd, write_access);
-			return do_no_page(mm, vma, address,
-					pte, pmd, write_access);
+			if (vma->vm_ops) {
+				if (vma->vm_ops->nopage)
+					return do_no_page(mm, vma, address,
+							  pte, pmd,
+							  write_access);
+				if (vma->vm_ops->nopfn)
+					return do_no_pfn(mm, vma, address, pte,
+							 pmd, write_access);
+			}
+			return do_anonymous_page(mm, vma, address,
+						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
 			return do_file_page(mm, vma, address,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch - repost] do_no_pfn handler
  2006-04-19 14:10     ` [patch - repost] " Jes Sorensen
@ 2006-04-21 10:41       ` Nick Piggin
  2006-04-24  7:55         ` Jes Sorensen
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2006-04-21 10:41 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Hugh Dickins,
	bjorn_helgaas, cotte

Jes Sorensen wrote:

> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -199,6 +199,7 @@
>  	void (*open)(struct vm_area_struct * area);
>  	void (*close)(struct vm_area_struct * area);
>  	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
> +	long (*nopfn)(struct vm_area_struct * area, unsigned long address);

Minor nit: make this unsigned long?

>  	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
>  #ifdef CONFIG_NUMA
>  	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
> @@ -612,6 +613,12 @@
>  #define NOPAGE_OOM	((struct page *) (-1))
>  
>  /*
> + * Error return values for the *_nopfn functions
> + */
> +#define NOPFN_SIGBUS	((unsigned long) -1)
> +#define NOPFN_OOM	((unsigned long) -2)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch - repost] do_no_pfn handler
  2006-04-21 10:41       ` Nick Piggin
@ 2006-04-24  7:55         ` Jes Sorensen
  0 siblings, 0 replies; 18+ messages in thread
From: Jes Sorensen @ 2006-04-24  7:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Hugh Dickins,
	bjorn_helgaas, cotte

>>>>> "Nick" == Nick Piggin <nickpiggin@yahoo.com.au> writes:

Nick> Jes Sorensen wrote:
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -199,6 +199,7 @@
>  	void (*open)(struct vm_area_struct * area);
>  	void (*close)(struct vm_area_struct * area);
>  	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
> +	long (*nopfn)(struct vm_area_struct * area, unsigned long address);

Nick> Minor nit: make this unsigned long?

Good catch! I had already changed all the code around it to be
unsigned long, but somehow missed this instance.

Updated patch attached.

Cheers,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it. This avoids creating fake page table entries for
regions which are not backed by real memory.

This version uses specific NOPFN_{SIGBUS,OOM} return values, rather
than expect all negative pfn values would be an error. It also bugs on
cow mappings as this would not work with the VM.

Signed-off-by: Jes Sorensen <jes@sgi.com>

---
 include/linux/mm.h |    7 +++++
 mm/memory.c        |   62 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 64 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+	unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
@@ -612,6 +613,12 @@
 #define NOPAGE_OOM	((struct page *) (-1))
 
 /*
+ * Error return values for the *_nopfn functions
+ */
+#define NOPFN_SIGBUS	((unsigned long) -1)
+#define NOPFN_OOM	((unsigned long) -2)
+
+/*
  * Different kinds of faults, as returned by handle_mm_fault().
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2146,6 +2146,52 @@
 }
 
 /*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, pte_t *page_table, pmd_t *pmd,
+		     int write_access)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	unsigned long pfn;
+	int ret = VM_FAULT_MINOR;
+
+	pte_unmap(page_table);
+	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+	BUG_ON(is_cow_mapping(vma->vm_flags));
+
+	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
+	if (pfn == NOPFN_OOM)
+		return VM_FAULT_OOM;
+	if (pfn == NOPFN_SIGBUS)
+		return VM_FAULT_SIGBUS;
+
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	/* Only go through if we didn't race with anybody else... */
+	if (pte_none(*page_table)) {
+		entry = pfn_pte(pfn, vma->vm_page_prot);
+		if (write_access)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+	}
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
+/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2207,11 +2253,17 @@
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (!vma->vm_ops || !vma->vm_ops->nopage)
-				return do_anonymous_page(mm, vma, address,
-					pte, pmd, write_access);
-			return do_no_page(mm, vma, address,
-					pte, pmd, write_access);
+			if (vma->vm_ops) {
+				if (vma->vm_ops->nopage)
+					return do_no_page(mm, vma, address,
+							  pte, pmd,
+							  write_access);
+				if (vma->vm_ops->nopfn)
+					return do_no_pfn(mm, vma, address, pte,
+							 pmd, write_access);
+			}
+			return do_anonymous_page(mm, vma, address,
+						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
 			return do_file_page(mm, vma, address,

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-04-24  7:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-03 11:32 [patch] do_no_pfn handler Jes Sorensen
2006-04-03 11:46 ` Nick Piggin
2006-04-03 14:49   ` Jes Sorensen
2006-04-04 10:58 ` Jes Sorensen
2006-04-04 11:05   ` Nick Piggin
2006-04-05  9:34     ` Jes Sorensen
2006-04-19 14:10     ` [patch - repost] " Jes Sorensen
2006-04-21 10:41       ` Nick Piggin
2006-04-24  7:55         ` Jes Sorensen
2006-04-11 14:29 ` [patch] " Jes Sorensen
2006-04-11 15:10   ` Linus Torvalds
2006-04-11 15:26     ` Carsten Otte
2006-04-11 15:35       ` Linus Torvalds
2006-04-11 20:03         ` Carsten Otte
2006-04-11 20:30           ` Linus Torvalds
2006-04-11 20:53             ` Carsten Otte
2006-04-12  9:09     ` Jes Sorensen
2006-04-12  9:16       ` Carsten Otte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox