public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c
       [not found] <20090414151204.C647.A69D9226@jp.fujitsu.com>
@ 2009-04-14  6:23 ` KOSAKI Motohiro
  2009-04-14  6:56   ` Nick Piggin
  2009-04-14  8:41 ` [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix Nick Piggin
  1 sibling, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2009-04-14  6:23 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, Nick Piggin,
	Andrea Arcangeli, Jeff Moyer, linux-mm, linux-fsdevel,
	Maciej Sosnowski, David S. Miller, Chris Leech, netdev

I don't have NET-DMA usable device. I hope to get expert review.

=========================
Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c

	down_read(mmap_sem)
	get_user_pages()
	up_read(mmap_sem)

is fork unsafe.
fix it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Leech <christopher.leech@intel.com>
Cc: netdev@vger.kernel.org
---
 drivers/dma/iovlock.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

Index: b/drivers/dma/iovlock.c
===================================================================
--- a/drivers/dma/iovlock.c	2009-02-21 16:53:23.000000000 +0900
+++ b/drivers/dma/iovlock.c	2009-04-13 04:46:02.000000000 +0900
@@ -94,18 +94,10 @@ struct dma_pinned_list *dma_pin_iovec_pa
 		pages += page_list->nr_pages;
 
 		/* pin pages down */
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(
-			current,
-			current->mm,
-			(unsigned long) iov[i].iov_base,
-			page_list->nr_pages,
-			1,	/* write */
-			0,	/* force */
-			page_list->pages,
-			NULL);
-		up_read(&current->mm->mmap_sem);
-
+		down_read(&current->mm->mm_pinned_sem);
+		ret = get_user_pages_fast((unsigned long) iov[i].iov_base,
+					  page_list->nr_pages, 1,
+					  page_list->pages);
 		if (ret != page_list->nr_pages)
 			goto unpin;
 
@@ -127,6 +119,8 @@ void dma_unpin_iovec_pages(struct dma_pi
 	if (!pinned_list)
 		return;
 
+	up_read(&current->mm->mm_pinned_sem);
+
 	for (i = 0; i < pinned_list->nr_iovecs; i++) {
 		struct dma_page_list *page_list = &pinned_list->page_list[i];
 		for (j = 0; j < page_list->nr_pages; j++) {



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c
  2009-04-14  6:23 ` [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c KOSAKI Motohiro
@ 2009-04-14  6:56   ` Nick Piggin
  2009-04-14  6:58     ` KOSAKI Motohiro
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2009-04-14  6:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Linus Torvalds, Andrew Morton, Andrea Arcangeli, Jeff Moyer,
	linux-mm, linux-fsdevel, Maciej Sosnowski, David S. Miller,
	Chris Leech, netdev

On Tuesday 14 April 2009 16:23:13 KOSAKI Motohiro wrote:
> I don't have NET-DMA usable device. I hope to get expert review.
> 
> =========================
> Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
> 
> 	down_read(mmap_sem)
> 	get_user_pages()
> 	up_read(mmap_sem)
> 
> is fork unsafe.
> fix it.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Chris Leech <christopher.leech@intel.com>
> Cc: netdev@vger.kernel.org
> ---
>  drivers/dma/iovlock.c |   18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> Index: b/drivers/dma/iovlock.c
> ===================================================================
> --- a/drivers/dma/iovlock.c	2009-02-21 16:53:23.000000000 +0900
> +++ b/drivers/dma/iovlock.c	2009-04-13 04:46:02.000000000 +0900
> @@ -94,18 +94,10 @@ struct dma_pinned_list *dma_pin_iovec_pa
>  		pages += page_list->nr_pages;
>  
>  		/* pin pages down */
> -		down_read(&current->mm->mmap_sem);
> -		ret = get_user_pages(
> -			current,
> -			current->mm,
> -			(unsigned long) iov[i].iov_base,
> -			page_list->nr_pages,
> -			1,	/* write */
> -			0,	/* force */
> -			page_list->pages,
> -			NULL);
> -		up_read(&current->mm->mmap_sem);
> -
> +		down_read(&current->mm->mm_pinned_sem);
> +		ret = get_user_pages_fast((unsigned long) iov[i].iov_base,
> +					  page_list->nr_pages, 1,
> +					  page_list->pages);

I would perhaps not fold gup_fast conversions into the same patch as
the fix.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c
  2009-04-14  6:56   ` Nick Piggin
@ 2009-04-14  6:58     ` KOSAKI Motohiro
  2009-04-15  8:48       ` KOSAKI Motohiro
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2009-04-14  6:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, LKML, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Jeff Moyer, linux-mm, linux-fsdevel,
	Maciej Sosnowski, David S. Miller, Chris Leech, netdev

> On Tuesday 14 April 2009 16:23:13 KOSAKI Motohiro wrote:
> > I don't have NET-DMA usable device. I hope to get expert review.
> > 
> > =========================
> > Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
> > 
> > 	down_read(mmap_sem)
> > 	get_user_pages()
> > 	up_read(mmap_sem)
> > 
> > is fork unsafe.
> > fix it.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
> > Cc: David S. Miller <davem@davemloft.net>
> > Cc: Chris Leech <christopher.leech@intel.com>
> > Cc: netdev@vger.kernel.org
> > ---
> >  drivers/dma/iovlock.c |   18 ++++++------------
> >  1 file changed, 6 insertions(+), 12 deletions(-)
> > 
> > Index: b/drivers/dma/iovlock.c
> > ===================================================================
> > --- a/drivers/dma/iovlock.c	2009-02-21 16:53:23.000000000 +0900
> > +++ b/drivers/dma/iovlock.c	2009-04-13 04:46:02.000000000 +0900
> > @@ -94,18 +94,10 @@ struct dma_pinned_list *dma_pin_iovec_pa
> >  		pages += page_list->nr_pages;
> >  
> >  		/* pin pages down */
> > -		down_read(&current->mm->mmap_sem);
> > -		ret = get_user_pages(
> > -			current,
> > -			current->mm,
> > -			(unsigned long) iov[i].iov_base,
> > -			page_list->nr_pages,
> > -			1,	/* write */
> > -			0,	/* force */
> > -			page_list->pages,
> > -			NULL);
> > -		up_read(&current->mm->mmap_sem);
> > -
> > +		down_read(&current->mm->mm_pinned_sem);
> > +		ret = get_user_pages_fast((unsigned long) iov[i].iov_base,
> > +					  page_list->nr_pages, 1,
> > +					  page_list->pages);
> 
> I would perhaps not fold gup_fast conversions into the same patch as
> the fix.

OK. I'll fix.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix
       [not found] <20090414151204.C647.A69D9226@jp.fujitsu.com>
  2009-04-14  6:23 ` [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c KOSAKI Motohiro
@ 2009-04-14  8:41 ` Nick Piggin
  2009-04-14  9:19   ` KOSAKI Motohiro
  1 sibling, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2009-04-14  8:41 UTC (permalink / raw)
  To: KOSAKI Motohiro, David S. Miller, netdev
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Jeff Moyer,
	linux-mm, linux-fsdevel, LKML

On Tuesday 14 April 2009 16:15:43 KOSAKI Motohiro wrote:
> 
> Linux Device Drivers, Third Edition, Chapter 15: Memory Mapping and DMA says
> 
> 	get_user_pages is a low-level memory management function, with a suitably complex
> 	interface. It also requires that the mmap reader/writer semaphore for the address
> 	space be obtained in read mode before the call. As a result, calls to get_user_pages
> 	usually look something like:
> 
> 		down_read(&current->mm->mmap_sem);
> 		result = get_user_pages(current, current->mm, ...);
> 		up_read(&current->mm->mmap_sem);
> 
> 	The return value is the number of pages actually mapped, which could be fewer than
> 	the number requested (but greater than zero).
> 
> but, it isn't true. mmap_sem isn't only used for vma traversal, but also prevent vs-fork race.
> up_read(mmap_sem) mean end of critical section, IOW after up_read() code is fork unsafe.
> (access_process_vm() explain proper get_user_pages() usage)
> 
> Oh well, We have many wrong caller now. What is the best fix method?

What indeed...


> Nick Piggin and Andrea Arcangeli proposed to change get_user_pages() semantics as caller expected.
>   see "[PATCH] fork vs gup(-fast) fix" thead in linux-mm
> but Linus NACKed it.
> 
> Thus I made caller change approach patch series. it is made for discuss to compare Nick's approach.
> I don't hope submit it yet.
> 
> Nick, This version fixed vmsplice and aio issue (you pointed). I hope to hear your opiniton ;)

I don't see how it fixes vmsplice? vmsplice can get_user_pages pages from one
process's address space and put them into a pipe, and they are released by
another process after consuming the pages I think. So it's fairly hard to hold
a lock over this.

I guess apart from the vmsplice issue (unless I missed a clever fix), I guess
this *does* work. I can't see any races... I'd really still like to hear a good
reason why my proposed patch is so obviously crap.

Reasons proposed so far:
"No locking" (I think this is a good thing; no *bugs* have been pointed out)
"Too many page flags" (but it only uses 1 anon page flag, only fs pagecache
has a flags shortage so we can easily overload a pagecache flag)
"Diffstat too large" (seems comparable when you factor in the fixes to callers,
but has the advantage of being contained within VM subsystem)
"Horrible code" (I still don't see it. Of course the code will be nicer if we
don't fix the issue _at all_, but I don't see this is so much worse than having
to fix callers.)

FWIW, I have attached my patch again (with simple function-movement hunks
moved into another patch so it is easier to see real impact of this patch).


> ChangeLog:
>   V2 -> V3
>    o remove early decow logic
>    o introduce prevent unmap logic
>    o fix nfs-directio
>    o fix aio
>    o fix bio (only bandaid fix)
> 
>   V1 -> V2
>    o fix aio+dio case
> 
> TODO
>   o implement down_write_killable()
>   o fix kvm (need?)
>   o fix get_arg_page() (Why this function don't use mmap_sem?)

Probably someone was shooting for a crazy optimisation because no other
threads should be able to access this mm yet :)

Anyway, this is my proposal. It has the advantage that it fixes every
caller in the tree.

---
 arch/powerpc/mm/gup.c      |    2 
 arch/x86/mm/gup.c          |    3 
 include/linux/mm.h         |    2 
 include/linux/page-flags.h |    5 +
 kernel/fork.c              |    2 
 mm/memory.c                |  178 +++++++++++++++++++++++++++++++++++++++------
 6 files changed, 167 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -791,7 +791,7 @@ int walk_page_range(unsigned long addr,
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -601,12 +601,103 @@ void zap_pte(struct mm_struct *mm, struc
 	}
 }
 /*
+ * breaks COW of child pte that has been marked COW by fork().
+ * Must be called with the child's ptl held and pte mapped.
+ * Returns 0 on success with ptl held and pte mapped.
+ * -ENOMEM on OOM failure, or -EAGAIN if something changed under us.
+ * ptl dropped and pte unmapped on error cases.
+ */
+static noinline int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd,
+			spinlock_t *ptl, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pte_t pte = *ptep;
+	struct page *page, *new_page;
+	int ret;
+
+	BUG_ON(!pte_present(pte));
+	BUG_ON(pte_write(pte));
+
+	page = vm_normal_page(vma, address, pte);
+	BUG_ON(!page);
+	BUG_ON(!PageAnon(page));
+	BUG_ON(!PageDontCOW(page));
+
+	/* The following code comes from do_wp_page */
+	page_cache_get(page);
+	pte_unmap_unlock(pte, ptl);
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto oom;
+	VM_BUG_ON(page == ZERO_PAGE(0));
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!new_page)
+		goto oom;
+	/*
+	 * Don't let another task, with possibly unlocked vma,
+	 * keep the mlocked page.
+	 */
+	if (vma->vm_flags & VM_LOCKED) {
+		lock_page(page);	/* for LRU manipulation */
+		clear_page_mlock(page);
+		unlock_page(page);
+	}
+	cow_user_page(new_page, page, address, vma);
+	__SetPageUptodate(new_page);
+
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
+		goto oom_free_new;
+
+	/*
+	 * Re-check the pte - we dropped the lock
+	 */
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (pte_same(*ptep, pte)) {
+		pte_t entry;
+
+		flush_cache_page(vma, address, pte_pfn(pte));
+		entry = mk_pte(new_page, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		/*
+		 * Clear the pte entry and flush it first, before updating the
+		 * pte with the new entry. This will avoid a race condition
+		 * seen in the presence of one thread doing SMC and another
+		 * thread doing COW.
+		 */
+		ptep_clear_flush_notify(vma, address, ptep);
+		page_add_new_anon_rmap(new_page, vma, address);
+		set_pte_at(mm, address, ptep, entry);
+
+		/* See comment in do_wp_page */
+		page_remove_rmap(page);
+		page_cache_release(page);
+		ret = 0;
+	} else {
+		if (!pte_none(*ptep))
+			zap_pte(mm, vma, address, ptep);
+		pte_unmap_unlock(pte, ptl);
+		mem_cgroup_uncharge_page(new_page);
+		page_cache_release(new_page);
+		ret = -EAGAIN;
+	}
+	page_cache_release(page);
+
+	return ret;
+
+oom_free_new:
+	page_cache_release(new_page);
+oom:
+	page_cache_release(page);
+	return -ENOMEM;
+}
+
+/*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr, int *rss)
@@ -614,6 +705,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int ret = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -665,20 +757,26 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
 		rss[!!PageAnon(page)]++;
+		if (unlikely(PageDontCOW(page)) && PageAnon(page))
+			ret = 1;
 	}
 
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+
+	return ret;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		pmd_t *dst_pmd, pmd_t *src_pmd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	int decow;
 
 again:
 	rss[1] = rss[0] = 0;
@@ -705,7 +803,10 @@ again:
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		decow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
+						src_vma, addr, rss);
+		if (unlikely(decow))
+			goto decow;
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -714,14 +815,31 @@ again:
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);
 	pte_unmap_unlock(dst_pte - 1, dst_ptl);
+next:
 	cond_resched();
 	if (addr != end)
 		goto again;
 	return 0;
+
+decow:
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(src_ptl);
+	pte_unmap_nested(src_pte);
+	add_mm_rss(dst_mm, rss[0], rss[1]);
+	decow = decow_one_pte(dst_mm, dst_pte, dst_pmd, dst_ptl, dst_vma, addr);
+	if (decow == -ENOMEM)
+		return -ENOMEM;
+	if (decow == -EAGAIN)
+		goto again;
+	pte_unmap_unlock(dst_pte, dst_ptl);
+	cond_resched();
+	addr += PAGE_SIZE;
+	goto next;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
+		pud_t *dst_pud, pud_t *src_pud,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
@@ -736,14 +854,15 @@ static inline int copy_pmd_range(struct
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+						dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		pgd_t *dst_pgd, pgd_t *src_pgd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
@@ -758,19 +877,19 @@ static inline int copy_pud_range(struct
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+						dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
 	int ret;
 
 	/*
@@ -779,20 +898,20 @@ int copy_page_range(struct mm_struct *ds
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
-		if (!vma->anon_vma)
+	if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
+		if (!src_vma->anon_vma)
 			return 0;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (is_vm_hugetlb_page(src_vma))
+		return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
 
-	if (unlikely(is_pfn_mapping(vma))) {
+	if (unlikely(is_pfn_mapping(src_vma))) {
 		/*
 		 * We do not free on error cases below as remove_vma
 		 * gets called on error from higher level routine
 		 */
-		ret = track_pfn_vma_copy(vma);
+		ret = track_pfn_vma_copy(src_vma);
 		if (ret)
 			return ret;
 	}
@@ -803,7 +922,7 @@ int copy_page_range(struct mm_struct *ds
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
 	ret = 0;
@@ -814,15 +933,16 @@ int copy_page_range(struct mm_struct *ds
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    dst_vma, src_vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_end(src_mm,
-						  vma->vm_start, end);
+						  src_vma->vm_start, end);
+
 	return ret;
 }
 
@@ -1272,8 +1392,6 @@ static inline int use_zero_page(struct v
 	return !vma->vm_ops || !vma->vm_ops->fault;
 }
 
-
-
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
 		struct page **pages, struct vm_area_struct **vmas)
@@ -1298,6 +1416,7 @@ int __get_user_pages(struct task_struct
 	do {
 		struct vm_area_struct *vma;
 		unsigned int foll_flags;
+		int decow;
 
 		vma = find_extend_vma(mm, start);
 		if (!vma && in_gate_area(tsk, start)) {
@@ -1352,6 +1471,14 @@ int __get_user_pages(struct task_struct
 			continue;
 		}
 
+		/*
+		 * Except in special cases where the caller will not read to or
+		 * write from these pages, we must break COW for any pages
+		 * returned from get_user_pages, so that our caller does not
+		 * subsequently end up with the pages of a parent or child
+		 * process after a COW takes place.
+		 */
+		decow = (pages && is_cow_mapping(vma->vm_flags));
 		foll_flags = FOLL_TOUCH;
 		if (pages)
 			foll_flags |= FOLL_GET;
@@ -1372,7 +1499,7 @@ int __get_user_pages(struct task_struct
 					fatal_signal_pending(current)))
 				return i ? i : -ERESTARTSYS;
 
-			if (write)
+			if (write || decow)
 				foll_flags |= FOLL_WRITE;
 
 			cond_resched();
@@ -1415,6 +1542,9 @@ int __get_user_pages(struct task_struct
 			if (pages) {
 				pages[i] = page;
 
+				if (decow && !PageDontCOW(page) &&
+						PageAnon(page))
+					SetPageDontCOW(page);
 				flush_anon_page(vma, page, start);
 				flush_dcache_page(page);
 			}
@@ -1966,6 +2096,8 @@ static int do_wp_page(struct mm_struct *
 		}
 		reuse = reuse_swap_page(old_page);
 		unlock_page(old_page);
+		VM_BUG_ON(PageDontCOW(old_page) && !reuse);
+
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
 		/*
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t
 		struct page *page;
 
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+failed:
 			pte_unmap(ptep);
 			return 0;
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		if (PageAnon(page) && unlikely(!PageDontCOW(page)))
+			goto failed;
 		get_page(page);
 		pages[*nr] = page;
 		(*nr)++;
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -106,6 +106,9 @@ enum pageflags {
 #endif
 	__NR_PAGEFLAGS,
 
+	/* Anonymous pages */
+	PG_dontcow = PG_owner_priv_1,	/* do not WP for COW optimisation */
+
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -225,6 +228,8 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 __PAGEFLAG(Buddy, buddy)
+__PAGEFLAG(DontCOW, dontcow)
+SETPAGEFLAG(DontCOW, dontcow)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -359,7 +359,7 @@ static int dup_mmap(struct mm_struct *mm
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
Index: linux-2.6/arch/powerpc/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/gup.c
+++ linux-2.6/arch/powerpc/mm/gup.c
@@ -41,6 +41,8 @@ static noinline int gup_pte_range(pmd_t
 			return 0;
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		if (PageAnon(page) && unlikely(!PageDontCOW(page)))
+			return 0;
 		if (!page_cache_get_speculative(page))
 			return 0;
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
\0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix
  2009-04-14  8:41 ` [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix Nick Piggin
@ 2009-04-14  9:19   ` KOSAKI Motohiro
  2009-04-14  9:37     ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2009-04-14  9:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, David S. Miller, netdev, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Jeff Moyer, linux-mm,
	linux-fsdevel, LKML

Hi

> > but, it isn't true. mmap_sem isn't only used for vma traversal, but also prevent vs-fork race.
> > up_read(mmap_sem) mean end of critical section, IOW after up_read() code is fork unsafe.
> > (access_process_vm() explain proper get_user_pages() usage)
> > 
> > Oh well, We have many wrong caller now. What is the best fix method?
> 
> What indeed...

Yes. semantics change vs many caller change is one of most important point
in this discussion.
I hope all related person reach the same conclusion.



> > Nick, This version fixed vmsplice and aio issue (you pointed). I hope to hear your opiniton ;)
> 
> I don't see how it fixes vmsplice? vmsplice can get_user_pages pages from one
> process's address space and put them into a pipe, and they are released by
> another process after consuming the pages I think. So it's fairly hard to hold
> a lock over this.

I recognize my explanation is poor.

firstly, pipe_to_user() via vmsplice_to_user use copy_to_user. then we don't need care
receive side.
secondly, get_iovec_page_array() via vmsplice_to_pipe() use gup(read).
then we only need prevent to change the page.

I changed reuse_swap_page() at [1/6]. then if any process touch the page while
the process isn't recived yet, it makes COW break and toucher get copyed page.
then, Anybody can't change original page.

Thus, This patch series also fixes vmsplice issue, I think.
Am I missing anything?

> I guess apart from the vmsplice issue (unless I missed a clever fix), I guess
> this *does* work. I can't see any races... I'd really still like to hear a good
> reason why my proposed patch is so obviously crap.
> 
> Reasons proposed so far:
> "No locking" (I think this is a good thing; no *bugs* have been pointed out)
> "Too many page flags" (but it only uses 1 anon page flag, only fs pagecache
> has a flags shortage so we can easily overload a pagecache flag)
> "Diffstat too large" (seems comparable when you factor in the fixes to callers,
> but has the advantage of being contained within VM subsystem)
> "Horrible code" (I still don't see it. Of course the code will be nicer if we
> don't fix the issue _at all_, but I don't see this is so much worse than having
> to fix callers.)

Honestly, I don't dislike your.
but I really hope to fix this bug. if someone nak your patch, I'll seek another way.



> FWIW, I have attached my patch again (with simple function-movement hunks
> moved into another patch so it is easier to see real impact of this patch).

OK. I try to test your patch too.


 - kosaki

> 
> 
> > ChangeLog:
> >   V2 -> V3
> >    o remove early decow logic
> >    o introduce prevent unmap logic
> >    o fix nfs-directio
> >    o fix aio
> >    o fix bio (only bandaid fix)
> > 
> >   V1 -> V2
> >    o fix aio+dio case
> > 
> > TODO
> >   o implement down_write_killable()
> >   o fix kvm (need?)
> >   o fix get_arg_page() (Why this function don't use mmap_sem?)
> 
> Probably someone was shooting for a crazy optimisation because no other
> threads should be able to access this mm yet :)
> 
> Anyway, this is my proposal. It has the advantage that it fixes every
> caller in the tree.
> 
> ---
>  arch/powerpc/mm/gup.c      |    2 
>  arch/x86/mm/gup.c          |    3 
>  include/linux/mm.h         |    2 
>  include/linux/page-flags.h |    5 +
>  kernel/fork.c              |    2 
>  mm/memory.c                |  178 +++++++++++++++++++++++++++++++++++++++------
>  6 files changed, 167 insertions(+), 25 deletions(-)
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -791,7 +791,7 @@ int walk_page_range(unsigned long addr,
>  void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>  		unsigned long end, unsigned long floor, unsigned long ceiling);
>  int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
> -			struct vm_area_struct *vma);
> +		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
>  void unmap_mapping_range(struct address_space *mapping,
>  		loff_t const holebegin, loff_t const holelen, int even_cows);
>  int follow_phys(struct vm_area_struct *vma, unsigned long address,
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -601,12 +601,103 @@ void zap_pte(struct mm_struct *mm, struc
>  	}
>  }
>  /*
> + * breaks COW of child pte that has been marked COW by fork().
> + * Must be called with the child's ptl held and pte mapped.
> + * Returns 0 on success with ptl held and pte mapped.
> + * -ENOMEM on OOM failure, or -EAGAIN if something changed under us.
> + * ptl dropped and pte unmapped on error cases.
> + */
> +static noinline int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd,
> +			spinlock_t *ptl, struct vm_area_struct *vma,
> +			unsigned long address)
> +{
> +	pte_t pte = *ptep;
> +	struct page *page, *new_page;
> +	int ret;
> +
> +	BUG_ON(!pte_present(pte));
> +	BUG_ON(pte_write(pte));
> +
> +	page = vm_normal_page(vma, address, pte);
> +	BUG_ON(!page);
> +	BUG_ON(!PageAnon(page));
> +	BUG_ON(!PageDontCOW(page));
> +
> +	/* The following code comes from do_wp_page */
> +	page_cache_get(page);
> +	pte_unmap_unlock(pte, ptl);
> +
> +	if (unlikely(anon_vma_prepare(vma)))
> +		goto oom;
> +	VM_BUG_ON(page == ZERO_PAGE(0));
> +	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> +	if (!new_page)
> +		goto oom;
> +	/*
> +	 * Don't let another task, with possibly unlocked vma,
> +	 * keep the mlocked page.
> +	 */
> +	if (vma->vm_flags & VM_LOCKED) {
> +		lock_page(page);	/* for LRU manipulation */
> +		clear_page_mlock(page);
> +		unlock_page(page);
> +	}
> +	cow_user_page(new_page, page, address, vma);
> +	__SetPageUptodate(new_page);
> +
> +	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> +		goto oom_free_new;
> +
> +	/*
> +	 * Re-check the pte - we dropped the lock
> +	 */
> +	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (pte_same(*ptep, pte)) {
> +		pte_t entry;
> +
> +		flush_cache_page(vma, address, pte_pfn(pte));
> +		entry = mk_pte(new_page, vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		/*
> +		 * Clear the pte entry and flush it first, before updating the
> +		 * pte with the new entry. This will avoid a race condition
> +		 * seen in the presence of one thread doing SMC and another
> +		 * thread doing COW.
> +		 */
> +		ptep_clear_flush_notify(vma, address, ptep);
> +		page_add_new_anon_rmap(new_page, vma, address);
> +		set_pte_at(mm, address, ptep, entry);
> +
> +		/* See comment in do_wp_page */
> +		page_remove_rmap(page);
> +		page_cache_release(page);
> +		ret = 0;
> +	} else {
> +		if (!pte_none(*ptep))
> +			zap_pte(mm, vma, address, ptep);
> +		pte_unmap_unlock(pte, ptl);
> +		mem_cgroup_uncharge_page(new_page);
> +		page_cache_release(new_page);
> +		ret = -EAGAIN;
> +	}
> +	page_cache_release(page);
> +
> +	return ret;
> +
> +oom_free_new:
> +	page_cache_release(new_page);
> +oom:
> +	page_cache_release(page);
> +	return -ENOMEM;
> +}
> +
> +/*
>   * copy one vm_area from one task to the other. Assumes the page tables
>   * already present in the new task to be cleared in the whole range
>   * covered by this vma.
>   */
>  
> -static inline void
> +static inline int
>  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
>  		unsigned long addr, int *rss)
> @@ -614,6 +705,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
>  	unsigned long vm_flags = vma->vm_flags;
>  	pte_t pte = *src_pte;
>  	struct page *page;
> +	int ret = 0;
>  
>  	/* pte contains position in swap or file, so copy. */
>  	if (unlikely(!pte_present(pte))) {
> @@ -665,20 +757,26 @@ copy_one_pte(struct mm_struct *dst_mm, s
>  		get_page(page);
>  		page_dup_rmap(page, vma, addr);
>  		rss[!!PageAnon(page)]++;
> +		if (unlikely(PageDontCOW(page)) && PageAnon(page))
> +			ret = 1;
>  	}
>  
>  out_set_pte:
>  	set_pte_at(dst_mm, addr, dst_pte, pte);
> +
> +	return ret;
>  }
>  
>  static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		pmd_t *dst_pmd, pmd_t *src_pmd,
> +		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		unsigned long addr, unsigned long end)
>  {
>  	pte_t *src_pte, *dst_pte;
>  	spinlock_t *src_ptl, *dst_ptl;
>  	int progress = 0;
>  	int rss[2];
> +	int decow;
>  
>  again:
>  	rss[1] = rss[0] = 0;
> @@ -705,7 +803,10 @@ again:
>  			progress++;
>  			continue;
>  		}
> -		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
> +		decow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> +						src_vma, addr, rss);
> +		if (unlikely(decow))
> +			goto decow;
>  		progress += 8;
>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>  
> @@ -714,14 +815,31 @@ again:
>  	pte_unmap_nested(src_pte - 1);
>  	add_mm_rss(dst_mm, rss[0], rss[1]);
>  	pte_unmap_unlock(dst_pte - 1, dst_ptl);
> +next:
>  	cond_resched();
>  	if (addr != end)
>  		goto again;
>  	return 0;
> +
> +decow:
> +	arch_leave_lazy_mmu_mode();
> +	spin_unlock(src_ptl);
> +	pte_unmap_nested(src_pte);
> +	add_mm_rss(dst_mm, rss[0], rss[1]);
> +	decow = decow_one_pte(dst_mm, dst_pte, dst_pmd, dst_ptl, dst_vma, addr);
> +	if (decow == -ENOMEM)
> +		return -ENOMEM;
> +	if (decow == -EAGAIN)
> +		goto again;
> +	pte_unmap_unlock(dst_pte, dst_ptl);
> +	cond_resched();
> +	addr += PAGE_SIZE;
> +	goto next;
>  }
>  
>  static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
> +		pud_t *dst_pud, pud_t *src_pud,
> +		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		unsigned long addr, unsigned long end)
>  {
>  	pmd_t *src_pmd, *dst_pmd;
> @@ -736,14 +854,15 @@ static inline int copy_pmd_range(struct
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> -						vma, addr, next))
> +						dst_vma, src_vma, addr, next))
>  			return -ENOMEM;
>  	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
>  	return 0;
>  }
>  
>  static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
> +		pgd_t *dst_pgd, pgd_t *src_pgd,
> +		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		unsigned long addr, unsigned long end)
>  {
>  	pud_t *src_pud, *dst_pud;
> @@ -758,19 +877,19 @@ static inline int copy_pud_range(struct
>  		if (pud_none_or_clear_bad(src_pud))
>  			continue;
>  		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
> -						vma, addr, next))
> +						dst_vma, src_vma, addr, next))
>  			return -ENOMEM;
>  	} while (dst_pud++, src_pud++, addr = next, addr != end);
>  	return 0;
>  }
>  
>  int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		struct vm_area_struct *vma)
> +		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  {
>  	pgd_t *src_pgd, *dst_pgd;
>  	unsigned long next;
> -	unsigned long addr = vma->vm_start;
> -	unsigned long end = vma->vm_end;
> +	unsigned long addr = src_vma->vm_start;
> +	unsigned long end = src_vma->vm_end;
>  	int ret;
>  
>  	/*
> @@ -779,20 +898,20 @@ int copy_page_range(struct mm_struct *ds
>  	 * readonly mappings. The tradeoff is that copy_page_range is more
>  	 * efficient than faulting.
>  	 */
> -	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
> -		if (!vma->anon_vma)
> +	if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
> +		if (!src_vma->anon_vma)
>  			return 0;
>  	}
>  
> -	if (is_vm_hugetlb_page(vma))
> -		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> +	if (is_vm_hugetlb_page(src_vma))
> +		return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
>  
> -	if (unlikely(is_pfn_mapping(vma))) {
> +	if (unlikely(is_pfn_mapping(src_vma))) {
>  		/*
>  		 * We do not free on error cases below as remove_vma
>  		 * gets called on error from higher level routine
>  		 */
> -		ret = track_pfn_vma_copy(vma);
> +		ret = track_pfn_vma_copy(src_vma);
>  		if (ret)
>  			return ret;
>  	}
> @@ -803,7 +922,7 @@ int copy_page_range(struct mm_struct *ds
>  	 * parent mm. And a permission downgrade will only happen if
>  	 * is_cow_mapping() returns true.
>  	 */
> -	if (is_cow_mapping(vma->vm_flags))
> +	if (is_cow_mapping(src_vma->vm_flags))
>  		mmu_notifier_invalidate_range_start(src_mm, addr, end);
>  
>  	ret = 0;
> @@ -814,15 +933,16 @@ int copy_page_range(struct mm_struct *ds
>  		if (pgd_none_or_clear_bad(src_pgd))
>  			continue;
>  		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
> -					    vma, addr, next))) {
> +					    dst_vma, src_vma, addr, next))) {
>  			ret = -ENOMEM;
>  			break;
>  		}
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
>  
> -	if (is_cow_mapping(vma->vm_flags))
> +	if (is_cow_mapping(src_vma->vm_flags))
>  		mmu_notifier_invalidate_range_end(src_mm,
> -						  vma->vm_start, end);
> +						  src_vma->vm_start, end);
> +
>  	return ret;
>  }
>  
> @@ -1272,8 +1392,6 @@ static inline int use_zero_page(struct v
>  	return !vma->vm_ops || !vma->vm_ops->fault;
>  }
>  
> -
> -
>  int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		     unsigned long start, int len, int flags,
>  		struct page **pages, struct vm_area_struct **vmas)
> @@ -1298,6 +1416,7 @@ int __get_user_pages(struct task_struct
>  	do {
>  		struct vm_area_struct *vma;
>  		unsigned int foll_flags;
> +		int decow;
>  
>  		vma = find_extend_vma(mm, start);
>  		if (!vma && in_gate_area(tsk, start)) {
> @@ -1352,6 +1471,14 @@ int __get_user_pages(struct task_struct
>  			continue;
>  		}
>  
> +		/*
> +		 * Except in special cases where the caller will not read to or
> +		 * write from these pages, we must break COW for any pages
> +		 * returned from get_user_pages, so that our caller does not
> +		 * subsequently end up with the pages of a parent or child
> +		 * process after a COW takes place.
> +		 */
> +		decow = (pages && is_cow_mapping(vma->vm_flags));
>  		foll_flags = FOLL_TOUCH;
>  		if (pages)
>  			foll_flags |= FOLL_GET;
> @@ -1372,7 +1499,7 @@ int __get_user_pages(struct task_struct
>  					fatal_signal_pending(current)))
>  				return i ? i : -ERESTARTSYS;
>  
> -			if (write)
> +			if (write || decow)
>  				foll_flags |= FOLL_WRITE;
>  
>  			cond_resched();
> @@ -1415,6 +1542,9 @@ int __get_user_pages(struct task_struct
>  			if (pages) {
>  				pages[i] = page;
>  
> +				if (decow && !PageDontCOW(page) &&
> +						PageAnon(page))
> +					SetPageDontCOW(page);
>  				flush_anon_page(vma, page, start);
>  				flush_dcache_page(page);
>  			}
> @@ -1966,6 +2096,8 @@ static int do_wp_page(struct mm_struct *
>  		}
>  		reuse = reuse_swap_page(old_page);
>  		unlock_page(old_page);
> +		VM_BUG_ON(PageDontCOW(old_page) && !reuse);
> +
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
>  		/*
> Index: linux-2.6/arch/x86/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/gup.c
> +++ linux-2.6/arch/x86/mm/gup.c
> @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t
>  		struct page *page;
>  
>  		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
> +failed:
>  			pte_unmap(ptep);
>  			return 0;
>  		}
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
> +		if (PageAnon(page) && unlikely(!PageDontCOW(page)))
> +			goto failed;
>  		get_page(page);
>  		pages[*nr] = page;
>  		(*nr)++;
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -106,6 +106,9 @@ enum pageflags {
>  #endif
>  	__NR_PAGEFLAGS,
>  
> +	/* Anonymous pages */
> +	PG_dontcow = PG_owner_priv_1,	/* do not WP for COW optimisation */
> +
>  	/* Filesystems */
>  	PG_checked = PG_owner_priv_1,
>  
> @@ -225,6 +228,8 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
>   */
>  TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
>  __PAGEFLAG(Buddy, buddy)
> +__PAGEFLAG(DontCOW, dontcow)
> +SETPAGEFLAG(DontCOW, dontcow)
>  PAGEFLAG(MappedToDisk, mappedtodisk)
>  
>  /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c
> +++ linux-2.6/kernel/fork.c
> @@ -359,7 +359,7 @@ static int dup_mmap(struct mm_struct *mm
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		retval = copy_page_range(mm, oldmm, mpnt);
> +		retval = copy_page_range(mm, oldmm, tmp, mpnt);
>  
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);
> Index: linux-2.6/arch/powerpc/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/gup.c
> +++ linux-2.6/arch/powerpc/mm/gup.c
> @@ -41,6 +41,8 @@ static noinline int gup_pte_range(pmd_t
>  			return 0;
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
> +		if (PageAnon(page) && unlikely(!PageDontCOW(page)))
> +			return 0;
>  		if (!page_cache_get_speculative(page))
>  			return 0;
>  		if (unlikely(pte_val(pte) != pte_val(*ptep))) {



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix
  2009-04-14  9:19   ` KOSAKI Motohiro
@ 2009-04-14  9:37     ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2009-04-14  9:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David S. Miller, netdev, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Jeff Moyer, linux-mm, linux-fsdevel, LKML

On Tuesday 14 April 2009 19:19:10 KOSAKI Motohiro wrote:

> > I don't see how it fixes vmsplice? vmsplice can get_user_pages pages from one
> > process's address space and put them into a pipe, and they are released by
> > another process after consuming the pages I think. So it's fairly hard to hold
> > a lock over this.
> 
> I recognize my explanation is poor.
> 
> firstly, pipe_to_user() via vmsplice_to_user use copy_to_user. then we don't need care
> receive side.
> secondly, get_iovec_page_array() via vmsplice_to_pipe() use gup(read).
> then we only need prevent to change the page.
> 
> I changed reuse_swap_page() at [1/6]. then if any process touch the page while
> the process isn't recived yet, it makes COW break and toucher get copyed page.
> then, Anybody can't change original page.
> 
> Thus, This patch series also fixes vmsplice issue, I think.
> Am I missing anything?

Ah thanks, I see now. No I don't think you're missing anything.


> > I guess apart from the vmsplice issue (unless I missed a clever fix), I guess
> > this *does* work. I can't see any races... I'd really still like to hear a good
> > reason why my proposed patch is so obviously crap.
> > 
> > Reasons proposed so far:
> > "No locking" (I think this is a good thing; no *bugs* have been pointed out)
> > "Too many page flags" (but it only uses 1 anon page flag, only fs pagecache
> > has a flags shortage so we can easily overload a pagecache flag)
> > "Diffstat too large" (seems comparable when you factor in the fixes to callers,
> > but has the advantage of being contained within VM subsystem)
> > "Horrible code" (I still don't see it. Of course the code will be nicer if we
> > don't fix the issue _at all_, but I don't see this is so much worse than having
> > to fix callers.)
> 
> Honestly, I don't dislike your.
> but I really hope to fix this bug. if someone nak your patch, I'll seek another way.

Yes, I appreciate you looking at alternatives, and you haven't been strongly
arguing against my patch. So this comment was not aimed at you :)


> > FWIW, I have attached my patch again (with simple function-movement hunks
> > moved into another patch so it is easier to see real impact of this patch).
> 
> OK. I try to test your patch too.

Well I split it out and it requires another patch to move functions around
(eg. zap_pte from fremap.c into memory.c). I just attached it here to
illustrate the core of my fix. If you would like to run any real tests, let
me know and I could send a proper rollup.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c
  2009-04-14  6:58     ` KOSAKI Motohiro
@ 2009-04-15  8:48       ` KOSAKI Motohiro
  2009-04-17 15:07         ` Sosnowski, Maciej
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2009-04-15  8:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, LKML, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Jeff Moyer, linux-mm, linux-fsdevel,
	Maciej Sosnowski, David S. Miller, Chris Leech, netdev


> > I would perhaps not fold gup_fast conversions into the same patch as
> > the fix.
> 
> OK. I'll fix.

Done.



===================================
Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c

	down_read(mmap_sem)
	get_user_pages()
	up_read(mmap_sem)

is fork unsafe.
fix it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Leech <christopher.leech@intel.com>
Cc: netdev@vger.kernel.org
---
 drivers/dma/iovlock.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/drivers/dma/iovlock.c
===================================================================
--- a/drivers/dma/iovlock.c	2009-04-13 22:58:36.000000000 +0900
+++ b/drivers/dma/iovlock.c	2009-04-14 20:27:16.000000000 +0900
@@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
 			0,	/* force */
 			page_list->pages,
 			NULL);
-		up_read(&current->mm->mmap_sem);
-
 		if (ret != page_list->nr_pages)
 			goto unpin;
 
@@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi
 	if (!pinned_list)
 		return;
 
+	up_read(&current->mm->mmap_sem);
+
 	for (i = 0; i < pinned_list->nr_iovecs; i++) {
 		struct dma_page_list *page_list = &pinned_list->page_list[i];
 		for (j = 0; j < page_list->nr_pages; j++) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c
  2009-04-15  8:48       ` KOSAKI Motohiro
@ 2009-04-17 15:07         ` Sosnowski, Maciej
  2009-04-19 12:37           ` KOSAKI Motohiro
  0 siblings, 1 reply; 10+ messages in thread
From: Sosnowski, Maciej @ 2009-04-17 15:07 UTC (permalink / raw)
  To: KOSAKI Motohiro, Nick Piggin
  Cc: LKML, Linus Torvalds, Andrew Morton, Andrea Arcangeli, Jeff Moyer,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	David S. Miller, Leech, Christopher, netdev@vger.kernel.org

KOSAKI Motohiro wrote:
>>> I would perhaps not fold gup_fast conversions into the same patch as
>>> the fix.
>> 
>> OK. I'll fix.
> 
> Done.
> 
> 
> 
> ===================================
> Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
> 
> 	down_read(mmap_sem)
> 	get_user_pages()
> 	up_read(mmap_sem)
> 
> is fork unsafe.
> fix it.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Chris Leech <christopher.leech@intel.com>
> Cc: netdev@vger.kernel.org
> ---
>  drivers/dma/iovlock.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: b/drivers/dma/iovlock.c
> ===================================================================
> --- a/drivers/dma/iovlock.c	2009-04-13 22:58:36.000000000 +0900
> +++ b/drivers/dma/iovlock.c	2009-04-14 20:27:16.000000000 +0900
> @@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
>  			0,	/* force */
>  			page_list->pages,
>  			NULL);
> -		up_read(&current->mm->mmap_sem);
> -
>  		if (ret != page_list->nr_pages)
>  			goto unpin;
> 
> @@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi
>  	if (!pinned_list)
>  		return;
> 
> +	up_read(&current->mm->mmap_sem);
> +
>  	for (i = 0; i < pinned_list->nr_iovecs; i++) {
>  		struct dma_page_list *page_list = &pinned_list->page_list[i];
>  		for (j = 0; j < page_list->nr_pages; j++) {

I have tried it with net_dma and here is what I've got.

Regards,
Maciej
---

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc2 #2
 -------------------------------------------------------
 iperf/10555 is trying to acquire lock:
  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff80450991>] sk_wait_data+0x90/0xc5

 but task is already holding lock:
  (&mm->mmap_sem){++++++}, at: [<ffffffff8043d0f6>] dma_pin_iovec_pages+0x122/0x1a0
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (&mm->mmap_sem){++++++}:
        [<ffffffffffffffff>] 0xffffffffffffffff
 
 -> #0 (sk_lock-AF_INET){+.+.+.}:
        [<ffffffffffffffff>] 0xffffffffffffffff
 
 other info that might help us debug this:
 
 1 lock held by iperf/10555:
  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8043d0f6>] dma_pin_iovec_pages+0x122/0x1a0
 
 stack backtrace:
 Pid: 10555, comm: iperf Tainted: G        W  2.6.30-rc2 #2
 Call Trace:
  [<ffffffff8025b1dc>] ? print_circular_bug_tail+0xc0/0xc9
  [<ffffffff8025aa2b>] ? print_circular_bug_header+0xc8/0xcf
  [<ffffffff8025b984>] ? validate_chain+0x67d/0xc7c
  [<ffffffff8025c6e6>] ? __lock_acquire+0x763/0x7ec
  [<ffffffff8025c835>] ? lock_acquire+0xc6/0xea
  [<ffffffff80450991>] ? sk_wait_data+0x90/0xc5
  [<ffffffff8044ff01>] ? lock_sock_nested+0xee/0x100
  [<ffffffff80450991>] ? sk_wait_data+0x90/0xc5
  [<ffffffff80259b07>] ? mark_held_locks+0x43/0x5b
  [<ffffffff802403ee>] ? local_bh_enable_ip+0xc4/0xc7
  [<ffffffff80259c3c>] ? trace_hardirqs_on_caller+0x11d/0x148
  [<ffffffff80450991>] ? sk_wait_data+0x90/0xc5
  [<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff80488ad3>] ? tcp_recvmsg+0x3bf/0xa21
  [<ffffffff8044f601>] ? sock_common_recvmsg+0x30/0x45
  [<ffffffff8044d847>] ? sock_recvmsg+0xf0/0x10f
  [<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff8025c704>] ? __lock_acquire+0x781/0x7ec
  [<ffffffff802b553c>] ? fget_light+0xd5/0xdf
  [<ffffffff802b54b0>] ? fget_light+0x49/0xdf
  [<ffffffff8044e91b>] ? sys_recvfrom+0xbc/0x119
  [<ffffffff80259c3c>] ? trace_hardirqs_on_caller+0x11d/0x148
  [<ffffffff804e1a7f>] ? _spin_unlock_irq+0x24/0x27
  [<ffffffff802359ab>] ? finish_task_switch+0x7a/0xe4
  [<ffffffff80235967>] ? finish_task_switch+0x36/0xe4
  [<ffffffff804df188>] ? thread_return+0x3e/0x97
  [<ffffffff802718f7>] ? audit_syscall_entry+0x192/0x1bd
  [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c
  2009-04-17 15:07         ` Sosnowski, Maciej
@ 2009-04-19 12:37           ` KOSAKI Motohiro
  2009-04-23 12:48             ` Sosnowski, Maciej
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2009-04-19 12:37 UTC (permalink / raw)
  To: Sosnowski, Maciej
  Cc: kosaki.motohiro, Nick Piggin, LKML, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Jeff Moyer, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, David S. Miller,
	Leech, Christopher, netdev@vger.kernel.org

> KOSAKI Motohiro wrote:
> >>> I would perhaps not fold gup_fast conversions into the same patch as
> >>> the fix.
> >> 
> >> OK. I'll fix.
> > 
> > Done.
> > 
> > 
> > 
> > ===================================
> > Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
> > 
> > 	down_read(mmap_sem)
> > 	get_user_pages()
> > 	up_read(mmap_sem)
> > 
> > is fork unsafe.
> > fix it.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
> > Cc: David S. Miller <davem@davemloft.net>
> > Cc: Chris Leech <christopher.leech@intel.com>
> > Cc: netdev@vger.kernel.org
> > ---
> >  drivers/dma/iovlock.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > Index: b/drivers/dma/iovlock.c
> > ===================================================================
> > --- a/drivers/dma/iovlock.c	2009-04-13 22:58:36.000000000 +0900
> > +++ b/drivers/dma/iovlock.c	2009-04-14 20:27:16.000000000 +0900
> > @@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
> >  			0,	/* force */
> >  			page_list->pages,
> >  			NULL);
> > -		up_read(&current->mm->mmap_sem);
> > -
> >  		if (ret != page_list->nr_pages)
> >  			goto unpin;
> > 
> > @@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi
> >  	if (!pinned_list)
> >  		return;
> > 
> > +	up_read(&current->mm->mmap_sem);
> > +
> >  	for (i = 0; i < pinned_list->nr_iovecs; i++) {
> >  		struct dma_page_list *page_list = &pinned_list->page_list[i];
> >  		for (j = 0; j < page_list->nr_pages; j++) {
> 
> I have tried it with net_dma and here is what I've got.

Thanks.
Instead, How about this?


============================================
Subject: [Untested][RFC][PATCH v3] fix wrong get_user_pages usage in iovlock.c

	down_read(mmap_sem)
	get_user_pages()
	up_read(mmap_sem)

is fork unsafe.
mmap_sem should't be released until dma_unpin_iovec_pages() is called.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Leech <christopher.leech@intel.com>
Cc: netdev@vger.kernel.org
---
 drivers/dma/iovlock.c |    5 ++---
 net/ipv4/tcp.c        |    9 +++++++++
 2 files changed, 11 insertions(+), 3 deletions(-)

Index: b/drivers/dma/iovlock.c
===================================================================
--- a/drivers/dma/iovlock.c	2009-04-19 17:27:25.000000000 +0900
+++ b/drivers/dma/iovlock.c	2009-04-19 17:29:42.000000000 +0900
@@ -45,6 +45,8 @@ static int num_pages_spanned(struct iove
  * We are allocating a single chunk of memory, and then carving it up into
  * 3 sections, the latter 2 whose size depends on the number of iovecs and the
  * total number of pages, respectively.
+ *
+ * Caller must hold mm->mmap_sem
  */
 struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
 {
@@ -94,7 +96,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
 		pages += page_list->nr_pages;
 
 		/* pin pages down */
-		down_read(&current->mm->mmap_sem);
 		ret = get_user_pages(
 			current,
 			current->mm,
@@ -104,8 +105,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
 			0,	/* force */
 			page_list->pages,
 			NULL);
-		up_read(&current->mm->mmap_sem);
-
 		if (ret != page_list->nr_pages)
 			goto unpin;
 
Index: b/net/ipv4/tcp.c
===================================================================
--- a/net/ipv4/tcp.c	2009-04-19 17:27:25.000000000 +0900
+++ b/net/ipv4/tcp.c	2009-04-19 18:09:42.000000000 +0900
@@ -1322,6 +1322,9 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 	int copied_early = 0;
 	struct sk_buff *skb;
 
+#ifdef CONFIG_NET_DMA
+	down_read(&current->mm->mmap_sem);
+#endif
 	lock_sock(sk);
 
 	TCP_CHECK_TIMER(sk);
@@ -1688,11 +1691,17 @@ skip_copy:
 
 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
+#ifdef CONFIG_NET_DMA
+	up_read(&current->mm->mmap_sem);
+#endif
 	return copied;
 
 out:
 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
+#ifdef CONFIG_NET_DMA
+	up_read(&current->mm->mmap_sem);
+#endif
 	return err;
 
 recv_urg:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c
  2009-04-19 12:37           ` KOSAKI Motohiro
@ 2009-04-23 12:48             ` Sosnowski, Maciej
  0 siblings, 0 replies; 10+ messages in thread
From: Sosnowski, Maciej @ 2009-04-23 12:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, LKML, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Jeff Moyer, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, David S. Miller,
	Leech, Christopher, netdev@vger.kernel.org

KOSAKI Motohiro wrote:
>> KOSAKI Motohiro wrote:
>>>>> I would perhaps not fold gup_fast conversions into the same patch as
>>>>> the fix.
>>>> 
>>>> OK. I'll fix.
>>> 
>>> Done.
>>> 
>>> 
>>> 
>>> ===================================
>>> Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
>>> 
>>> 	down_read(mmap_sem)
>>> 	get_user_pages()
>>> 	up_read(mmap_sem)
>>> 
>>> is fork unsafe.
>>> fix it.
>>> 
>>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
>>> Cc: David S. Miller <davem@davemloft.net>
>>> Cc: Chris Leech <christopher.leech@intel.com>
>>> Cc: netdev@vger.kernel.org
>>> ---
>>>  drivers/dma/iovlock.c |    4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>> 
>>> Index: b/drivers/dma/iovlock.c
>>> ===================================================================
>>> --- a/drivers/dma/iovlock.c	2009-04-13 22:58:36.000000000 +0900
>>> +++ b/drivers/dma/iovlock.c	2009-04-14 20:27:16.000000000 +0900
>>> @@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa  			0,	/* force */
>>>  			page_list->pages,
>>>  			NULL);
>>> -		up_read(&current->mm->mmap_sem);
>>> -
>>>  		if (ret != page_list->nr_pages)
>>>  			goto unpin;
>>> 
>>> @@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi  	if (!pinned_list)
>>>  		return;
>>> 
>>> +	up_read(&current->mm->mmap_sem);
>>> +
>>>  	for (i = 0; i < pinned_list->nr_iovecs; i++) {
>>>  		struct dma_page_list *page_list = &pinned_list->page_list[i];
>>>  		for (j = 0; j < page_list->nr_pages; j++) {
>> 
>> I have tried it with net_dma and here is what I've got.
> 
> Thanks.
> Instead, How about this?
> 

Unfortuantelly still does not look good.

Regards,
Maciej

 =============================================
 [ INFO: possible recursive locking detected ]
 2.6.30-rc2 #14
 ---------------------------------------------
 iperf/9932 is trying to acquire lock:
  (&mm->mmap_sem){++++++}, at: [<ffffffff804e3d5e>] do_page_fault+0x170/0

 
 but task is already holding lock:
  (&mm->mmap_sem){++++++}, at: [<ffffffff80488722>] tcp_recvmsg+0x3a/0xa7

 
 other info that might help us debug this:
 2 locks held by iperf/9932:
  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff80488722>] tcp_recvmsg+0x3a

  #1:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff80450965>] sk_wait_data+0

 
 stack backtrace:
 Pid: 9932, comm: iperf Tainted: G        W  2.6.30-rc2 #14
 Call Trace:
  [<ffffffff8025b861>] ? validate_chain+0x55a/0xc7c
  [<ffffffff8025c6e6>] ? __lock_acquire+0x763/0x7ec
  [<ffffffff8025c835>] ? lock_acquire+0xc6/0xea
  [<ffffffff804e3d5e>] ? do_page_fault+0x170/0x29d
  [<ffffffff804e0693>] ? down_read+0x46/0x77
  [<ffffffff804e3d5e>] ? do_page_fault+0x170/0x29d
  [<ffffffff804e3d5e>] ? do_page_fault+0x170/0x29d
  [<ffffffff804e1ebf>] ? page_fault+0x1f/0x30
  [<ffffffff803580ed>] ? copy_user_generic_string+0x2d/0x40
  [<ffffffff804562cc>] ? memcpy_toiovec+0x36/0x66
  [<ffffffff804569eb>] ? skb_copy_datagram_iovec+0x133/0x1f0
  [<ffffffff80490199>] ? tcp_rcv_established+0x297/0x71a
  [<ffffffff804953f8>] ? tcp_v4_do_rcv+0x2c/0x1d5
  [<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff80486239>] ? tcp_prequeue_process+0x6b/0x7e
  [<ffffffff80488b31>] ? tcp_recvmsg+0x449/0xa70
  [<ffffffff8025c704>] ? __lock_acquire+0x781/0x7ec
  [<ffffffff8044f5d5>] ? sock_common_recvmsg+0x30/0x45
  [<ffffffff8044d81b>] ? sock_recvmsg+0xf0/0x10f
  [<ffffffff80259c3c>] ? trace_hardirqs_on_caller+0x11d/0x148
  [<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff8020c43c>] ? restore_args+0x0/0x30
  [<ffffffff802b553c>] ? fget_light+0xd5/0xdf
  [<ffffffff802b54b0>] ? fget_light+0x49/0xdf
  [<ffffffff8044e8ef>] ? sys_recvfrom+0xbc/0x119
  [<ffffffff802331cd>] ? try_to_wake_up+0x2ae/0x2c0
  [<ffffffff802718f7>] ? audit_syscall_entry+0x192/0x1bd
  [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-04-23 12:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090414151204.C647.A69D9226@jp.fujitsu.com>
2009-04-14  6:23 ` [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c KOSAKI Motohiro
2009-04-14  6:56   ` Nick Piggin
2009-04-14  6:58     ` KOSAKI Motohiro
2009-04-15  8:48       ` KOSAKI Motohiro
2009-04-17 15:07         ` Sosnowski, Maciej
2009-04-19 12:37           ` KOSAKI Motohiro
2009-04-23 12:48             ` Sosnowski, Maciej
2009-04-14  8:41 ` [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix Nick Piggin
2009-04-14  9:19   ` KOSAKI Motohiro
2009-04-14  9:37     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox