Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andi Kleen <andi@firstfloor.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Wu Fengguang <fengguang.wu@intel.com>
Subject: Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
Date: Fri, 14 May 2010 10:54:50 +0100	[thread overview]
Message-ID: <20100514095449.GB21481@csn.ul.ie> (raw)
In-Reply-To: <20100514074641.GD10000@spritzerA.linux.bs1.fc.nec.co.jp>

On Fri, May 14, 2010 at 04:46:41PM +0900, Naoya Horiguchi wrote:
> On Thu, May 13, 2010 at 04:27:37PM +0100, Mel Gorman wrote:
> > On Thu, May 13, 2010 at 04:55:20PM +0900, Naoya Horiguchi wrote:
> > > While hugepage is not currently swappable, rmapping can be useful
> > > for memory error handler.
> > > Using rmap, memory error handler can collect processes affected
> > > by hugepage errors and unmap them to contain error's effect.
> > > 
> > 
> > As a verification point, can you ensure that the libhugetlbfs "make
> > func" tests complete successfully with this patch applied? It's also
> > important that there is no oddness in the Hugepage-related counters in
> > /proc/meminfo. I'm not in the position to test it now unfortunately as
> > I'm on the road.
> 
> Yes. Thanks for the good test-set.
> 
> Hmm. I failed libhugetlbfs test with a oops in "private mapped" test :(
> 

That's not a disaster - it's what the regression test is for. I haven't
restarted the review in this case. I'll wait for another version that
passes those regression tests.

> <OOPS SNIP>
> 
> Someone seems to call hugetlb_fault() with anon_vma == NULL.
> For more detail, I'm investigating it.
> 

Sure.

> > > Current status of hugepage rmap differs depending on mapping mode:
> > > - for shared hugepage:
> > >   we can collect processes using a hugepage through pagecache,
> > >   but can not unmap the hugepage because of the lack of mapcount.
> > > - for privately mapped hugepage:
> > >   we can neither collect processes nor unmap the hugepage.
> > > 
> > > To realize hugepage rmapping, this patch introduces mapcount for
> > > shared/private-mapped hugepage and anon_vma for private-mapped hugepage.
> > > 
> > > This patch can be the replacement of the following bug fix.
> > > 
> > 
> > Actually, you replace chunks but not all of that fix with this patch.
> > After this patch HUGETLB_POISON is never assigned but the definition still
> > exists in poison.h. You should also remove it if it is unnecessary.
> 
> OK. I'll remove HUGETLB_POISON in the next post.
> 

Thanks

> <SNIP>
> > For ordinary anon_vma's, there
> > is a chain of related vma's chained together via the anon_vma's. It's so
> > in the event of an unmapping, all the PTEs related to the page can be
> > found. Where are we doing the same here?
> 
> Finding all processes using a hugepage is done by try_to_unmap() as usual.
> Among callers of this function, only memory error handler calls it for
> hugepage for now.

Ok, my bad, it's anon_vma_prepare that does most of the linkages.
However, there still appears to be logic missing between how anon rmap
pages are setup and hugetlb anon rmap pages. See __page_set_anon_rmap
for example and what it does with chains and compare it to
hugetlb_add_anon_rmap. There are some important differences.


> What this patch does is to enable try_to_unmap() to be called for hugepages
> by setting up anon_vma in hugetlb code.
> 
> > I think what you're getting with this is the ability to unmap MAP_PRIVATE pages
> > from one process but if there are multiple processes, the second process could
> > still end up referencing the poisoned MAP_PRIVATE page. Is this accurate? Even
> > if it is, I guess it's still an improvement over what currently happens.
> 
> Try_to_unmap_anon() runs for each vma belonging to the anon_vma associated
> with the error hugepage. So it works for multiple processes.
> 

Yep, as long as anon_vma_prepare is called in all the correct cases. I
haven't double checked you have and will wait until you pin down why
anon_vma is NULL in the next version.

> > > +
> > >  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  			unsigned long address, pte_t *ptep, pte_t pte,
> > >  			struct page *pagecache_page)
> > > @@ -2348,6 +2371,12 @@ retry_avoidcopy:
> > >  		huge_ptep_clear_flush(vma, address, ptep);
> > >  		set_huge_pte_at(mm, address, ptep,
> > >  				make_huge_pte(vma, new_page, 1));
> > > +		page_remove_rmap(old_page);
> > > +		/*
> > > +		 * We need not call anon_vma_prepare() because anon_vma
> > > +		 * is already prepared when the process fork()ed.
> > > +		 */
> > > +		hugepage_add_anon_rmap(new_page, vma, address);
> > 
> > This means that the anon_vma is shared between parent and child even
> > after fork. Does this not mean that the behaviour of anon_vma differs
> > between the core VM and hugetlb?
> 
> No. IIUC, anon_vma associated with (non-huge) anonymous page is also shared
> between parent and child until COW.
> 

In the base page case, it does but where is a new anon_vma being
allocated here and the rmap moved with page_move_anon_rmap?

> > >  		/* Make the old page be freed below */
> > >  		new_page = old_page;
> > >  	}
> > > @@ -2450,7 +2479,11 @@ retry:
> > >  			spin_unlock(&inode->i_lock);
> > >  		} else {
> > >  			lock_page(page);
> > > -			page->mapping = HUGETLB_POISON;
> > > +			if (unlikely(anon_vma_prepare(vma))) {
> > > +				ret = VM_FAULT_OOM;
> > > +				goto backout_unlocked;
> > > +			}
> > > +			hugepage_add_anon_rmap(page, vma, address);
> > 
> > Seems ok for private pages at least.
> > 
> > >  		}
> > >  	}
> > >  
> > > @@ -2479,6 +2512,13 @@ retry:
> > >  				&& (vma->vm_flags & VM_SHARED)));
> > >  	set_huge_pte_at(mm, address, ptep, new_pte);
> > >  
> > > +	/*
> > > +	 * For privately mapped hugepage, _mapcount is incremented
> > > +	 * in hugetlb_cow(), so only increment for shared hugepage here.
> > > +	 */
> > > +	if (vma->vm_flags & VM_MAYSHARE)
> > > +		page_dup_rmap(page);
> > > +
> > 
> > What happens when try_to_unmap_file is called on a hugetlb page?
> 
> Try_to_unmap_file() is called for shared hugepages, so it tracks all vmas
> sharing one hugepage through pagecache pointed to by page->mapping,
> and sets all ptes into hwpoison swap entries instead of flushing them.
> Curiously file backed pte is changed to swap entry, but it's OK because
> hwpoison hugepage should not be touched afterward.
> 

Grand.

> > >  	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > >  		/* Optimization, do the COW without a second fault */
> > >  		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> > > diff --git v2.6.34-rc7/mm/rmap.c v2.6.34-rc7/mm/rmap.c
> > > index 0feeef8..58cd2f9 100644
> > > --- v2.6.34-rc7/mm/rmap.c
> > > +++ v2.6.34-rc7/mm/rmap.c
> > > @@ -56,6 +56,7 @@
> > >  #include <linux/memcontrol.h>
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/migrate.h>
> > > +#include <linux/hugetlb.h>
> > >  
> > >  #include <asm/tlbflush.h>
> > >  
> > > @@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> > >  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > >  	unsigned long address;
> > >  
> > > +	if (unlikely(is_vm_hugetlb_page(vma)))
> > > +		pgoff = page->index << compound_order(page);
> > 
> > Again, it would be nice to use hstate information if possible just so
> > how the pagesize is discovered is consistent.
> 
> OK.
> 
> > >  	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > >  	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > >  		/* page should be within @vma mapping range */
> > > @@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > >  	pte_t *pte;
> > >  	spinlock_t *ptl;
> > >  
> > > +	if (unlikely(PageHuge(page))) {
> > > +		pte = huge_pte_offset(mm, address);
> > > +		ptl = &mm->page_table_lock;
> > > +		goto check;
> > > +	}
> > > +
> > >  	pgd = pgd_offset(mm, address);
> > >  	if (!pgd_present(*pgd))
> > >  		return NULL;
> > > @@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > >  	}
> > >  
> > >  	ptl = pte_lockptr(mm, pmd);
> > > +check:
> > >  	spin_lock(ptl);
> > >  	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
> > >  		*ptlp = ptl;
> > > @@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
> > >  		page_clear_dirty(page);
> > >  		set_page_dirty(page);
> > >  	}
> > > +	/*
> > > +	 * Mapping for Hugepages are not counted in NR_ANON_PAGES nor
> > > +	 * NR_FILE_MAPPED and no charged by memcg for now.
> > > +	 */
> > > +	if (unlikely(PageHuge(page)))
> > > +		return;
> > >  	if (PageAnon(page)) {
> > >  		mem_cgroup_uncharge_page(page);
> > >  		__dec_zone_page_state(page, NR_ANON_PAGES);
> > 
> > I don't see anything obviously wrong with this but it's a bit rushed and
> > there are a few snarls that I pointed out above. I'd like to hear it passed
> > the libhugetlbfs regression tests for different sizes without any oddness
> > in the counters.
> 
> Since there exists regression as described above, I'll fix it first of all.
> 

Sure. As it is, the hugetlb parts of this patch to my eye are not ready yet
with some snags that need ironing out. That said, I see nothing fundamentally
wrong with the approach as such.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-05-14  9:55 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage Naoya Horiguchi
2010-05-13  9:18   ` Andi Kleen
2010-05-17  4:53     ` Naoya Horiguchi
2010-05-13 15:27   ` Mel Gorman
2010-05-13 16:14     ` Andi Kleen
2010-05-14  7:46     ` Naoya Horiguchi
2010-05-14  9:54       ` Mel Gorman [this message]
2010-05-24  7:15         ` Naoya Horiguchi
2010-05-25 10:59           ` Mel Gorman
2010-05-26  6:51             ` Naoya Horiguchi
2010-05-26  9:03               ` Mel Gorman
2010-05-26  9:19               ` Andi Kleen
2010-05-26  9:44                 ` Mel Gorman
2010-05-26  9:58                   ` Andi Kleen
2010-05-13  7:55 ` [PATCH 2/7] HWPOISON, hugetlb: enable error handling path " Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage Naoya Horiguchi
2010-05-13 14:27 ` [PATCH 0/7] HWPOISON for hugepage (v5) Mel Gorman
2010-05-14  7:35   ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100514095449.GB21481@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=fengguang.wu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).