[PATCH 0/7] HWPOISON for hugepage (v5)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] HWPOISON for hugepage (v5)
@ 2010-05-13  7:55 Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage Naoya Horiguchi
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi; +Cc: linux-kernel, linux-mm

This patchset enables error handling for hugepage by containing error
in the affected hugepage.

Until now, memory error (classified as SRAO in MCA language) on hugepage
was simply ignored, which means if someone accesses the error page later,
the second MCE (severer than the first one) occurs and the system panics.

It's useful for some aggressive hugepage users if only affected processes
are killed.  Then other unrelated processes aren't disturbed by the error
and can continue operation.

Moreover, for other extensive hugetlb users which have own "pagecache"
on hugepage, the most valued feature would be being able to receive
the early kill signal BUS_MCEERR_AO, because the cache pages have
good opportunity to be dropped without side effects on BUS_MCEERR_AO.


The design of hugepage error handling is based on that of non-hugepage
error handling, where we:
 1. mark the error page as hwpoison,
 2. unmap the hwpoisoned page from processes using it,
 3. invalidate error page, and
 4. block later accesses to the hwpoisoned pages.

Similarities and differences between huge and non-huge case are
summarized below:

 1. (Difference) when error occurs on a hugepage, PG_hwpoison bits on all pages
    in the hugepage are set, because we have no simple way to break up
    hugepage into individual pages for now. This means there is a some
    risk to be killed by touching non-guilty pages within the error hugepage.

 2. (Similarity) hugetlb entry for the error hugepage is replaced by hwpoison
    swap entry, with which we can detect hwpoisoned memory in VM code.
    This is accomplished by adding rmapping code for hugepage, which enables
    to use try_to_unmap() for hugepage.

 3. (Difference) since hugepage is not linked to LRU list and is unswappable,
    there are not many things to do for page invalidation (only dequeuing
    free/reserved hugepage from freelist. See patch 5/7.)
    If we want to contain the error into one page, there may be more to do.

 4. (Similarity) we block later accesses by forcing page requests for
    hwpoisoned hugepage to fail as done in non-hugepage case in do_wp_page().

ToDo:
- Narrow down the containment region into one raw page.
- Soft-offlining for hugepage is not supported due to the lack of migration
  for hugepage.
- Counting file-mapped/anonymous hugepage in NR_FILE_MAPPED/NR_ANON_PAGES.

 [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
 [PATCH 2/7] HWPOISON, hugetlb: enable error handling path for hugepage
 [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
 [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
 [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage
 [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code
 [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage

Dependency:
- patch 2 depends on patch 1.
- patch 3 to patch 6 depend on patch 2.

 include/linux/hugetlb.h |    3 +
 mm/hugetlb.c            |   98 ++++++++++++++++++++++++++++++++++++++-
 mm/hwpoison-inject.c    |   15 ++++--
 mm/memory-failure.c     |  120 +++++++++++++++++++++++++++++++++++------------
 mm/rmap.c               |   16 ++++++
 5 files changed, 215 insertions(+), 37 deletions(-)

ChangeLog from v4:
- rebased to 2.6.34-rc7
- add isolation code for free/reserved hugepage in me_huge_page()
- set/clear PG_hwpoison bits of all pages in hugepage.
- mce_bad_pages counts all pages in hugepage.
- rename __hugepage_set_anon_rmap() to hugepage_add_anon_rmap()
- add huge_pte_offset() dummy function in header file on !CONFIG_HUGETLBFS

ChangeLog from v3:
- rebased to 2.6.34-rc5
- support for privately mapped hugepage

ChangeLog from v2:
- rebase to 2.6.34-rc3
- consider mapcount of hugepage
- rename pointer "head" into "hpage"

ChangeLog from v1:
- rebase to 2.6.34-rc1
- add comment from Wu Fengguang

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
@ 2010-05-13  7:55 ` Naoya Horiguchi
  2010-05-13  9:18   ` Andi Kleen
  2010-05-13 15:27   ` Mel Gorman
  2010-05-13  7:55 ` [PATCH 2/7] HWPOISON, hugetlb: enable error handling path " Naoya Horiguchi
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang,
	Mel Gorman

While hugepage is not currently swappable, rmapping can be useful
for memory error handler.
Using rmap, memory error handler can collect processes affected
by hugepage errors and unmap them to contain error's effect.

Current status of hugepage rmap differs depending on mapping mode:
- for shared hugepage:
  we can collect processes using a hugepage through pagecache,
  but can not unmap the hugepage because of the lack of mapcount.
- for privately mapped hugepage:
  we can neither collect processes nor unmap the hugepage.

To realize hugepage rmapping, this patch introduces mapcount for
shared/private-mapped hugepage and anon_vma for private-mapped hugepage.

This patch can be the replacement of the following bug fix.

  commit 23be7468e8802a2ac1de6ee3eecb3ec7f14dc703
  Author: Mel Gorman <mel@csn.ul.ie>
  Date:   Fri Apr 23 13:17:56 2010 -0400
  Subject: hugetlb: fix infinite loop in get_futex_key() when backed by huge pages

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/hugetlb.h |    1 +
 mm/hugetlb.c            |   42 +++++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c               |   16 ++++++++++++++++
 3 files changed, 58 insertions(+), 1 deletions(-)

diff --git v2.6.34-rc7/include/linux/hugetlb.h v2.6.34-rc7/include/linux/hugetlb.h
index 78b4bc6..1d0c2a4 100644
--- v2.6.34-rc7/include/linux/hugetlb.h
+++ v2.6.34-rc7/include/linux/hugetlb.h
@@ -108,6 +108,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
+#define huge_pte_offset(mm, address)	0
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
diff --git v2.6.34-rc7/mm/hugetlb.c v2.6.34-rc7/mm/hugetlb.c
index ffbdfc8..149eb12 100644
--- v2.6.34-rc7/mm/hugetlb.c
+++ v2.6.34-rc7/mm/hugetlb.c
@@ -18,6 +18,7 @@
 #include <linux/bootmem.h>
 #include <linux/sysfs.h>
 #include <linux/slab.h>
+#include <linux/rmap.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -2125,6 +2126,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
+			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(&src->page_table_lock);
@@ -2203,6 +2205,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
+		page_remove_rmap(page);
 		list_del(&page->lru);
 		put_page(page);
 	}
@@ -2268,6 +2271,26 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 1;
 }
 
+/*
+ * This is a counterpart of page_add_anon_rmap() for hugepage.
+ */
+static void hugepage_add_anon_rmap(struct page *page,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	int first;
+
+	BUG_ON(!anon_vma);
+	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+	first = atomic_inc_and_test(&page->_mapcount);
+	if (first) {
+		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
+		page->mapping = (struct address_space *) anon_vma;
+		page->index = linear_page_index(vma, address)
+			>> compound_order(page);
+	}
+}
+
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, pte_t pte,
 			struct page *pagecache_page)
@@ -2348,6 +2371,12 @@ retry_avoidcopy:
 		huge_ptep_clear_flush(vma, address, ptep);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
+		page_remove_rmap(old_page);
+		/*
+		 * We need not call anon_vma_prepare() because anon_vma
+		 * is already prepared when the process fork()ed.
+		 */
+		hugepage_add_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
@@ -2450,7 +2479,11 @@ retry:
 			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
-			page->mapping = HUGETLB_POISON;
+			if (unlikely(anon_vma_prepare(vma))) {
+				ret = VM_FAULT_OOM;
+				goto backout_unlocked;
+			}
+			hugepage_add_anon_rmap(page, vma, address);
 		}
 	}
 
@@ -2479,6 +2512,13 @@ retry:
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	/*
+	 * For privately mapped hugepage, _mapcount is incremented
+	 * in hugetlb_cow(), so only increment for shared hugepage here.
+	 */
+	if (vma->vm_flags & VM_MAYSHARE)
+		page_dup_rmap(page);
+
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
diff --git v2.6.34-rc7/mm/rmap.c v2.6.34-rc7/mm/rmap.c
index 0feeef8..58cd2f9 100644
--- v2.6.34-rc7/mm/rmap.c
+++ v2.6.34-rc7/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	unsigned long address;
 
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		pgoff = page->index << compound_order(page);
 	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
 		/* page should be within @vma mapping range */
@@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	pte_t *pte;
 	spinlock_t *ptl;
 
+	if (unlikely(PageHuge(page))) {
+		pte = huge_pte_offset(mm, address);
+		ptl = &mm->page_table_lock;
+		goto check;
+	}
+
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
 		return NULL;
@@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	}
 
 	ptl = pte_lockptr(mm, pmd);
+check:
 	spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
@@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
 		page_clear_dirty(page);
 		set_page_dirty(page);
 	}
+	/*
+	 * Mapping for Hugepages are not counted in NR_ANON_PAGES nor
+	 * NR_FILE_MAPPED and no charged by memcg for now.
+	 */
+	if (unlikely(PageHuge(page)))
+		return;
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
 		__dec_zone_page_state(page, NR_ANON_PAGES);
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/7] HWPOISON, hugetlb: enable error handling path for hugepage
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage Naoya Horiguchi
@ 2010-05-13  7:55 ` Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage Naoya Horiguchi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

This patch just enables handling path. Real containing and
recovering operation will be implemented in following patches.

Dependency:
  "hugetlb, rmap: add reverse mapping for hugepage."

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   39 ++++++++++++++++++++++-----------------
 1 files changed, 22 insertions(+), 17 deletions(-)

diff --git v2.6.34-rc7/mm/memory-failure.c v2.6.34-rc7/mm/memory-failure.c
index 620b0b4..1ec68c8 100644
--- v2.6.34-rc7/mm/memory-failure.c
+++ v2.6.34-rc7/mm/memory-failure.c
@@ -45,6 +45,7 @@
 #include <linux/page-isolation.h>
 #include <linux/suspend.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
 #include "internal.h"
 
 int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -837,6 +838,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	int ret;
 	int i;
 	int kill = 1;
+	struct page *hpage = compound_head(p);
 
 	if (PageReserved(p) || PageSlab(p))
 		return SWAP_SUCCESS;
@@ -845,10 +847,10 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * This check implies we don't kill processes if their pages
 	 * are in the swap cache early. Those are always late kills.
 	 */
-	if (!page_mapped(p))
+	if (!page_mapped(hpage))
 		return SWAP_SUCCESS;
 
-	if (PageCompound(p) || PageKsm(p))
+	if (PageKsm(p))
 		return SWAP_FAIL;
 
 	if (PageSwapCache(p)) {
@@ -863,10 +865,11 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * XXX: the dirty test could be racy: set_page_dirty() may not always
 	 * be called inside page lock (it's recommended but not enforced).
 	 */
-	mapping = page_mapping(p);
-	if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
-		if (page_mkclean(p)) {
-			SetPageDirty(p);
+	mapping = page_mapping(hpage);
+	if (!PageDirty(hpage) && mapping &&
+	    mapping_cap_writeback_dirty(mapping)) {
+		if (page_mkclean(hpage)) {
+			SetPageDirty(hpage);
 		} else {
 			kill = 0;
 			ttu |= TTU_IGNORE_HWPOISON;
@@ -885,14 +888,14 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * there's nothing that can be done.
 	 */
 	if (kill)
-		collect_procs(p, &tokill);
+		collect_procs(hpage, &tokill);
 
 	/*
 	 * try_to_unmap can fail temporarily due to races.
 	 * Try a few times (RED-PEN better strategy?)
 	 */
 	for (i = 0; i < N_UNMAP_TRIES; i++) {
-		ret = try_to_unmap(p, ttu);
+		ret = try_to_unmap(hpage, ttu);
 		if (ret == SWAP_SUCCESS)
 			break;
 		pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
@@ -900,7 +903,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 
 	if (ret != SWAP_SUCCESS)
 		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
-				pfn, page_mapcount(p));
+				pfn, page_mapcount(hpage));
 
 	/*
 	 * Now that the dirty bit has been propagated to the
@@ -911,7 +914,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * use a more force-full uncatchable kill to prevent
 	 * any accesses to the poisoned memory.
 	 */
-	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+	kill_procs_ao(&tokill, !!PageDirty(hpage), trapno,
 		      ret != SWAP_SUCCESS, pfn);
 
 	return ret;
@@ -921,6 +924,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 {
 	struct page_state *ps;
 	struct page *p;
+	struct page *hpage;
 	int res;
 
 	if (!sysctl_memory_failure_recovery)
@@ -934,6 +938,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	}
 
 	p = pfn_to_page(pfn);
+	hpage = compound_head(p);
 	if (TestSetPageHWPoison(p)) {
 		printk(KERN_ERR "MCE %#lx: already hardware poisoned\n", pfn);
 		return 0;
@@ -953,7 +958,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
 	if (!(flags & MF_COUNT_INCREASED) &&
-		!get_page_unless_zero(compound_head(p))) {
+		!get_page_unless_zero(hpage)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, "free buddy", DELAYED);
 			return 0;
@@ -971,9 +976,9 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * The check (unnecessarily) ignores LRU pages being isolated and
 	 * walked by the page reclaim code, however that's not a big loss.
 	 */
-	if (!PageLRU(p))
+	if (!PageLRU(p) && !PageHuge(p))
 		shake_page(p, 0);
-	if (!PageLRU(p)) {
+	if (!PageLRU(p) && !PageHuge(p)) {
 		/*
 		 * shake_page could have turned it free.
 		 */
@@ -991,7 +996,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * It's very difficult to mess with pages currently under IO
 	 * and in many cases impossible, so we just avoid it here.
 	 */
-	lock_page_nosync(p);
+	lock_page_nosync(hpage);
 
 	/*
 	 * unpoison always clear PG_hwpoison inside page lock
@@ -1004,8 +1009,8 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
 			atomic_long_dec(&mce_bad_pages);
-		unlock_page(p);
-		put_page(p);
+		unlock_page(hpage);
+		put_page(hpage);
 		return 0;
 	}
 
@@ -1038,7 +1043,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 		}
 	}
 out:
-	unlock_page(p);
+	unlock_page(hpage);
 	return res;
 }
 EXPORT_SYMBOL_GPL(__memory_failure);
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 2/7] HWPOISON, hugetlb: enable error handling path " Naoya Horiguchi
@ 2010-05-13  7:55 ` Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error Naoya Horiguchi
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

To avoid race condition between concurrent memory errors on identified
hugepage, we atomically test and set PG_hwpoison bit on the head page.
All pages in the error hugepage are considered as hwpoisoned
for now, so set and clear all PG_hwpoison bits in the hugepage
with page lock of the head page held.

Dependency:
  "HWPOISON, hugetlb: enable error handling path for hugepage"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 38 insertions(+), 0 deletions(-)

diff --git v2.6.34-rc7/mm/memory-failure.c v2.6.34-rc7/mm/memory-failure.c
index 1ec68c8..fee648b 100644
--- v2.6.34-rc7/mm/memory-failure.c
+++ v2.6.34-rc7/mm/memory-failure.c
@@ -920,6 +920,22 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	return ret;
 }
 
+static void set_page_hwpoison_huge_page(struct page *hpage)
+{
+	int i;
+	int nr_pages = 1 << compound_order(hpage);
+	for (i = 0; i < nr_pages; i++)
+		SetPageHWPoison(hpage + i);
+}
+
+static void clear_page_hwpoison_huge_page(struct page *hpage)
+{
+	int i;
+	int nr_pages = 1 << compound_order(hpage);
+	for (i = 0; i < nr_pages; i++)
+		ClearPageHWPoison(hpage + i);
+}
+
 int __memory_failure(unsigned long pfn, int trapno, int flags)
 {
 	struct page_state *ps;
@@ -1014,6 +1030,26 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
+	/*
+	 * For error on the tail page, we should set PG_hwpoison
+	 * on the head page to show that the hugepage is hwpoisoned
+	 */
+	if (PageTail(p) && TestSetPageHWPoison(hpage)) {
+		action_result(pfn, "hugepage already hardware poisoned",
+				IGNORED);
+		unlock_page(hpage);
+		put_page(hpage);
+		return 0;
+	}
+	/*
+	 * Set PG_hwpoison on all pages in an error hugepage,
+	 * because containment is done in hugepage unit for now.
+	 * Since we have done TestSetPageHWPoison() for the head page with
+	 * page lock held, we can safely set PG_hwpoison bits on tail pages.
+	 */
+	if (PageHuge(p))
+		set_page_hwpoison_huge_page(hpage);
+
 	wait_on_page_writeback(p);
 
 	/*
@@ -1118,6 +1154,8 @@ int unpoison_memory(unsigned long pfn)
 		atomic_long_dec(&mce_bad_pages);
 		freeit = 1;
 	}
+	if (PageHuge(p))
+		clear_page_hwpoison_huge_page(page);
 	unlock_page(page);
 
 	put_page(page);
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2010-05-13  7:55 ` [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage Naoya Horiguchi
@ 2010-05-13  7:55 ` Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage Naoya Horiguchi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

For now all pages in the error hugepage are considered as hwpoisoned,
so count all of them in mce_bad_pages.

Dependency:
  "HWPOISON, hugetlb: enable error handling path for hugepage"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git v2.6.34-rc7/mm/memory-failure.c v2.6.34-rc7/mm/memory-failure.c
index fee648b..473f15a 100644
--- v2.6.34-rc7/mm/memory-failure.c
+++ v2.6.34-rc7/mm/memory-failure.c
@@ -942,6 +942,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	struct page *p;
 	struct page *hpage;
 	int res;
+	unsigned int nr_pages;
 
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure from trap %d on page %lx", trapno, pfn);
@@ -960,7 +961,8 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
-	atomic_long_add(1, &mce_bad_pages);
+	nr_pages = 1 << compound_order(hpage);
+	atomic_long_add(nr_pages, &mce_bad_pages);
 
 	/*
 	 * We need/can do nothing about count=0 pages.
@@ -1024,7 +1026,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	}
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
-			atomic_long_dec(&mce_bad_pages);
+			atomic_long_sub(nr_pages, &mce_bad_pages);
 		unlock_page(hpage);
 		put_page(hpage);
 		return 0;
@@ -1123,6 +1125,7 @@ int unpoison_memory(unsigned long pfn)
 	struct page *page;
 	struct page *p;
 	int freeit = 0;
+	unsigned int nr_pages;
 
 	if (!pfn_valid(pfn))
 		return -ENXIO;
@@ -1135,9 +1138,11 @@ int unpoison_memory(unsigned long pfn)
 		return 0;
 	}
 
+	nr_pages = 1 << compound_order(page);
+
 	if (!get_page_unless_zero(page)) {
 		if (TestClearPageHWPoison(p))
-			atomic_long_dec(&mce_bad_pages);
+			atomic_long_sub(nr_pages, &mce_bad_pages);
 		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
 		return 0;
 	}
@@ -1149,9 +1154,9 @@ int unpoison_memory(unsigned long pfn)
 	 * the PG_hwpoison page will be caught and isolated on the entrance to
 	 * the free buddy page pool.
 	 */
-	if (TestClearPageHWPoison(p)) {
+	if (TestClearPageHWPoison(page)) {
 		pr_debug("MCE: Software-unpoisoned page %#lx\n", pfn);
-		atomic_long_dec(&mce_bad_pages);
+		atomic_long_sub(nr_pages, &mce_bad_pages);
 		freeit = 1;
 	}
 	if (PageHuge(p))
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2010-05-13  7:55 ` [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error Naoya Horiguchi
@ 2010-05-13  7:55 ` Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code Naoya Horiguchi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

If error hugepage is not in-use, we can fully recovery from error
by dequeuing it from freelist, so return RECOVERY.
Otherwise whether or not we can recovery depends on user processes,
so return DELAYED.

Dependency:
  "HWPOISON, hugetlb: enable error handling path for hugepage"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/hugetlb.h |    2 ++
 mm/hugetlb.c            |   16 ++++++++++++++++
 mm/memory-failure.c     |   28 ++++++++++++++++++++--------
 3 files changed, 38 insertions(+), 8 deletions(-)

diff --git v2.6.34-rc7/include/linux/hugetlb.h v2.6.34-rc7/include/linux/hugetlb.h
index 1d0c2a4..7779ae6 100644
--- v2.6.34-rc7/include/linux/hugetlb.h
+++ v2.6.34-rc7/include/linux/hugetlb.h
@@ -47,6 +47,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+void __isolate_hwpoisoned_huge_page(struct page *page);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -109,6 +110,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
+#define __isolate_hwpoisoned_huge_page(page)	0
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
diff --git v2.6.34-rc7/mm/hugetlb.c v2.6.34-rc7/mm/hugetlb.c
index 149eb12..3c1232f 100644
--- v2.6.34-rc7/mm/hugetlb.c
+++ v2.6.34-rc7/mm/hugetlb.c
@@ -2821,3 +2821,19 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	hugetlb_put_quota(inode->i_mapping, (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
 }
+
+/*
+ * This function is called from memory failure code.
+ * Assume the caller holds page lock of the head page.
+ */
+void __isolate_hwpoisoned_huge_page(struct page *hpage)
+{
+	struct hstate *h = page_hstate(hpage);
+	int nid = page_to_nid(hpage);
+
+	spin_lock(&hugetlb_lock);
+	list_del(&hpage->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	spin_unlock(&hugetlb_lock);
+}
diff --git v2.6.34-rc7/mm/memory-failure.c v2.6.34-rc7/mm/memory-failure.c
index 473f15a..d0b420a 100644
--- v2.6.34-rc7/mm/memory-failure.c
+++ v2.6.34-rc7/mm/memory-failure.c
@@ -690,17 +690,29 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
 /*
  * Huge pages. Needs work.
  * Issues:
- * No rmap support so we cannot find the original mapper. In theory could walk
- * all MMs and look for the mappings, but that would be non atomic and racy.
- * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
- * like just walking the current process and hoping it has it mapped (that
- * should be usually true for the common "shared database cache" case)
- * Should handle free huge pages and dequeue them too, but this needs to
- * handle huge page accounting correctly.
+ * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
+ *   To narrow down kill region to one page, we need to break up pmd.
+ * - To support soft-offlining for hugepage, we need to support hugepage
+ *   migration.
  */
 static int me_huge_page(struct page *p, unsigned long pfn)
 {
-	return FAILED;
+	struct page *hpage = compound_head(p);
+	/*
+	 * We can safely recover from error on free or reserved (i.e.
+	 * not in-use) hugepage by dequeuing it from freelist.
+	 * To check whether a hugepage is in-use or not, we can't use
+	 * page->lru because it can be used in other hugepage operations,
+	 * such as __unmap_hugepage_range() and gather_surplus_pages().
+	 * So instead we use page_mapping() and PageAnon().
+	 * We assume that this function is called with page lock held,
+	 * so there is no race between isolation and mapping/unmapping.
+	 */
+	if (!(page_mapping(hpage) || PageAnon(hpage))) {
+		__isolate_hwpoisoned_huge_page(hpage);
+		return RECOVERED;
+	}
+	return DELAYED;
 }
 
 /*
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2010-05-13  7:55 ` [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage Naoya Horiguchi
@ 2010-05-13  7:55 ` Naoya Horiguchi
  2010-05-13  7:55 ` [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage Naoya Horiguchi
  2010-05-13 14:27 ` [PATCH 0/7] HWPOISON for hugepage (v5) Mel Gorman
  7 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

This patch enables to block access to hwpoisoned hugepage and
also enables to block unmapping for it.

Dependency:
  "HWPOISON, hugetlb: enable error handling path for hugepage"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/hugetlb.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git v2.6.34-rc7/mm/hugetlb.c v2.6.34-rc7/mm/hugetlb.c
index 3c1232f..8bddaf0 100644
--- v2.6.34-rc7/mm/hugetlb.c
+++ v2.6.34-rc7/mm/hugetlb.c
@@ -19,6 +19,8 @@
 #include <linux/sysfs.h>
 #include <linux/slab.h>
 #include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -2138,6 +2140,19 @@ nomem:
 	return -ENOMEM;
 }
 
+static int is_hugetlb_entry_hwpoisoned(pte_t pte)
+{
+	swp_entry_t swp;
+
+	if (huge_pte_none(pte) || pte_present(pte))
+		return 0;
+	swp = pte_to_swp_entry(pte);
+	if (non_swap_entry(swp) && is_hwpoison_entry(swp)) {
+		return 1;
+	} else
+		return 0;
+}
+
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			    unsigned long end, struct page *ref_page)
 {
@@ -2196,6 +2211,12 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 		if (huge_pte_none(pte))
 			continue;
 
+		/*
+		 * HWPoisoned hugepage is already unmapped and dropped reference
+		 */
+		if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
+			continue;
+
 		page = pte_page(pte);
 		if (pte_dirty(pte))
 			set_page_dirty(page);
@@ -2488,6 +2509,18 @@ retry:
 	}
 
 	/*
+	 * Since memory error handler replaces pte into hwpoison swap entry
+	 * at the time of error handling, a process which reserved but not have
+	 * the mapping to the error hugepage does not have hwpoison swap entry.
+	 * So we need to block accesses from such a process by checking
+	 * PG_hwpoison bit here.
+	 */
+	if (unlikely(PageHWPoison(page))) {
+		ret = VM_FAULT_HWPOISON;
+		goto backout_unlocked;
+	}
+
+	/*
 	 * If we are going to COW a private mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that
 	 * any allocations necessary to record that reservation occur outside
@@ -2547,6 +2580,13 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
+	ptep = huge_pte_offset(mm, address);
+	if (ptep) {
+		entry = huge_ptep_get(ptep);
+		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+			return VM_FAULT_HWPOISON;
+	}
+
 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
                   ` (5 preceding siblings ...)
  2010-05-13  7:55 ` [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code Naoya Horiguchi
@ 2010-05-13  7:55 ` Naoya Horiguchi
  2010-05-13 14:27 ` [PATCH 0/7] HWPOISON for hugepage (v5) Mel Gorman
  7 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-13  7:55 UTC (permalink / raw)
  To: n-horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

This patch enables hwpoison injection through debug/hwpoison interfaces,
with which we can test memory error handling for free or reserved
hugepages (which cannot be tested by madvise() injector).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/hwpoison-inject.c |   15 +++++++++------
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git v2.6.34-rc7/mm/hwpoison-inject.c v2.6.34-rc7/mm/hwpoison-inject.c
index 10ea719..0948f10 100644
--- v2.6.34-rc7/mm/hwpoison-inject.c
+++ v2.6.34-rc7/mm/hwpoison-inject.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
+#include <linux/hugetlb.h>
 #include "internal.h"
 
 static struct dentry *hwpoison_dir;
@@ -13,6 +14,7 @@ static int hwpoison_inject(void *data, u64 val)
 {
 	unsigned long pfn = val;
 	struct page *p;
+	struct page *hpage;
 	int err;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -24,18 +26,19 @@ static int hwpoison_inject(void *data, u64 val)
 		return -ENXIO;
 
 	p = pfn_to_page(pfn);
+	hpage = compound_head(p);
 	/*
 	 * This implies unable to support free buddy pages.
 	 */
-	if (!get_page_unless_zero(p))
+	if (!get_page_unless_zero(hpage))
 		return 0;
 
-	if (!PageLRU(p))
+	if (!PageLRU(p) && !PageHuge(p))
 		shake_page(p, 0);
 	/*
 	 * This implies unable to support non-LRU pages.
 	 */
-	if (!PageLRU(p))
+	if (!PageLRU(p) && !PageHuge(p))
 		return 0;
 
 	/*
@@ -44,9 +47,9 @@ static int hwpoison_inject(void *data, u64 val)
 	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * __memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(p);
-	err = hwpoison_filter(p);
-	unlock_page(p);
+	lock_page(hpage);
+	err = hwpoison_filter(hpage);
+	unlock_page(hpage);
 	if (err)
 		return 0;
 
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-13  7:55 ` [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage Naoya Horiguchi
@ 2010-05-13  9:18   ` Andi Kleen
  2010-05-17  4:53     ` Naoya Horiguchi
  2010-05-13 15:27   ` Mel Gorman
  1 sibling, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2010-05-13  9:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Andrew Morton, Wu Fengguang, Mel Gorman,
	aarcange, lwoodman, Lee.Schermerhorn

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

Adding a few more recent hugetlb hackers in cc. Folks, please consider
reviewing the hugetlb.c parts of the original patch kit in linux-mm.

> While hugepage is not currently swappable, rmapping can be useful
> for memory error handler.
> Using rmap, memory error handler can collect processes affected
> by hugepage errors and unmap them to contain error's effect.

Thanks.

I reviewed all the patches and they look good to me. I can merge
them through the hwpoison git tree.

But before merging it there I would like to have some review
and acks from mm hackers on the mm/hugetlb.c parts, which
do (relatively minor) changes outside memory-failure.c

I think you also had a patch for mce-test, can you send me that
one too?

BTW I wonder: did you verify that the 1GB page support works?
I would expect it does, but it would be good to double check.
One would need a Westmere server or AMD Family10h+ system to test that.

> Current status of hugepage rmap differs depending on mapping mode:
> - for shared hugepage:
>   we can collect processes using a hugepage through pagecache,
>   but can not unmap the hugepage because of the lack of mapcount.
> - for privately mapped hugepage:
>   we can neither collect processes nor unmap the hugepage.

I hope these points can be eventually addressed too, but this
is a good first step and closes an important hole in hwpoison
coverage.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] HWPOISON for hugepage (v5)
  2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
                   ` (6 preceding siblings ...)
  2010-05-13  7:55 ` [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage Naoya Horiguchi
@ 2010-05-13 14:27 ` Mel Gorman
  2010-05-14  7:35   ` Naoya Horiguchi
  7 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2010-05-13 14:27 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: linux-kernel, linux-mm

On Thu, May 13, 2010 at 04:55:19PM +0900, Naoya Horiguchi wrote:
> This patchset enables error handling for hugepage by containing error
> in the affected hugepage.
> 
> Until now, memory error (classified as SRAO in MCA language) on hugepage

What does SRAO stand for? It doesn't matter, I'm just curious.

> was simply ignored, which means if someone accesses the error page later,
> the second MCE (severer than the first one) occurs and the system panics.
> 
> It's useful for some aggressive hugepage users if only affected processes
> are killed.  Then other unrelated processes aren't disturbed by the error
> and can continue operation.
> 

Surely, it's useful for any user of huge pages?

> Moreover, for other extensive hugetlb users which have own "pagecache"
> on hugepage, the most valued feature would be being able to receive
> the early kill signal BUS_MCEERR_AO, because the cache pages have
> good opportunity to be dropped without side effects on BUS_MCEERR_AO.
> 

Be careful here. The page cache that hugetlb uses is for MAP_SHARED
mappings. If the pages are discarded, they are gone and the result is data
loss. I think what you are suggesting is that a kill signal can instead be
translated into a harmless loss of page cache. That works for normal files
but not hugetlb.

> The design of hugepage error handling is based on that of non-hugepage
> error handling, where we:
>  1. mark the error page as hwpoison,
>  2. unmap the hwpoisoned page from processes using it,
>  3. invalidate error page, and
>  4. block later accesses to the hwpoisoned pages.
> 
> Similarities and differences between huge and non-huge case are
> summarized below:
> 
>  1. (Difference) when error occurs on a hugepage, PG_hwpoison bits on all pages
>     in the hugepage are set, because we have no simple way to break up
>     hugepage into individual pages for now. This means there is a some
>     risk to be killed by touching non-guilty pages within the error hugepage.
> 

You're right in that you cannot easily demote a hugepage. It is possible but
I cannot see the value of the required effort. If there is an error within
the hugepage and touching another part of it results in the process being
unnecessarily killed, then so be it.

>  2. (Similarity) hugetlb entry for the error hugepage is replaced by hwpoison
>     swap entry, with which we can detect hwpoisoned memory in VM code.
>     This is accomplished by adding rmapping code for hugepage, which enables
>     to use try_to_unmap() for hugepage.
> 

This will be interesting. hugetlbfs pages could look like a file or anon
depending on whether it has been mapped shared or private. It's odd.

>  3. (Difference) since hugepage is not linked to LRU list and is unswappable,
>     there are not many things to do for page invalidation (only dequeuing
>     free/reserved hugepage from freelist. See patch 5/7.)
>     If we want to contain the error into one page, there may be more to do.
> 
>  4. (Similarity) we block later accesses by forcing page requests for
>     hwpoisoned hugepage to fail as done in non-hugepage case in do_wp_page().
> 
> ToDo:
> - Narrow down the containment region into one raw page.
> - Soft-offlining for hugepage is not supported due to the lack of migration
>   for hugepage.
> - Counting file-mapped/anonymous hugepage in NR_FILE_MAPPED/NR_ANON_PAGES.
> 
>  [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
>  [PATCH 2/7] HWPOISON, hugetlb: enable error handling path for hugepage
>  [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
>  [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
>  [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage
>  [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code
>  [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage
> 
> Dependency:
> - patch 2 depends on patch 1.
> - patch 3 to patch 6 depend on patch 2.
> 
>  include/linux/hugetlb.h |    3 +
>  mm/hugetlb.c            |   98 ++++++++++++++++++++++++++++++++++++++-
>  mm/hwpoison-inject.c    |   15 ++++--
>  mm/memory-failure.c     |  120 +++++++++++++++++++++++++++++++++++------------
>  mm/rmap.c               |   16 ++++++
>  5 files changed, 215 insertions(+), 37 deletions(-)
> 
> ChangeLog from v4:
> - rebased to 2.6.34-rc7
> - add isolation code for free/reserved hugepage in me_huge_page()
> - set/clear PG_hwpoison bits of all pages in hugepage.
> - mce_bad_pages counts all pages in hugepage.
> - rename __hugepage_set_anon_rmap() to hugepage_add_anon_rmap()
> - add huge_pte_offset() dummy function in header file on !CONFIG_HUGETLBFS
> 
> ChangeLog from v3:
> - rebased to 2.6.34-rc5
> - support for privately mapped hugepage
> 
> ChangeLog from v2:
> - rebase to 2.6.34-rc3
> - consider mapcount of hugepage
> - rename pointer "head" into "hpage"
> 
> ChangeLog from v1:
> - rebase to 2.6.34-rc1
> - add comment from Wu Fengguang
> 
> Thanks,
> Naoya Horiguchi
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-13  7:55 ` [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage Naoya Horiguchi
  2010-05-13  9:18   ` Andi Kleen
@ 2010-05-13 15:27   ` Mel Gorman
  2010-05-13 16:14     ` Andi Kleen
  2010-05-14  7:46     ` Naoya Horiguchi
  1 sibling, 2 replies; 23+ messages in thread
From: Mel Gorman @ 2010-05-13 15:27 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

On Thu, May 13, 2010 at 04:55:20PM +0900, Naoya Horiguchi wrote:
> While hugepage is not currently swappable, rmapping can be useful
> for memory error handler.
> Using rmap, memory error handler can collect processes affected
> by hugepage errors and unmap them to contain error's effect.
> 

As a verification point, can you ensure that the libhugetlbfs "make
func" tests complete successfully with this patch applied? It's also
important that there is no oddness in the Hugepage-related counters in
/proc/meminfo. I'm not in the position to test it now unfortunately as
I'm on the road.

> Current status of hugepage rmap differs depending on mapping mode:
> - for shared hugepage:
>   we can collect processes using a hugepage through pagecache,
>   but can not unmap the hugepage because of the lack of mapcount.
> - for privately mapped hugepage:
>   we can neither collect processes nor unmap the hugepage.
> 
> To realize hugepage rmapping, this patch introduces mapcount for
> shared/private-mapped hugepage and anon_vma for private-mapped hugepage.
> 
> This patch can be the replacement of the following bug fix.
> 

Actually, you replace chunks but not all of that fix with this patch.
After this patch HUGETLB_POISON is never assigned but the definition still
exists in poison.h. You should also remove it if it is unnecessary.

>   commit 23be7468e8802a2ac1de6ee3eecb3ec7f14dc703
>   Author: Mel Gorman <mel@csn.ul.ie>
>   Date:   Fri Apr 23 13:17:56 2010 -0400
>   Subject: hugetlb: fix infinite loop in get_futex_key() when backed by huge pages
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/hugetlb.h |    1 +
>  mm/hugetlb.c            |   42 +++++++++++++++++++++++++++++++++++++++++-
>  mm/rmap.c               |   16 ++++++++++++++++
>  3 files changed, 58 insertions(+), 1 deletions(-)
> 
> diff --git v2.6.34-rc7/include/linux/hugetlb.h v2.6.34-rc7/include/linux/hugetlb.h
> index 78b4bc6..1d0c2a4 100644
> --- v2.6.34-rc7/include/linux/hugetlb.h
> +++ v2.6.34-rc7/include/linux/hugetlb.h
> @@ -108,6 +108,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
>  #define is_hugepage_only_range(mm, addr, len)	0
>  #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
>  #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
> +#define huge_pte_offset(mm, address)	0
>  
>  #define hugetlb_change_protection(vma, address, end, newprot)
>  
> diff --git v2.6.34-rc7/mm/hugetlb.c v2.6.34-rc7/mm/hugetlb.c
> index ffbdfc8..149eb12 100644
> --- v2.6.34-rc7/mm/hugetlb.c
> +++ v2.6.34-rc7/mm/hugetlb.c
> @@ -18,6 +18,7 @@
>  #include <linux/bootmem.h>
>  #include <linux/sysfs.h>
>  #include <linux/slab.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -2125,6 +2126,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			entry = huge_ptep_get(src_pte);
>  			ptepage = pte_page(entry);
>  			get_page(ptepage);
> +			page_dup_rmap(ptepage);
>  			set_huge_pte_at(dst, addr, dst_pte, entry);
>  		}
>  		spin_unlock(&src->page_table_lock);
> @@ -2203,6 +2205,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  	flush_tlb_range(vma, start, end);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
>  	list_for_each_entry_safe(page, tmp, &page_list, lru) {
> +		page_remove_rmap(page);
>  		list_del(&page->lru);
>  		put_page(page);
>  	}
> @@ -2268,6 +2271,26 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return 1;
>  }
>  
> +/*
> + * This is a counterpart of page_add_anon_rmap() for hugepage.
> + */
> +static void hugepage_add_anon_rmap(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address)

So hugepage anon rmap is MAP_PRIVATE mappings.

> +{
> +	struct anon_vma *anon_vma = vma->anon_vma;
> +	int first;
> +
> +	BUG_ON(!anon_vma);
> +	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> +	first = atomic_inc_and_test(&page->_mapcount);
> +	if (first) {
> +		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> +		page->mapping = (struct address_space *) anon_vma;
> +		page->index = linear_page_index(vma, address)
> +			>> compound_order(page);

What was wrong with vma_hugecache_offset()? You can lookup the necessary
hstate with hstate_vma(). Even if they are similar functionally, the
use of hstate would match better how other parts of hugetlbfs handle
multiple page sizes.

> +	}
> +}

Ok, so this is against 2.6.34-rc7, right? For ordinary anon_vma's, there
is a chain of related vma's chained together via the anon_vma's. It's so
in the event of an unmapping, all the PTEs related to the page can be
found. Where are we doing the same here?

I think what you're getting with this is the ability to unmap MAP_PRIVATE pages
from one process but if there are multiple processes, the second process could
still end up referencing the poisoned MAP_PRIVATE page. Is this accurate? Even
if it is, I guess it's still an improvement over what currently happens.

> +
>  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, pte_t *ptep, pte_t pte,
>  			struct page *pagecache_page)
> @@ -2348,6 +2371,12 @@ retry_avoidcopy:
>  		huge_ptep_clear_flush(vma, address, ptep);
>  		set_huge_pte_at(mm, address, ptep,
>  				make_huge_pte(vma, new_page, 1));
> +		page_remove_rmap(old_page);
> +		/*
> +		 * We need not call anon_vma_prepare() because anon_vma
> +		 * is already prepared when the process fork()ed.
> +		 */
> +		hugepage_add_anon_rmap(new_page, vma, address);

This means that the anon_vma is shared between parent and child even
after fork. Does this not mean that the behaviour of anon_vma differs
between the core VM and hugetlb?

>  		/* Make the old page be freed below */
>  		new_page = old_page;
>  	}
> @@ -2450,7 +2479,11 @@ retry:
>  			spin_unlock(&inode->i_lock);
>  		} else {
>  			lock_page(page);
> -			page->mapping = HUGETLB_POISON;
> +			if (unlikely(anon_vma_prepare(vma))) {
> +				ret = VM_FAULT_OOM;
> +				goto backout_unlocked;
> +			}
> +			hugepage_add_anon_rmap(page, vma, address);

Seems ok for private pages at least.

>  		}
>  	}
>  
> @@ -2479,6 +2512,13 @@ retry:
>  				&& (vma->vm_flags & VM_SHARED)));
>  	set_huge_pte_at(mm, address, ptep, new_pte);
>  
> +	/*
> +	 * For privately mapped hugepage, _mapcount is incremented
> +	 * in hugetlb_cow(), so only increment for shared hugepage here.
> +	 */
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		page_dup_rmap(page);
> +

What happens when try_to_unmap_file is called on a hugetlb page?

>  	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>  		/* Optimization, do the COW without a second fault */
>  		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> diff --git v2.6.34-rc7/mm/rmap.c v2.6.34-rc7/mm/rmap.c
> index 0feeef8..58cd2f9 100644
> --- v2.6.34-rc7/mm/rmap.c
> +++ v2.6.34-rc7/mm/rmap.c
> @@ -56,6 +56,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>  	unsigned long address;
>  
> +	if (unlikely(is_vm_hugetlb_page(vma)))
> +		pgoff = page->index << compound_order(page);

Again, it would be nice to use hstate information if possible just so
how the pagesize is discovered is consistent.

>  	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
>  		/* page should be within @vma mapping range */
> @@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> +	if (unlikely(PageHuge(page))) {
> +		pte = huge_pte_offset(mm, address);
> +		ptl = &mm->page_table_lock;
> +		goto check;
> +	}
> +
>  	pgd = pgd_offset(mm, address);
>  	if (!pgd_present(*pgd))
>  		return NULL;
> @@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>  	}
>  
>  	ptl = pte_lockptr(mm, pmd);
> +check:
>  	spin_lock(ptl);
>  	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
>  		*ptlp = ptl;
> @@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
>  		page_clear_dirty(page);
>  		set_page_dirty(page);
>  	}
> +	/*
> +	 * Mapping for Hugepages are not counted in NR_ANON_PAGES nor
> +	 * NR_FILE_MAPPED and no charged by memcg for now.
> +	 */
> +	if (unlikely(PageHuge(page)))
> +		return;
>  	if (PageAnon(page)) {
>  		mem_cgroup_uncharge_page(page);
>  		__dec_zone_page_state(page, NR_ANON_PAGES);

I don't see anything obviously wrong with this but it's a bit rushed and
there are a few snarls that I pointed out above. I'd like to hear it passed
the libhugetlbfs regression tests for different sizes without any oddness
in the counters.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-13 15:27   ` Mel Gorman
@ 2010-05-13 16:14     ` Andi Kleen
  2010-05-14  7:46     ` Naoya Horiguchi
  1 sibling, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2010-05-13 16:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Naoya Horiguchi, linux-kernel, linux-mm, Andi Kleen,
	Andrew Morton, Wu Fengguang

> I think what you're getting with this is the ability to unmap MAP_PRIVATE pages
> from one process but if there are multiple processes, the second process could
> still end up referencing the poisoned MAP_PRIVATE page. Is this accurate? Even
> if it is, I guess it's still an improvement over what currently happens.

The only real requirement is that all PTEs pointing to that page 
get replaced by poisoned PTEs.

(that's essentially always "late kill" mode)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] HWPOISON for hugepage (v5)
  2010-05-13 14:27 ` [PATCH 0/7] HWPOISON for hugepage (v5) Mel Gorman
@ 2010-05-14  7:35   ` Naoya Horiguchi
  0 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-14  7:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, Andi Kleen, Wu Fengguang

(Add Cc: Andi and Fengguang)

On Thu, May 13, 2010 at 03:27:50PM +0100, Mel Gorman wrote:
> On Thu, May 13, 2010 at 04:55:19PM +0900, Naoya Horiguchi wrote:
> > This patchset enables error handling for hugepage by containing error
> > in the affected hugepage.
> > 
> > Until now, memory error (classified as SRAO in MCA language) on hugepage
> 
> What does SRAO stand for? It doesn't matter, I'm just curious.

SRAO stands for "Software Recoverable Action Optional."
SRAO errors can be contained by software and then become harmless.

> > was simply ignored, which means if someone accesses the error page later,
> > the second MCE (severer than the first one) occurs and the system panics.
> > 
> > It's useful for some aggressive hugepage users if only affected processes
> > are killed.  Then other unrelated processes aren't disturbed by the error
> > and can continue operation.
> > 
> 
> Surely, it's useful for any user of huge pages?

Yes.

> > Moreover, for other extensive hugetlb users which have own "pagecache"
> > on hugepage, the most valued feature would be being able to receive
> > the early kill signal BUS_MCEERR_AO, because the cache pages have
> > good opportunity to be dropped without side effects on BUS_MCEERR_AO.
> > 
> 
> Be careful here. The page cache that hugetlb uses is for MAP_SHARED
> mappings. If the pages are discarded, they are gone and the result is data
> loss. I think what you are suggesting is that a kill signal can instead be
> translated into a harmless loss of page cache. That works for normal files
> but not hugetlb.

"Pagecache" I meant here is not the page cache in Linux kernel,
but a cache managed by an application, e.g. the application reads/writes
the cache contents with direct I/O and manages clean/dirty status itself.
If HWPOISON-aware application catches signal BUS_MCEERR_AO, it can discard
hugepage used as a cache and can reread from the file.

Thanks,
Naoya Horiguchi

> > The design of hugepage error handling is based on that of non-hugepage
> > error handling, where we:
> >  1. mark the error page as hwpoison,
> >  2. unmap the hwpoisoned page from processes using it,
> >  3. invalidate error page, and
> >  4. block later accesses to the hwpoisoned pages.
> > 
> > Similarities and differences between huge and non-huge case are
> > summarized below:
> > 
> >  1. (Difference) when error occurs on a hugepage, PG_hwpoison bits on all pages
> >     in the hugepage are set, because we have no simple way to break up
> >     hugepage into individual pages for now. This means there is a some
> >     risk to be killed by touching non-guilty pages within the error hugepage.
> > 
> 
> You're right in that you cannot easily demote a hugepage. It is possible but
> I cannot see the value of the required effort. If there is an error within
> the hugepage and touching another part of it results in the process being
> unnecessarily killed, then so be it.
> 
> >  2. (Similarity) hugetlb entry for the error hugepage is replaced by hwpoison
> >     swap entry, with which we can detect hwpoisoned memory in VM code.
> >     This is accomplished by adding rmapping code for hugepage, which enables
> >     to use try_to_unmap() for hugepage.
> > 
> 
> This will be interesting. hugetlbfs pages could look like a file or anon
> depending on whether it has been mapped shared or private. It's odd.
> 
> >  3. (Difference) since hugepage is not linked to LRU list and is unswappable,
> >     there are not many things to do for page invalidation (only dequeuing
> >     free/reserved hugepage from freelist. See patch 5/7.)
> >     If we want to contain the error into one page, there may be more to do.
> > 
> >  4. (Similarity) we block later accesses by forcing page requests for
> >     hwpoisoned hugepage to fail as done in non-hugepage case in do_wp_page().
> > 
> > ToDo:
> > - Narrow down the containment region into one raw page.
> > - Soft-offlining for hugepage is not supported due to the lack of migration
> >   for hugepage.
> > - Counting file-mapped/anonymous hugepage in NR_FILE_MAPPED/NR_ANON_PAGES.
> > 
> >  [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
> >  [PATCH 2/7] HWPOISON, hugetlb: enable error handling path for hugepage
> >  [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
> >  [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
> >  [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage
> >  [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code
> >  [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage
> > 
> > Dependency:
> > - patch 2 depends on patch 1.
> > - patch 3 to patch 6 depend on patch 2.
> > 
> >  include/linux/hugetlb.h |    3 +
> >  mm/hugetlb.c            |   98 ++++++++++++++++++++++++++++++++++++++-
> >  mm/hwpoison-inject.c    |   15 ++++--
> >  mm/memory-failure.c     |  120 +++++++++++++++++++++++++++++++++++------------
> >  mm/rmap.c               |   16 ++++++
> >  5 files changed, 215 insertions(+), 37 deletions(-)
> > 
> > ChangeLog from v4:
> > - rebased to 2.6.34-rc7
> > - add isolation code for free/reserved hugepage in me_huge_page()
> > - set/clear PG_hwpoison bits of all pages in hugepage.
> > - mce_bad_pages counts all pages in hugepage.
> > - rename __hugepage_set_anon_rmap() to hugepage_add_anon_rmap()
> > - add huge_pte_offset() dummy function in header file on !CONFIG_HUGETLBFS
> > 
> > ChangeLog from v3:
> > - rebased to 2.6.34-rc5
> > - support for privately mapped hugepage
> > 
> > ChangeLog from v2:
> > - rebase to 2.6.34-rc3
> > - consider mapcount of hugepage
> > - rename pointer "head" into "hpage"
> > 
> > ChangeLog from v1:
> > - rebase to 2.6.34-rc1
> > - add comment from Wu Fengguang
> > 
> > Thanks,
> > Naoya Horiguchi
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-13 15:27   ` Mel Gorman
  2010-05-13 16:14     ` Andi Kleen
@ 2010-05-14  7:46     ` Naoya Horiguchi
  2010-05-14  9:54       ` Mel Gorman
  1 sibling, 1 reply; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-14  7:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

On Thu, May 13, 2010 at 04:27:37PM +0100, Mel Gorman wrote:
> On Thu, May 13, 2010 at 04:55:20PM +0900, Naoya Horiguchi wrote:
> > While hugepage is not currently swappable, rmapping can be useful
> > for memory error handler.
> > Using rmap, memory error handler can collect processes affected
> > by hugepage errors and unmap them to contain error's effect.
> > 
> 
> As a verification point, can you ensure that the libhugetlbfs "make
> func" tests complete successfully with this patch applied? It's also
> important that there is no oddness in the Hugepage-related counters in
> /proc/meminfo. I'm not in the position to test it now unfortunately as
> I'm on the road.

Yes. Thanks for the good test-set.

Hmm. I failed libhugetlbfs test with a oops in "private mapped" test :(

dm120ei2 login: [  693.471581] ------------[ cut here ]------------
[  693.472130] kernel BUG at mm/hugetlb.c:2305!
[  693.472130] invalid opcode: 0000 [#1] SMP 
[  693.472130] last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:12:00.0/local_cpus
[  693.472130] CPU 2 
[  693.472130] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc autofs4 lockd sunrpc bonding ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath scsi_dh video output sbs sbshc battery acpi_memhotplug ac parport_pc lp parport kvm_intel kvm e1000e option usbserial sr_mod cdrom sg ioatdma i2c_i801 shpchp i2c_core serio_raw button dca rtc_cmos rtc_core rtc_lib pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ata_piix ahci libata sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
[  693.472130] 
[  693.472130] Pid: 4896, comm: private Not tainted 2.6.34-rc7-hwpoison-hugetlb #702 ******************
[  693.472130] RIP: 0010:[<ffffffff810f7812>]  [<ffffffff810f7812>] hugepage_add_anon_rmap+0x12/0x6f
[  693.472130] RSP: 0000:ffff8801d92c1b98  EFLAGS: 00010246
[  693.472130] RAX: ffff8801d92c1fd8 RBX: ffff8801d9a76de8 RCX: 0000000000000000
[  693.472130] RDX: 00000000f7a00000 RSI: ffff8801d98ee840 RDI: ffffea00069a1000
[  693.472130] RBP: ffff8801d92c1b98 R08: ffff880000000000 R09: 00003ffffffff000
[  693.472130] R10: 0000000000000246 R11: 0000000000000000 R12: ffffea000699a000
[  693.472130] R13: 00000000069a8000 R14: ffffea000699a000 R15: 0000000000000201
[  693.472130] FS:  0000000000000000(0000) GS:ffff880002a00000(0063) knlGS:00000000f77bb6c0
[  693.472130] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
[  693.472130] CR2: 00000000f7a00000 CR3: 00000001d9208000 CR4: 00000000000006e0
[  693.472130] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  693.472130] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  693.472130] Process private (pid: 4896, threadinfo ffff8801d92c0000, task ffff8801e243a390)
[  693.472130] Stack:
[  693.472130]  ffff8801d92c1cb8 ffffffff810f9561 ffff8801e243a390 000000000000013f
[  693.472130] <0> ffff8801d92c1c28 ffffffff8107851b ffff8801d92c1be8 ffffea000699a000
[  693.472130] <0> 80000001e2c000a5 ffff8801d9a76de8 00000000f7a00000 ffff8801d98ee840
[  693.472130] Call Trace:
[  693.472130]  [<ffffffff810f9561>] hugetlb_cow+0x645/0x677
[  693.472130]  [<ffffffff8107851b>] ? __lock_acquire+0x7b1/0x808
[  693.472130]  [<ffffffff810f9b90>] ? hugetlb_fault+0x5fd/0x6ac
[  693.472130]  [<ffffffff810f9bbd>] hugetlb_fault+0x62a/0x6ac
[  693.472130]  [<ffffffff810746a9>] ? trace_hardirqs_off+0xd/0xf
[  693.472130]  [<ffffffff8106a0fe>] ? cpu_clock+0x41/0x5b
[  693.472130]  [<ffffffff8107461d>] ? trace_hardirqs_off_caller+0x1f/0x9e
[  693.472130]  [<ffffffff810e5b28>] handle_mm_fault+0x61/0x8b4
[  693.472130]  [<ffffffff81392b12>] ? do_page_fault+0x1ef/0x3da
[  693.472130]  [<ffffffff81392c02>] do_page_fault+0x2df/0x3da
[  693.472130]  [<ffffffff8106a0fe>] ? cpu_clock+0x41/0x5b
[  693.472130]  [<ffffffff81073eff>] ? lock_release_holdtime+0xa4/0xa9
[  693.472130]  [<ffffffff8138ea3e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  693.472130]  [<ffffffff8138fdc6>] ? error_sti+0x5/0x6
[  693.472130]  [<ffffffff8138ea3e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  693.472130]  [<ffffffff8138fbb5>] page_fault+0x25/0x30
[  693.472130] Code: 00 00 00 00 41 c7 40 08 01 00 00 00 8b 77 08 4c 89 c7 e8 7a aa fd ff c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8b 4e 78 48 85 c9 75 04 <0f> 0b eb fe 48 3b 56 08 72 06 48 3b 56 10 72 04 0f 0b eb fe f0 
[  693.472130] RIP  [<ffffffff810f7812>] hugepage_add_anon_rmap+0x12/0x6f
[  693.472130]  RSP <ffff8801d92c1b98>
[  693.869837] ---[ end trace bd996c35d4583bca ]---

Someone seems to call hugetlb_fault() with anon_vma == NULL.
For more detail, I'm investigating it.

> > Current status of hugepage rmap differs depending on mapping mode:
> > - for shared hugepage:
> >   we can collect processes using a hugepage through pagecache,
> >   but can not unmap the hugepage because of the lack of mapcount.
> > - for privately mapped hugepage:
> >   we can neither collect processes nor unmap the hugepage.
> > 
> > To realize hugepage rmapping, this patch introduces mapcount for
> > shared/private-mapped hugepage and anon_vma for private-mapped hugepage.
> > 
> > This patch can be the replacement of the following bug fix.
> > 
> 
> Actually, you replace chunks but not all of that fix with this patch.
> After this patch HUGETLB_POISON is never assigned but the definition still
> exists in poison.h. You should also remove it if it is unnecessary.

OK. I'll remove HUGETLB_POISON in the next post.

> >   commit 23be7468e8802a2ac1de6ee3eecb3ec7f14dc703
> >   Author: Mel Gorman <mel@csn.ul.ie>
> >   Date:   Fri Apr 23 13:17:56 2010 -0400
> >   Subject: hugetlb: fix infinite loop in get_futex_key() when backed by huge pages
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Andi Kleen <andi@firstfloor.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Wu Fengguang <fengguang.wu@intel.com>
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/hugetlb.h |    1 +
> >  mm/hugetlb.c            |   42 +++++++++++++++++++++++++++++++++++++++++-
> >  mm/rmap.c               |   16 ++++++++++++++++
> >  3 files changed, 58 insertions(+), 1 deletions(-)
> > 
> > diff --git v2.6.34-rc7/include/linux/hugetlb.h v2.6.34-rc7/include/linux/hugetlb.h
> > index 78b4bc6..1d0c2a4 100644
> > --- v2.6.34-rc7/include/linux/hugetlb.h
> > +++ v2.6.34-rc7/include/linux/hugetlb.h
> > @@ -108,6 +108,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
> >  #define is_hugepage_only_range(mm, addr, len)	0
> >  #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
> >  #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
> > +#define huge_pte_offset(mm, address)	0
> >  
> >  #define hugetlb_change_protection(vma, address, end, newprot)
> >  
> > diff --git v2.6.34-rc7/mm/hugetlb.c v2.6.34-rc7/mm/hugetlb.c
> > index ffbdfc8..149eb12 100644
> > --- v2.6.34-rc7/mm/hugetlb.c
> > +++ v2.6.34-rc7/mm/hugetlb.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/bootmem.h>
> >  #include <linux/sysfs.h>
> >  #include <linux/slab.h>
> > +#include <linux/rmap.h>
> >  
> >  #include <asm/page.h>
> >  #include <asm/pgtable.h>
> > @@ -2125,6 +2126,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  			entry = huge_ptep_get(src_pte);
> >  			ptepage = pte_page(entry);
> >  			get_page(ptepage);
> > +			page_dup_rmap(ptepage);
> >  			set_huge_pte_at(dst, addr, dst_pte, entry);
> >  		}
> >  		spin_unlock(&src->page_table_lock);
> > @@ -2203,6 +2205,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> >  	flush_tlb_range(vma, start, end);
> >  	mmu_notifier_invalidate_range_end(mm, start, end);
> >  	list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > +		page_remove_rmap(page);
> >  		list_del(&page->lru);
> >  		put_page(page);
> >  	}
> > @@ -2268,6 +2271,26 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	return 1;
> >  }
> >  
> > +/*
> > + * This is a counterpart of page_add_anon_rmap() for hugepage.
> > + */
> > +static void hugepage_add_anon_rmap(struct page *page,
> > +			struct vm_area_struct *vma, unsigned long address)
> 
> So hugepage anon rmap is MAP_PRIVATE mappings.

Yes.

> > +{
> > +	struct anon_vma *anon_vma = vma->anon_vma;
> > +	int first;
> > +
> > +	BUG_ON(!anon_vma);
> > +	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> > +	first = atomic_inc_and_test(&page->_mapcount);
> > +	if (first) {
> > +		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> > +		page->mapping = (struct address_space *) anon_vma;
> > +		page->index = linear_page_index(vma, address)
> > +			>> compound_order(page);
> 
> What was wrong with vma_hugecache_offset()? You can lookup the necessary
> hstate with hstate_vma(). Even if they are similar functionally, the
> use of hstate would match better how other parts of hugetlbfs handle
> multiple page sizes.

I understand.

> > +	}
> > +}
> 
> Ok, so this is against 2.6.34-rc7, right?

Yes.

> For ordinary anon_vma's, there
> is a chain of related vma's chained together via the anon_vma's. It's so
> in the event of an unmapping, all the PTEs related to the page can be
> found. Where are we doing the same here?

Finding all processes using a hugepage is done by try_to_unmap() as usual.
Among callers of this function, only memory error handler calls it for
hugepage for now.
What this patch does is to enable try_to_unmap() to be called for hugepages
by setting up anon_vma in hugetlb code.

> I think what you're getting with this is the ability to unmap MAP_PRIVATE pages
> from one process but if there are multiple processes, the second process could
> still end up referencing the poisoned MAP_PRIVATE page. Is this accurate? Even
> if it is, I guess it's still an improvement over what currently happens.

Try_to_unmap_anon() runs for each vma belonging to the anon_vma associated
with the error hugepage. So it works for multiple processes.

> > +
> >  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> >  			unsigned long address, pte_t *ptep, pte_t pte,
> >  			struct page *pagecache_page)
> > @@ -2348,6 +2371,12 @@ retry_avoidcopy:
> >  		huge_ptep_clear_flush(vma, address, ptep);
> >  		set_huge_pte_at(mm, address, ptep,
> >  				make_huge_pte(vma, new_page, 1));
> > +		page_remove_rmap(old_page);
> > +		/*
> > +		 * We need not call anon_vma_prepare() because anon_vma
> > +		 * is already prepared when the process fork()ed.
> > +		 */
> > +		hugepage_add_anon_rmap(new_page, vma, address);
> 
> This means that the anon_vma is shared between parent and child even
> after fork. Does this not mean that the behaviour of anon_vma differs
> between the core VM and hugetlb?

No. IIUC, anon_vma associated with (non-huge) anonymous page is also shared
between parent and child until COW.

> >  		/* Make the old page be freed below */
> >  		new_page = old_page;
> >  	}
> > @@ -2450,7 +2479,11 @@ retry:
> >  			spin_unlock(&inode->i_lock);
> >  		} else {
> >  			lock_page(page);
> > -			page->mapping = HUGETLB_POISON;
> > +			if (unlikely(anon_vma_prepare(vma))) {
> > +				ret = VM_FAULT_OOM;
> > +				goto backout_unlocked;
> > +			}
> > +			hugepage_add_anon_rmap(page, vma, address);
> 
> Seems ok for private pages at least.
> 
> >  		}
> >  	}
> >  
> > @@ -2479,6 +2512,13 @@ retry:
> >  				&& (vma->vm_flags & VM_SHARED)));
> >  	set_huge_pte_at(mm, address, ptep, new_pte);
> >  
> > +	/*
> > +	 * For privately mapped hugepage, _mapcount is incremented
> > +	 * in hugetlb_cow(), so only increment for shared hugepage here.
> > +	 */
> > +	if (vma->vm_flags & VM_MAYSHARE)
> > +		page_dup_rmap(page);
> > +
> 
> What happens when try_to_unmap_file is called on a hugetlb page?

Try_to_unmap_file() is called for shared hugepages, so it tracks all vmas
sharing one hugepage through pagecache pointed to by page->mapping,
and sets all ptes into hwpoison swap entries instead of flushing them.
Curiously file backed pte is changed to swap entry, but it's OK because
hwpoison hugepage should not be touched afterward.

> >  	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> >  		/* Optimization, do the COW without a second fault */
> >  		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> > diff --git v2.6.34-rc7/mm/rmap.c v2.6.34-rc7/mm/rmap.c
> > index 0feeef8..58cd2f9 100644
> > --- v2.6.34-rc7/mm/rmap.c
> > +++ v2.6.34-rc7/mm/rmap.c
> > @@ -56,6 +56,7 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/migrate.h>
> > +#include <linux/hugetlb.h>
> >  
> >  #include <asm/tlbflush.h>
> >  
> > @@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> >  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> >  	unsigned long address;
> >  
> > +	if (unlikely(is_vm_hugetlb_page(vma)))
> > +		pgoff = page->index << compound_order(page);
> 
> Again, it would be nice to use hstate information if possible just so
> how the pagesize is discovered is consistent.

OK.

> >  	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> >  	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> >  		/* page should be within @vma mapping range */
> > @@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> >  
> > +	if (unlikely(PageHuge(page))) {
> > +		pte = huge_pte_offset(mm, address);
> > +		ptl = &mm->page_table_lock;
> > +		goto check;
> > +	}
> > +
> >  	pgd = pgd_offset(mm, address);
> >  	if (!pgd_present(*pgd))
> >  		return NULL;
> > @@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> >  	}
> >  
> >  	ptl = pte_lockptr(mm, pmd);
> > +check:
> >  	spin_lock(ptl);
> >  	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
> >  		*ptlp = ptl;
> > @@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
> >  		page_clear_dirty(page);
> >  		set_page_dirty(page);
> >  	}
> > +	/*
> > +	 * Mapping for Hugepages are not counted in NR_ANON_PAGES nor
> > +	 * NR_FILE_MAPPED and no charged by memcg for now.
> > +	 */
> > +	if (unlikely(PageHuge(page)))
> > +		return;
> >  	if (PageAnon(page)) {
> >  		mem_cgroup_uncharge_page(page);
> >  		__dec_zone_page_state(page, NR_ANON_PAGES);
> 
> I don't see anything obviously wrong with this but it's a bit rushed and
> there are a few snarls that I pointed out above. I'd like to hear it passed
> the libhugetlbfs regression tests for different sizes without any oddness
> in the counters.

Since there exists regression as described above, I'll fix it first of all.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-14  7:46     ` Naoya Horiguchi
@ 2010-05-14  9:54       ` Mel Gorman
  2010-05-24  7:15         ` Naoya Horiguchi
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2010-05-14  9:54 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang

On Fri, May 14, 2010 at 04:46:41PM +0900, Naoya Horiguchi wrote:
> On Thu, May 13, 2010 at 04:27:37PM +0100, Mel Gorman wrote:
> > On Thu, May 13, 2010 at 04:55:20PM +0900, Naoya Horiguchi wrote:
> > > While hugepage is not currently swappable, rmapping can be useful
> > > for memory error handler.
> > > Using rmap, memory error handler can collect processes affected
> > > by hugepage errors and unmap them to contain error's effect.
> > > 
> > 
> > As a verification point, can you ensure that the libhugetlbfs "make
> > func" tests complete successfully with this patch applied? It's also
> > important that there is no oddness in the Hugepage-related counters in
> > /proc/meminfo. I'm not in the position to test it now unfortunately as
> > I'm on the road.
> 
> Yes. Thanks for the good test-set.
> 
> Hmm. I failed libhugetlbfs test with a oops in "private mapped" test :(
> 

That's not a disaster - it's what the regression test is for. I haven't
restarted the review in this case. I'll wait for another version that
passes those regression tests.

> <OOPS SNIP>
> 
> Someone seems to call hugetlb_fault() with anon_vma == NULL.
> For more detail, I'm investigating it.
> 

Sure.

> > > Current status of hugepage rmap differs depending on mapping mode:
> > > - for shared hugepage:
> > >   we can collect processes using a hugepage through pagecache,
> > >   but can not unmap the hugepage because of the lack of mapcount.
> > > - for privately mapped hugepage:
> > >   we can neither collect processes nor unmap the hugepage.
> > > 
> > > To realize hugepage rmapping, this patch introduces mapcount for
> > > shared/private-mapped hugepage and anon_vma for private-mapped hugepage.
> > > 
> > > This patch can be the replacement of the following bug fix.
> > > 
> > 
> > Actually, you replace chunks but not all of that fix with this patch.
> > After this patch HUGETLB_POISON is never assigned but the definition still
> > exists in poison.h. You should also remove it if it is unnecessary.
> 
> OK. I'll remove HUGETLB_POISON in the next post.
> 

Thanks

> <SNIP>
> > For ordinary anon_vma's, there
> > is a chain of related vma's chained together via the anon_vma's. It's so
> > in the event of an unmapping, all the PTEs related to the page can be
> > found. Where are we doing the same here?
> 
> Finding all processes using a hugepage is done by try_to_unmap() as usual.
> Among callers of this function, only memory error handler calls it for
> hugepage for now.

Ok, my bad, it's anon_vma_prepare that does most of the linkages.
However, there still appears to be logic missing between how anon rmap
pages are setup and hugetlb anon rmap pages. See __page_set_anon_rmap
for example and what it does with chains and compare it to
hugetlb_add_anon_rmap. There are some important differences.


> What this patch does is to enable try_to_unmap() to be called for hugepages
> by setting up anon_vma in hugetlb code.
> 
> > I think what you're getting with this is the ability to unmap MAP_PRIVATE pages
> > from one process but if there are multiple processes, the second process could
> > still end up referencing the poisoned MAP_PRIVATE page. Is this accurate? Even
> > if it is, I guess it's still an improvement over what currently happens.
> 
> Try_to_unmap_anon() runs for each vma belonging to the anon_vma associated
> with the error hugepage. So it works for multiple processes.
> 

Yep, as long as anon_vma_prepare is called in all the correct cases. I
haven't double checked you have and will wait until you pin down why
anon_vma is NULL in the next version.

> > > +
> > >  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  			unsigned long address, pte_t *ptep, pte_t pte,
> > >  			struct page *pagecache_page)
> > > @@ -2348,6 +2371,12 @@ retry_avoidcopy:
> > >  		huge_ptep_clear_flush(vma, address, ptep);
> > >  		set_huge_pte_at(mm, address, ptep,
> > >  				make_huge_pte(vma, new_page, 1));
> > > +		page_remove_rmap(old_page);
> > > +		/*
> > > +		 * We need not call anon_vma_prepare() because anon_vma
> > > +		 * is already prepared when the process fork()ed.
> > > +		 */
> > > +		hugepage_add_anon_rmap(new_page, vma, address);
> > 
> > This means that the anon_vma is shared between parent and child even
> > after fork. Does this not mean that the behaviour of anon_vma differs
> > between the core VM and hugetlb?
> 
> No. IIUC, anon_vma associated with (non-huge) anonymous page is also shared
> between parent and child until COW.
> 

In the base page case, it does but where is a new anon_vma being
allocated here and the rmap moved with page_move_anon_rmap?

> > >  		/* Make the old page be freed below */
> > >  		new_page = old_page;
> > >  	}
> > > @@ -2450,7 +2479,11 @@ retry:
> > >  			spin_unlock(&inode->i_lock);
> > >  		} else {
> > >  			lock_page(page);
> > > -			page->mapping = HUGETLB_POISON;
> > > +			if (unlikely(anon_vma_prepare(vma))) {
> > > +				ret = VM_FAULT_OOM;
> > > +				goto backout_unlocked;
> > > +			}
> > > +			hugepage_add_anon_rmap(page, vma, address);
> > 
> > Seems ok for private pages at least.
> > 
> > >  		}
> > >  	}
> > >  
> > > @@ -2479,6 +2512,13 @@ retry:
> > >  				&& (vma->vm_flags & VM_SHARED)));
> > >  	set_huge_pte_at(mm, address, ptep, new_pte);
> > >  
> > > +	/*
> > > +	 * For privately mapped hugepage, _mapcount is incremented
> > > +	 * in hugetlb_cow(), so only increment for shared hugepage here.
> > > +	 */
> > > +	if (vma->vm_flags & VM_MAYSHARE)
> > > +		page_dup_rmap(page);
> > > +
> > 
> > What happens when try_to_unmap_file is called on a hugetlb page?
> 
> Try_to_unmap_file() is called for shared hugepages, so it tracks all vmas
> sharing one hugepage through pagecache pointed to by page->mapping,
> and sets all ptes into hwpoison swap entries instead of flushing them.
> Curiously file backed pte is changed to swap entry, but it's OK because
> hwpoison hugepage should not be touched afterward.
> 

Grand.

> > >  	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > >  		/* Optimization, do the COW without a second fault */
> > >  		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> > > diff --git v2.6.34-rc7/mm/rmap.c v2.6.34-rc7/mm/rmap.c
> > > index 0feeef8..58cd2f9 100644
> > > --- v2.6.34-rc7/mm/rmap.c
> > > +++ v2.6.34-rc7/mm/rmap.c
> > > @@ -56,6 +56,7 @@
> > >  #include <linux/memcontrol.h>
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/migrate.h>
> > > +#include <linux/hugetlb.h>
> > >  
> > >  #include <asm/tlbflush.h>
> > >  
> > > @@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> > >  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > >  	unsigned long address;
> > >  
> > > +	if (unlikely(is_vm_hugetlb_page(vma)))
> > > +		pgoff = page->index << compound_order(page);
> > 
> > Again, it would be nice to use hstate information if possible just so
> > how the pagesize is discovered is consistent.
> 
> OK.
> 
> > >  	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > >  	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > >  		/* page should be within @vma mapping range */
> > > @@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > >  	pte_t *pte;
> > >  	spinlock_t *ptl;
> > >  
> > > +	if (unlikely(PageHuge(page))) {
> > > +		pte = huge_pte_offset(mm, address);
> > > +		ptl = &mm->page_table_lock;
> > > +		goto check;
> > > +	}
> > > +
> > >  	pgd = pgd_offset(mm, address);
> > >  	if (!pgd_present(*pgd))
> > >  		return NULL;
> > > @@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > >  	}
> > >  
> > >  	ptl = pte_lockptr(mm, pmd);
> > > +check:
> > >  	spin_lock(ptl);
> > >  	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
> > >  		*ptlp = ptl;
> > > @@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
> > >  		page_clear_dirty(page);
> > >  		set_page_dirty(page);
> > >  	}
> > > +	/*
> > > +	 * Mapping for Hugepages are not counted in NR_ANON_PAGES nor
> > > +	 * NR_FILE_MAPPED and no charged by memcg for now.
> > > +	 */
> > > +	if (unlikely(PageHuge(page)))
> > > +		return;
> > >  	if (PageAnon(page)) {
> > >  		mem_cgroup_uncharge_page(page);
> > >  		__dec_zone_page_state(page, NR_ANON_PAGES);
> > 
> > I don't see anything obviously wrong with this but it's a bit rushed and
> > there are a few snarls that I pointed out above. I'd like to hear it passed
> > the libhugetlbfs regression tests for different sizes without any oddness
> > in the counters.
> 
> Since there exists regression as described above, I'll fix it first of all.
> 

Sure. As it is, the hugetlb parts of this patch to my eye are not ready yet
with some snags that need ironing out. That said, I see nothing fundamentally
wrong with the approach as such.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-13  9:18   ` Andi Kleen
@ 2010-05-17  4:53     ` Naoya Horiguchi
  0 siblings, 0 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-17  4:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-mm, Andrew Morton, Wu Fengguang, Mel Gorman,
	aarcange, lwoodman, Lee.Schermerhorn

On Thu, May 13, 2010 at 11:18:04AM +0200, Andi Kleen wrote:
> ...
> I think you also had a patch for mce-test, can you send me that
> one too?

Yes.
I attached stand-alone test program and run script below.
These don't depend on other code in mce-test,
so we can just add to /tsrc.

> BTW I wonder: did you verify that the 1GB page support works?
> I would expect it does, but it would be good to double check.
> One would need a Westmere server or AMD Family10h+ system to test that.

No, I didn't.

Thanks,
Naoya Horiguchi

---
thugetlb.c
---
/*
 * Test program for memory error handling for hugepages
 * Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
 */
#include <stdlib.h>
#include <stdio.h>
#include <string.h> 
#include <fcntl.h>
#include <signal.h>
#include <unistd.h>
#include <getopt.h>
#include <sys/mman.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/sem.h>
#include <sys/types.h>
#include <sys/prctl.h>

#define FILE_BASE  "test"

#define HPAGE_SIZE (2UL*1024*1024)
#define BUF_SIZE   256
#define PROTECTION (PROT_READ | PROT_WRITE)

#ifndef SHM_HUGETLB
#define SHM_HUGETLB 04000
#endif

/* Control early_kill/late_kill */
#define PR_MCE_KILL 33
#define PR_MCE_KILL_CLEAR   0
#define PR_MCE_KILL_SET     1
#define PR_MCE_KILL_LATE    0
#define PR_MCE_KILL_EARLY   1
#define PR_MCE_KILL_DEFAULT 2
#define PR_MCE_KILL_GET 34

int PS; /* Page size */
int file_size; /* Memory allocation size (hugepage unit) */
/* Error injection position (page offset from the first hugepage head) */
int corrupt_page;
char filename[BUF_SIZE] = "/test";
char filepath[BUF_SIZE];

#define DEB printf("DEBUG [%d:%s:%d]\n", getpid(), __FILE__, __LINE__);

static void usage(void)
{
	printf(
"./thugetlb [-m memory] [-o offset] [-f file] [-xeSAaFpch] hugetlbfs_directory\n"
"            -m|--memory size(hugepage unit)    Size of hugetlbfs file\n"
"            -o|--offset offset(page unit)      Position of error injection\n"
"            -x|--inject                        Error injection switch\n"
"            -e|--early-kill                    Set PR_MCE_KILL_EARLY\n"
"            -S|--shm                           Use shmem with SHM_HUGETLB\n"
"            -A|--anonymous                     Use MAP_ANONYMOUS\n"
"            -a|--avoid-touch                   Avoid touching error page\n"
"            -F|--fork\n"
"            -p|--private\n"
"            -c|--cow\n"
"            -f|--filename string\n"
"            -h|--help\n"
"\n"
	);
}

/*
 * semaphore get/put wrapper
 */
int get_semaphore(int sem_id, struct sembuf *sembuffer)
{
	sembuffer->sem_num = 0;
	sembuffer->sem_op  = -1;
	sembuffer->sem_flg = SEM_UNDO;
	return semop(sem_id, sembuffer, 1);
}

int put_semaphore(int sem_id, struct sembuf *sembuffer)
{
	sembuffer->sem_num = 0;
	sembuffer->sem_op  = 1;
	sembuffer->sem_flg = SEM_UNDO;
	return semop(sem_id, sembuffer, 1);
}

static int avoid_hpage(void *addr, int flag, char *avoid)
{
	return flag == 1 && addr == avoid;
}

static void write_bytes(char *addr, int flag, char *avoid)
{
	int i, j;
	for (i = 0; i < file_size; i++) {
		if (avoid_hpage(addr + i * HPAGE_SIZE, flag, avoid))
			continue;
		for (j = 0; j < HPAGE_SIZE; j++) {
			*(addr + i * HPAGE_SIZE + j) = (char)('a' +
							      ((i + j) % 26));
		}
	}
}

static void read_bytes(char *addr, int flag, char *avoid)
{
	int i, j;

	for (i = 0; i < file_size; i++) {
		if (avoid_hpage(addr + i * HPAGE_SIZE, flag, avoid))
			continue;
		for (j = 0; j < HPAGE_SIZE; j++) {
			if (*(addr + i * HPAGE_SIZE + j) != (char)('a' +
							   ((i + j) % 26))) {
				printf("Mismatch at %lu\n", i + j);
				break;
			}
		}
	}
}

static struct option opts[] = {
	{ "memory"          , 1, NULL, 'm' },
	{ "offset"          , 1, NULL, 'o' },
	{ "inject"          , 0, NULL, 'x' },
	{ "early_kill"      , 0, NULL, 'e' },
	{ "shm"             , 0, NULL, 'S' },
	{ "anonymous"       , 0, NULL, 'A' },
	{ "avoid-touch"     , 0, NULL, 'a' },
	{ "fork"            , 0, NULL, 'F' },
	{ "private"         , 0, NULL, 'p' },
	{ "cow"             , 0, NULL, 'c' },
	{ "filename"        , 1, NULL, 'f' },
	{ "help"            , 0, NULL, 'h' },
	{ NULL              , 0, NULL,  0  }
};

int main(int argc, char *argv[])
{
	void *addr;
	int i;
	int ret;
	int fd = 0;
	int shmid;
	int semid;
	int semaphore;
	int inject = 0;
	int early_kill = 0;
	int avoid_touch = 0;
	int anonflag = 0;
	int shmflag = 0;
	int shmkey = 0;
	int forkflag = 0;
	int privateflag = 0;
	int cowflag = 0;
	char c;
	pid_t pid = 0;
	void *expected_addr = NULL;
	struct sembuf sembuffer;

	PS = getpagesize();
	file_size = 1;
	corrupt_page = -1;

	if (argc == 1) {
		usage();
		exit(EXIT_FAILURE);
	}

	while ((c = getopt_long(argc, argv,
				"m:o:xeSAaFpcf:h", opts, NULL)) != -1) {
		switch (c) {
		case 'm':
			file_size = strtol(optarg, NULL, 10);
			break;
		case 'o':
			corrupt_page = strtol(optarg, NULL, 10);
			break;
		case 'x':
			inject = 1;
			break;
		case 'e':
			early_kill = 1;
			break;
		case 'S':
			shmflag = 1;
			break;
		case 'A':
			anonflag = 1;
			break;
		case 'a':
			avoid_touch = 1;
			break;
		case 'F':
			forkflag = 1;
			break;
		case 'p':
			privateflag = 1;
			break;
		case 'c':
			cowflag = 1;
			break;
		case 'f':
			strcat(filename, optarg);
			shmkey = strtol(optarg, NULL, 10);
			break;
		case 'h':
			usage();
			exit(EXIT_SUCCESS);
		default:
			usage();
			exit(EXIT_FAILURE);
		}
	}

	if (inject && corrupt_page * PS > file_size * HPAGE_SIZE)
		err("Target page is out of range.\n");

	if (avoid_touch && corrupt_page == -1)
		err("Avoid which page?\n");

	/* Construct file name */
	if (access(argv[argc - 1], F_OK) == -1) {
		usage();
		exit(EXIT_FAILURE);
	} else {
		strcpy(filepath, argv[argc - 1]);
		strcat(filepath, filename);
	}

	if (shmflag) {
		if ((shmid = shmget(shmkey, file_size * HPAGE_SIZE,
				    SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0)
			err("shmget");
		addr = shmat(shmid, (void *)0x0UL, 0);
		if (addr == (char *)-1) {
			perror("Shared memory attach failure");
			shmctl(shmid, IPC_RMID, NULL);
			exit(2);
		}
	} else if (anonflag) {
		int mapflag = MAP_ANONYMOUS | 0x40000; /* MAP_HUGETLB */
		if (privateflag)
			mapflag |= MAP_PRIVATE;
		else
			mapflag |= MAP_SHARED;
		if ((addr = mmap(0, file_size * HPAGE_SIZE,
				 PROTECTION, mapflag, -1, 0)) == MAP_FAILED)
			err("mmap");
	} else {
		int mapflag = MAP_SHARED;
		if (privateflag)
			mapflag = MAP_PRIVATE;
		if ((fd = open(filepath, O_CREAT | O_RDWR, 0777)) < 0)
			err("Open failed");
		if ((addr = mmap(0, file_size * HPAGE_SIZE,
				 PROTECTION, mapflag, fd, 0)) == MAP_FAILED) {
			unlink(filepath);
			err("mmap");
		}
	}

	if (corrupt_page != -1)
		expected_addr = (void *)(addr + corrupt_page / 512 * HPAGE_SIZE);

	if (forkflag) {
		semid = semget(IPC_PRIVATE, 1, 0666|IPC_CREAT);
		if (semid == -1) {
			perror("semget");
			goto cleanout;
		}
		semaphore = semctl(semid, 0, SETVAL, 1);
		if (semaphore == -1) {
			perror("semctl");
			goto cleanout;
		}
		if (get_semaphore(semid, &sembuffer)) {
			perror("get_semaphore");
			goto cleanout;
		}
	}

	write_bytes(addr, 0, 0);
	read_bytes(addr, 0, 0);

	if (early_kill)
		prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY,
		      NULL, NULL);

	/*
	 * Intended order:
	 *   1. Child COWs
	 *   2. Parent madvise()s
	 *   3. Child exit()s
	 */
	if (forkflag) {
		pid = fork();
		if (!pid) {
			/* Semaphore is already held */
			if (cowflag) {
				write_bytes(addr, 0, expected_addr);
				read_bytes(addr, 0, expected_addr);
			}
			if (put_semaphore(semid, &sembuffer))
				err("put_semaphore");
			usleep(1000);
			/* Wait for madvise() to be done */
			if (get_semaphore(semid, &sembuffer))
				err("put_semaphore");
			if (put_semaphore(semid, &sembuffer))
				err("put_semaphore");
			return 0;
		}
	}

	/* Wait for COW */
	if (forkflag && get_semaphore(semid, &sembuffer)) {
		perror("get_semaphore");
		goto cleanout;
	}

	if (inject && corrupt_page != -1) {
		ret = madvise(addr + corrupt_page * PS, PS, 100);
		if (ret) {
			printf("madivise return %d :", ret);
			perror("madvise");
			goto cleanout;
		}
	}

	if (forkflag && put_semaphore(semid, &sembuffer)) {
		perror("put_semaphore");
		goto cleanout;
	}

	write_bytes(addr, avoid_touch, expected_addr);
	read_bytes(addr, avoid_touch, expected_addr);

	if (forkflag)
		if (wait(&i) == -1)
			err("wait");
cleanout:
	if (shmflag) {
		if (shmdt((const void *)addr) != 0) {
			err("Detach failure");
			shmctl(shmid, IPC_RMID, NULL);
			exit(EXIT_FAILURE);
		}
		shmctl(shmid, IPC_RMID, NULL);
	} else {
		if (munmap(addr, file_size * HPAGE_SIZE))
			err("munmap");
		if (close(fd))
			err("close");
		if (!anonflag && unlink(filepath))
			err("unlink");
	}

	return 0;
}

---
run-huge-test.sh
---
#!/bin/bash
#
# Test program for memory error handling for hugepages
# Usage: ./run-huge-test.sh hugetlbfs_directory
# Author: Naoya Horiguchi

usage()
{
    echo "Usage: ./run-hgue-test.sh hugetlbfs_directory" && exit 1
}

htdir=$1
[ $# -ne 1 ] && usage
[ ! -d $htdir ] && usage

rm -rf $htdir/test*
echo 1000 > /proc/sys/vm/nr_hugepages

num=0

exec_testcase() {
    error=0
    echo "TestCase $@"
    hpage_size=$1
    hpage_target=$2
    num=$7

    if [ "$3" = "head" ] ; then
	hpage_target_offset=0
    elif [ "$3" = "tail" ] ; then
	hpage_target_offset=1
    else
	error=1
    fi
    hpage_target=$((hpage_target * 512 + hpage_target_offset))

    if [ "$4" = "early" ] ; then
	process_type="-e"
    elif [ "$4" = "late_touch" ] ; then
	process_type=""
    elif [ "$4" = "late_avoid" ] ; then
	process_type="-a"
    else
	error=1
    fi

    if [ "$5" = "anonymous" ] ; then
	file_type="-A"
    elif [ "$5" = "file" ] ; then
	file_type="-f $num"
    elif [ "$5" = "shm" ] ; then
	file_type="-S"
    else
	error=1
    fi

    if [ "$6" = "fork_shared" ] ; then
	share_type="-F"
    elif [ "$6" = "fork_private_nocow" ] ; then
	share_type="-Fp"
    elif [ "$6" = "fork_private_cow" ] ; then
	share_type="-Fpc"
    else
	error=1
    fi

    command="./thugetlb -x -m $hpage_size -o $hpage_target $process_type $file_type $share_type $htdir &"
    echo $command
    eval $command
    wait $!
    echo ""

    return 0
}

num=$((num+1))
exec_testcase 2 1 "head" "early" "file" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "early" "file" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "head" "early" "file" "fork_private_cow" $num
num=$((num+1))
exec_testcase 2 1 "head" "early" "shm" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "early" "anonymous" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "early" "anonymous" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "head" "early" "anonymous" "fork_private_cow" $num

num=$((num+1))
exec_testcase 2 1 "head" "late_touch" "file" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_touch" "file" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_touch" "file" "fork_private_cow" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_touch" "shm" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_touch" "anonymous" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_touch" "anonymous" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_touch" "anonymous" "fork_private_cow" $num

num=$((num+1))
exec_testcase 2 1 "head" "late_avoid" "file" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_avoid" "file" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_avoid" "file" "fork_private_cow" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_avoid" "shm" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_avoid" "anonymous" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_avoid" "anonymous" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "head" "late_avoid" "anonymous" "fork_private_cow" $num

num=$((num+1))
exec_testcase 2 1 "tail" "early" "file" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "early" "file" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "early" "file" "fork_private_cow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "early" "shm" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "early" "anonymous" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "early" "anonymous" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "early" "anonymous" "fork_private_cow" $num

num=$((num+1))
exec_testcase 2 1 "tail" "late_touch" "file" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_touch" "file" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_touch" "file" "fork_private_cow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_touch" "shm" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_touch" "anonymous" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_touch" "anonymous" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_touch" "anonymous" "fork_private_cow" $num

num=$((num+1))
exec_testcase 2 1 "tail" "late_avoid" "file" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_avoid" "file" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_avoid" "file" "fork_private_cow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_avoid" "shm" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_avoid" "anonymous" "fork_shared" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_avoid" "anonymous" "fork_private_nocow" $num
num=$((num+1))
exec_testcase 2 1 "tail" "late_avoid" "anonymous" "fork_private_cow" $num

# free IPC semaphores used by thugetlb.c
ipcs -s|grep $USER|cut -f2 -d' '|xargs ipcrm sem 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-14  9:54       ` Mel Gorman
@ 2010-05-24  7:15         ` Naoya Horiguchi
  2010-05-25 10:59           ` Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-24  7:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang,
	Andrea Arcangeli, Larry Woodman, Lee Schermerhorn

Hi,

On Fri, May 14, 2010 at 10:54:50AM +0100, Mel Gorman wrote:
> ...
> > Hmm. I failed libhugetlbfs test with a oops in "private mapped" test :(
> >
>
> That's not a disaster - it's what the regression test is for. I haven't
> restarted the review in this case. I'll wait for another version that
> passes those regression tests.

I have fixed bugs on "rmap for hugepage" patch in the latest version,
where 'make func' test passes with the same result as in vanilla kernel.
Other HWPOISON patches have no change.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Mon, 24 May 2010 11:26:33 +0900
Subject: [PATCH] hugetlb, rmap: add reverse mapping for hugepage

This patch adds reverse mapping feature for hugepage by introducing
mapcount for shared/private-mapped hugepage and anon_vma for
private-mapped hugepage.

While hugepage is not currently swappable, reverse mapping can be useful
for memory error handler.

Without this patch, memory error handler cannot identify processes
using the bad hugepage nor unmap it from them. That is:
- for shared hugepage:
  we can collect processes using a hugepage through pagecache,
  but can not unmap the hugepage because of the lack of mapcount.
- for privately mapped hugepage:
  we can neither collect processes nor unmap the hugepage.
This patch solves these problems.

This patch include the bug fix given by commit 23be7468e8, so reverts it.

ChangeLog since May 13.
- rebased to 2.6.34
- fix logic error (in case that private mapping and shared mapping coexist)
- move is_vm_hugetlb_page() into include/linux/mm.h to use this function
  from linear_page_index()
- define and use linear_hugepage_index() instead of compound_order()
- use page_move_anon_rmap() in hugetlb_cow()
- copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
- revert commit 24be7468 completely

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
---
 include/linux/hugetlb.h |   11 +-----
 include/linux/mm.h      |    9 +++++
 include/linux/pagemap.h |    8 ++++-
 include/linux/poison.h  |    9 -----
 mm/hugetlb.c            |   85 +++++++++++++++++++++++++++++++++++++++++++++--
 mm/rmap.c               |   16 +++++++++
 6 files changed, 115 insertions(+), 23 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 78b4bc6..a574d09 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -14,11 +14,6 @@ struct user_struct;
 
 int PageHuge(struct page *page);
 
-static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
-{
-	return vma->vm_flags & VM_HUGETLB;
-}
-
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -77,11 +72,6 @@ static inline int PageHuge(struct page *page)
 	return 0;
 }
 
-static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
-{
-	return 0;
-}
-
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
@@ -108,6 +98,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
+#define huge_pte_offset(mm, address)	0
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 462acaf..ee2c442 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -366,6 +366,15 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].lru.prev = (void *)order;
 }
 
+static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_HUGETLBFS
+	return vma->vm_flags & VM_HUGETLB;
+#else
+	return 0;
+#endif
+}
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3c62ed4..9e0bd64 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -281,10 +281,16 @@ static inline loff_t page_offset(struct page *page)
 	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
 }
 
+extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
+				     unsigned long address);
+
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 					unsigned long address)
 {
-	pgoff_t pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
+	pgoff_t pgoff;
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		return linear_hugepage_index(vma, address);
+	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
 	pgoff += vma->vm_pgoff;
 	return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 }
diff --git a/include/linux/poison.h b/include/linux/poison.h
index 34066ff..2110a81 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -48,15 +48,6 @@
 #define POISON_FREE	0x6b	/* for use-after-free poisoning */
 #define	POISON_END	0xa5	/* end-byte of poisoning */
 
-/********** mm/hugetlb.c **********/
-/*
- * Private mappings of hugetlb pages use this poisoned value for
- * page->mapping. The core VM should not be doing anything with this mapping
- * but futex requires the existence of some page->mapping value even though it
- * is unused if PAGE_MAPPING_ANON is set.
- */
-#define HUGETLB_POISON	((void *)(0x00300300 + POISON_POINTER_DELTA + PAGE_MAPPING_ANON))
-
 /********** arch/$ARCH/mm/init.c **********/
 #define POISON_FREE_INITMEM	0xcc
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4c9e6bb..b8b8ea4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -18,6 +18,7 @@
 #include <linux/bootmem.h>
 #include <linux/sysfs.h>
 #include <linux/slab.h>
+#include <linux/rmap.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -220,6 +221,12 @@ static pgoff_t vma_hugecache_offset(struct hstate *h,
 			(vma->vm_pgoff >> huge_page_order(h));
 }
 
+pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
+				     unsigned long address)
+{
+	return vma_hugecache_offset(hstate_vma(vma), vma, address);
+}
+
 /*
  * Return the size of the pages allocated when backing a VMA. In the majority
  * cases this will be same size as used by the page table entries.
@@ -546,8 +553,8 @@ static void free_huge_page(struct page *page)
 
 	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
-	page->mapping = NULL;
 	BUG_ON(page_count(page));
+	BUG_ON(page_mapcount(page));
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
@@ -2125,6 +2132,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
+			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(&src->page_table_lock);
@@ -2203,6 +2211,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
+		page_remove_rmap(page);
 		list_del(&page->lru);
 		put_page(page);
 	}
@@ -2268,6 +2277,50 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 1;
 }
 
+/*
+ * The following three functions are counterparts of ones in mm/rmap.c.
+ * Unlike them, these functions don't have accounting code nor lru code,
+ * because we handle hugepages differently from common anonymous pages.
+ */
+static void __hugepage_set_anon_rmap(struct page *page,
+	struct vm_area_struct *vma, unsigned long address, int exclusive)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	BUG_ON(!anon_vma);
+	if (!exclusive) {
+		struct anon_vma_chain *avc;
+		avc = list_entry(vma->anon_vma_chain.prev,
+				 struct anon_vma_chain, same_vma);
+		anon_vma = avc->anon_vma;
+	}
+	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
+	page->mapping = (struct address_space *) anon_vma;
+	page->index = linear_page_index(vma, address);
+}
+
+static void hugepage_add_anon_rmap(struct page *page,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	int first;
+	BUG_ON(!anon_vma);
+	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+	first = atomic_inc_and_test(&page->_mapcount);
+	if (first)
+		__hugepage_set_anon_rmap(page, vma, address, 0);
+}
+
+void hugepage_add_new_anon_rmap(struct page *page,
+	struct vm_area_struct *vma, unsigned long address)
+{
+	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+	atomic_set(&page->_mapcount, 0);
+	__hugepage_set_anon_rmap(page, vma, address, 1);
+}
+
+/*
+ * Hugetlb_cow() should be called with page lock of the original hugepage held.
+ */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, pte_t pte,
 			struct page *pagecache_page)
@@ -2282,8 +2335,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 retry_avoidcopy:
 	/* If no-one else is actually using this page, avoid the copy
 	 * and just make the page writable */
-	avoidcopy = (page_count(old_page) == 1);
+	avoidcopy = (page_mapcount(old_page) == 1);
 	if (avoidcopy) {
+		if (PageAnon(old_page))
+			page_move_anon_rmap(old_page, vma, address);
 		set_huge_ptep_writable(vma, address, ptep);
 		return 0;
 	}
@@ -2334,6 +2389,13 @@ retry_avoidcopy:
 		return -PTR_ERR(new_page);
 	}
 
+	/*
+	 * When the original hugepage is shared one, it does not have
+	 * anon_vma prepared.
+	 */
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
 	copy_huge_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
@@ -2348,6 +2410,8 @@ retry_avoidcopy:
 		huge_ptep_clear_flush(vma, address, ptep);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
+		page_remove_rmap(old_page);
+		hugepage_add_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
@@ -2448,10 +2512,17 @@ retry:
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
+			page_dup_rmap(page);
 		} else {
 			lock_page(page);
-			page->mapping = HUGETLB_POISON;
+			if (unlikely(anon_vma_prepare(vma))) {
+				ret = VM_FAULT_OOM;
+				goto backout_unlocked;
+			}
+			hugepage_add_new_anon_rmap(page, vma, address);
 		}
+	} else {
+		page_dup_rmap(page);
 	}
 
 	/*
@@ -2503,6 +2574,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
+	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
@@ -2544,6 +2616,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 								vma, address);
 	}
 
+	if (!pagecache_page) {
+		page = pte_page(entry);
+		lock_page(page);
+	}
+
 	spin_lock(&mm->page_table_lock);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
@@ -2569,6 +2646,8 @@ out_page_table_lock:
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
 		put_page(pagecache_page);
+	} else {
+		unlock_page(page);
 	}
 
 out_mutex:
diff --git a/mm/rmap.c b/mm/rmap.c
index 0feeef8..4ec6b1c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	unsigned long address;
 
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		pgoff = page->index << huge_page_order(page_hstate(page));
 	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
 		/* page should be within @vma mapping range */
@@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	pte_t *pte;
 	spinlock_t *ptl;
 
+	if (unlikely(PageHuge(page))) {
+		pte = huge_pte_offset(mm, address);
+		ptl = &mm->page_table_lock;
+		goto check;
+	}
+
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
 		return NULL;
@@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	}
 
 	ptl = pte_lockptr(mm, pmd);
+check:
 	spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
@@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
 		page_clear_dirty(page);
 		set_page_dirty(page);
 	}
+	/*
+	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
+	 * and not charged by memcg for now.
+	 */
+	if (unlikely(PageHuge(page)))
+		return;
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
 		__dec_zone_page_state(page, NR_ANON_PAGES);
-- 
1.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-24  7:15         ` Naoya Horiguchi
@ 2010-05-25 10:59           ` Mel Gorman
  2010-05-26  6:51             ` Naoya Horiguchi
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2010-05-25 10:59 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang,
	Andrea Arcangeli, Larry Woodman, Lee Schermerhorn

On Mon, May 24, 2010 at 04:15:16PM +0900, Naoya Horiguchi wrote:
> Hi,
> 
> On Fri, May 14, 2010 at 10:54:50AM +0100, Mel Gorman wrote:
> > ...
> > > Hmm. I failed libhugetlbfs test with a oops in "private mapped" test :(
> > >
> >
> > That's not a disaster - it's what the regression test is for. I haven't
> > restarted the review in this case. I'll wait for another version that
> > passes those regression tests.
> 
> I have fixed bugs on "rmap for hugepage" patch in the latest version,
> where 'make func' test passes with the same result as in vanilla kernel.
> Other HWPOISON patches have no change.
> 

I'd have preferred to see the whole series but still...

> <SNIP>
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Larry Woodman <lwoodman@redhat.com>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> ---
>  include/linux/hugetlb.h |   11 +-----
>  include/linux/mm.h      |    9 +++++
>  include/linux/pagemap.h |    8 ++++-
>  include/linux/poison.h  |    9 -----
>  mm/hugetlb.c            |   85 +++++++++++++++++++++++++++++++++++++++++++++--
>  mm/rmap.c               |   16 +++++++++
>  6 files changed, 115 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 78b4bc6..a574d09 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -14,11 +14,6 @@ struct user_struct;
>  
>  int PageHuge(struct page *page);
>  
> -static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> -{
> -	return vma->vm_flags & VM_HUGETLB;
> -}
> -
>  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
>  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> @@ -77,11 +72,6 @@ static inline int PageHuge(struct page *page)
>  	return 0;
>  }
>  
> -static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> -{
> -	return 0;
> -}
> -

You collapse two functions into one here and move them to another
header. Is there a reason why pagemap.h could not include hugetlb.h?
It adds another header dependency which is bad but moving hugetlb stuff
into mm.h seems bad too.

>  static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
>  {
>  }
> @@ -108,6 +98,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
>  #define is_hugepage_only_range(mm, addr, len)	0
>  #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
>  #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
> +#define huge_pte_offset(mm, address)	0
>  
>  #define hugetlb_change_protection(vma, address, end, newprot)
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 462acaf..ee2c442 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -366,6 +366,15 @@ static inline void set_compound_order(struct page *page, unsigned long order)
>  	page[1].lru.prev = (void *)order;
>  }
>  
> +static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> +{
> +#ifdef CONFIG_HUGETLBFS
> +	return vma->vm_flags & VM_HUGETLB;
> +#else
> +	return 0;
> +#endif
> +}
> +
>  /*
>   * Multiple processes may "see" the same page. E.g. for untouched
>   * mappings of /dev/null, all processes see the same page full of
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 3c62ed4..9e0bd64 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -281,10 +281,16 @@ static inline loff_t page_offset(struct page *page)
>  	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
>  }
>  
> +extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
> +				     unsigned long address);
> +
>  static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
>  					unsigned long address)
>  {
> -	pgoff_t pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
> +	pgoff_t pgoff;
> +	if (unlikely(is_vm_hugetlb_page(vma)))
> +		return linear_hugepage_index(vma, address);
> +	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
>  	pgoff += vma->vm_pgoff;
>  	return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>  }
> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index 34066ff..2110a81 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -48,15 +48,6 @@
>  #define POISON_FREE	0x6b	/* for use-after-free poisoning */
>  #define	POISON_END	0xa5	/* end-byte of poisoning */
>  
> -/********** mm/hugetlb.c **********/
> -/*
> - * Private mappings of hugetlb pages use this poisoned value for
> - * page->mapping. The core VM should not be doing anything with this mapping
> - * but futex requires the existence of some page->mapping value even though it
> - * is unused if PAGE_MAPPING_ANON is set.
> - */
> -#define HUGETLB_POISON	((void *)(0x00300300 + POISON_POINTER_DELTA + PAGE_MAPPING_ANON))
> -
>  /********** arch/$ARCH/mm/init.c **********/
>  #define POISON_FREE_INITMEM	0xcc
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4c9e6bb..b8b8ea4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -18,6 +18,7 @@
>  #include <linux/bootmem.h>
>  #include <linux/sysfs.h>
>  #include <linux/slab.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -220,6 +221,12 @@ static pgoff_t vma_hugecache_offset(struct hstate *h,
>  			(vma->vm_pgoff >> huge_page_order(h));
>  }
>  
> +pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
> +				     unsigned long address)
> +{
> +	return vma_hugecache_offset(hstate_vma(vma), vma, address);
> +}
> +
>  /*
>   * Return the size of the pages allocated when backing a VMA. In the majority
>   * cases this will be same size as used by the page table entries.
> @@ -546,8 +553,8 @@ static void free_huge_page(struct page *page)
>  
>  	mapping = (struct address_space *) page_private(page);
>  	set_page_private(page, 0);
> -	page->mapping = NULL;
>  	BUG_ON(page_count(page));
> +	BUG_ON(page_mapcount(page));
>  	INIT_LIST_HEAD(&page->lru);
>  
>  	spin_lock(&hugetlb_lock);
> @@ -2125,6 +2132,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			entry = huge_ptep_get(src_pte);
>  			ptepage = pte_page(entry);
>  			get_page(ptepage);
> +			page_dup_rmap(ptepage);
>  			set_huge_pte_at(dst, addr, dst_pte, entry);
>  		}
>  		spin_unlock(&src->page_table_lock);
> @@ -2203,6 +2211,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  	flush_tlb_range(vma, start, end);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
>  	list_for_each_entry_safe(page, tmp, &page_list, lru) {
> +		page_remove_rmap(page);
>  		list_del(&page->lru);
>  		put_page(page);
>  	}
> @@ -2268,6 +2277,50 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return 1;
>  }
>  
> +/*
> + * The following three functions are counterparts of ones in mm/rmap.c.
> + * Unlike them, these functions don't have accounting code nor lru code,
> + * because we handle hugepages differently from common anonymous pages.
> + */
> +static void __hugepage_set_anon_rmap(struct page *page,
> +	struct vm_area_struct *vma, unsigned long address, int exclusive)
> +{
> +	struct anon_vma *anon_vma = vma->anon_vma;
> +	BUG_ON(!anon_vma);
> +	if (!exclusive) {
> +		struct anon_vma_chain *avc;
> +		avc = list_entry(vma->anon_vma_chain.prev,
> +				 struct anon_vma_chain, same_vma);
> +		anon_vma = avc->anon_vma;
> +	}
> +	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> +	page->mapping = (struct address_space *) anon_vma;
> +	page->index = linear_page_index(vma, address);
> +}
> +
> +static void hugepage_add_anon_rmap(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct anon_vma *anon_vma = vma->anon_vma;
> +	int first;
> +	BUG_ON(!anon_vma);
> +	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> +	first = atomic_inc_and_test(&page->_mapcount);
> +	if (first)
> +		__hugepage_set_anon_rmap(page, vma, address, 0);
> +}
> +
> +void hugepage_add_new_anon_rmap(struct page *page,
> +	struct vm_area_struct *vma, unsigned long address)
> +{
> +	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> +	atomic_set(&page->_mapcount, 0);
> +	__hugepage_set_anon_rmap(page, vma, address, 1);
> +}
> +

Is it possible to move these to mm/rmap.c so all the anon rmap adding
code is in the same place? In the event that __page_set_anon_rmap() is
updated, there would be a greater chance the hugepage equivalent will be
noticed and updated.

I didn't spot anything particularly bad after this.  If these minor issues
could be addressed and the full series reposted, I'll test the hugetlb side
of things further just to be sure.

> +/*
> + * Hugetlb_cow() should be called with page lock of the original hugepage held.
> + */
>  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, pte_t *ptep, pte_t pte,
>  			struct page *pagecache_page)
> @@ -2282,8 +2335,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  retry_avoidcopy:
>  	/* If no-one else is actually using this page, avoid the copy
>  	 * and just make the page writable */
> -	avoidcopy = (page_count(old_page) == 1);
> +	avoidcopy = (page_mapcount(old_page) == 1);
>  	if (avoidcopy) {
> +		if (PageAnon(old_page))
> +			page_move_anon_rmap(old_page, vma, address);
>  		set_huge_ptep_writable(vma, address, ptep);
>  		return 0;
>  	}
> @@ -2334,6 +2389,13 @@ retry_avoidcopy:
>  		return -PTR_ERR(new_page);
>  	}
>  
> +	/*
> +	 * When the original hugepage is shared one, it does not have
> +	 * anon_vma prepared.
> +	 */
> +	if (unlikely(anon_vma_prepare(vma)))
> +		return VM_FAULT_OOM;
> +
>  	copy_huge_page(new_page, old_page, address, vma);
>  	__SetPageUptodate(new_page);
>  
> @@ -2348,6 +2410,8 @@ retry_avoidcopy:
>  		huge_ptep_clear_flush(vma, address, ptep);
>  		set_huge_pte_at(mm, address, ptep,
>  				make_huge_pte(vma, new_page, 1));
> +		page_remove_rmap(old_page);
> +		hugepage_add_anon_rmap(new_page, vma, address);
>  		/* Make the old page be freed below */
>  		new_page = old_page;
>  	}
> @@ -2448,10 +2512,17 @@ retry:
>  			spin_lock(&inode->i_lock);
>  			inode->i_blocks += blocks_per_huge_page(h);
>  			spin_unlock(&inode->i_lock);
> +			page_dup_rmap(page);
>  		} else {
>  			lock_page(page);
> -			page->mapping = HUGETLB_POISON;
> +			if (unlikely(anon_vma_prepare(vma))) {
> +				ret = VM_FAULT_OOM;
> +				goto backout_unlocked;
> +			}
> +			hugepage_add_new_anon_rmap(page, vma, address);
>  		}
> +	} else {
> +		page_dup_rmap(page);
>  	}
>  
>  	/*
> @@ -2503,6 +2574,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	pte_t *ptep;
>  	pte_t entry;
>  	int ret;
> +	struct page *page = NULL;
>  	struct page *pagecache_page = NULL;
>  	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>  	struct hstate *h = hstate_vma(vma);
> @@ -2544,6 +2616,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  								vma, address);
>  	}
>  
> +	if (!pagecache_page) {
> +		page = pte_page(entry);
> +		lock_page(page);
> +	}
> +
>  	spin_lock(&mm->page_table_lock);
>  	/* Check for a racing update before calling hugetlb_cow */
>  	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> @@ -2569,6 +2646,8 @@ out_page_table_lock:
>  	if (pagecache_page) {
>  		unlock_page(pagecache_page);
>  		put_page(pagecache_page);
> +	} else {
> +		unlock_page(page);
>  	}
>  
>  out_mutex:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0feeef8..4ec6b1c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -56,6 +56,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>  	unsigned long address;
>  
> +	if (unlikely(is_vm_hugetlb_page(vma)))
> +		pgoff = page->index << huge_page_order(page_hstate(page));
>  	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
>  		/* page should be within @vma mapping range */
> @@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> +	if (unlikely(PageHuge(page))) {
> +		pte = huge_pte_offset(mm, address);
> +		ptl = &mm->page_table_lock;
> +		goto check;
> +	}
> +
>  	pgd = pgd_offset(mm, address);
>  	if (!pgd_present(*pgd))
>  		return NULL;
> @@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>  	}
>  
>  	ptl = pte_lockptr(mm, pmd);
> +check:
>  	spin_lock(ptl);
>  	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
>  		*ptlp = ptl;
> @@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
>  		page_clear_dirty(page);
>  		set_page_dirty(page);
>  	}
> +	/*
> +	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
> +	 * and not charged by memcg for now.
> +	 */
> +	if (unlikely(PageHuge(page)))
> +		return;
>  	if (PageAnon(page)) {
>  		mem_cgroup_uncharge_page(page);
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
> -- 
> 1.7.0
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-25 10:59           ` Mel Gorman
@ 2010-05-26  6:51             ` Naoya Horiguchi
  2010-05-26  9:03               ` Mel Gorman
  2010-05-26  9:19               ` Andi Kleen
  0 siblings, 2 replies; 23+ messages in thread
From: Naoya Horiguchi @ 2010-05-26  6:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang,
	Andrea Arcangeli, Larry Woodman, Lee Schermerhorn

Hi, Mel.

Thank you for the review.

On Tue, May 25, 2010 at 11:59:57AM +0100, Mel Gorman wrote:
> ...
> I'd have preferred to see the whole series but still...

OK.

> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 78b4bc6..a574d09 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -14,11 +14,6 @@ struct user_struct;
> >  
> >  int PageHuge(struct page *page);
> >  
> > -static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > -{
> > -	return vma->vm_flags & VM_HUGETLB;
> > -}
> > -
> >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> >  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> >  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > @@ -77,11 +72,6 @@ static inline int PageHuge(struct page *page)
> >  	return 0;
> >  }
> >  
> > -static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > -{
> > -	return 0;
> > -}
> > -
> 
> You collapse two functions into one here and move them to another
> header. Is there a reason why pagemap.h could not include hugetlb.h?

Yes, hugetlb.h includes pagemap.h through mempolicy.h.
I didn't make pagemap.h depend on hugetlb.h because it makes cyclic dependency
among pagemap.h, mempolicy.h and hugetlb.h.

> It adds another header dependency which is bad but moving hugetlb stuff
> into mm.h seems bad too.

I have another choice to move the definition of is_vm_hugetlb_page() into
mm/hugetlb.c and introduce declaration of this function to pagemap.h,
but this needed a bit ugly #ifdefs and I didn't like it.
If putting hugetlb code in mm.h is worse, I'll take the second choice
in the next post.

> > @@ -2268,6 +2277,50 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	return 1;
> >  }
> >  
> > +/*
> > + * The following three functions are counterparts of ones in mm/rmap.c.
> > + * Unlike them, these functions don't have accounting code nor lru code,
> > + * because we handle hugepages differently from common anonymous pages.
> > + */
> > +static void __hugepage_set_anon_rmap(struct page *page,
> > +	struct vm_area_struct *vma, unsigned long address, int exclusive)
> > +{
> > +	struct anon_vma *anon_vma = vma->anon_vma;
> > +	BUG_ON(!anon_vma);
> > +	if (!exclusive) {
> > +		struct anon_vma_chain *avc;
> > +		avc = list_entry(vma->anon_vma_chain.prev,
> > +				 struct anon_vma_chain, same_vma);
> > +		anon_vma = avc->anon_vma;
> > +	}
> > +	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> > +	page->mapping = (struct address_space *) anon_vma;
> > +	page->index = linear_page_index(vma, address);
> > +}
> > +
> > +static void hugepage_add_anon_rmap(struct page *page,
> > +			struct vm_area_struct *vma, unsigned long address)
> > +{
> > +	struct anon_vma *anon_vma = vma->anon_vma;
> > +	int first;
> > +	BUG_ON(!anon_vma);
> > +	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> > +	first = atomic_inc_and_test(&page->_mapcount);
> > +	if (first)
> > +		__hugepage_set_anon_rmap(page, vma, address, 0);
> > +}
> > +
> > +void hugepage_add_new_anon_rmap(struct page *page,
> > +	struct vm_area_struct *vma, unsigned long address)
> > +{
> > +	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> > +	atomic_set(&page->_mapcount, 0);
> > +	__hugepage_set_anon_rmap(page, vma, address, 1);
> > +}
> > +
> 
> Is it possible to move these to mm/rmap.c so all the anon rmap adding
> code is in the same place? In the event that __page_set_anon_rmap() is
> updated, there would be a greater chance the hugepage equivalent will be
> noticed and updated.

Sounds reasonable, I'll do it.

> I didn't spot anything particularly bad after this.  If these minor issues
> could be addressed and the full series reposted, I'll test the hugetlb side
> of things further just to be sure.

OK, thanks you :)

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-26  6:51             ` Naoya Horiguchi
@ 2010-05-26  9:03               ` Mel Gorman
  2010-05-26  9:19               ` Andi Kleen
  1 sibling, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2010-05-26  9:03 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Andi Kleen, Andrew Morton, Wu Fengguang,
	Andrea Arcangeli, Larry Woodman, Lee Schermerhorn

On Wed, May 26, 2010 at 03:51:56PM +0900, Naoya Horiguchi wrote:
> Hi, Mel.
> 
> Thank you for the review.
> 
> On Tue, May 25, 2010 at 11:59:57AM +0100, Mel Gorman wrote:
> > ...
> > I'd have preferred to see the whole series but still...
> 
> OK.
> 
> > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > index 78b4bc6..a574d09 100644
> > > --- a/include/linux/hugetlb.h
> > > +++ b/include/linux/hugetlb.h
> > > @@ -14,11 +14,6 @@ struct user_struct;
> > >  
> > >  int PageHuge(struct page *page);
> > >  
> > > -static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > > -{
> > > -	return vma->vm_flags & VM_HUGETLB;
> > > -}
> > > -
> > >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > >  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > >  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > @@ -77,11 +72,6 @@ static inline int PageHuge(struct page *page)
> > >  	return 0;
> > >  }
> > >  
> > > -static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > > -{
> > > -	return 0;
> > > -}
> > > -
> > 
> > You collapse two functions into one here and move them to another
> > header. Is there a reason why pagemap.h could not include hugetlb.h?
> 
> Yes, hugetlb.h includes pagemap.h through mempolicy.h.
> I didn't make pagemap.h depend on hugetlb.h because it makes cyclic dependency
> among pagemap.h, mempolicy.h and hugetlb.h.
> 

Ok, that's a good reason.

> > It adds another header dependency which is bad but moving hugetlb stuff
> > into mm.h seems bad too.
> 
> I have another choice to move the definition of is_vm_hugetlb_page() into
> mm/hugetlb.c and introduce declaration of this function to pagemap.h,
> but this needed a bit ugly #ifdefs and I didn't like it.
> If putting hugetlb code in mm.h is worse, I'll take the second choice
> in the next post.
> 

That would add an additional function call overhead to page table teardown
which would be very unfortunate. I still am not very keen on moving hugetlb
code to mm.h though.

How about moving the definition of shared_policy under a CONFIG_NUMA
block in mm_types.h and removing the dependency between hugetlb.h and
mempolicy.h?

Does anyone else see a problem with this from a "clean" perspective?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-26  6:51             ` Naoya Horiguchi
  2010-05-26  9:03               ` Mel Gorman
@ 2010-05-26  9:19               ` Andi Kleen
  2010-05-26  9:44                 ` Mel Gorman
  1 sibling, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2010-05-26  9:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mel Gorman, linux-kernel, linux-mm, Andi Kleen, Andrew Morton,
	Wu Fengguang, Andrea Arcangeli, Larry Woodman, Lee Schermerhorn


Mel, other than this nit are you happy with these changes now?

> > It adds another header dependency which is bad but moving hugetlb stuff
> > into mm.h seems bad too.
> 
> I have another choice to move the definition of is_vm_hugetlb_page() into
> mm/hugetlb.c and introduce declaration of this function to pagemap.h,
> but this needed a bit ugly #ifdefs and I didn't like it.
> If putting hugetlb code in mm.h is worse, I'll take the second choice
> in the next post.

You could always create a new include file hugetlb-inlines.h

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-26  9:19               ` Andi Kleen
@ 2010-05-26  9:44                 ` Mel Gorman
  2010-05-26  9:58                   ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2010-05-26  9:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Naoya Horiguchi, linux-kernel, linux-mm, Andrew Morton,
	Wu Fengguang, Andrea Arcangeli, Larry Woodman, Lee Schermerhorn

On Wed, May 26, 2010 at 11:19:58AM +0200, Andi Kleen wrote:
> 
> Mel, other than this nit are you happy with these changes now?
> 

Pretty much but I also want to test the series myself to be sure I haven't
missed something in review.

> > > It adds another header dependency which is bad but moving hugetlb stuff
> > > into mm.h seems bad too.
> > 
> > I have another choice to move the definition of is_vm_hugetlb_page() into
> > mm/hugetlb.c and introduce declaration of this function to pagemap.h,
> > but this needed a bit ugly #ifdefs and I didn't like it.
> > If putting hugetlb code in mm.h is worse, I'll take the second choice
> > in the next post.
> 
> You could always create a new include file hugetlb-inlines.h
> 

That would be another option. It'd need to be figured out what should
move from hugetlb.h to hugetlb-inlines.h in the future but ultimately it
would still be tidier than moving hugetlb stuff to mm.h (at least to
me).

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage
  2010-05-26  9:44                 ` Mel Gorman
@ 2010-05-26  9:58                   ` Andi Kleen
  0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2010-05-26  9:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Naoya Horiguchi, linux-kernel, linux-mm,
	Andrew Morton, Wu Fengguang, Andrea Arcangeli, Larry Woodman,
	Lee Schermerhorn

On Wed, May 26, 2010 at 10:44:43AM +0100, Mel Gorman wrote:
> On Wed, May 26, 2010 at 11:19:58AM +0200, Andi Kleen wrote:
> > 
> > Mel, other than this nit are you happy with these changes now?
> > 
> 
> Pretty much but I also want to test the series myself to be sure I haven't
> missed something in review.

Thanks. I'll wait a bit more and if there is no negative reviews
I'll start queueing it in the hwpoison tree.

I should add this is probably only the first step, "early kill" support
and soft offline might need some more mm changes.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-05-26  9:58 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-13  7:55 [PATCH 0/7] HWPOISON for hugepage (v5) Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage Naoya Horiguchi
2010-05-13  9:18   ` Andi Kleen
2010-05-17  4:53     ` Naoya Horiguchi
2010-05-13 15:27   ` Mel Gorman
2010-05-13 16:14     ` Andi Kleen
2010-05-14  7:46     ` Naoya Horiguchi
2010-05-14  9:54       ` Mel Gorman
2010-05-24  7:15         ` Naoya Horiguchi
2010-05-25 10:59           ` Mel Gorman
2010-05-26  6:51             ` Naoya Horiguchi
2010-05-26  9:03               ` Mel Gorman
2010-05-26  9:19               ` Andi Kleen
2010-05-26  9:44                 ` Mel Gorman
2010-05-26  9:58                   ` Andi Kleen
2010-05-13  7:55 ` [PATCH 2/7] HWPOISON, hugetlb: enable error handling path " Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 3/7] HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 4/7] HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 5/7] HWPOISON, hugetlb: isolate corrupted hugepage Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 6/7] HWPOISON, hugetlb: detect hwpoison in hugetlb code Naoya Horiguchi
2010-05-13  7:55 ` [PATCH 7/7] HWPOISON, hugetlb: support hwpoison injection for hugepage Naoya Horiguchi
2010-05-13 14:27 ` [PATCH 0/7] HWPOISON for hugepage (v5) Mel Gorman
2010-05-14  7:35   ` Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).