* Re: FAILED: patch "[PATCH] mm: migration: fix migration of huge PMD shared pages" failed to apply to 4.9-stable tree
2018-10-18 17:28 ` Mike Kravetz
@ 2018-10-18 18:23 ` Michal Hocko
2018-10-18 19:20 ` Jerome Glisse
2018-11-19 12:56 ` Greg KH
2 siblings, 0 replies; 5+ messages in thread
From: Michal Hocko @ 2018-10-18 18:23 UTC (permalink / raw)
To: Mike Kravetz
Cc: gregkh, akpm, dave, jglisse, kirill.shutemov, n-horiguchi, stable,
vbabka
On Thu 18-10-18 10:28:12, Mike Kravetz wrote:
> On 10/10/18 11:04 PM, gregkh@linuxfoundation.org wrote:
> >
> > The patch below does not apply to the 4.9-stable tree.
> > If someone wants it applied there, or to any other stable or longterm
> > tree, then please email the backport, including the original git commit
> > id to <stable@vger.kernel.org>.
>
> From: Mike Kravetz <mike.kravetz@oracle.com>
>
> mm: migration: fix migration of huge PMD shared pages
>
> commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream
>
> The page migration code employs try_to_unmap() to try and unmap the
> source page. This is accomplished by using rmap_walk to find all
> vmas where the page is mapped. This search stops when page mapcount
> is zero. For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings. Shared mappings are tracked via
> the reference count of the PMD page. Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
>
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page. Hence, data is lost.
>
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages. A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
>
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page. After this, flush
> caches and TLB.
>
> mmu notifiers are called before locking page tables, but we can not
> be sure of PMD sharing until page tables are locked. Therefore,
> check for the possibility of PMD sharing before locking so that
> notifiers can prepare for the worst possible case. The mmu notifier
> calls in this commit are different than upstream. That is because
> upstream went to a different model here. Instead of moving to the
> new model, we leave existing model unchanged and only use the
> mmu_*range* calls in this special case.
>
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
LGTM
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> include/linux/hugetlb.h | 14 +++++++++++
> include/linux/mm.h | 6 +++++
> mm/hugetlb.c | 37 +++++++++++++++++++++++++--
> mm/rmap.c | 56 +++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 111 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 48c76d612d40..b699d59d0f4f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -109,6 +109,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> unsigned long addr, unsigned long sz);
> pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end);
> struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> int write);
> struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> @@ -131,6 +133,18 @@ static inline unsigned long hugetlb_total_pages(void)
> return 0;
> }
>
> +static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
> + pte_t *ptep)
> +{
> + return 0;
> +}
> +
> +static inline void adjust_range_if_pmd_sharing_possible(
> + struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end)
> +{
> +}
> +
> #define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; })
> #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
> #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 493d07931ea5..11a5a46ce72b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2187,6 +2187,12 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> return vma;
> }
>
> +static inline bool range_in_vma(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + return (vma && vma->vm_start <= start && end <= vma->vm_end);
> +}
> +
> #ifdef CONFIG_MMU
> pgprot_t vm_get_page_prot(unsigned long vm_flags);
> void vma_set_page_prot(struct vm_area_struct *vma);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f9e735537c37..6e9aed4079f7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4312,12 +4312,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> /*
> * check on proper vm_flags and page table alignment
> */
> - if (vma->vm_flags & VM_MAYSHARE &&
> - vma->vm_start <= base && end <= vma->vm_end)
> + if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
> return true;
> return false;
> }
>
> +/*
> + * Determine if start,end range within vma could be mapped by shared pmd.
> + * If yes, adjust start and end to cover range associated with possible
> + * shared pmd mappings.
> + */
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end)
> +{
> + unsigned long check_addr = *start;
> +
> + if (!(vma->vm_flags & VM_MAYSHARE))
> + return;
> +
> + for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> + unsigned long a_start = check_addr & PUD_MASK;
> + unsigned long a_end = a_start + PUD_SIZE;
> +
> + /*
> + * If sharing is possible, adjust start/end if necessary.
> + */
> + if (range_in_vma(vma, a_start, a_end)) {
> + if (a_start < *start)
> + *start = a_start;
> + if (a_end > *end)
> + *end = a_end;
> + }
> + }
> +}
> +
> /*
> * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
> * and returns the corresponding pte. While this is not necessary for the
> @@ -4414,6 +4442,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
> {
> return 0;
> }
> +
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end)
> +{
> +}
> #define want_pmd_share() (0)
> #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 94488b0362f8..a7276d8c96f3 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1476,6 +1476,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> pte_t pteval;
> spinlock_t *ptl;
> int ret = SWAP_AGAIN;
> + unsigned long sh_address;
> + bool pmd_sharing_possible = false;
> + unsigned long spmd_start, spmd_end;
> struct rmap_private *rp = arg;
> enum ttu_flags flags = rp->flags;
>
> @@ -1491,6 +1494,32 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> goto out;
> }
>
> + /*
> + * Only use the range_start/end mmu notifiers if huge pmd sharing
> + * is possible. In the normal case, mmu_notifier_invalidate_page
> + * is sufficient as we only unmap a page. However, if we unshare
> + * a pmd, we will unmap a PUD_SIZE range.
> + */
> + if (PageHuge(page)) {
> + spmd_start = address;
> + spmd_end = spmd_start + vma_mmu_pagesize(vma);
> +
> + /*
> + * Check if pmd sharing is possible. If possible, we could
> + * unmap a PUD_SIZE range. spmd_start/spmd_end will be
> + * modified if sharing is possible.
> + */
> + adjust_range_if_pmd_sharing_possible(vma, &spmd_start,
> + &spmd_end);
> + if (spmd_end - spmd_start != vma_mmu_pagesize(vma)) {
> + sh_address = address;
> +
> + pmd_sharing_possible = true;
> + mmu_notifier_invalidate_range_start(vma->vm_mm,
> + spmd_start, spmd_end);
> + }
> + }
> +
> pte = page_check_address(page, mm, address, &ptl,
> PageTransCompound(page));
> if (!pte)
> @@ -1524,6 +1553,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> }
> }
>
> + /*
> + * Call huge_pmd_unshare to potentially unshare a huge pmd. Pass
> + * sh_address as it will be modified if unsharing is successful.
> + */
> + if (PageHuge(page) && huge_pmd_unshare(mm, &sh_address, pte)) {
> + /*
> + * huge_pmd_unshare unmapped an entire PMD page. There is
> + * no way of knowing exactly which PMDs may be cached for
> + * this mm, so flush them all. spmd_start/spmd_end cover
> + * this PUD_SIZE range.
> + */
> + flush_cache_range(vma, spmd_start, spmd_end);
> + flush_tlb_range(vma, spmd_start, spmd_end);
> +
> + /*
> + * The ref count of the PMD page was dropped which is part
> + * of the way map counting is done for shared PMDs. When
> + * there is no other sharing, huge_pmd_unshare returns false
> + * and we will unmap the actual page and drop map count
> + * to zero.
> + */
> + goto out_unmap;
> + }
> +
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, page_to_pfn(page));
> if (should_defer_flush(mm, flags)) {
> @@ -1621,6 +1674,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
> mmu_notifier_invalidate_page(mm, address);
> out:
> + if (pmd_sharing_possible)
> + mmu_notifier_invalidate_range_end(vma->vm_mm,
> + spmd_start, spmd_end);
> return ret;
> }
>
> --
> 2.17.2
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: FAILED: patch "[PATCH] mm: migration: fix migration of huge PMD shared pages" failed to apply to 4.9-stable tree
2018-10-18 17:28 ` Mike Kravetz
2018-10-18 18:23 ` Michal Hocko
@ 2018-10-18 19:20 ` Jerome Glisse
2018-11-19 12:56 ` Greg KH
2 siblings, 0 replies; 5+ messages in thread
From: Jerome Glisse @ 2018-10-18 19:20 UTC (permalink / raw)
To: Mike Kravetz
Cc: gregkh, akpm, dave, kirill.shutemov, mhocko, n-horiguchi, stable,
vbabka
On Thu, Oct 18, 2018 at 10:28:12AM -0700, Mike Kravetz wrote:
> On 10/10/18 11:04 PM, gregkh@linuxfoundation.org wrote:
> >
> > The patch below does not apply to the 4.9-stable tree.
> > If someone wants it applied there, or to any other stable or longterm
> > tree, then please email the backport, including the original git commit
> > id to <stable@vger.kernel.org>.
>
> From: Mike Kravetz <mike.kravetz@oracle.com>
>
> mm: migration: fix migration of huge PMD shared pages
>
> commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream
>
> The page migration code employs try_to_unmap() to try and unmap the
> source page. This is accomplished by using rmap_walk to find all
> vmas where the page is mapped. This search stops when page mapcount
> is zero. For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings. Shared mappings are tracked via
> the reference count of the PMD page. Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
>
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page. Hence, data is lost.
>
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages. A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
>
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page. After this, flush
> caches and TLB.
>
> mmu notifiers are called before locking page tables, but we can not
> be sure of PMD sharing until page tables are locked. Therefore,
> check for the possibility of PMD sharing before locking so that
> notifiers can prepare for the worst possible case. The mmu notifier
> calls in this commit are different than upstream. That is because
> upstream went to a different model here. Instead of moving to the
> new model, we leave existing model unchanged and only use the
> mmu_*range* calls in this special case.
>
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: J�r�me Glisse <jglisse@redhat.com>
> ---
> include/linux/hugetlb.h | 14 +++++++++++
> include/linux/mm.h | 6 +++++
> mm/hugetlb.c | 37 +++++++++++++++++++++++++--
> mm/rmap.c | 56 +++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 111 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 48c76d612d40..b699d59d0f4f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -109,6 +109,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> unsigned long addr, unsigned long sz);
> pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end);
> struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> int write);
> struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> @@ -131,6 +133,18 @@ static inline unsigned long hugetlb_total_pages(void)
> return 0;
> }
>
> +static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
> + pte_t *ptep)
> +{
> + return 0;
> +}
> +
> +static inline void adjust_range_if_pmd_sharing_possible(
> + struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end)
> +{
> +}
> +
> #define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; })
> #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
> #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 493d07931ea5..11a5a46ce72b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2187,6 +2187,12 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> return vma;
> }
>
> +static inline bool range_in_vma(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + return (vma && vma->vm_start <= start && end <= vma->vm_end);
> +}
> +
> #ifdef CONFIG_MMU
> pgprot_t vm_get_page_prot(unsigned long vm_flags);
> void vma_set_page_prot(struct vm_area_struct *vma);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f9e735537c37..6e9aed4079f7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4312,12 +4312,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> /*
> * check on proper vm_flags and page table alignment
> */
> - if (vma->vm_flags & VM_MAYSHARE &&
> - vma->vm_start <= base && end <= vma->vm_end)
> + if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
> return true;
> return false;
> }
>
> +/*
> + * Determine if start,end range within vma could be mapped by shared pmd.
> + * If yes, adjust start and end to cover range associated with possible
> + * shared pmd mappings.
> + */
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end)
> +{
> + unsigned long check_addr = *start;
> +
> + if (!(vma->vm_flags & VM_MAYSHARE))
> + return;
> +
> + for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> + unsigned long a_start = check_addr & PUD_MASK;
> + unsigned long a_end = a_start + PUD_SIZE;
> +
> + /*
> + * If sharing is possible, adjust start/end if necessary.
> + */
> + if (range_in_vma(vma, a_start, a_end)) {
> + if (a_start < *start)
> + *start = a_start;
> + if (a_end > *end)
> + *end = a_end;
> + }
> + }
> +}
> +
> /*
> * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
> * and returns the corresponding pte. While this is not necessary for the
> @@ -4414,6 +4442,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
> {
> return 0;
> }
> +
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> + unsigned long *start, unsigned long *end)
> +{
> +}
> #define want_pmd_share() (0)
> #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 94488b0362f8..a7276d8c96f3 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1476,6 +1476,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> pte_t pteval;
> spinlock_t *ptl;
> int ret = SWAP_AGAIN;
> + unsigned long sh_address;
> + bool pmd_sharing_possible = false;
> + unsigned long spmd_start, spmd_end;
> struct rmap_private *rp = arg;
> enum ttu_flags flags = rp->flags;
>
> @@ -1491,6 +1494,32 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> goto out;
> }
>
> + /*
> + * Only use the range_start/end mmu notifiers if huge pmd sharing
> + * is possible. In the normal case, mmu_notifier_invalidate_page
> + * is sufficient as we only unmap a page. However, if we unshare
> + * a pmd, we will unmap a PUD_SIZE range.
> + */
> + if (PageHuge(page)) {
> + spmd_start = address;
> + spmd_end = spmd_start + vma_mmu_pagesize(vma);
> +
> + /*
> + * Check if pmd sharing is possible. If possible, we could
> + * unmap a PUD_SIZE range. spmd_start/spmd_end will be
> + * modified if sharing is possible.
> + */
> + adjust_range_if_pmd_sharing_possible(vma, &spmd_start,
> + &spmd_end);
> + if (spmd_end - spmd_start != vma_mmu_pagesize(vma)) {
> + sh_address = address;
> +
> + pmd_sharing_possible = true;
> + mmu_notifier_invalidate_range_start(vma->vm_mm,
> + spmd_start, spmd_end);
> + }
> + }
> +
> pte = page_check_address(page, mm, address, &ptl,
> PageTransCompound(page));
> if (!pte)
> @@ -1524,6 +1553,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> }
> }
>
> + /*
> + * Call huge_pmd_unshare to potentially unshare a huge pmd. Pass
> + * sh_address as it will be modified if unsharing is successful.
> + */
> + if (PageHuge(page) && huge_pmd_unshare(mm, &sh_address, pte)) {
> + /*
> + * huge_pmd_unshare unmapped an entire PMD page. There is
> + * no way of knowing exactly which PMDs may be cached for
> + * this mm, so flush them all. spmd_start/spmd_end cover
> + * this PUD_SIZE range.
> + */
> + flush_cache_range(vma, spmd_start, spmd_end);
> + flush_tlb_range(vma, spmd_start, spmd_end);
> +
> + /*
> + * The ref count of the PMD page was dropped which is part
> + * of the way map counting is done for shared PMDs. When
> + * there is no other sharing, huge_pmd_unshare returns false
> + * and we will unmap the actual page and drop map count
> + * to zero.
> + */
> + goto out_unmap;
> + }
> +
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, page_to_pfn(page));
> if (should_defer_flush(mm, flags)) {
> @@ -1621,6 +1674,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
> mmu_notifier_invalidate_page(mm, address);
> out:
> + if (pmd_sharing_possible)
> + mmu_notifier_invalidate_range_end(vma->vm_mm,
> + spmd_start, spmd_end);
> return ret;
> }
>
> --
> 2.17.2
>
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: FAILED: patch "[PATCH] mm: migration: fix migration of huge PMD shared pages" failed to apply to 4.9-stable tree
2018-10-18 17:28 ` Mike Kravetz
2018-10-18 18:23 ` Michal Hocko
2018-10-18 19:20 ` Jerome Glisse
@ 2018-11-19 12:56 ` Greg KH
2 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2018-11-19 12:56 UTC (permalink / raw)
To: Mike Kravetz
Cc: akpm, dave, jglisse, kirill.shutemov, mhocko, n-horiguchi, stable,
vbabka
On Thu, Oct 18, 2018 at 10:28:12AM -0700, Mike Kravetz wrote:
> On 10/10/18 11:04 PM, gregkh@linuxfoundation.org wrote:
> >
> > The patch below does not apply to the 4.9-stable tree.
> > If someone wants it applied there, or to any other stable or longterm
> > tree, then please email the backport, including the original git commit
> > id to <stable@vger.kernel.org>.
>
> From: Mike Kravetz <mike.kravetz@oracle.com>
>
> mm: migration: fix migration of huge PMD shared pages
>
> commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream
>
> The page migration code employs try_to_unmap() to try and unmap the
> source page. This is accomplished by using rmap_walk to find all
> vmas where the page is mapped. This search stops when page mapcount
> is zero. For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings. Shared mappings are tracked via
> the reference count of the PMD page. Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
>
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page. Hence, data is lost.
>
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages. A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
>
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page. After this, flush
> caches and TLB.
>
> mmu notifiers are called before locking page tables, but we can not
> be sure of PMD sharing until page tables are locked. Therefore,
> check for the possibility of PMD sharing before locking so that
> notifiers can prepare for the worst possible case. The mmu notifier
> calls in this commit are different than upstream. That is because
> upstream went to a different model here. Instead of moving to the
> new model, we leave existing model unchanged and only use the
> mmu_*range* calls in this special case.
>
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
> include/linux/hugetlb.h | 14 +++++++++++
> include/linux/mm.h | 6 +++++
> mm/hugetlb.c | 37 +++++++++++++++++++++++++--
> mm/rmap.c | 56 +++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 111 insertions(+), 2 deletions(-)
Now queued up, thanks.
greg k-h
^ permalink raw reply [flat|nested] 5+ messages in thread