Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Zi Yan <zi.yan@cs.rutgers.edu>, Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kirill.shutemov@linux.intel.com, akpm@linux-foundation.org,
	vbabka@suse.cz, mgorman@techsingularity.net,
	n-horiguchi@ah.jp.nec.com, khandual@linux.vnet.ibm.com,
	Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
Date: Mon, 13 Feb 2017 13:59:06 +0300	[thread overview]
Message-ID: <20170213105906.GA16419@node.shutemov.name> (raw)
In-Reply-To: <44001748-05AC-49B2-88F5-371618C12AD9@cs.rutgers.edu>

On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> Hi Kirill,
> 
> >>>> The crash scenario I guess is like:
> >>>> 1. A huge page pmd entry is in the middle of being changed into either a
> >>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>>>
> >>>> 2. At the same time, the application frees the vma this page belongs to.
> >>>
> >>> Em... no.
> >>>
> >>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> >>> And we only be able to remove vma under down_write(mmap_sem), so the
> >>> scenario should be excluded.
> >>>
> >>> What do I miss?
> >>
> >> You are right. This problem will not happen in the upstream kernel.
> >>
> >> The problem comes from my customized kernel, where I migrate pages away
> >> instead of reclaiming them when memory is under pressure. I did not take
> >> any mmap_sem when I migrate pages. So I got this error.
> >>
> >> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
> >
> > I think there's still a race between MADV_DONTNEED and
> > change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> > zap_pmd_range(). It need to be addressed.
> >
> > And MADV_FREE requires a fix.
> >
> > So, minus one non-bug, plus two bugs.
> >
> 
> You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
> It is only true for huge pages, right?

mmap_sem is a way to make sure that the VMA will not go away under you.
Besides mmap_sem, anon_vma_lock/i_mmap_lock can be used for this.

> Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> Am I right about this?
> 
> If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> possible.
> 
> Do you agree?

I *think* we can get away with speculative (without ptl) check in
zap_pmd_range() if we make page fault the only place that can turn
pmd_none() into something else.

It means all other sides that change pmd must not clear it intermittently
during pmd change, unless run under down_write(mmap_sem).

I found two such problematic places in kernel:

 - change_huge_pmd(.prot_numa=1);

 - madvise_free_huge_pmd();

Both clear pmd before setting up a modified version. Both under
down_read(mmap_sem).

The migration path also would need to establish migration pmd atomically
to make this work.

Once all these cases will be fixed, zap_pmd_range() would only be able to
race with page fault if it called from MADV_DONTNEED.
This case is not a problem.

Andrea, does it sound reasonable to you?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Zi Yan <zi.yan@cs.rutgers.edu>, Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kirill.shutemov@linux.intel.com, akpm@linux-foundation.org,
	vbabka@suse.cz, mgorman@techsingularity.net,
	n-horiguchi@ah.jp.nec.com, khandual@linux.vnet.ibm.com,
	Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
Date: Mon, 13 Feb 2017 13:59:06 +0300	[thread overview]
Message-ID: <20170213105906.GA16419@node.shutemov.name> (raw)
In-Reply-To: <44001748-05AC-49B2-88F5-371618C12AD9@cs.rutgers.edu>

On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> Hi Kirill,
> 
> >>>> The crash scenario I guess is like:
> >>>> 1. A huge page pmd entry is in the middle of being changed into either a
> >>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>>>
> >>>> 2. At the same time, the application frees the vma this page belongs to.
> >>>
> >>> Em... no.
> >>>
> >>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> >>> And we only be able to remove vma under down_write(mmap_sem), so the
> >>> scenario should be excluded.
> >>>
> >>> What do I miss?
> >>
> >> You are right. This problem will not happen in the upstream kernel.
> >>
> >> The problem comes from my customized kernel, where I migrate pages away
> >> instead of reclaiming them when memory is under pressure. I did not take
> >> any mmap_sem when I migrate pages. So I got this error.
> >>
> >> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
> >
> > I think there's still a race between MADV_DONTNEED and
> > change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> > zap_pmd_range(). It need to be addressed.
> >
> > And MADV_FREE requires a fix.
> >
> > So, minus one non-bug, plus two bugs.
> >
> 
> You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
> It is only true for huge pages, right?

mmap_sem is a way to make sure that the VMA will not go away under you.
Besides mmap_sem, anon_vma_lock/i_mmap_lock can be used for this.

> Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> Am I right about this?
> 
> If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> possible.
> 
> Do you agree?

I *think* we can get away with speculative (without ptl) check in
zap_pmd_range() if we make page fault the only place that can turn
pmd_none() into something else.

It means all other sides that change pmd must not clear it intermittently
during pmd change, unless run under down_write(mmap_sem).

I found two such problematic places in kernel:

 - change_huge_pmd(.prot_numa=1);

 - madvise_free_huge_pmd();

Both clear pmd before setting up a modified version. Both under
down_read(mmap_sem).

The migration path also would need to establish migration pmd atomically
to make this work.

Once all these cases will be fixed, zap_pmd_range() would only be able to
race with page fault if it called from MADV_DONTNEED.
This case is not a problem.

Andrea, does it sound reasonable to you?

-- 
 Kirill A. Shutemov

next prev parent reply	other threads:[~2017-02-13 10:59 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-05 16:12 [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-06  6:12   ` Naoya Horiguchi
2017-02-06  6:12     ` Naoya Horiguchi
2017-02-06 12:10     ` Zi Yan
2017-02-06 15:02   ` Matthew Wilcox
2017-02-06 15:02     ` Matthew Wilcox
2017-02-06 15:03     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 02/14] mm: thp: create new __zap_huge_pmd_locked function Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range() Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-06  4:02   ` Hillf Danton
2017-02-06  4:02     ` Hillf Danton
2017-02-06  4:14     ` Zi Yan
2017-02-06  4:14       ` Zi Yan
2017-02-06  7:43   ` Naoya Horiguchi
2017-02-06  7:43     ` Naoya Horiguchi
2017-02-06 13:02     ` Zi Yan
2017-02-06 23:22       ` Naoya Horiguchi
2017-02-06 23:22         ` Naoya Horiguchi
2017-02-06 16:07   ` Kirill A. Shutemov
2017-02-06 16:07     ` Kirill A. Shutemov
2017-02-06 16:32     ` Zi Yan
2017-02-06 17:35       ` Kirill A. Shutemov
2017-02-06 17:35         ` Kirill A. Shutemov
2017-02-07 13:55     ` Aneesh Kumar K.V
2017-02-07 13:55       ` Aneesh Kumar K.V
2017-02-07 14:12       ` Zi Yan
2017-02-07 14:19   ` Kirill A. Shutemov
2017-02-07 14:19     ` Kirill A. Shutemov
2017-02-07 15:11     ` Zi Yan
2017-02-07 15:11       ` Zi Yan
2017-02-07 16:37       ` Kirill A. Shutemov
2017-02-07 16:37         ` Kirill A. Shutemov
2017-02-07 17:14         ` Zi Yan
2017-02-07 17:14           ` Zi Yan
2017-02-07 17:45           ` Kirill A. Shutemov
2017-02-07 17:45             ` Kirill A. Shutemov
2017-02-13  0:25             ` Zi Yan
2017-02-13  0:25               ` Zi Yan
2017-02-13 10:59               ` Kirill A. Shutemov [this message]
2017-02-13 10:59                 ` Kirill A. Shutemov
2017-02-13 14:40                 ` Andrea Arcangeli
2017-02-13 14:40                   ` Andrea Arcangeli
2017-02-05 16:12 ` [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:14   ` Naoya Horiguchi
2017-02-09  9:14     ` Naoya Horiguchi
2017-02-09 15:07     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 05/14] mm: mempolicy: add queue_pages_node_check() Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 06/14] mm: thp: introduce separate TTU flag for thp freezing Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 07/14] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 08/14] mm: thp: enable thp migration in generic path Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:15   ` Naoya Horiguchi
2017-02-09  9:15     ` Naoya Horiguchi
2017-02-09 15:17     ` Zi Yan
2017-02-09 23:04       ` Naoya Horiguchi
2017-02-09 23:04         ` Naoya Horiguchi
2017-02-14 20:13   ` Zi Yan
2017-02-14 20:13     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 09/14] mm: thp: check pmd migration entry in common path Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:16   ` Naoya Horiguchi
2017-02-09  9:16     ` Naoya Horiguchi
2017-02-09 17:36     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 10/14] mm: soft-dirty: keep soft-dirty bits over thp migration Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 11/14] mm: hwpoison: soft offline supports " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 12/14] mm: mempolicy: mbind and migrate_pages support " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 13/14] mm: migrate: move_pages() supports " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:16   ` Naoya Horiguchi
2017-02-09  9:16     ` Naoya Horiguchi
2017-02-09 17:37     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 14/14] mm: memory_hotplug: memory hotremove " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-23 16:12 ` [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-23 16:12   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170213105906.GA16419@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=minchan@kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=vbabka@suse.cz \
    --cc=zi.yan@cs.rutgers.edu \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.