All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiri Slaby <jslaby@suse.cz>
To: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Mel Gorman <mgorman@suse.de>, Bob Liu <bob.liu@oracle.com>,
	Christoph Lameter <cl@gentwo.org>, Dave Jones <davej@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 3.12 78/78] mm: let mm_find_pmd fix buggy race with THP fault
Date: Mon, 12 Jan 2015 11:01:46 +0100	[thread overview]
Message-ID: <54B39B8A.7000002@suse.cz> (raw)
In-Reply-To: <alpine.LSU.2.11.1501092048090.2283@eggly.anvils>

[-- Attachment #1: Type: text/plain, Size: 2365 bytes --]

On 01/10/2015, 06:01 AM, Hugh Dickins wrote:
> On Fri, 9 Jan 2015, Jiri Slaby wrote:
> 
>> From: Hugh Dickins <hughd@google.com>
>>
>> 3.12-stable review patch.  If anyone has any objections, please let me know.
>>
>> ===============
>>
>> commit f72e7dcdd25229446b102e587ef2f826f76bff28 upstream.
...
> Fine for this to go in, but there is one catch, which I discovered when
> backporting to v3.11: it needed one more hunk.  I haven't checked your
> base tree, but if this applies then I believe you need it - most of the
> time no problem, but it can case page migration to fail to find a
> migration entry it inserted earlier, then BUG_ON(!PageLocked(p)) in
> migration_entry_to_page() soon after.  Here's what I wrote back then:
> 
> Note on rebase to v3.11: added a hunk to replace the use of mm_find_pmd()
> in page_check_address_pmd().  This call had been similarly replaced by
> the time of my v3.16 commit, in Kirill Shutemov's v3.15 b5a8cad376ee
> ("thp: close race between split and zap huge pages"): which we do not
> need as such, since it's fixing v3.13 117b0791ac42 ("mm, thp: move ptl
> taking inside page_check_address_pmd()"), from a split page-table-lock
> series we are not backporting.  But without this additional hunk, rmap
> sometimes broke when the new semantic for mm_find_pmd() was used here.
> 
> (Adding Kirill to Cc: shouldn't he have been Cc'ed already?)
> 
> Hugh

Thanks, I see. So the diff between the hunk below and 117b0791ac42 are
two things:

> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1584,12 +1584,20 @@ pmd_t *page_check_address_pmd(struct page *page,
>  			      unsigned long address,
>  			      enum page_check_address_pmd_flag flag)
>  {
> +	pgd_t *pgd;
> +	pud_t *pud;
>  	pmd_t *pmd, *ret = NULL;
>  
>  	if (address & ~HPAGE_PMD_MASK)
>  		goto out;
>  
> -	pmd = mm_find_pmd(mm, address);
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +	pmd = pmd_offset(pud, address);
>  	if (!pmd)
>  		goto out;

This check is removed by 117b0791ac42. Can actually pmd returned from
pmd_offset be NULL?

>  	if (pmd_none(*pmd))

pmd_none() is replaced by !pmd_present().

My question is: is it OK to take the backport of 117b0791ac42 attached
(to stay with what upstream has)?

thanks,
-- 
js
suse labs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-thp-close-race-between-split-and-zap-huge-pages.patch --]
[-- Type: text/x-patch; name="0001-thp-close-race-between-split-and-zap-huge-pages.patch", Size: 4479 bytes --]

From f43340a2b0a461572ed53284148f9eb67d93733b Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 18 Apr 2014 15:07:25 -0700
Subject: [PATCH 1/1] thp: close race between split and zap huge pages

commit b5a8cad376eebbd8598642697e92a27983aee802 upstream.

Sasha Levin has reported two THP BUGs[1][2].  I believe both of them
have the same root cause.  Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".  It's
BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().  From my
testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd().  page_check_address_pmd() misses PMD which is
under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[3.13+]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 mm/huge_memory.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 04d17ba00893..04535b64119c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1541,15 +1541,22 @@ pmd_t *page_check_address_pmd(struct page *page,
 			      unsigned long address,
 			      enum page_check_address_pmd_flag flag)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd, *ret = NULL;
 
 	if (address & ~HPAGE_PMD_MASK)
 		goto out;
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
 		goto out;
-	if (pmd_none(*pmd))
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+	pmd = pmd_offset(pud, address);
+
+	if (!pmd_present(*pmd))
 		goto out;
 	if (pmd_page(*pmd) != page)
 		goto out;
-- 
2.2.1


  reply	other threads:[~2015-01-12 10:01 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-09 10:30 [PATCH 3.12 00/78] 3.12.36-stable review Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 01/78] ipv6: gre: fix wrong skb->protocol in WCCP Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 02/78] Fix race condition between vxlan_sock_add and vxlan_sock_release Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 03/78] tg3: fix ring init when there are more TX than RX channels Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 04/78] net/mlx4_core: Limit count field to 24 bits in qp_alloc_res Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 05/78] rtnetlink: release net refcnt on error in do_setlink() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 06/78] xen-netfront: Remove BUGs on paged skb data which crosses a page boundary Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 07/78] net: mvneta: fix Tx interrupt delay Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 08/78] net: mvneta: fix race condition in mvneta_tx() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 09/78] net: sctp: use MAX_HEADER for headroom reserve in output path Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 10/78] ceph: fix null pointer dereference in discard_cap_releases() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 11/78] perf/x86/intel: Protect LBR and extra_regs against KVM lying Jiri Slaby
2015-01-10 11:24   ` Dongsu Park
2015-01-10 11:42     ` Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 12/78] s390/3215: fix hanging console issue Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 13/78] s390/3215: fix tty output containing tabs Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 14/78] usb: gadget: at91_udc: move prepare clk into process context Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 15/78] tty: Fix pty master poll() after slave closes v2 Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 16/78] mm: frontswap: invalidate expired data on a dup-store failure Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 17/78] mm/vmpressure.c: fix race in vmpressure_work_fn() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 18/78] mm: fix swapoff hang after page migration and fork Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 19/78] mm: fix anon_vma_clone() error treatment Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 20/78] i2c: omap: fix NACK and Arbitration Lost irq handling Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 21/78] i2c: omap: fix i207 errata handling Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 22/78] i2c: davinci: generate STP always when NACK is received Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 23/78] drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6 Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 24/78] drm/i915: More cautious with pch fifo underruns Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 25/78] drm/i915: Unlock panel even when LVDS is disabled Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 26/78] media: smiapp: Only some selection targets are settable Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 27/78] USB: xhci: Reset a halted endpoint immediately when we encounter a stall Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 28/78] AHCI: Add DeviceIDs for Sunrise Point-LP SATA controller Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 29/78] ahci: disable MSI on SAMSUNG 0xa800 SSD Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 30/78] sata_fsl: fix error handling of irq_of_parse_and_map Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 31/78] igb: bring link up when PHY is powered up Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 32/78] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 33/78] ALSA: hda - Add EAPD fixup for ASUS Z99He laptop Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 34/78] ALSA: hda - Fix built-in mic at resume on Lenovo Ideapad S210 Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 35/78] ALSA: usb-audio: Don't resubmit pending URBs at MIDI error recovery Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 36/78] isofs: Fix infinite looping over CE entries Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 37/78] x86/tls: Validate TLS entries to protect espfix Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 38/78] x86/tls: Disallow unusual TLS segments Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 39/78] x86, kvm: Clear paravirt_enabled on KVM guests for espfix32's benefit Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 40/78] mfd: tc6393xb: Fail ohci suspend if full state restore is required Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 41/78] mmc: block: add newline to sysfs display of force_ro Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 42/78] megaraid_sas: corrected return of wait_event from abort frame path Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 43/78] scsi: correct return values for .eh_abort_handler implementations Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 44/78] nfs41: fix nfs4_proc_layoutget error handling Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 45/78] dm bufio: fix memleak when using a dm_buffer's inline bio Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 46/78] dm space map metadata: fix sm_bootstrap_get_nr_blocks() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 47/78] x86/tls: Don't validate lm in set_thread_area() after all Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 48/78] audit: change decimal constant to macro for invalid uid Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 49/78] isofs: Fix unchecked printing of ER records Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 50/78] KEYS: Fix stale key registration at error path Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 51/78] mac80211: fix multicast LED blinking and counter Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 52/78] mac80211: free management frame keys when removing station Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 53/78] thermal: Fix error path in thermal_init() Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 54/78] mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 55/78] mnt: Update unprivileged remount test Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 56/78] umount: Disallow unprivileged mount force Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 57/78] groups: Consolidate the setgroups permission checks Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 58/78] userns: Document what the invariant required for safe unprivileged mappings Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 59/78] userns: Don't allow setgroups until a gid mapping has been setablished Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 60/78] userns: Don't allow unprivileged creation of gid mappings Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 61/78] userns: Check euid no fsuid when establishing an unprivileged uid mapping Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 62/78] userns: Only allow the creator of the userns unprivileged mappings Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 63/78] userns: Rename id_map_mutex to userns_state_mutex Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 64/78] userns: Add a knob to disable setgroups on a per user namespace basis Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 65/78] userns: Allow setting gid_maps without privilege when setgroups is disabled Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 66/78] userns: Unbreak the unprivileged remount tests Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 67/78] audit: restore AUDIT_LOGINUID unset ABI Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 68/78] crypto: af_alg - fix backlog handling Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 69/78] ncpfs: return proper error from NCP_IOC_SETROOT ioctl Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 70/78] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 71/78] udf: Verify symlink size before loading it Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 72/78] eCryptfs: Force RO mount when encrypted view is enabled Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 73/78] eCryptfs: Remove buggy and unnecessary write in file name decode routine Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 74/78] Btrfs: do not move em to modified list when unpinning Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 75/78] Btrfs: fix fs corruption on transaction abort if device supports discard Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 76/78] mfd: stmpe: Fix STMPE24xx GPMR LSB Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 77/78] mfd: viperboard: Fix platform-device id collision Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 78/78] mm: let mm_find_pmd fix buggy race with THP fault Jiri Slaby
2015-01-10  5:01   ` Hugh Dickins
2015-01-12 10:01     ` Jiri Slaby [this message]
2015-01-12 11:13       ` Kirill A. Shutemov
2015-01-12 23:13         ` Hugh Dickins
2015-01-09 17:59 ` [PATCH 3.12 00/78] 3.12.36-stable review Guenter Roeck
2015-01-11  3:40   ` Satoru Takeuchi
2015-01-12 10:35     ` Jiri Slaby
2015-01-12 18:00 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54B39B8A.7000002@suse.cz \
    --to=jslaby@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=bob.liu@oracle.com \
    --cc=cl@gentwo.org \
    --cc=davej@redhat.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=koct9i@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=rientjes@google.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.