[ 37/44] mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg KH <gregkh@linuxfoundation.org>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	alan@lxorguk.ukuu.org.uk, Hugh Dickins <hughd@google.com>,
	Michal Hocko <mhocko@suse.cz>, Mel Gorman <mgorman@suse.de>
Subject: [ 37/44] mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
Date: Mon, 13 Aug 2012 15:02:44 -0700	[thread overview]
Message-ID: <20120813220145.425780799@linuxfoundation.org> (raw)
In-Reply-To: <20120813220142.113186818@linuxfoundation.org>

From: Greg KH <gregkh@linuxfoundation.org>

3.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mel Gorman <mgorman@suse.de>

commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream.

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log.  The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37.  It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page].  The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce.  To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted.  The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
hard reset or a sysrq-b.

The basic problem is a race between page table sharing and teardown.  For
the most part page table sharing depends on i_mmap_mutex.  In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A					Process B
---------					---------
hugetlb_fault					shmdt
  						LockWrite(mmap_sem)
    						  do_munmap
						    unmap_region
						      unmap_vmas
						        unmap_single_vma
						          unmap_hugepage_range
      						            Lock(i_mmap_mutex)
							    Lock(mm->page_table_lock)
							    huge_pmd_unshare/unmap tables <--- (1)
							    Unlock(mm->page_table_lock)
      						            Unlock(i_mmap_mutex)
  huge_pte_alloc				      ...
    Lock(i_mmap_mutex)				      ...
    vma_prio_walk, find svma, spte		      ...
    Lock(mm->page_table_lock)			      ...
    share spte					      ...
    Unlock(mm->page_table_lock)			      ...
    Unlock(i_mmap_mutex)			      ...
  hugetlb_no_page									  <--- (2)
						      free_pgtables
						        unlink_file_vma
							hugetlb_free_pgd_range
						    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables.  At (1) above,
the page tables are not shared yet so it unmaps the PMDs.  Process A sets
up page table sharing and at (2) faults a new entry.  Process B then trips
up on it in free_pgtables.

This patch fixes the problem by adding a new function
__unmap_hugepage_range_final that is only called when the VMA is about to
be destroyed.  This function clears VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
ineligible for sharing and avoids the race.  Superficially this looks like
it would then be vunerable to truncate and madvise issues but hugetlbfs
has its own truncate handlers so does not use unmap_mapping_range() and
does not support madvise(DONTNEED).

This should be treated as a -stable candidate if it is merged.

Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	srandom(tv.tv_usec);
	return random() % max;
}

static void play(void *addr, size_t size)
{
	unsigned char *start = addr,
		      *end = start + size,
		      *a;
	start += get_random(size/2);

	/* we could itterate on huge pages but let's give it more time. */
	for (a = start; a < end; a += 4096)
		*a = 0;
}

int main(int argc, char **argv)
{
	key_t key = IPC_PRIVATE;
	size_t sizeA = nr_huge_page_A * huge_page_size;
	size_t sizeB = nr_huge_page_B * huge_page_size;
	int shmidA, shmidB;
	void *addrA = NULL, *addrB = NULL;
	int nr_children = 300, n = 0;

	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}
	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}

fork_child:
	switch(fork()) {
		case 0:
			switch (n%3) {
			case 0:
				play(addrA, sizeA);
				break;
			case 1:
				play(addrB, sizeB);
				break;
			case 2:
				break;
			}
			break;
		case -1:
			perror("fork:");
			break;
		default:
			if (++n < nr_children)
				goto fork_child;
			play(addrA, sizeA);
			break;
	}
	shmdt(addrA);
	shmdt(addrB);
	do {
		wait(NULL);
	} while (--n > 0);
	shmctl(shmidA, IPC_RMID, NULL);
	shmctl(shmidB, IPC_RMID, NULL);
	return 0;
}

[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


---
 mm/hugetlb.c |   25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2301,6 +2301,22 @@ void unmap_hugepage_range(struct vm_area
 {
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	__unmap_hugepage_range(vma, start, end, ref_page);
+	/*
+	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
+	 * test will fail on a vma being torn down, and not grab a page table
+	 * on its way out.  We're lucky that the flag has such an appropriate
+	 * name, and can in fact be safely cleared here. We could clear it
+	 * before the __unmap_hugepage_range above, but all that's necessary
+	 * is to clear it before releasing the i_mmap_mutex below.
+	 *
+	 * This works because in the contexts this is called, the VMA is
+	 * going to be destroyed. It is not vunerable to madvise(DONTNEED)
+	 * because madvise is not supported on hugetlbfs. The same applies
+	 * for direct IO. unmap_hugepage_range() is only being called just
+	 * before free_pgtables() so clearing VM_MAYSHARE will not cause
+	 * surprises later.
+	 */
+	vma->vm_flags &= ~VM_MAYSHARE;
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 }
 
@@ -2853,9 +2869,14 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-
+	/*
+	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+	 * may have cleared our pud entry and done put_page on the page table:
+	 * once we release i_mmap_mutex, another task can do the final put_page
+	 * and that page table be reused and filled with junk.
+	 */
 	flush_tlb_range(vma, start, end);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 }
 
 int hugetlb_reserve_pages(struct inode *inode,

next prev parent reply	other threads:[~2012-08-13 22:05 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-13 22:02 [ 00/44] 3.0.41-stable review Greg Kroah-Hartman
2012-08-13 22:02 ` [ 01/44] [IA64] Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts Greg Kroah-Hartman
2012-08-13 22:02 ` [ 02/44] SUNRPC: return negative value in case rpcbind client creation error Greg Kroah-Hartman
2012-08-13 22:02 ` [ 03/44] nilfs2: fix deadlock issue between chcp and thaw ioctls Greg Kroah-Hartman
2012-08-13 22:02 ` [ 04/44] pcdp: use early_ioremap/early_iounmap to access pcdp table Greg Kroah-Hartman
2012-08-13 22:02 ` [ 05/44] mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page() Greg Kroah-Hartman
2012-08-13 22:02 ` [ 06/44] ARM: 7467/1: mutex: use generic xchg-based implementation for ARMv6+ Greg Kroah-Hartman
2012-08-15 14:02   ` Ben Hutchings
2012-08-13 22:02 ` [ 07/44] ARM: 7477/1: vfp: Always save VFP state in vfp_pm_suspend on UP Greg Kroah-Hartman
2012-08-14 20:01   ` Herton Ronaldo Krzesinski
2012-08-15 14:05     ` Greg Kroah-Hartman
2012-08-15 14:50       ` Herton Ronaldo Krzesinski
2012-08-13 22:02 ` [ 08/44] ARM: 7478/1: errata: extend workaround for erratum #720789 Greg Kroah-Hartman
2012-08-13 22:02 ` [ 09/44] ARM: 7479/1: mm: avoid NULL dereference when flushing gate_vma with VIVT caches Greg Kroah-Hartman
2012-08-13 22:02 ` [ 10/44] ALSA: hda - remove quirk for Dell Vostro 1015 Greg Kroah-Hartman
2012-08-14  5:17   ` David Henningsson
2012-08-14  5:43     ` Takashi Iwai
2012-08-15 14:03     ` Greg Kroah-Hartman
2012-08-13 22:02 ` [ 11/44] mm: mmu_notifier: fix freed page still mapped in secondary MMU Greg Kroah-Hartman
2012-08-13 22:02 ` [ 12/44] mac80211: cancel mesh path timer Greg Kroah-Hartman
2012-08-13 22:02 ` [ 13/44] x86, nops: Missing break resulting in incorrect selection on Intel Greg Kroah-Hartman
2012-08-13 22:02 ` [ 14/44] random: Add support for architectural random hooks Greg Kroah-Hartman
2012-08-13 22:02 ` [ 15/44] fix typo/thinko in get_random_bytes() Greg Kroah-Hartman
2012-08-13 22:02 ` [ 16/44] random: Use arch_get_random_int instead of cycle counter if avail Greg Kroah-Hartman
2012-08-13 22:02 ` [ 17/44] random: Use arch-specific RNG to initialize the entropy store Greg Kroah-Hartman
2012-08-13 22:02 ` [ 18/44] random: Adjust the number of loops when initializing Greg Kroah-Hartman
2012-08-13 22:02 ` [ 19/44] drivers/char/random.c: fix boot id uniqueness race Greg Kroah-Hartman
2012-08-13 22:02 ` [ 20/44] random: make add_interrupt_randomness() do something sane Greg Kroah-Hartman
2012-08-13 22:02 ` [ 21/44] random: use lockless techniques in the interrupt path Greg Kroah-Hartman
2012-08-13 22:02 ` [ 22/44] random: create add_device_randomness() interface Greg Kroah-Hartman
2012-08-13 22:02 ` [ 23/44] usb: feed USB device information to the /dev/random driver Greg Kroah-Hartman
2012-08-13 22:02 ` [ 24/44] net: feed /dev/random with the MAC address when registering a device Greg Kroah-Hartman
2012-08-13 22:02 ` [ 25/44] random: use the arch-specific rng in xfer_secondary_pool Greg Kroah-Hartman
2012-08-13 22:02 ` [ 26/44] random: add new get_random_bytes_arch() function Greg Kroah-Hartman
2012-08-13 22:02 ` [ 27/44] random: add tracepoints for easier debugging and verification Greg Kroah-Hartman
2012-08-13 22:02 ` [ 28/44] MAINTAINERS: Theodore Tso is taking over the random driver Greg Kroah-Hartman
2012-08-13 22:02 ` [ 29/44] rtc: wm831x: Feed the write counter into device_add_randomness() Greg Kroah-Hartman
2012-08-13 22:02 ` [ 30/44] mfd: wm831x: Feed the device UUID " Greg Kroah-Hartman
2012-08-13 22:02 ` [ 31/44] random: remove rand_initialize_irq() Greg Kroah-Hartman
2012-08-13 22:02 ` [ 32/44] random: Add comment to random_initialize() Greg Kroah-Hartman
2012-08-13 22:02 ` [ 33/44] dmi: Feed DMI table to /dev/random driver Greg Kroah-Hartman
2012-08-13 22:02 ` [ 34/44] random: mix in architectural randomness in extract_buf() Greg Kroah-Hartman
2012-08-13 22:02 ` [ 35/44] x86, microcode: microcode_core.c simple_strtoul cleanup Greg Kroah-Hartman
2012-08-13 22:02 ` [ 36/44] x86, microcode: Sanitize per-cpu microcode reloading interface Greg Kroah-Hartman
2012-08-15  0:26   ` Henrique de Moraes Holschuh
2012-08-15 14:06     ` Greg Kroah-Hartman
2012-08-15 16:30       ` Henrique de Moraes Holschuh
2012-08-15 18:26         ` Greg Kroah-Hartman
2012-08-13 22:02 ` Greg Kroah-Hartman [this message]
2012-08-13 22:02 ` [ 38/44] ARM: mxs: Remove MMAP_MIN_ADDR setting from mxs_defconfig Greg Kroah-Hartman
2012-08-13 22:02 ` [ 39/44] ARM: pxa: remove irq_to_gpio from ezx-pcap driver Greg Kroah-Hartman
2012-08-13 22:02 ` [ 40/44] cfg80211: process pending events when unregistering net device Greg Kroah-Hartman
2012-08-13 22:02 ` [ 41/44] cfg80211: fix interface combinations check for ADHOC(IBSS) Greg Kroah-Hartman
2012-08-13 22:02 ` [ 42/44] e1000e: NIC goes up and immediately goes down Greg Kroah-Hartman
2012-08-13 22:02 ` [ 43/44] Input: wacom - Bamboo One 1024 pressure fix Greg Kroah-Hartman
2012-08-13 22:02 ` [ 44/44] rt61pci: fix NULL pointer dereference in config_lna_gain Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120813220145.425780799@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).