From: Michal Hocko <mhocko@suse.cz>
To: Mel Gorman <mgorman@suse.de>
Cc: Linux-MM <linux-mm@kvack.org>, Hugh Dickins <hughd@google.com>,
David Gibson <david@gibson.dropbear.id.au>,
Ken Chen <kenchen@google.com>,
Cong Wang <xiyou.wangcong@gmail.com>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
Date: Fri, 20 Jul 2012 16:29:20 +0200 [thread overview]
Message-ID: <20120720142920.GD12434@tiehlicka.suse.cz> (raw)
In-Reply-To: <20120720141108.GH9222@suse.de>
On Fri 20-07-12 15:11:08, Mel Gorman wrote:
> Sorry for the resend, I did not properly refresh Cong Wang's suggested
> fix. This V2 is still the mmap_sem approach that fixes a potential deadlock
> problem pointed out by Michal.
>
> Changelog since V1
> o Correct cut&paste error in race description (hugh)
> o Handle potential deadlock during fork (mhocko)
> o Reorder unlocking (wangcong)
>
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
>
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
>
> [ ..........] Lots of bad pmd messages followed by this
> [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [ 127.186778] ------------[ cut here ]------------
> [ 127.186781] kernel BUG at mm/filemap.c:134!
> [ 127.186782] invalid opcode: 0000 [#1] SMP
> [ 127.186783] CPU 7
> [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [ 127.186801]
> [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
> [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [ 127.186821] Stack:
> [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [ 127.186827] Call Trace:
> [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0
> [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50
> [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130
> [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0
> [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230
> [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30
> [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80
> [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [ 127.186870] RSP <ffff8804144b5c08>
> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---
>
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
>
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
>
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in trouble and needs to be rebooted. The machine
> will respond but processes accessing proc like "ps aux" will hang due to
> the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.
>
> The test case was mostly written by Michal Hocko with a few minor changes
> by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
> possible sources of the race and saved me the embarrassment of posting a
> completely broken patch yesterday. He did not see this patch before
> going to the lists so any critical flaws are all mine!
>
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
>
> Unfortunately it appears to be also insufficient. Consider the following
> situation
>
> Process A Process B
> --------- ---------
> hugetlb_fault shmdt
> LockWrite(mmap_sem)
> do_munmap
> unmap_region
> unmap_vmas
> unmap_single_vma
> unmap_hugepage_range
> Lock(i_mmap_mutex)
> Lock(mm->page_table_lock)
> huge_pmd_unshare/unmap tables <--- (1)
> Unlock(mm->page_table_lock)
> Unlock(i_mmap_mutex)
> huge_pte_alloc ...
> Lock(i_mmap_mutex) ...
> vma_prio_walk, find svma, spte ...
> Lock(mm->page_table_lock) ...
> share spte ...
> Unlock(mm->page_table_lock) ...
> Unlock(i_mmap_mutex) ...
> hugetlb_no_page <--- (2)
> free_pgtables
> unlink_file_vma
> hugetlb_free_pgd_range
> remove_vma_list
>
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down. The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
>
> This patch takes the mmap_sem for read and then the page_table_lock of
> address spaces being considered for page table sharing. I verified that
> page table sharing still occurs using the awesome technology of printk
> to spit out a message when huge_pmd_share is successful. libhugetlbfs
> regression test suite passed.
>
> I strongly suggest this be treated as a -stable candidate if it is merged.
>
> Test program is as follows.
>
> ==== CUT HERE ====
>
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
>
> unsigned int get_random(unsigned int max)
> {
> struct timeval tv;
>
> gettimeofday(&tv, NULL);
> srandom(tv.tv_usec);
> return random() % max;
> }
>
> static void play(void *addr, size_t size)
> {
> unsigned char *start = addr,
> *end = start + size,
> *a;
> start += get_random(size/2);
>
> /* we could itterate on huge pages but let's give it more time. */
> for (a = start; a < end; a += 4096)
> *a = 0;
> }
>
> int main(int argc, char **argv)
> {
> key_t key = IPC_PRIVATE;
> size_t sizeA = nr_huge_page_A * huge_page_size;
> size_t sizeB = nr_huge_page_B * huge_page_size;
> int shmidA, shmidB;
> void *addrA = NULL, *addrB = NULL;
> int nr_children = 300, n = 0;
>
> if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> perror("shmget:");
> return 1;
> }
>
> if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> perror("shmat");
> return 1;
> }
> if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> perror("shmget:");
> return 1;
> }
>
> if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> perror("shmat");
> return 1;
> }
>
> fork_child:
> switch(fork()) {
> case 0:
> switch (n%3) {
> case 0:
> play(addrA, sizeA);
> break;
> case 1:
> play(addrB, sizeB);
> break;
> case 2:
> break;
> }
> break;
> case -1:
> perror("fork:");
> break;
> default:
> if (++n < nr_children)
> goto fork_child;
> play(addrA, sizeA);
> break;
> }
> shmdt(addrA);
> shmdt(addrB);
> do {
> wait(NULL);
> } while (--n > 0);
> shmctl(shmidA, IPC_RMID, NULL);
> shmctl(shmidB, IPC_RMID, NULL);
> return 0;
> }
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Yes this looks correct. mmap_sem will make sure that unmap_vmas and
free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
see non-NULL spte on the way out. I am just wondering whether we need
the page_table_lock as well. It is not harmful but I guess we can drop
it because both exit_mmap and shmdt are not taking it and mmap_sem is
sufficient for them.
One more nit bellow.
I will send my version of the fix which took another path as a reply to
this email to have something to compare with.
[...]
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..944b2df 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> /*
> * search for a shareable pmd page for hugetlb.
> */
> -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm,
> + unsigned long addr, pud_t *pud)
> {
> struct vm_area_struct *vma = find_vma(mm, addr);
> struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> struct vm_area_struct *svma;
> unsigned long saddr;
> pte_t *spte = NULL;
> + spinlock_t *spage_table_lock = NULL;
> + struct rw_semaphore *smmap_sem = NULL;
>
> if (!vma_shareable(vma, addr))
> return;
>
> +retry:
> mutex_lock(&mapping->i_mmap_mutex);
> vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
> if (svma == vma)
> continue;
> + if (svma->vm_mm == vma->vm_mm)
> + continue;
> +
> + /*
> + * The target mm could be in the process of tearing down
> + * its page tables and the i_mmap_mutex on its own is
> + * not sufficient. To prevent races against teardown and
> + * pagetable updates, we acquire the mmap_sem and pagetable
> + * lock of the remote address space. down_read_trylock()
> + * is necessary as the other process could also be trying
> + * to share pagetables with the current mm. In the fork
> + * case, we are already both mm's so check for that
> + */
> + if (locked_mm != svma->vm_mm) {
> + if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> + mutex_unlock(&mapping->i_mmap_mutex);
> + goto retry;
> + }
> + smmap_sem = &svma->vm_mm->mmap_sem;
> + }
> +
> + spage_table_lock = &svma->vm_mm->page_table_lock;
> + spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
>
> saddr = page_table_shareable(svma, vma, addr, idx);
> if (saddr) {
[...]
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ae8f708..4832277 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> src_pte = huge_pte_offset(src, addr);
> if (!src_pte)
> continue;
> - dst_pte = huge_pte_alloc(dst, addr, sz);
> + dst_pte = huge_pte_alloc(dst, src, addr, sz);
> if (!dst_pte)
> goto nomem;
>
> @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> VM_FAULT_SET_HINDEX(h - hstates);
> }
>
> - ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> + ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));
strictly speaking we should provide current->mm here because we are in
the page fault path and mmap_sem is held for reading. This doesn't
matter here though because huge_pmd_share will take it for reading so
nesting wouldn't hurt. Maybe a small comment that this is intentional
and correct would be nice.
> if (!ptep)
> return VM_FAULT_OOM;
>
>
> --
> Mel Gorman
> SUSE Labs
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.cz>
To: Mel Gorman <mgorman@suse.de>
Cc: Linux-MM <linux-mm@kvack.org>, Hugh Dickins <hughd@google.com>,
David Gibson <david@gibson.dropbear.id.au>,
Ken Chen <kenchen@google.com>,
Cong Wang <xiyou.wangcong@gmail.com>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
Date: Fri, 20 Jul 2012 16:29:20 +0200 [thread overview]
Message-ID: <20120720142920.GD12434@tiehlicka.suse.cz> (raw)
In-Reply-To: <20120720141108.GH9222@suse.de>
On Fri 20-07-12 15:11:08, Mel Gorman wrote:
> Sorry for the resend, I did not properly refresh Cong Wang's suggested
> fix. This V2 is still the mmap_sem approach that fixes a potential deadlock
> problem pointed out by Michal.
>
> Changelog since V1
> o Correct cut&paste error in race description (hugh)
> o Handle potential deadlock during fork (mhocko)
> o Reorder unlocking (wangcong)
>
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
>
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
>
> [ ..........] Lots of bad pmd messages followed by this
> [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [ 127.186778] ------------[ cut here ]------------
> [ 127.186781] kernel BUG at mm/filemap.c:134!
> [ 127.186782] invalid opcode: 0000 [#1] SMP
> [ 127.186783] CPU 7
> [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [ 127.186801]
> [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
> [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [ 127.186821] Stack:
> [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [ 127.186827] Call Trace:
> [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0
> [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50
> [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130
> [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0
> [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230
> [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30
> [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80
> [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [ 127.186870] RSP <ffff8804144b5c08>
> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---
>
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
>
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
>
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in trouble and needs to be rebooted. The machine
> will respond but processes accessing proc like "ps aux" will hang due to
> the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.
>
> The test case was mostly written by Michal Hocko with a few minor changes
> by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
> possible sources of the race and saved me the embarrassment of posting a
> completely broken patch yesterday. He did not see this patch before
> going to the lists so any critical flaws are all mine!
>
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
>
> Unfortunately it appears to be also insufficient. Consider the following
> situation
>
> Process A Process B
> --------- ---------
> hugetlb_fault shmdt
> LockWrite(mmap_sem)
> do_munmap
> unmap_region
> unmap_vmas
> unmap_single_vma
> unmap_hugepage_range
> Lock(i_mmap_mutex)
> Lock(mm->page_table_lock)
> huge_pmd_unshare/unmap tables <--- (1)
> Unlock(mm->page_table_lock)
> Unlock(i_mmap_mutex)
> huge_pte_alloc ...
> Lock(i_mmap_mutex) ...
> vma_prio_walk, find svma, spte ...
> Lock(mm->page_table_lock) ...
> share spte ...
> Unlock(mm->page_table_lock) ...
> Unlock(i_mmap_mutex) ...
> hugetlb_no_page <--- (2)
> free_pgtables
> unlink_file_vma
> hugetlb_free_pgd_range
> remove_vma_list
>
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down. The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
>
> This patch takes the mmap_sem for read and then the page_table_lock of
> address spaces being considered for page table sharing. I verified that
> page table sharing still occurs using the awesome technology of printk
> to spit out a message when huge_pmd_share is successful. libhugetlbfs
> regression test suite passed.
>
> I strongly suggest this be treated as a -stable candidate if it is merged.
>
> Test program is as follows.
>
> ==== CUT HERE ====
>
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
>
> unsigned int get_random(unsigned int max)
> {
> struct timeval tv;
>
> gettimeofday(&tv, NULL);
> srandom(tv.tv_usec);
> return random() % max;
> }
>
> static void play(void *addr, size_t size)
> {
> unsigned char *start = addr,
> *end = start + size,
> *a;
> start += get_random(size/2);
>
> /* we could itterate on huge pages but let's give it more time. */
> for (a = start; a < end; a += 4096)
> *a = 0;
> }
>
> int main(int argc, char **argv)
> {
> key_t key = IPC_PRIVATE;
> size_t sizeA = nr_huge_page_A * huge_page_size;
> size_t sizeB = nr_huge_page_B * huge_page_size;
> int shmidA, shmidB;
> void *addrA = NULL, *addrB = NULL;
> int nr_children = 300, n = 0;
>
> if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> perror("shmget:");
> return 1;
> }
>
> if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> perror("shmat");
> return 1;
> }
> if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> perror("shmget:");
> return 1;
> }
>
> if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> perror("shmat");
> return 1;
> }
>
> fork_child:
> switch(fork()) {
> case 0:
> switch (n%3) {
> case 0:
> play(addrA, sizeA);
> break;
> case 1:
> play(addrB, sizeB);
> break;
> case 2:
> break;
> }
> break;
> case -1:
> perror("fork:");
> break;
> default:
> if (++n < nr_children)
> goto fork_child;
> play(addrA, sizeA);
> break;
> }
> shmdt(addrA);
> shmdt(addrB);
> do {
> wait(NULL);
> } while (--n > 0);
> shmctl(shmidA, IPC_RMID, NULL);
> shmctl(shmidB, IPC_RMID, NULL);
> return 0;
> }
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Yes this looks correct. mmap_sem will make sure that unmap_vmas and
free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
see non-NULL spte on the way out. I am just wondering whether we need
the page_table_lock as well. It is not harmful but I guess we can drop
it because both exit_mmap and shmdt are not taking it and mmap_sem is
sufficient for them.
One more nit bellow.
I will send my version of the fix which took another path as a reply to
this email to have something to compare with.
[...]
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..944b2df 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> /*
> * search for a shareable pmd page for hugetlb.
> */
> -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm,
> + unsigned long addr, pud_t *pud)
> {
> struct vm_area_struct *vma = find_vma(mm, addr);
> struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> struct vm_area_struct *svma;
> unsigned long saddr;
> pte_t *spte = NULL;
> + spinlock_t *spage_table_lock = NULL;
> + struct rw_semaphore *smmap_sem = NULL;
>
> if (!vma_shareable(vma, addr))
> return;
>
> +retry:
> mutex_lock(&mapping->i_mmap_mutex);
> vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
> if (svma == vma)
> continue;
> + if (svma->vm_mm == vma->vm_mm)
> + continue;
> +
> + /*
> + * The target mm could be in the process of tearing down
> + * its page tables and the i_mmap_mutex on its own is
> + * not sufficient. To prevent races against teardown and
> + * pagetable updates, we acquire the mmap_sem and pagetable
> + * lock of the remote address space. down_read_trylock()
> + * is necessary as the other process could also be trying
> + * to share pagetables with the current mm. In the fork
> + * case, we are already both mm's so check for that
> + */
> + if (locked_mm != svma->vm_mm) {
> + if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> + mutex_unlock(&mapping->i_mmap_mutex);
> + goto retry;
> + }
> + smmap_sem = &svma->vm_mm->mmap_sem;
> + }
> +
> + spage_table_lock = &svma->vm_mm->page_table_lock;
> + spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
>
> saddr = page_table_shareable(svma, vma, addr, idx);
> if (saddr) {
[...]
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ae8f708..4832277 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> src_pte = huge_pte_offset(src, addr);
> if (!src_pte)
> continue;
> - dst_pte = huge_pte_alloc(dst, addr, sz);
> + dst_pte = huge_pte_alloc(dst, src, addr, sz);
> if (!dst_pte)
> goto nomem;
>
> @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> VM_FAULT_SET_HINDEX(h - hstates);
> }
>
> - ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> + ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));
strictly speaking we should provide current->mm here because we are in
the page fault path and mmap_sem is held for reading. This doesn't
matter here though because huge_pmd_share will take it for reading so
nesting wouldn't hurt. Maybe a small comment that this is intentional
and correct would be nice.
> if (!ptep)
> return VM_FAULT_OOM;
>
>
> --
> Mel Gorman
> SUSE Labs
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
next prev parent reply other threads:[~2012-07-20 14:29 UTC|newest]
Thread overview: 100+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-20 13:49 [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Mel Gorman
2012-07-20 13:49 ` Mel Gorman
2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
2012-07-20 14:11 ` Mel Gorman
2012-07-20 14:29 ` Michal Hocko [this message]
2012-07-20 14:29 ` Michal Hocko
2012-07-20 14:37 ` Mel Gorman
2012-07-20 14:37 ` Mel Gorman
2012-07-20 14:40 ` Michal Hocko
2012-07-20 14:40 ` Michal Hocko
2012-07-20 14:36 ` [PATCH -alternative] " Michal Hocko
2012-07-20 14:36 ` Michal Hocko
2012-07-20 14:51 ` Mel Gorman
2012-07-20 14:51 ` Mel Gorman
2012-07-23 4:04 ` Hugh Dickins
2012-07-23 4:04 ` Hugh Dickins
2012-07-23 11:40 ` Mel Gorman
2012-07-23 11:40 ` Mel Gorman
2012-07-24 1:08 ` Hugh Dickins
2012-07-24 1:08 ` Hugh Dickins
2012-07-24 8:32 ` Michal Hocko
2012-07-24 8:32 ` Michal Hocko
2012-07-24 9:34 ` Mel Gorman
2012-07-24 9:34 ` Mel Gorman
2012-07-24 10:04 ` Michal Hocko
2012-07-24 10:04 ` Michal Hocko
2012-07-24 19:23 ` Hugh Dickins
2012-07-24 19:23 ` Hugh Dickins
2012-07-25 8:36 ` Mel Gorman
2012-07-25 8:36 ` Mel Gorman
2012-07-26 17:42 ` Rik van Riel
2012-07-26 17:42 ` Rik van Riel
2012-07-26 18:04 ` Larry Woodman
2012-07-26 18:04 ` Larry Woodman
2012-07-27 8:42 ` Mel Gorman
2012-07-27 8:42 ` Mel Gorman
2012-07-26 18:37 ` Rik van Riel
2012-07-26 18:37 ` Rik van Riel
2012-07-26 21:03 ` Larry Woodman
2012-07-26 21:03 ` Larry Woodman
2012-07-27 3:48 ` Larry Woodman
2012-07-27 3:48 ` Larry Woodman
2012-07-27 10:10 ` Larry Woodman
2012-07-27 10:10 ` Larry Woodman
2012-07-27 10:23 ` Mel Gorman
2012-07-27 10:23 ` Mel Gorman
2012-07-27 10:36 ` Larry Woodman
2012-07-27 10:36 ` Larry Woodman
2012-07-30 19:11 ` Larry Woodman
2012-07-30 19:11 ` Larry Woodman
2012-07-31 12:16 ` Hillf Danton
2012-07-31 12:16 ` Hillf Danton
2012-07-31 12:46 ` Mel Gorman
2012-07-31 12:46 ` Mel Gorman
2012-07-31 13:07 ` Larry Woodman
2012-07-31 13:07 ` Larry Woodman
2012-07-31 13:29 ` Mel Gorman
2012-07-31 13:29 ` Mel Gorman
2012-07-31 13:21 ` Michal Hocko
2012-07-31 13:21 ` Michal Hocko
2012-07-31 17:49 ` Larry Woodman
2012-07-31 17:49 ` Larry Woodman
2012-07-31 20:06 ` Michal Hocko
2012-07-31 20:06 ` Michal Hocko
2012-07-31 20:57 ` Larry Woodman
2012-07-31 20:57 ` Larry Woodman
2012-08-01 2:45 ` Larry Woodman
2012-08-01 2:45 ` Larry Woodman
2012-08-01 8:20 ` Michal Hocko
2012-08-01 8:20 ` Michal Hocko
2012-08-01 12:32 ` Michal Hocko
2012-08-01 12:32 ` Michal Hocko
2012-08-01 15:06 ` Larry Woodman
2012-08-01 15:06 ` Larry Woodman
2012-08-02 7:19 ` Michal Hocko
2012-08-02 7:19 ` Michal Hocko
2012-08-02 7:37 ` Mel Gorman
2012-08-02 7:37 ` Mel Gorman
2012-08-02 12:36 ` Michal Hocko
2012-08-02 12:36 ` Michal Hocko
2012-08-02 13:33 ` Mel Gorman
2012-08-02 13:33 ` Mel Gorman
2012-08-02 13:53 ` Michal Hocko
2012-08-02 13:53 ` Michal Hocko
2012-07-31 18:03 ` Rik van Riel
2012-07-31 18:03 ` Rik van Riel
2012-07-26 18:31 ` Rik van Riel
2012-07-26 18:31 ` Rik van Riel
2012-07-27 9:02 ` Michal Hocko
2012-07-27 9:02 ` Michal Hocko
2012-07-26 16:01 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Larry Woodman
2012-07-26 16:01 ` Larry Woodman
2012-07-27 8:47 ` Mel Gorman
2012-07-27 8:47 ` Mel Gorman
2012-07-26 21:00 ` Rik van Riel
2012-07-26 21:00 ` Rik van Riel
2012-07-26 21:54 ` Hugh Dickins
2012-07-26 21:54 ` Hugh Dickins
2012-07-27 8:52 ` Mel Gorman
2012-07-27 8:52 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120720142920.GD12434@tiehlicka.suse.cz \
--to=mhocko@suse.cz \
--cc=david@gibson.dropbear.id.au \
--cc=hughd@google.com \
--cc=kenchen@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.