Re: mm: memcg: A infinite loop in __handle_mm_fault()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Mizuma, Masayoshi" <m.mizuma-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
To: "Kirill A. Shutemov"
	<kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
	Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Balbir Singh
	<bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	KAMEZAWA Hiroyuki
	<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Subject: Re: mm: memcg: A infinite loop in __handle_mm_fault()
Date: Wed, 12 Feb 2014 10:04:34 +0900	[thread overview]
Message-ID: <52FAC8A2.1080607@jp.fujitsu.com> (raw)
In-Reply-To: <20140210125655.4AB48E0090-awolqMwQN9atqXYlAKuG4QC/G2K4zDHf@public.gmane.org>

On Mon, 10 Feb 2014 14:56:55 +0200 Kirill A. Shutemov wrote:
> Michal Hocko wrote:
>> [CCing Kirill]
>>
>> On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
>>> Hi,
>>
>> Hi,
>>
>>> This is a bug report for memory cgroup hang up.
>>> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
>>>
>>> When I ran a program (see below) under a limit of memcg, the process hanged up.
>>> Using kprobe trace, I detected the hangup in __handle_mm_fault().
>>> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
>>> VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
>>
>> Thanks a lot for this very good report. I would bet the issue is related
>> to the THP zero page.
>>
>> __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
>> expects that the pmd is marked for splitting so that it can break out
>> and retry the fault. This is not the case for THP zero page though.
>> do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
>> huge page which will succeed in your case because you are hitting memcg
>> limit not the global memory pressure. But then a new page is charged by
>> mem_cgroup_newpage_charge which fails. An existing page is then split
>> and we are returning VM_FAULT_OOM. But we do not have page initialized
>> in that path because page = pmd_page(orig_pmd) is called after
>> is_huge_zero_pmd check.
>>
>> I am not familiar with THP zero page code much but I guess splitting
>> such a zero page is not a way to go. Instead we should simply drop the
>> zero page and retry the fault. I would assume that one of
>> do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
>> do the trick but both of them try to charge new page(s) before the
>> current zero page is uncharged. That makes it prone to the same issue
>> AFAICS.
>>
>> But may be Kirill has a better idea.
> 
> Your analysis looks accurate. Although I was not able to reproduce
> hang up.
> 
> The problem with do_huge_pmd_wp_zero_page_fallback() that it can return
> VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM.
> 
> Untested patch below tries to fix. Masayoshi, could you test.

I applied the patch to 3.14-rc2.
Then, I confirmed this issue does not happen and the process is killed by
oom-killer normally.
Thank you for analyzing the root cause and providing the fix!

Regards,
Masayoshi Mizuma

> 
> BTW, Michal, I've triggered sleep-in-atomic bug in
> mem_cgroup_print_oom_info():
> 
> [    2.386563] Task in /test killed as a result of limit of /test
> [    2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51
> [    2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0
> [    2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> [    2.389640] Memory cgroup stats for /test:
> [    2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68
> [    2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
> [    2.392416] 2 locks held by memcg_test/66:
> [    2.392945]  #0:  (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.394233]  #1:  (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
> [    2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
> [    2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> [    2.397540]  ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000
> [    2.398541]  ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000
> [    2.399533]  0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10
> [    2.400588] Call Trace:
> [    2.400908]  [<ffffffff81776ea3>] dump_stack+0x4d/0x66
> [    2.401578]  [<ffffffff8108418a>] __might_sleep+0x16a/0x210
> [    2.402295]  [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60
> [    2.403005]  [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0
> [    2.403769]  [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390
> [    2.404653]  [<ffffffff8177314e>] dump_header+0x88/0x251
> [    2.405342]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.406098]  [<ffffffff81130618>] oom_kill_process+0x258/0x3d0
> [    2.406833]  [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0
> [    2.407674]  [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0
> [    2.408553]  [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.409354]  [<ffffffff817712f7>] mm_fault_error+0x91/0x189
> [    2.410069]  [<ffffffff81783eae>] __do_page_fault+0x48e/0x580
> [    2.410791]  [<ffffffff8108f656>] ? local_clock+0x16/0x30
> [    2.411467]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.412248]  [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40
> [    2.413039]  [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100
> [    2.413821]  [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [    2.414652]  [<ffffffff81783fae>] do_page_fault+0xe/0x10
> [    2.415330]  [<ffffffff81780552>] page_fault+0x22/0x30
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 82166bf974e1..974eb9eea2c0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1166,8 +1166,10 @@ alloc:
>   		} else {
>   			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
>   					pmd, orig_pmd, page, haddr);
> -			if (ret & VM_FAULT_OOM)
> +			if (ret & VM_FAULT_OOM) {
>   				split_huge_page(page);
> +				ret |= VM_FAULT_FALLBACK;
> +			}
>   			put_page(page);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1179,9 +1181,12 @@ alloc:
>   		if (page) {
>   			split_huge_page(page);
>   			put_page(page);
> +			ret |= VM_FAULT_FALLBACK;
> +		} else {
> +			ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
> +					address, pmd, orig_pmd, haddr);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
> -		ret |= VM_FAULT_OOM;
>   		goto out;
>   	}
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index be6a0c0d4ae0..3b57b7864667 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	if (unlikely(is_vm_hugetlb_page(vma)))
>   		return hugetlb_fault(mm, vma, address, flags);
>   
> -retry:
>   	pgd = pgd_offset(mm, address);
>   	pud = pud_alloc(mm, pgd, address);
>   	if (!pud)
> @@ -3741,20 +3740,13 @@ retry:
>   			if (dirty && !pmd_write(orig_pmd)) {
>   				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>   							  orig_pmd);
> -				/*
> -				 * If COW results in an oom, the huge pmd will
> -				 * have been split, so retry the fault on the
> -				 * pte for a smaller charge.
> -				 */
> -				if (unlikely(ret & VM_FAULT_OOM))
> -					goto retry;
> -				return ret;
> +				if (!(ret & VM_FAULT_FALLBACK))
> +					return ret;
>   			} else {
>   				huge_pmd_set_accessed(mm, vma, address, pmd,
>   						      orig_pmd, dirty);
> +				return 0;
>   			}
> -
> -			return 0;
>   		}
>   	}
>   
>

WARNING: multiple messages have this Message-ID (diff)

From: "Mizuma, Masayoshi" <m.mizuma@jp.fujitsu.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Balbir Singh <bsingharora@gmail.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: mm: memcg: A infinite loop in __handle_mm_fault()
Date: Wed, 12 Feb 2014 10:04:34 +0900	[thread overview]
Message-ID: <52FAC8A2.1080607@jp.fujitsu.com> (raw)
In-Reply-To: <20140210125655.4AB48E0090@blue.fi.intel.com>

On Mon, 10 Feb 2014 14:56:55 +0200 Kirill A. Shutemov wrote:
> Michal Hocko wrote:
>> [CCing Kirill]
>>
>> On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
>>> Hi,
>>
>> Hi,
>>
>>> This is a bug report for memory cgroup hang up.
>>> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
>>>
>>> When I ran a program (see below) under a limit of memcg, the process hanged up.
>>> Using kprobe trace, I detected the hangup in __handle_mm_fault().
>>> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
>>> VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
>>
>> Thanks a lot for this very good report. I would bet the issue is related
>> to the THP zero page.
>>
>> __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
>> expects that the pmd is marked for splitting so that it can break out
>> and retry the fault. This is not the case for THP zero page though.
>> do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
>> huge page which will succeed in your case because you are hitting memcg
>> limit not the global memory pressure. But then a new page is charged by
>> mem_cgroup_newpage_charge which fails. An existing page is then split
>> and we are returning VM_FAULT_OOM. But we do not have page initialized
>> in that path because page = pmd_page(orig_pmd) is called after
>> is_huge_zero_pmd check.
>>
>> I am not familiar with THP zero page code much but I guess splitting
>> such a zero page is not a way to go. Instead we should simply drop the
>> zero page and retry the fault. I would assume that one of
>> do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
>> do the trick but both of them try to charge new page(s) before the
>> current zero page is uncharged. That makes it prone to the same issue
>> AFAICS.
>>
>> But may be Kirill has a better idea.
> 
> Your analysis looks accurate. Although I was not able to reproduce
> hang up.
> 
> The problem with do_huge_pmd_wp_zero_page_fallback() that it can return
> VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM.
> 
> Untested patch below tries to fix. Masayoshi, could you test.

I applied the patch to 3.14-rc2.
Then, I confirmed this issue does not happen and the process is killed by
oom-killer normally.
Thank you for analyzing the root cause and providing the fix!

Regards,
Masayoshi Mizuma

> 
> BTW, Michal, I've triggered sleep-in-atomic bug in
> mem_cgroup_print_oom_info():
> 
> [    2.386563] Task in /test killed as a result of limit of /test
> [    2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51
> [    2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0
> [    2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> [    2.389640] Memory cgroup stats for /test:
> [    2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68
> [    2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
> [    2.392416] 2 locks held by memcg_test/66:
> [    2.392945]  #0:  (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.394233]  #1:  (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
> [    2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
> [    2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> [    2.397540]  ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000
> [    2.398541]  ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000
> [    2.399533]  0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10
> [    2.400588] Call Trace:
> [    2.400908]  [<ffffffff81776ea3>] dump_stack+0x4d/0x66
> [    2.401578]  [<ffffffff8108418a>] __might_sleep+0x16a/0x210
> [    2.402295]  [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60
> [    2.403005]  [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0
> [    2.403769]  [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390
> [    2.404653]  [<ffffffff8177314e>] dump_header+0x88/0x251
> [    2.405342]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.406098]  [<ffffffff81130618>] oom_kill_process+0x258/0x3d0
> [    2.406833]  [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0
> [    2.407674]  [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0
> [    2.408553]  [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.409354]  [<ffffffff817712f7>] mm_fault_error+0x91/0x189
> [    2.410069]  [<ffffffff81783eae>] __do_page_fault+0x48e/0x580
> [    2.410791]  [<ffffffff8108f656>] ? local_clock+0x16/0x30
> [    2.411467]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.412248]  [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40
> [    2.413039]  [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100
> [    2.413821]  [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [    2.414652]  [<ffffffff81783fae>] do_page_fault+0xe/0x10
> [    2.415330]  [<ffffffff81780552>] page_fault+0x22/0x30
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 82166bf974e1..974eb9eea2c0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1166,8 +1166,10 @@ alloc:
>   		} else {
>   			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
>   					pmd, orig_pmd, page, haddr);
> -			if (ret & VM_FAULT_OOM)
> +			if (ret & VM_FAULT_OOM) {
>   				split_huge_page(page);
> +				ret |= VM_FAULT_FALLBACK;
> +			}
>   			put_page(page);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1179,9 +1181,12 @@ alloc:
>   		if (page) {
>   			split_huge_page(page);
>   			put_page(page);
> +			ret |= VM_FAULT_FALLBACK;
> +		} else {
> +			ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
> +					address, pmd, orig_pmd, haddr);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
> -		ret |= VM_FAULT_OOM;
>   		goto out;
>   	}
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index be6a0c0d4ae0..3b57b7864667 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	if (unlikely(is_vm_hugetlb_page(vma)))
>   		return hugetlb_fault(mm, vma, address, flags);
>   
> -retry:
>   	pgd = pgd_offset(mm, address);
>   	pud = pud_alloc(mm, pgd, address);
>   	if (!pud)
> @@ -3741,20 +3740,13 @@ retry:
>   			if (dirty && !pmd_write(orig_pmd)) {
>   				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>   							  orig_pmd);
> -				/*
> -				 * If COW results in an oom, the huge pmd will
> -				 * have been split, so retry the fault on the
> -				 * pte for a smaller charge.
> -				 */
> -				if (unlikely(ret & VM_FAULT_OOM))
> -					goto retry;
> -				return ret;
> +				if (!(ret & VM_FAULT_FALLBACK))
> +					return ret;
>   			} else {
>   				huge_pmd_set_accessed(mm, vma, address, pmd,
>   						      orig_pmd, dirty);
> +				return 0;
>   			}
> -
> -			return 0;
>   		}
>   	}
>   
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-02-12  1:04 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-10  0:25 mm: memcg: A infinite loop in __handle_mm_fault() Mizuma, Masayoshi
2014-02-10  0:25 ` Mizuma, Masayoshi
2014-02-10 11:19 ` Michal Hocko
     [not found]   ` <20140210111928.GA7117-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-02-10 11:51     ` Mizuma, Masayoshi
2014-02-10 11:51       ` Mizuma, Masayoshi
2014-02-10 12:56     ` Kirill A. Shutemov
2014-02-10 12:56       ` Kirill A. Shutemov
2014-02-10 13:52       ` Michal Hocko
     [not found]       ` <20140210125655.4AB48E0090-awolqMwQN9atqXYlAKuG4QC/G2K4zDHf@public.gmane.org>
2014-02-12  1:04         ` Mizuma, Masayoshi [this message]
2014-02-12  1:04           ` Mizuma, Masayoshi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52FAC8A2.1080607@jp.fujitsu.com \
    --to=m.mizuma-+cum20s59erqfuhtdcdx3a@public.gmane.org \
    --cc=bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org \
    --cc=kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.