Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Mika Penttilä" <mika.penttila@kolumbus.fi>,
	"Tony Battersby" <tonyb@cybernetics.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Andrew Morton" <akpm@linux-foundation.org>
Subject: Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)
Date: Wed, 30 Apr 2008 07:19:32 +0200	[thread overview]
Message-ID: <20080430051932.GD27652@wotan.suse.de> (raw)
In-Reply-To: <20080430141738.e6b80d4b.kamezawa.hiroyu@jp.fujitsu.com>

On Wed, Apr 30, 2008 at 02:17:38PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 30 Apr 2008 08:03:33 +0300
> Mika Penttilä <mika.penttila@kolumbus.fi> wrote:
> 
> > > ==
> > > @@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > >         spinlock_t *ptl;
> > >  {
> > > <snip>
> > > -               page_add_new_anon_rmap(page, vma, address);
> > > -       } else {
> > > -               /* Map the ZERO_PAGE - vm_page_prot is readonly */
> > > -               page = ZERO_PAGE(address);
> > > -               page_cache_get(page);
> > > -               entry = mk_pte(page, vma->vm_page_prot);
> > > +       if (unlikely(anon_vma_prepare(vma)))
> > > +               goto oom;
> > > +       page = alloc_zeroed_user_highpage_movable(vma, address);
> > > ==
> > >
> > > above change is for avoiding to use ZERO_PAGE at read-page-fault to anonymous
> > > vma. This is reasonable I think. But at coredump, tons of read-but-never-written 
> > > pages can be allocated.
> > > ==
> > > coredump
> > >   -> get_user_pages()
> > >        -> follow_page() returns NULL
> > >             -> handle mm fault
> > >                  -> do_anonymous page.
> > > ==
> > > follow_page() returns ZERO_PAGE only when page table is not avaiable.
> > >
> > > So, making follow_page() return ZERO_PAGE can be a fix of extra memory
> > > consumpstion at core dump. (Maybe someone can think of other fix.)
> > >
> > > how about this patch ? Could you try ?
> > >
> > > (I'm sorry but I'll not be active for a week because my servers are powered off.)
> > >
> > > -Kame
> > >
> > >   
> > 
> > 
> > But sure we still have to handle the fault for instance swapped pages, 
> > for other uses of get_user_pages();
> > 
> Ah, my bad.....how about this ? I changed !pte_present() to pte_none().
> 
> -Kame
> ==
> follow_page() returns ZERO_PAGE if a page table is not available.
> but returns NULL if a page table exists. If NULL, handle_mm_fault()
> allocates a new page.
> 
> This behavior increases page consumption at coredump, which tend
> to do read-once-but-never-written page fault.  This patch is
> for avoiding this.

I think you still need the pte_present test too, otherwise !present and
!none ptes can slip through and be treated as present.

Something like this should do:
if (!pte_present(pte)) {
	if (pte_none(pte)) {
		pte_unmap_unlock
		goto null_or_zeropage;
	}
	goto unlock;
}


> 
> Changelog:
>   - fixed to check pte_none() not !pte_present().
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Index: linux-2.6.25/mm/memory.c
> ===================================================================
> --- linux-2.6.25.orig/mm/memory.c
> +++ linux-2.6.25/mm/memory.c
> @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
>  	page = NULL;
>  	pgd = pgd_offset(mm, address);
>  	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> -		goto no_page_table;
> +		goto null_or_zeropage;
>  
>  	pud = pud_offset(pgd, address);
>  	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> -		goto no_page_table;
> +		goto null_or_zeropage;
>  	
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> -		goto no_page_table;
> +		goto null_or_zeropage;
>  
>  	if (pmd_huge(*pmd)) {
>  		BUG_ON(flags & FOLL_GET);
> @@ -947,8 +947,10 @@ struct page *follow_page(struct vm_area_
>  		goto out;
>  
>  	pte = *ptep;
> -	if (!pte_present(pte))
> -		goto unlock;
> +	if (!(flags & FOLL_WRITE) && pte_none(pte)) {
> +		pte_unmap_unlock(ptep, ptl);
> +		goto null_or_zeropage;
> +	}
>  	if ((flags & FOLL_WRITE) && !pte_write(pte))
>  		goto unlock;
>  	page = vm_normal_page(vma, address, pte);
> @@ -968,7 +970,7 @@ unlock:
>  out:
>  	return page;
>  
> -no_page_table:
> +null_or_zeropage:
>  	/*
>  	 * When core dumping an enormous anonymous area that nobody
>  	 * has touched so far, we don't want to allocate page tables.
> 
>

WARNING: multiple messages have this Message-ID (diff)

From: Nick Piggin <npiggin@suse.de>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Mika Penttilä" <mika.penttila@kolumbus.fi>,
	"Tony Battersby" <tonyb@cybernetics.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Andrew Morton" <akpm@linux-foundation.org>
Subject: Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)
Date: Wed, 30 Apr 2008 07:19:32 +0200	[thread overview]
Message-ID: <20080430051932.GD27652@wotan.suse.de> (raw)
In-Reply-To: <20080430141738.e6b80d4b.kamezawa.hiroyu@jp.fujitsu.com>

On Wed, Apr 30, 2008 at 02:17:38PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 30 Apr 2008 08:03:33 +0300
> Mika Penttila <mika.penttila@kolumbus.fi> wrote:
> 
> > > ==
> > > @@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > >         spinlock_t *ptl;
> > >  {
> > > <snip>
> > > -               page_add_new_anon_rmap(page, vma, address);
> > > -       } else {
> > > -               /* Map the ZERO_PAGE - vm_page_prot is readonly */
> > > -               page = ZERO_PAGE(address);
> > > -               page_cache_get(page);
> > > -               entry = mk_pte(page, vma->vm_page_prot);
> > > +       if (unlikely(anon_vma_prepare(vma)))
> > > +               goto oom;
> > > +       page = alloc_zeroed_user_highpage_movable(vma, address);
> > > ==
> > >
> > > above change is for avoiding to use ZERO_PAGE at read-page-fault to anonymous
> > > vma. This is reasonable I think. But at coredump, tons of read-but-never-written 
> > > pages can be allocated.
> > > ==
> > > coredump
> > >   -> get_user_pages()
> > >        -> follow_page() returns NULL
> > >             -> handle mm fault
> > >                  -> do_anonymous page.
> > > ==
> > > follow_page() returns ZERO_PAGE only when page table is not avaiable.
> > >
> > > So, making follow_page() return ZERO_PAGE can be a fix of extra memory
> > > consumpstion at core dump. (Maybe someone can think of other fix.)
> > >
> > > how about this patch ? Could you try ?
> > >
> > > (I'm sorry but I'll not be active for a week because my servers are powered off.)
> > >
> > > -Kame
> > >
> > >   
> > 
> > 
> > But sure we still have to handle the fault for instance swapped pages, 
> > for other uses of get_user_pages();
> > 
> Ah, my bad.....how about this ? I changed !pte_present() to pte_none().
> 
> -Kame
> ==
> follow_page() returns ZERO_PAGE if a page table is not available.
> but returns NULL if a page table exists. If NULL, handle_mm_fault()
> allocates a new page.
> 
> This behavior increases page consumption at coredump, which tend
> to do read-once-but-never-written page fault.  This patch is
> for avoiding this.

I think you still need the pte_present test too, otherwise !present and
!none ptes can slip through and be treated as present.

Something like this should do:
if (!pte_present(pte)) {
	if (pte_none(pte)) {
		pte_unmap_unlock
		goto null_or_zeropage;
	}
	goto unlock;
}


> 
> Changelog:
>   - fixed to check pte_none() not !pte_present().
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Index: linux-2.6.25/mm/memory.c
> ===================================================================
> --- linux-2.6.25.orig/mm/memory.c
> +++ linux-2.6.25/mm/memory.c
> @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
>  	page = NULL;
>  	pgd = pgd_offset(mm, address);
>  	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> -		goto no_page_table;
> +		goto null_or_zeropage;
>  
>  	pud = pud_offset(pgd, address);
>  	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> -		goto no_page_table;
> +		goto null_or_zeropage;
>  	
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> -		goto no_page_table;
> +		goto null_or_zeropage;
>  
>  	if (pmd_huge(*pmd)) {
>  		BUG_ON(flags & FOLL_GET);
> @@ -947,8 +947,10 @@ struct page *follow_page(struct vm_area_
>  		goto out;
>  
>  	pte = *ptep;
> -	if (!pte_present(pte))
> -		goto unlock;
> +	if (!(flags & FOLL_WRITE) && pte_none(pte)) {
> +		pte_unmap_unlock(ptep, ptl);
> +		goto null_or_zeropage;
> +	}
>  	if ((flags & FOLL_WRITE) && !pte_write(pte))
>  		goto unlock;
>  	page = vm_normal_page(vma, address, pte);
> @@ -968,7 +970,7 @@ unlock:
>  out:
>  	return page;
>  
> -no_page_table:
> +null_or_zeropage:
>  	/*
>  	 * When core dumping an enormous anonymous area that nobody
>  	 * has touched so far, we don't want to allocate page tables.
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-04-30  5:19 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-28 15:11 2.6.24 regression: deadlock on coredump of big process Tony Battersby
2008-04-28 15:11 ` Tony Battersby
2008-04-29  1:00 ` KAMEZAWA Hiroyuki
2008-04-29  1:00   ` KAMEZAWA Hiroyuki
2008-04-29 14:10   ` Tony Battersby
2008-04-29 14:10     ` Tony Battersby
2008-04-30  4:25     ` [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process) KAMEZAWA Hiroyuki
2008-04-30  4:25       ` KAMEZAWA Hiroyuki
2008-04-30  4:46       ` Nick Piggin
2008-04-30  4:46         ` Nick Piggin
2008-04-30  5:03       ` Mika Penttilä
2008-04-30  5:03         ` Mika Penttilä
2008-04-30  5:09         ` Nick Piggin
2008-04-30  5:09           ` Nick Piggin
2008-04-30  5:17         ` KAMEZAWA Hiroyuki
2008-04-30  5:17           ` KAMEZAWA Hiroyuki
2008-04-30  5:19           ` Nick Piggin [this message]
2008-04-30  5:19             ` Nick Piggin
2008-04-30  5:35             ` KAMEZAWA Hiroyuki
2008-04-30  5:35               ` KAMEZAWA Hiroyuki
2008-04-30  6:11               ` Nick Piggin
2008-04-30  6:11                 ` Nick Piggin
2008-05-07  2:14                 ` KAMEZAWA Hiroyuki
2008-05-07  2:14                   ` KAMEZAWA Hiroyuki
2008-05-07  2:27                   ` KAMEZAWA Hiroyuki
2008-05-07  2:27                     ` KAMEZAWA Hiroyuki
2008-04-30 13:57               ` Tony Battersby
2008-04-30 13:57                 ` Tony Battersby
2008-05-01  8:39                 ` kamezawa.hiroyu
2008-05-01  8:39                   ` kamezawa.hiroyu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080430051932.GD27652@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mika.penttila@kolumbus.fi \
    --cc=tonyb@cybernetics.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.