public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Mel Gorman <mel@csn.ul.ie>
Cc: torvalds@linux-foundation.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com, hugh@veritas.com,
	Lee.Schermerhorn@hp.com, gregkh@suse.de,
	maksim.yevmenkin@gmail.com, npiggin@suse.de,
	will@crowder-design.com, riel@redhat.com,
	kamezawa.hiroyu@jp.fujitsu.com, miklos@szeredi.hu,
	apw@canonical.com, wli@movementarian.org
Subject: Re: [PATCH] Do not account for the address space used by hugetlbfs using VM_ACCOUNT V2 (Was Linus 2.6.29-rc4)
Date: Tue, 10 Feb 2009 15:45:49 -0800	[thread overview]
Message-ID: <20090210154549.2bbb1a6e.akpm@linux-foundation.org> (raw)
In-Reply-To: <20090210140227.GC4023@csn.ul.ie>

On Tue, 10 Feb 2009 14:02:27 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Sun, Feb 08, 2009 at 01:01:04PM -0800, Linus Torvalds wrote:
> > 
> > Another week (and a half), another -rc.
> > 
> > Arch updates (sparc, blackfin), driver updates (dvb, mmc, ide), ACPI 
> > updates, FS updates (ubifs, btrfs), you name it. It's all there.
> > 
> > But more importantly, people really have been working on regressions, and 
> > hopefully this closes a nice set of the top one, and hopefully without 
> > introducing too many new ones.
> > 
> 
> Hugepages are currently busted due to commit
> fc8744adc870a8d4366908221508bb113d8b72ee and the problem is in
> 2.6.29-rc4. There was a bit of a discussion but it didn't get very far and
> then I went offline for the weekend. Here is a revised patch that tries to
> bring hugetlbfs more in line with what the core VM is doing by dealing with
> VM_ACCOUNT and VM_NORESERVE.
> 
> =============
> Do not account for the address space used by hugetlbfs using VM_ACCOUNT
> 
> When overcommit is disabled, the core VM accounts for pages used by anonymous
> shared, private mappings and special mappings. It keeps track of VMAs that
> should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
> with VM_NORESERVE.
> 
> Overcommit for hugetlbfs is much riskier than overcommit for base pages
> due to contiguity requirements. It avoids overcommiting on both shared and
> private mappings using reservation counters that are checked and updated
> during mmap(). This ensures (within limits) that hugepages exist in the
> future when faults occurs or it is too easy to applications to be SIGKILLed.
> 
> As hugetlbfs makes its own reservations of a different unit to the base page
> size, VM_ACCOUNT should never be set. Even if the units were correct, we would
> double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
> be set because an application can request no reserves be made for hugetlbfs
> at the risk of getting killed later.
> 
> With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
> VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
> breaks the accounting for both the core VM and hugetlbfs, can trigger an
> OOM storm when hugepage pools are too small lockups and corrupted counters
> otherwise are used. This patch brings hugetlbfs more in line with how the
> core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
> 
> ...
>
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -108,7 +108,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  
>  	if (hugetlb_reserve_pages(inode,
>  				vma->vm_pgoff >> huge_page_order(h),
> -				len >> huge_page_shift(h), vma))
> +				len >> huge_page_shift(h), vma,
> +				vma->vm_flags))
>  		goto out;
>  
>  	ret = 0;
> @@ -947,7 +948,7 @@ static int can_do_hugetlb_shm(void)
>  			can_do_mlock());
>  }
>  
> -struct file *hugetlb_file_setup(const char *name, size_t size)
> +struct file *hugetlb_file_setup(const char *name, size_t size, int acctflag)

This superundocumented acctflag looks like a poorly-named boolean.  But
it is fact a composition of VM_foo bits.

Would it not be clearer to name it `vm_flags'?

>  {
>  	int error = -ENOMEM;
>  	struct file *file;
> @@ -981,7 +982,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
>  
>  	error = -ENOMEM;
>  	if (hugetlb_reserve_pages(inode, 0,
> -			size >> huge_page_shift(hstate_inode(inode)), NULL))
> +			size >> huge_page_shift(hstate_inode(inode)), NULL,
> +			acctflag))
>  		goto out_inode;
>  
>  	d_instantiate(dentry, inode);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index f1d2fba..af09660 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -33,7 +33,8 @@ unsigned long hugetlb_total_pages(void);
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, int write_access);
>  int hugetlb_reserve_pages(struct inode *inode, long from, long to,
> -						struct vm_area_struct *vma);
> +						struct vm_area_struct *vma,
> +						int acctflags);

here it went plural.

>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>  
>  extern unsigned long hugepages_treat_as_movable;
> @@ -138,7 +139,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
>  
>  extern const struct file_operations hugetlbfs_file_operations;
>  extern struct vm_operations_struct hugetlb_vm_ops;
> -struct file *hugetlb_file_setup(const char *name, size_t);
> +struct file *hugetlb_file_setup(const char *name, size_t, int);

Here it is omitted altogether.  We now have one named parameter and two
secret ones.


Also, the patch forgot to update this:

#define hugetlb_file_setup(name,size)   ERR_PTR(-ENOSYS)

Which I assume breaks CONFIG_HUGETLBFS=n.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e8ddc98..3235615 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1129,8 +1129,7 @@ extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
>  	unsigned long flag, unsigned long pgoff);
>  extern unsigned long mmap_region(struct file *file, unsigned long addr,
>  	unsigned long len, unsigned long flags,
> -	unsigned int vm_flags, unsigned long pgoff,
> -	int accountable);
> +	unsigned int vm_flags, unsigned long pgoff);

And there is a vm_flags which has type `unsigned int'.

Your newly-added should-have-been-called-vm_flags has type `int'.

The vm_flagses in `struct vm_region' and `struct vm_area_struct' have type
`unsigned long'.

It'd be nice to get these consistent.  We only have two bits left in
the vm_flags namespace so arguably we could add a vm_flags_t while
fixing this up, with a view to makeing it u64 later.  Maybe.



  reply	other threads:[~2009-02-10 23:47 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-08 21:01 Linus 2.6.29-rc4 Linus Torvalds
2009-02-09  1:21 ` Arjan van de Ven
2009-02-09  8:28   ` Ingo Molnar
2009-02-09 12:18     ` Avi Kivity
2009-02-09 13:34 ` Gerd Hoffmann
2009-02-09 15:04   ` Steven Noonan
2009-02-09 18:26 ` Oopses and ACPI problems (Linus 2.6.29-rc4) Darren Salt
2009-02-09 23:49   ` Ingo Molnar
2009-02-10 14:12     ` Darren Salt
2009-02-10 14:54       ` [PATCH 2.6.29-rc4] Restore ACPI reporting via /proc/acpi/events for EeePC & other Asus laptops Darren Salt
2009-02-24 11:31         ` Corentin Chary
2009-02-10 15:04       ` Oopses and ACPI problems (Linus 2.6.29-rc4) Matthew Garrett
2009-02-10 15:15         ` Darren Salt
2009-02-10 15:45           ` Matthew Garrett
2009-02-10 16:03             ` Darren Salt
2009-02-23 16:39               ` Matthew Garrett
2009-02-24 15:29                 ` Darren Salt
2009-02-24 16:00                   ` Matthew Garrett
2009-02-24 19:45                     ` Darren Salt
2009-02-10 16:06             ` Corentin Chary
2009-02-10 19:16               ` Darren Salt
2009-02-11  2:03                 ` Matthew Garrett
2009-02-11  1:23               ` yakui_zhao
2009-04-19  1:56           ` [PATCH] eee-laptop: Register as a pci-hotplug device Matthew Garrett
2009-04-19  7:20             ` Corentin Chary
2009-04-19 15:13               ` Matthew Garrett
2009-04-25 14:12                 ` Corentin Chary
2009-04-26 17:16                   ` Matthew Garrett
2009-04-26 20:51                     ` Corentin Chary
2009-02-10  1:06   ` Oopses and ACPI problems (Linus 2.6.29-rc4) yakui_zhao
2009-02-10 14:02 ` [PATCH] Do not account for the address space used by hugetlbfs using VM_ACCOUNT V2 (Was Linus 2.6.29-rc4) Mel Gorman
2009-02-10 23:45   ` Andrew Morton [this message]
2009-02-11 11:15     ` Mel Gorman
2009-02-11  9:43   ` Andy Whitcroft
2009-02-11 10:30     ` Mel Gorman
2009-02-11 12:03       ` Andy Whitcroft
2009-02-11 14:20         ` Mel Gorman
2009-02-11 16:03           ` Andy Whitcroft
2009-02-11 16:34             ` Mel Gorman
2009-02-11 16:43               ` Andy Whitcroft

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090210154549.2bbb1a6e.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=apw@canonical.com \
    --cc=gregkh@suse.de \
    --cc=hugh@veritas.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maksim.yevmenkin@gmail.com \
    --cc=mel@csn.ul.ie \
    --cc=miklos@szeredi.hu \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@crowder-design.com \
    --cc=wli@movementarian.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox