Hugepage cleanup and bugfix

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Hugepage cleanup and bugfix
@ 2012-02-16  4:23 David Gibson
  2012-02-16  4:23 ` [PATCH 1/2] Cleanup to hugetlb.h David Gibson
  2012-02-16  4:24 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
  0 siblings, 2 replies; 16+ messages in thread
From: David Gibson @ 2012-02-16  4:23 UTC (permalink / raw)
  To: akpm, abarry; +Cc: hughd, mgorman, minchan.kim, paulus, linux-kernel

Hi, this is a revised version of Andrew Barry's fix for the hugepage
race between unmount and quota update.  In fact it's not just that
race which is a problem, it's a more general use-after-free problem on
fs inodes from the hugepage code.

Despite all the abstract talk in the patch comments about layering
violations, this fixes a real, exploitable, crash bug which has been
in the kernel for a very long time now.

Andrew, please apply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] Cleanup to hugetlb.h
  2012-02-16  4:23 Hugepage cleanup and bugfix David Gibson
@ 2012-02-16  4:23 ` David Gibson
  2012-02-16  4:24 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
  1 sibling, 0 replies; 16+ messages in thread
From: David Gibson @ 2012-02-16  4:23 UTC (permalink / raw)
  To: akpm, abarry; +Cc: hughd, mgorman, minchan.kim, paulus, linux-kernel

This patch makes a couple of small cleanups to linux/include/hugetlb.h.
The set_file_hugepages() function, which was not used anywhere is removed,
and the hugetlbfs_config and hugetlbfs_inode_info structures with its
HUGETLBFS_I helper function are moved into inode.c, the only place they
were used.

These structures are really linked to the hugetlbfs filesystem specifically
not to hugepage mm handling in general, so they belong in the filesystem
code not in a generally available header.  It would be nice to move the
hugetlbfs_sb_info (superblock) structure in there as well, but it's
currently needed in a number of places via the hstate_vma() and
hstate_inode().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 fs/hugetlbfs/inode.c    |   19 +++++++++++++++++++
 include/linux/hugetlb.h |   25 -------------------------
 2 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1e85a7a..bb0e366 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -41,6 +41,25 @@ const struct file_operations hugetlbfs_file_operations;
 static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
+struct hugetlbfs_config {
+	uid_t   uid;
+	gid_t   gid;
+	umode_t mode;
+	long	nr_blocks;
+	long	nr_inodes;
+	struct hstate *hstate;
+};
+
+struct hugetlbfs_inode_info {
+	struct shared_policy policy;
+	struct inode vfs_inode;
+};
+
+static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
+{
+	return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
+}
+
 static struct backing_dev_info hugetlbfs_backing_dev_info = {
 	.name		= "hugetlbfs",
 	.ra_pages	= 0,	/* No readahead */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9d6c86..7adc492 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -128,15 +128,6 @@ enum {
 };
 
 #ifdef CONFIG_HUGETLBFS
-struct hugetlbfs_config {
-	uid_t   uid;
-	gid_t   gid;
-	umode_t mode;
-	long	nr_blocks;
-	long	nr_inodes;
-	struct hstate *hstate;
-};
-
 struct hugetlbfs_sb_info {
 	long	max_blocks;   /* blocks allowed */
 	long	free_blocks;  /* blocks free */
@@ -146,17 +137,6 @@ struct hugetlbfs_sb_info {
 	struct hstate *hstate;
 };
 
-
-struct hugetlbfs_inode_info {
-	struct shared_policy policy;
-	struct inode vfs_inode;
-};
-
-static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
-{
-	return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
-}
-
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -179,14 +159,9 @@ static inline int is_file_hugepages(struct file *file)
 	return 0;
 }
 
-static inline void set_file_hugepages(struct file *file)
-{
-	file->f_op = &hugetlbfs_file_operations;
-}
 #else /* !CONFIG_HUGETLBFS */
 
 #define is_file_hugepages(file)			0
-#define set_file_hugepages(file)		BUG()
 static inline struct file *hugetlb_file_setup(const char *name, size_t size,
 		vm_flags_t acctflag, struct user_struct **user, int creat_flags)
 {
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-02-16  4:23 Hugepage cleanup and bugfix David Gibson
  2012-02-16  4:23 ` [PATCH 1/2] Cleanup to hugetlb.h David Gibson
@ 2012-02-16  4:24 ` David Gibson
  2012-02-16 12:33   ` Hillf Danton
  1 sibling, 1 reply; 16+ messages in thread
From: David Gibson @ 2012-02-16  4:24 UTC (permalink / raw)
  To: akpm, abarry; +Cc: hughd, mgorman, minchan.kim, paulus, linux-kernel

hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.

Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the kernel
does a get_user_pages() on hugepages (which can happen from KVM amongst
others), then the free_huge_page() can be delayed until after the
associated inode has already been freed.  If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.

Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock, bumping
the reference count as appropriate to avoid it being freed.  Andrew Morton
rejected that version, however, on the grounds that it made the existing
layering violation worse.

This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation.  It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is
a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a pointer
to the subpool gets added to each allocated hugepage, instead of the
address_space pointer used now.  The subpool has its own lifetime and is
only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.

subpools are optional - a NULL subpool pointer is taken by the code to mean
that no subpool limits are in effect.

Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1

Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/hugetlbfs/inode.c    |   54 +++++++-----------
 include/linux/hugetlb.h |   14 +++-
 mm/hugetlb.c            |  140 +++++++++++++++++++++++++++++++++++++----------
 3 files changed, 142 insertions(+), 66 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index bb0e366..74c6ba2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -626,9 +626,15 @@ static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 		spin_lock(&sbinfo->stat_lock);
 		/* If no limits set, just report 0 for max/free/used
 		 * blocks, like simple_statfs() */
-		if (sbinfo->max_blocks >= 0) {
-			buf->f_blocks = sbinfo->max_blocks;
-			buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
+		if (sbinfo->spool) {
+			long free_pages;
+
+			spin_lock(&sbinfo->spool->lock);
+			buf->f_blocks = sbinfo->spool->max_hpages;
+			free_pages = sbinfo->spool->max_hpages
+				- sbinfo->spool->used_hpages;
+			buf->f_bavail = buf->f_bfree = free_pages;
+			spin_unlock(&sbinfo->spool->lock);
 			buf->f_files = sbinfo->max_inodes;
 			buf->f_ffree = sbinfo->free_inodes;
 		}
@@ -644,6 +650,10 @@ static void hugetlbfs_put_super(struct super_block *sb)
 
 	if (sbi) {
 		sb->s_fs_info = NULL;
+
+		if (sbi->spool)
+			hugepage_put_subpool(sbi->spool);
+
 		kfree(sbi);
 	}
 }
@@ -874,10 +884,14 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_fs_info = sbinfo;
 	sbinfo->hstate = config.hstate;
 	spin_lock_init(&sbinfo->stat_lock);
-	sbinfo->max_blocks = config.nr_blocks;
-	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
+	sbinfo->spool = NULL;
+	if (config.nr_blocks != -1) {
+		sbinfo->spool = hugepage_new_subpool(config.nr_blocks);
+		if (!sbinfo->spool)
+			goto out_free;
+	}
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = huge_page_size(config.hstate);
 	sb->s_blocksize_bits = huge_page_shift(config.hstate);
@@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_root = root;
 	return 0;
 out_free:
+	if (sbinfo->spool)
+		kfree(sbinfo->spool);
 	kfree(sbinfo);
 	return -ENOMEM;
 }
 
-int hugetlb_get_quota(struct address_space *mapping, long delta)
-{
-	int ret = 0;
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
-
-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
-		if (sbinfo->free_blocks - delta >= 0)
-			sbinfo->free_blocks -= delta;
-		else
-			ret = -ENOMEM;
-		spin_unlock(&sbinfo->stat_lock);
-	}
-
-	return ret;
-}
-
-void hugetlb_put_quota(struct address_space *mapping, long delta)
-{
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
-
-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
-		sbinfo->free_blocks += delta;
-		spin_unlock(&sbinfo->stat_lock);
-	}
-}
-
 static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7adc492..cf01817 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -14,6 +14,15 @@ struct user_struct;
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
 
+struct hugepage_subpool {
+	spinlock_t lock;
+	long count;
+	long max_hpages, used_hpages;
+};
+
+struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
+void hugepage_put_subpool(struct hugepage_subpool *spool);
+
 int PageHuge(struct page *page);
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
@@ -129,12 +138,11 @@ enum {
 
 #ifdef CONFIG_HUGETLBFS
 struct hugetlbfs_sb_info {
-	long	max_blocks;   /* blocks allowed */
-	long	free_blocks;  /* blocks free */
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
 	spinlock_t	stat_lock;
 	struct hstate *hstate;
+	struct hugepage_subpool *spool;
 };
 
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
@@ -146,8 +154,6 @@ extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags);
-int hugetlb_get_quota(struct address_space *mapping, long delta);
-void hugetlb_put_quota(struct address_space *mapping, long delta);
 
 static inline int is_file_hugepages(struct file *file)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..fac25b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+{
+	bool free = (spool->count == 0) && (spool->used_hpages == 0);
+
+	spin_unlock(&spool->lock);
+
+	/* If no pages are used, and no other handles to the subpool
+	 * remain, free the subpool the subpool remain */
+	if (free)
+		kfree(spool);
+}
+
+struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
+{
+	struct hugepage_subpool *spool;
+
+	spool = kmalloc(sizeof(*spool), GFP_KERNEL);
+	if (!spool)
+		return NULL;
+
+	spin_lock_init(&spool->lock);
+	spool->count = 1;
+	spool->max_hpages = nr_blocks;
+	spool->used_hpages = 0;
+
+	return spool;
+}
+
+void hugepage_put_subpool(struct hugepage_subpool *spool)
+{
+	spin_lock(&spool->lock);
+	BUG_ON(!spool->count);
+	spool->count--;
+	unlock_or_release_subpool(spool);
+}
+
+static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
+				      long delta)
+{
+	int ret = 0;
+
+	if (!spool)
+		return 0;
+
+	spin_lock(&spool->lock);
+	if ((spool->used_hpages + delta) <= spool->max_hpages) {
+		spool->used_hpages += delta;
+	} else {
+		ret = -ENOMEM;
+	}
+	spin_unlock(&spool->lock);
+
+	return ret;
+}
+
+static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
+				       long delta)
+{
+	if (!spool)
+		return;
+
+	spin_lock(&spool->lock);
+	spool->used_hpages -= delta;
+	/* If hugetlbfs_put_super couldn't free spool due to
+	* an outstanding quota reference, free it now. */
+	unlock_or_release_subpool(spool);
+}
+
+static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
+{
+	return HUGETLBFS_SB(inode->i_sb)->spool;
+}
+
+static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
+{
+	return subpool_inode(vma->vm_file->f_dentry->d_inode);
+}
+
 /*
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
@@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct address_space *mapping;
+	struct hugepage_subpool *spool =
+		(struct hugepage_subpool *)page_private(page);
 
-	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
@@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	if (mapping)
-		hugetlb_put_quota(mapping, 1);
+	hugepage_subpool_put_pages(spool, 1);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -966,11 +1043,12 @@ static void return_unused_surplus_pages(struct hstate *h,
 /*
  * Determine if the huge page at addr within the vma has an associated
  * reservation.  Where it does not we will need to logically increase
- * reservation and actually increase quota before an allocation can occur.
- * Where any new reservation would be required the reservation change is
- * prepared, but not committed.  Once the page has been quota'd allocated
- * an instantiated the change should be committed via vma_commit_reservation.
- * No action is required on failure.
+ * reservation and actually increase subpool usage before an allocation
+ * can occur.  Where any new reservation would be required the
+ * reservation change is prepared, but not committed.  Once the page
+ * has been allocated from the subpool and instantiated the change should
+ * be committed via vma_commit_reservation.  No action is required on
+ * failure.
  */
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
@@ -1017,26 +1095,26 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+				    unsigned long addr, int avoid_reserve,
+				    struct hugepage_subpool *spool)
 {
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
 	long chg;
 
 	/*
-	 * Processes that did not create the mapping will have no reserves and
-	 * will not have accounted against quota. Check that the quota can be
-	 * made before satisfying the allocation
-	 * MAP_NORESERVE mappings may also need pages and quota allocated
-	 * if no reserve mapping overlaps.
+	 * Processes that did not create the mapping will have no
+	 * reserves and will not have accounted against subpool
+	 * limit. Check that the subpool limit can be made before
+	 * satisfying the allocation MAP_NORESERVE mappings may also
+	 * need pages and subpool limit allocated allocated if no reserve
+	 * mapping overlaps.
 	 */
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
 	if (chg)
-		if (hugetlb_get_quota(inode->i_mapping, chg))
+		if (hugepage_subpool_get_pages(spool, chg))
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 
 	spin_lock(&hugetlb_lock);
@@ -1046,12 +1124,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugetlb_put_quota(inode->i_mapping, chg);
+			hugepage_subpool_put_pages(spool, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}
 
-	set_page_private(page, (unsigned long) mapping);
+	set_page_private(page, (unsigned long)spool);
 
 	vma_commit_reservation(h, vma, addr);
 
@@ -2072,6 +2150,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
 	struct resv_map *reservations = vma_resv_map(vma);
+	struct hugepage_subpool *spool = subpool_vma(vma);
 	unsigned long reserve;
 	unsigned long start;
 	unsigned long end;
@@ -2087,7 +2166,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
-			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
+			hugepage_subpool_put_pages(spool, reserve);
 		}
 	}
 }
@@ -2392,7 +2471,8 @@ retry_avoidcopy:
 
 	/* Drop page_table_lock as buddy allocator may be called */
 	spin_unlock(&mm->page_table_lock);
-	new_page = alloc_huge_page(vma, address, outside_reserve);
+	new_page = alloc_huge_page(vma, address, outside_reserve,
+				   subpool_vma(vma));
 
 	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
@@ -2540,7 +2620,7 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+		page = alloc_huge_page(vma, address, 0, subpool_vma(vma));
 		if (IS_ERR(page)) {
 			ret = -PTR_ERR(page);
 			goto out;
@@ -2869,11 +2949,12 @@ int hugetlb_reserve_pages(struct inode *inode,
 {
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
+	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
 	 * attempt will be made for VM_NORESERVE to allocate a page
-	 * and filesystem quota without using reserves
+	 * without using reserves
 	 */
 	if (vm_flags & VM_NORESERVE)
 		return 0;
@@ -2900,17 +2981,17 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (chg < 0)
 		return chg;
 
-	/* There must be enough filesystem quota for the mapping */
-	if (hugetlb_get_quota(inode->i_mapping, chg))
+	/* There must be enough pages in the subpool for the mapping */
+	if (hugepage_subpool_get_pages(spool, chg))
 		return -ENOSPC;
 
 	/*
 	 * Check enough hugepages are available for the reservation.
-	 * Hand back the quota if there are not
+	 * Hand the pages back to the subpool if there are not
 	 */
 	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
-		hugetlb_put_quota(inode->i_mapping, chg);
+		hugepage_subpool_put_pages(spool, chg);
 		return ret;
 	}
 
@@ -2934,12 +3015,13 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
 
-	hugetlb_put_quota(inode->i_mapping, (chg - freed));
+	hugepage_subpool_put_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-02-16  4:24 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
@ 2012-02-16 12:33   ` Hillf Danton
  2012-03-06  2:37     ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Hillf Danton @ 2012-02-16 12:33 UTC (permalink / raw)
  To: David Gibson
  Cc: akpm, abarry, hughd, mgorman, minchan.kim, paulus, linux-kernel

On Thu, Feb 16, 2012 at 12:24 PM, David Gibson
<david@gibson.dropbear.id.au> wrote:
> @@ -1046,12 +1124,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>        if (!page) {
>                page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>                if (!page) {
> -                       hugetlb_put_quota(inode->i_mapping, chg);
> +                       hugepage_subpool_put_pages(spool, chg);
>                        return ERR_PTR(-VM_FAULT_SIGBUS);
>                }
>        }
>
> -       set_page_private(page, (unsigned long) mapping);
> +       set_page_private(page, (unsigned long)spool);
>

Page mapping is used in unmap_ref_private(), and I am
wondering it no longer works:-(

> @@ -2392,7 +2471,8 @@ retry_avoidcopy:
>
>        /* Drop page_table_lock as buddy allocator may be called */
>        spin_unlock(&mm->page_table_lock);
> -       new_page = alloc_huge_page(vma, address, outside_reserve);
> +       new_page = alloc_huge_page(vma, address, outside_reserve,
> +                                  subpool_vma(vma));

Change in the number of parameters of alloc_huge_page()
looks unnecessary.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-02-16 12:33   ` Hillf Danton
@ 2012-03-06  2:37     ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2012-03-06  2:37 UTC (permalink / raw)
  To: Hillf Danton
  Cc: akpm, abarry, hughd, mgorman, minchan.kim, paulus, linux-kernel

On Thu, Feb 16, 2012 at 08:33:51PM +0800, Hillf Danton wrote:
> On Thu, Feb 16, 2012 at 12:24 PM, David Gibson
> <david@gibson.dropbear.id.au> wrote:
> > @@ -1046,12 +1124,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> >        if (!page) {
> >                page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> >                if (!page) {
> > -                       hugetlb_put_quota(inode->i_mapping, chg);
> > +                       hugepage_subpool_put_pages(spool, chg);
> >                        return ERR_PTR(-VM_FAULT_SIGBUS);
> >                }
> >        }
> >
> > -       set_page_private(page, (unsigned long) mapping);
> > +       set_page_private(page, (unsigned long)spool);

Sorry for the very delayed reply.  Somehow I never spotted this when
it first came.  I think I must have acidentally deleted it when
cleaning out automated messages and spam.

> Page mapping is used in unmap_ref_private(), and I am
> wondering it no longer works:-(

Good point.  But unmap_ref_private() does take the vma, so it should
be able to get to the mapping from there.  I'll respin doing that,
instead of using page_private().

> > @@ -2392,7 +2471,8 @@ retry_avoidcopy:
> >
> >        /* Drop page_table_lock as buddy allocator may be called */
> >        spin_unlock(&mm->page_table_lock);
> > -       new_page = alloc_huge_page(vma, address, outside_reserve);
> > +       new_page = alloc_huge_page(vma, address, outside_reserve,
> > +                                  subpool_vma(vma));
> 
> Change in the number of parameters of alloc_huge_page()
> looks unnecessary.

Because alloc_huge_page() also takes vma.  Yeah, fair enough.  I'll
change that in the respin too.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-07  4:48 [PATCH 1/2] Cleanup to hugetlb.h David Gibson
@ 2012-03-07  4:48 ` David Gibson
  2012-03-07 12:28   ` Hillf Danton
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: David Gibson @ 2012-03-07  4:48 UTC (permalink / raw)
  To: akpm, hughd
  Cc: paulus, linux-kernel, David Gibson, Andrew Barry, Mel Gorman,
	Minchan Kim, Hillf Danton

hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.

Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the kernel
does a get_user_pages() on hugepages (which can happen from KVM amongst
others), then the free_huge_page() can be delayed until after the
associated inode has already been freed.  If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.

Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock, bumping
the reference count as appropriate to avoid it being freed.  Andrew Morton
rejected that version, however, on the grounds that it made the existing
layering violation worse.

This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation.  It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is
a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a pointer
to the subpool gets added to each allocated hugepage, instead of the
address_space pointer used now.  The subpool has its own lifetime and is
only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.

subpools are optional - a NULL subpool pointer is taken by the code to mean
that no subpool limits are in effect.

Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1

v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.

Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hillf Danton <dhillf@gmail.com>
---
 fs/hugetlbfs/inode.c    |   54 +++++++-----------
 include/linux/hugetlb.h |   14 ++++--
 mm/hugetlb.c            |  135 +++++++++++++++++++++++++++++++++++++---------
 3 files changed, 139 insertions(+), 64 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index bb0e366..74c6ba2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -626,9 +626,15 @@ static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 		spin_lock(&sbinfo->stat_lock);
 		/* If no limits set, just report 0 for max/free/used
 		 * blocks, like simple_statfs() */
-		if (sbinfo->max_blocks >= 0) {
-			buf->f_blocks = sbinfo->max_blocks;
-			buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
+		if (sbinfo->spool) {
+			long free_pages;
+
+			spin_lock(&sbinfo->spool->lock);
+			buf->f_blocks = sbinfo->spool->max_hpages;
+			free_pages = sbinfo->spool->max_hpages
+				- sbinfo->spool->used_hpages;
+			buf->f_bavail = buf->f_bfree = free_pages;
+			spin_unlock(&sbinfo->spool->lock);
 			buf->f_files = sbinfo->max_inodes;
 			buf->f_ffree = sbinfo->free_inodes;
 		}
@@ -644,6 +650,10 @@ static void hugetlbfs_put_super(struct super_block *sb)
 
 	if (sbi) {
 		sb->s_fs_info = NULL;
+
+		if (sbi->spool)
+			hugepage_put_subpool(sbi->spool);
+
 		kfree(sbi);
 	}
 }
@@ -874,10 +884,14 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_fs_info = sbinfo;
 	sbinfo->hstate = config.hstate;
 	spin_lock_init(&sbinfo->stat_lock);
-	sbinfo->max_blocks = config.nr_blocks;
-	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
+	sbinfo->spool = NULL;
+	if (config.nr_blocks != -1) {
+		sbinfo->spool = hugepage_new_subpool(config.nr_blocks);
+		if (!sbinfo->spool)
+			goto out_free;
+	}
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = huge_page_size(config.hstate);
 	sb->s_blocksize_bits = huge_page_shift(config.hstate);
@@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_root = root;
 	return 0;
 out_free:
+	if (sbinfo->spool)
+		kfree(sbinfo->spool);
 	kfree(sbinfo);
 	return -ENOMEM;
 }
 
-int hugetlb_get_quota(struct address_space *mapping, long delta)
-{
-	int ret = 0;
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
-
-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
-		if (sbinfo->free_blocks - delta >= 0)
-			sbinfo->free_blocks -= delta;
-		else
-			ret = -ENOMEM;
-		spin_unlock(&sbinfo->stat_lock);
-	}
-
-	return ret;
-}
-
-void hugetlb_put_quota(struct address_space *mapping, long delta)
-{
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
-
-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
-		sbinfo->free_blocks += delta;
-		spin_unlock(&sbinfo->stat_lock);
-	}
-}
-
 static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7adc492..cf01817 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -14,6 +14,15 @@ struct user_struct;
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
 
+struct hugepage_subpool {
+	spinlock_t lock;
+	long count;
+	long max_hpages, used_hpages;
+};
+
+struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
+void hugepage_put_subpool(struct hugepage_subpool *spool);
+
 int PageHuge(struct page *page);
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
@@ -129,12 +138,11 @@ enum {
 
 #ifdef CONFIG_HUGETLBFS
 struct hugetlbfs_sb_info {
-	long	max_blocks;   /* blocks allowed */
-	long	free_blocks;  /* blocks free */
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
 	spinlock_t	stat_lock;
 	struct hstate *hstate;
+	struct hugepage_subpool *spool;
 };
 
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
@@ -146,8 +154,6 @@ extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags);
-int hugetlb_get_quota(struct address_space *mapping, long delta);
-void hugetlb_put_quota(struct address_space *mapping, long delta);
 
 static inline int is_file_hugepages(struct file *file)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..36b38b3a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+{
+	bool free = (spool->count == 0) && (spool->used_hpages == 0);
+
+	spin_unlock(&spool->lock);
+
+	/* If no pages are used, and no other handles to the subpool
+	 * remain, free the subpool the subpool remain */
+	if (free)
+		kfree(spool);
+}
+
+struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
+{
+	struct hugepage_subpool *spool;
+
+	spool = kmalloc(sizeof(*spool), GFP_KERNEL);
+	if (!spool)
+		return NULL;
+
+	spin_lock_init(&spool->lock);
+	spool->count = 1;
+	spool->max_hpages = nr_blocks;
+	spool->used_hpages = 0;
+
+	return spool;
+}
+
+void hugepage_put_subpool(struct hugepage_subpool *spool)
+{
+	spin_lock(&spool->lock);
+	BUG_ON(!spool->count);
+	spool->count--;
+	unlock_or_release_subpool(spool);
+}
+
+static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
+				      long delta)
+{
+	int ret = 0;
+
+	if (!spool)
+		return 0;
+
+	spin_lock(&spool->lock);
+	if ((spool->used_hpages + delta) <= spool->max_hpages) {
+		spool->used_hpages += delta;
+	} else {
+		ret = -ENOMEM;
+	}
+	spin_unlock(&spool->lock);
+
+	return ret;
+}
+
+static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
+				       long delta)
+{
+	if (!spool)
+		return;
+
+	spin_lock(&spool->lock);
+	spool->used_hpages -= delta;
+	/* If hugetlbfs_put_super couldn't free spool due to
+	* an outstanding quota reference, free it now. */
+	unlock_or_release_subpool(spool);
+}
+
+static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
+{
+	return HUGETLBFS_SB(inode->i_sb)->spool;
+}
+
+static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
+{
+	return subpool_inode(vma->vm_file->f_dentry->d_inode);
+}
+
 /*
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
@@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct address_space *mapping;
+	struct hugepage_subpool *spool =
+		(struct hugepage_subpool *)page_private(page);
 
-	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
@@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	if (mapping)
-		hugetlb_put_quota(mapping, 1);
+	hugepage_subpool_put_pages(spool, 1);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -966,11 +1043,12 @@ static void return_unused_surplus_pages(struct hstate *h,
 /*
  * Determine if the huge page at addr within the vma has an associated
  * reservation.  Where it does not we will need to logically increase
- * reservation and actually increase quota before an allocation can occur.
- * Where any new reservation would be required the reservation change is
- * prepared, but not committed.  Once the page has been quota'd allocated
- * an instantiated the change should be committed via vma_commit_reservation.
- * No action is required on failure.
+ * reservation and actually increase subpool usage before an allocation
+ * can occur.  Where any new reservation would be required the
+ * reservation change is prepared, but not committed.  Once the page
+ * has been allocated from the subpool and instantiated the change should
+ * be committed via vma_commit_reservation.  No action is required on
+ * failure.
  */
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
@@ -1019,24 +1097,24 @@ static void vma_commit_reservation(struct hstate *h,
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
+	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
 	long chg;
 
 	/*
-	 * Processes that did not create the mapping will have no reserves and
-	 * will not have accounted against quota. Check that the quota can be
-	 * made before satisfying the allocation
-	 * MAP_NORESERVE mappings may also need pages and quota allocated
-	 * if no reserve mapping overlaps.
+	 * Processes that did not create the mapping will have no
+	 * reserves and will not have accounted against subpool
+	 * limit. Check that the subpool limit can be made before
+	 * satisfying the allocation MAP_NORESERVE mappings may also
+	 * need pages and subpool limit allocated allocated if no reserve
+	 * mapping overlaps.
 	 */
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
 	if (chg)
-		if (hugetlb_get_quota(inode->i_mapping, chg))
+		if (hugepage_subpool_get_pages(spool, chg))
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 
 	spin_lock(&hugetlb_lock);
@@ -1046,12 +1124,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugetlb_put_quota(inode->i_mapping, chg);
+			hugepage_subpool_put_pages(spool, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}
 
-	set_page_private(page, (unsigned long) mapping);
+	set_page_private(page, (unsigned long)spool);
 
 	vma_commit_reservation(h, vma, addr);
 
@@ -2072,6 +2150,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
 	struct resv_map *reservations = vma_resv_map(vma);
+	struct hugepage_subpool *spool = subpool_vma(vma);
 	unsigned long reserve;
 	unsigned long start;
 	unsigned long end;
@@ -2087,7 +2166,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
-			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
+			hugepage_subpool_put_pages(spool, reserve);
 		}
 	}
 }
@@ -2316,7 +2395,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	address = address & huge_page_mask(h);
 	pgoff = vma_hugecache_offset(h, vma, address);
-	mapping = (struct address_space *)page_private(page);
+	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
 
 	/*
 	 * Take the mapping lock for the duration of the table walk. As
@@ -2869,11 +2948,12 @@ int hugetlb_reserve_pages(struct inode *inode,
 {
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
+	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
 	 * attempt will be made for VM_NORESERVE to allocate a page
-	 * and filesystem quota without using reserves
+	 * without using reserves
 	 */
 	if (vm_flags & VM_NORESERVE)
 		return 0;
@@ -2900,17 +2980,17 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (chg < 0)
 		return chg;
 
-	/* There must be enough filesystem quota for the mapping */
-	if (hugetlb_get_quota(inode->i_mapping, chg))
+	/* There must be enough pages in the subpool for the mapping */
+	if (hugepage_subpool_get_pages(spool, chg))
 		return -ENOSPC;
 
 	/*
 	 * Check enough hugepages are available for the reservation.
-	 * Hand back the quota if there are not
+	 * Hand the pages back to the subpool if there are not
 	 */
 	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
-		hugetlb_put_quota(inode->i_mapping, chg);
+		hugepage_subpool_put_pages(spool, chg);
 		return ret;
 	}
 
@@ -2934,12 +3014,13 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
 
-	hugetlb_put_quota(inode->i_mapping, (chg - freed));
+	hugepage_subpool_put_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
-- 
1.7.9.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-07  4:48 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
@ 2012-03-07 12:28   ` Hillf Danton
  2012-03-08  0:57     ` David Gibson
  2012-03-08  4:17     ` Aneesh Kumar K.V
  2012-03-08  0:27   ` Andrew Morton
  2012-03-08  4:30   ` Aneesh Kumar K.V
  2 siblings, 2 replies; 16+ messages in thread
From: Hillf Danton @ 2012-03-07 12:28 UTC (permalink / raw)
  To: David Gibson
  Cc: akpm, hughd, paulus, linux-kernel, Andrew Barry, Mel Gorman,
	Minchan Kim, Aneesh Kumar K.V

On Wed, Mar 7, 2012 at 12:48 PM, David Gibson
<david@gibson.dropbear.id.au> wrote:
> hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
> general quota handling code, and they don't much resemble its behaviour.
> Rather than being about maintaining limits on on-disk block usage by
> particular users, they are instead about maintaining limits on in-memory
> page usage (including anonymous MAP_PRIVATE copied-on-write pages)
> associated with a particular hugetlbfs filesystem instance.
>
> Worse, they work by having callbacks to the hugetlbfs filesystem code from
> the low-level page handling code, in particular from free_huge_page().
> This is a layering violation of itself, but more importantly, if the kernel
> does a get_user_pages() on hugepages (which can happen from KVM amongst
> others), then the free_huge_page() can be delayed until after the
> associated inode has already been freed.  If an unmount occurs at the
> wrong time, even the hugetlbfs superblock where the "quota" limits are
> stored may have been freed.
>
> Andrew Barry proposed a patch to fix this by having hugepages, instead of
> storing a pointer to their address_space and reaching the superblock from
> there, had the hugepages store pointers directly to the superblock, bumping
> the reference count as appropriate to avoid it being freed.  Andrew Morton
> rejected that version, however, on the grounds that it made the existing
> layering violation worse.
>
> This is a reworked version of Andrew's patch, which removes the extra, and
> some of the existing, layering violation.  It works by introducing the
> concept of a hugepage "subpool" at the lower hugepage mm layer - that is
> a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
> a subpool for each filesystem instance with a page limit set, and a pointer
> to the subpool gets added to each allocated hugepage, instead of the
> address_space pointer used now.  The subpool has its own lifetime and is
> only freed once all pages in it _and_ all other references to it (i.e.
> superblocks) are gone.
>
> subpools are optional - a NULL subpool pointer is taken by the code to mean
> that no subpool limits are in effect.
>
> Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
> quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
> http://marc.info/?l=linux-mm&m=126928970510627&w=1
>
> v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
> alloc_huge_page() - since it already takes the vma, it is not necessary.
>
> Signed-off-by: Andrew Barry <abarry@cray.com>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hillf Danton <dhillf@gmail.com>
> ---
>  fs/hugetlbfs/inode.c    |   54 +++++++-----------
>  include/linux/hugetlb.h |   14 ++++--
>  mm/hugetlb.c            |  135 +++++++++++++++++++++++++++++++++++++---------
>  3 files changed, 139 insertions(+), 64 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index bb0e366..74c6ba2 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -626,9 +626,15 @@ static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>                spin_lock(&sbinfo->stat_lock);
>                /* If no limits set, just report 0 for max/free/used
>                 * blocks, like simple_statfs() */
> -               if (sbinfo->max_blocks >= 0) {
> -                       buf->f_blocks = sbinfo->max_blocks;
> -                       buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
> +               if (sbinfo->spool) {
> +                       long free_pages;
> +
> +                       spin_lock(&sbinfo->spool->lock);
> +                       buf->f_blocks = sbinfo->spool->max_hpages;
> +                       free_pages = sbinfo->spool->max_hpages
> +                               - sbinfo->spool->used_hpages;
> +                       buf->f_bavail = buf->f_bfree = free_pages;
> +                       spin_unlock(&sbinfo->spool->lock);
>                        buf->f_files = sbinfo->max_inodes;
>                        buf->f_ffree = sbinfo->free_inodes;
>                }
> @@ -644,6 +650,10 @@ static void hugetlbfs_put_super(struct super_block *sb)
>
>        if (sbi) {
>                sb->s_fs_info = NULL;
> +
> +               if (sbi->spool)
> +                       hugepage_put_subpool(sbi->spool);
> +
>                kfree(sbi);
>        }
>  }
> @@ -874,10 +884,14 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
>        sb->s_fs_info = sbinfo;
>        sbinfo->hstate = config.hstate;
>        spin_lock_init(&sbinfo->stat_lock);
> -       sbinfo->max_blocks = config.nr_blocks;
> -       sbinfo->free_blocks = config.nr_blocks;
>        sbinfo->max_inodes = config.nr_inodes;
>        sbinfo->free_inodes = config.nr_inodes;
> +       sbinfo->spool = NULL;
> +       if (config.nr_blocks != -1) {
> +               sbinfo->spool = hugepage_new_subpool(config.nr_blocks);
> +               if (!sbinfo->spool)
> +                       goto out_free;
> +       }
>        sb->s_maxbytes = MAX_LFS_FILESIZE;
>        sb->s_blocksize = huge_page_size(config.hstate);
>        sb->s_blocksize_bits = huge_page_shift(config.hstate);
> @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
>        sb->s_root = root;
>        return 0;
>  out_free:
> +       if (sbinfo->spool)
> +               kfree(sbinfo->spool);
>        kfree(sbinfo);
>        return -ENOMEM;
>  }
>
> -int hugetlb_get_quota(struct address_space *mapping, long delta)
> -{
> -       int ret = 0;
> -       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> -
> -       if (sbinfo->free_blocks > -1) {
> -               spin_lock(&sbinfo->stat_lock);
> -               if (sbinfo->free_blocks - delta >= 0)
> -                       sbinfo->free_blocks -= delta;
> -               else
> -                       ret = -ENOMEM;
> -               spin_unlock(&sbinfo->stat_lock);
> -       }
> -
> -       return ret;
> -}
> -
> -void hugetlb_put_quota(struct address_space *mapping, long delta)
> -{
> -       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> -
> -       if (sbinfo->free_blocks > -1) {
> -               spin_lock(&sbinfo->stat_lock);
> -               sbinfo->free_blocks += delta;
> -               spin_unlock(&sbinfo->stat_lock);
> -       }
> -}
> -
>  static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
>        int flags, const char *dev_name, void *data)
>  {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 7adc492..cf01817 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -14,6 +14,15 @@ struct user_struct;
>  #include <linux/shm.h>
>  #include <asm/tlbflush.h>
>
> +struct hugepage_subpool {
> +       spinlock_t lock;
> +       long count;
> +       long max_hpages, used_hpages;
> +};
> +
> +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
> +void hugepage_put_subpool(struct hugepage_subpool *spool);
> +
>  int PageHuge(struct page *page);
>
>  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> @@ -129,12 +138,11 @@ enum {
>
>  #ifdef CONFIG_HUGETLBFS
>  struct hugetlbfs_sb_info {
> -       long    max_blocks;   /* blocks allowed */
> -       long    free_blocks;  /* blocks free */
>        long    max_inodes;   /* inodes allowed */
>        long    free_inodes;  /* inodes free */
>        spinlock_t      stat_lock;
>        struct hstate *hstate;
> +       struct hugepage_subpool *spool;
>  };
>
>  static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
> @@ -146,8 +154,6 @@ extern const struct file_operations hugetlbfs_file_operations;
>  extern const struct vm_operations_struct hugetlb_vm_ops;
>  struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
>                                struct user_struct **user, int creat_flags);
> -int hugetlb_get_quota(struct address_space *mapping, long delta);
> -void hugetlb_put_quota(struct address_space *mapping, long delta);
>
>  static inline int is_file_hugepages(struct file *file)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f34bd8..36b38b3a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
>  */
>  static DEFINE_SPINLOCK(hugetlb_lock);
>
> +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> +{
> +       bool free = (spool->count == 0) && (spool->used_hpages == 0);
> +
> +       spin_unlock(&spool->lock);
> +
> +       /* If no pages are used, and no other handles to the subpool
> +        * remain, free the subpool the subpool remain */
> +       if (free)
> +               kfree(spool);
> +}
> +
> +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
> +{
> +       struct hugepage_subpool *spool;
> +
> +       spool = kmalloc(sizeof(*spool), GFP_KERNEL);
> +       if (!spool)
> +               return NULL;
> +
> +       spin_lock_init(&spool->lock);
> +       spool->count = 1;
> +       spool->max_hpages = nr_blocks;
> +       spool->used_hpages = 0;
> +
> +       return spool;
> +}
> +
> +void hugepage_put_subpool(struct hugepage_subpool *spool)
> +{
> +       spin_lock(&spool->lock);
> +       BUG_ON(!spool->count);
> +       spool->count--;
> +       unlock_or_release_subpool(spool);
> +}
> +
> +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
> +                                     long delta)
> +{
> +       int ret = 0;
> +
> +       if (!spool)
> +               return 0;
> +
> +       spin_lock(&spool->lock);
> +       if ((spool->used_hpages + delta) <= spool->max_hpages) {
> +               spool->used_hpages += delta;
> +       } else {
> +               ret = -ENOMEM;
> +       }
> +       spin_unlock(&spool->lock);
> +
> +       return ret;
> +}
> +
> +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
> +                                      long delta)
> +{
> +       if (!spool)
> +               return;
> +
> +       spin_lock(&spool->lock);
> +       spool->used_hpages -= delta;
> +       /* If hugetlbfs_put_super couldn't free spool due to
> +       * an outstanding quota reference, free it now. */
> +       unlock_or_release_subpool(spool);
> +}
> +
> +static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
> +{
> +       return HUGETLBFS_SB(inode->i_sb)->spool;
> +}
> +
> +static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> +{
> +       return subpool_inode(vma->vm_file->f_dentry->d_inode);
> +}
> +
>  /*
>  * Region tracking -- allows tracking of reservations and instantiated pages
>  *                    across the pages in a mapping.
> @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
>         */
>        struct hstate *h = page_hstate(page);
>        int nid = page_to_nid(page);
> -       struct address_space *mapping;
> +       struct hugepage_subpool *spool =
> +               (struct hugepage_subpool *)page_private(page);
>
> -       mapping = (struct address_space *) page_private(page);
>        set_page_private(page, 0);
>        page->mapping = NULL;
>        BUG_ON(page_count(page));
> @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
>                enqueue_huge_page(h, page);
>        }
>        spin_unlock(&hugetlb_lock);
> -       if (mapping)
> -               hugetlb_put_quota(mapping, 1);
> +       hugepage_subpool_put_pages(spool, 1);

Like current code, quota is handed back *unconditionally*, but ...

>  }
>
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -966,11 +1043,12 @@ static void return_unused_surplus_pages(struct hstate *h,
>  /*
>  * Determine if the huge page at addr within the vma has an associated
>  * reservation.  Where it does not we will need to logically increase
> - * reservation and actually increase quota before an allocation can occur.
> - * Where any new reservation would be required the reservation change is
> - * prepared, but not committed.  Once the page has been quota'd allocated
> - * an instantiated the change should be committed via vma_commit_reservation.
> - * No action is required on failure.
> + * reservation and actually increase subpool usage before an allocation
> + * can occur.  Where any new reservation would be required the
> + * reservation change is prepared, but not committed.  Once the page
> + * has been allocated from the subpool and instantiated the change should
> + * be committed via vma_commit_reservation.  No action is required on
> + * failure.
>  */
>  static long vma_needs_reservation(struct hstate *h,
>                        struct vm_area_struct *vma, unsigned long addr)
> @@ -1019,24 +1097,24 @@ static void vma_commit_reservation(struct hstate *h,
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
>                                    unsigned long addr, int avoid_reserve)
>  {
> +       struct hugepage_subpool *spool = subpool_vma(vma);
>        struct hstate *h = hstate_vma(vma);
>        struct page *page;
> -       struct address_space *mapping = vma->vm_file->f_mapping;
> -       struct inode *inode = mapping->host;
>        long chg;
>
>        /*
> -        * Processes that did not create the mapping will have no reserves and
> -        * will not have accounted against quota. Check that the quota can be
> -        * made before satisfying the allocation
> -        * MAP_NORESERVE mappings may also need pages and quota allocated
> -        * if no reserve mapping overlaps.
> +        * Processes that did not create the mapping will have no
> +        * reserves and will not have accounted against subpool
> +        * limit. Check that the subpool limit can be made before
> +        * satisfying the allocation MAP_NORESERVE mappings may also
> +        * need pages and subpool limit allocated allocated if no reserve
> +        * mapping overlaps.
>         */
>        chg = vma_needs_reservation(h, vma, addr);
>        if (chg < 0)
>                return ERR_PTR(-VM_FAULT_OOM);
>        if (chg)
> -               if (hugetlb_get_quota(inode->i_mapping, chg))
> +               if (hugepage_subpool_get_pages(spool, chg))
>                        return ERR_PTR(-VM_FAULT_SIGBUS);

... quota is allocated only if needed.

Given that mismatch, are you fixing the use-after-free bug, or
quota maintenance, or both?

>
>        spin_lock(&hugetlb_lock);
> @@ -1046,12 +1124,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>        if (!page) {
>                page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>                if (!page) {
> -                       hugetlb_put_quota(inode->i_mapping, chg);
> +                       hugepage_subpool_put_pages(spool, chg);
>                        return ERR_PTR(-VM_FAULT_SIGBUS);
>                }
>        }
>
> -       set_page_private(page, (unsigned long) mapping);
> +       set_page_private(page, (unsigned long)spool);
>
>        vma_commit_reservation(h, vma, addr);
>
> @@ -2072,6 +2150,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  {
>        struct hstate *h = hstate_vma(vma);
>        struct resv_map *reservations = vma_resv_map(vma);
> +       struct hugepage_subpool *spool = subpool_vma(vma);
>        unsigned long reserve;
>        unsigned long start;
>        unsigned long end;
> @@ -2087,7 +2166,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>
>                if (reserve) {
>                        hugetlb_acct_memory(h, -reserve);
> -                       hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
> +                       hugepage_subpool_put_pages(spool, reserve);
>                }
>        }
>  }
> @@ -2316,7 +2395,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>         */
>        address = address & huge_page_mask(h);
>        pgoff = vma_hugecache_offset(h, vma, address);
> -       mapping = (struct address_space *)page_private(page);
> +       mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
>
>        /*
>         * Take the mapping lock for the duration of the table walk. As
> @@ -2869,11 +2948,12 @@ int hugetlb_reserve_pages(struct inode *inode,
>  {
>        long ret, chg;
>        struct hstate *h = hstate_inode(inode);
> +       struct hugepage_subpool *spool = subpool_inode(inode);
>
>        /*
>         * Only apply hugepage reservation if asked. At fault time, an
>         * attempt will be made for VM_NORESERVE to allocate a page
> -        * and filesystem quota without using reserves
> +        * without using reserves
>         */
>        if (vm_flags & VM_NORESERVE)
>                return 0;
> @@ -2900,17 +2980,17 @@ int hugetlb_reserve_pages(struct inode *inode,
>        if (chg < 0)
>                return chg;
>
> -       /* There must be enough filesystem quota for the mapping */
> -       if (hugetlb_get_quota(inode->i_mapping, chg))
> +       /* There must be enough pages in the subpool for the mapping */
> +       if (hugepage_subpool_get_pages(spool, chg))
>                return -ENOSPC;
>
>        /*
>         * Check enough hugepages are available for the reservation.
> -        * Hand back the quota if there are not
> +        * Hand the pages back to the subpool if there are not
>         */
>        ret = hugetlb_acct_memory(h, chg);
>        if (ret < 0) {
> -               hugetlb_put_quota(inode->i_mapping, chg);
> +               hugepage_subpool_put_pages(spool, chg);
>                return ret;
>        }
>
> @@ -2934,12 +3014,13 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>        struct hstate *h = hstate_inode(inode);
>        long chg = region_truncate(&inode->i_mapping->private_list, offset);
> +       struct hugepage_subpool *spool = subpool_inode(inode);
>
>        spin_lock(&inode->i_lock);
>        inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>        spin_unlock(&inode->i_lock);
>
> -       hugetlb_put_quota(inode->i_mapping, (chg - freed));
> +       hugepage_subpool_put_pages(spool, (chg - freed));
>        hugetlb_acct_memory(h, -(chg - freed));
>  }
>
> --
> 1.7.9.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-07  4:48 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
  2012-03-07 12:28   ` Hillf Danton
@ 2012-03-08  0:27   ` Andrew Morton
  2012-03-08  2:09     ` David Gibson
  2012-03-09  3:25     ` David Gibson
  2012-03-08  4:30   ` Aneesh Kumar K.V
  2 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2012-03-08  0:27 UTC (permalink / raw)
  To: David Gibson
  Cc: hughd, paulus, linux-kernel, Andrew Barry, Mel Gorman,
	Minchan Kim, Hillf Danton

On Wed,  7 Mar 2012 15:48:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
> general quota handling code, and they don't much resemble its behaviour.
> Rather than being about maintaining limits on on-disk block usage by
> particular users, they are instead about maintaining limits on in-memory
> page usage (including anonymous MAP_PRIVATE copied-on-write pages)
> associated with a particular hugetlbfs filesystem instance.
> 
> Worse, they work by having callbacks to the hugetlbfs filesystem code from
> the low-level page handling code, in particular from free_huge_page().
> This is a layering violation of itself, but more importantly, if the kernel
> does a get_user_pages() on hugepages (which can happen from KVM amongst
> others), then the free_huge_page() can be delayed until after the
> associated inode has already been freed.  If an unmount occurs at the
> wrong time, even the hugetlbfs superblock where the "quota" limits are
> stored may have been freed.
> 
> Andrew Barry proposed a patch to fix this by having hugepages, instead of
> storing a pointer to their address_space and reaching the superblock from
> there, had the hugepages store pointers directly to the superblock, bumping
> the reference count as appropriate to avoid it being freed.  Andrew Morton
> rejected that version, however, on the grounds that it made the existing
> layering violation worse.
> 
> This is a reworked version of Andrew's patch, which removes the extra, and
> some of the existing, layering violation.  It works by introducing the
> concept of a hugepage "subpool" at the lower hugepage mm layer - that is
> a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
> a subpool for each filesystem instance with a page limit set, and a pointer
> to the subpool gets added to each allocated hugepage, instead of the
> address_space pointer used now.  The subpool has its own lifetime and is
> only freed once all pages in it _and_ all other references to it (i.e.
> superblocks) are gone.
> 
> subpools are optional - a NULL subpool pointer is taken by the code to mean
> that no subpool limits are in effect.
> 
> Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
> quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
> http://marc.info/?l=linux-mm&m=126928970510627&w=1
> 
> v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
> alloc_huge_page() - since it already takes the vma, it is not necessary.

Looks good - thanks for doing this.

Some comments, nothing serious:

>
> ...
>
> @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
>  	sb->s_root = root;
>  	return 0;
>  out_free:
> +	if (sbinfo->spool)
> +		kfree(sbinfo->spool);

kfree(NULL) is OK, and omitting the NULL test is idiomatic.

>  	kfree(sbinfo);
>  	return -ENOMEM;
>  }
>  
>
> ...
>
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -14,6 +14,15 @@ struct user_struct;
>  #include <linux/shm.h>
>  #include <asm/tlbflush.h>
>  
> +struct hugepage_subpool {
> +	spinlock_t lock;
> +	long count;
> +	long max_hpages, used_hpages;
> +};

Could we get this struct docmented please?  Why it exists, what it is
used for.  Basically a copy-n-paste from the (nice) changelog.

Also please doucment the fields.  I'm sitting here reading the code
trying to work out what these three fields semantically *mean*.  Armed
with enough information to understand the code, I then get to read it
all again :(

I don't think any of these things can go negative, and a negative page
count is absurd.  So perhaps they should have an unsigned type.

And maybe a 32-bit type?  How much memory is 2^32 hugepages, and does
the present code even work correctly in that situation?

> +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
> +void hugepage_put_subpool(struct hugepage_subpool *spool);
> +
>  int PageHuge(struct page *page);
>  
>
> ...
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);
>  
> +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)

Strange name.  unlock_and_release() would be closer?

> +{
> +	bool free = (spool->count == 0) && (spool->used_hpages == 0);

I wish I knew what those fields do.

> +	spin_unlock(&spool->lock);
> +
> +	/* If no pages are used, and no other handles to the subpool
> +	 * remain, free the subpool the subpool remain */

Comment is mangled.

Please use the normal multiline comment layout?  There are examples in
this file.

> +	if (free)
> +		kfree(spool);
> +}
> +
> +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)

nr_blocks is logically an unsigned thing.  And maybe 32-bit.

> +{
> +	struct hugepage_subpool *spool;
> +
> +	spool = kmalloc(sizeof(*spool), GFP_KERNEL);
> +	if (!spool)
> +		return NULL;
> +
> +	spin_lock_init(&spool->lock);
> +	spool->count = 1;
> +	spool->max_hpages = nr_blocks;
> +	spool->used_hpages = 0;
> +
> +	return spool;
> +}

hm, I wonder why we didn't support a modular hugetlbfs driver.  Perhaps
because of core hugetlb code which fiddles around in hugetlbfs.  I
wonder if this patch fixed that?

> +void hugepage_put_subpool(struct hugepage_subpool *spool)
> +{
> +	spin_lock(&spool->lock);
> +	BUG_ON(!spool->count);
> +	spool->count--;
> +	unlock_or_release_subpool(spool);
> +}

ah-hah!  ->count is in fact a refcount?  Then why didn't we call it
"refcount" to save me all that head scratching?  Here I was thinking it
must count hugepages.

> +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
> +				      long delta)
> +{
> +	int ret = 0;
> +
> +	if (!spool)
> +		return 0;
> +
> +	spin_lock(&spool->lock);
> +	if ((spool->used_hpages + delta) <= spool->max_hpages) {
> +		spool->used_hpages += delta;
> +	} else {
> +		ret = -ENOMEM;
> +	}

Can lose all the braces here.

> +	spin_unlock(&spool->lock);
> +
> +	return ret;
> +}

And some documeentation would be nice.  It's strange to have a
foo_get_pages() which doesn't get any pages!  And what does it mean
when spool==0?  Why would anyone call this function without a subpool?

`delta' could be an unsigned int?

> +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
> +				       long delta)
> +{
> +	if (!spool)
> +		return;
> +
> +	spin_lock(&spool->lock);
> +	spool->used_hpages -= delta;
> +	/* If hugetlbfs_put_super couldn't free spool due to
> +	* an outstanding quota reference, free it now. */

This is the sole remaining use of the term "quota" in hugetlb (yay). 
Can we make this one go away as well?

> +	unlock_or_release_subpool(spool);
> +}
> +
> +static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
> +{
> +	return HUGETLBFS_SB(inode->i_sb)->spool;
> +}
> +
> +static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> +{
> +	return subpool_inode(vma->vm_file->f_dentry->d_inode);
> +}
> +
>  /*
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
> @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
>  	 */
>  	struct hstate *h = page_hstate(page);
>  	int nid = page_to_nid(page);
> -	struct address_space *mapping;
> +	struct hugepage_subpool *spool =
> +		(struct hugepage_subpool *)page_private(page);

Could do

	struct hugepage_subpool *spool;
	
	spool = (struct hugepage_subpool *)page_private(page);

and save the Ingo-upsetting 80-col trickery.

I'm shocked that we weren't already using the .private field of hugetlb
head pages.

> -	mapping = (struct address_space *) page_private(page);
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	BUG_ON(page_count(page));
> @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
>  		enqueue_huge_page(h, page);
>  	}
>  	spin_unlock(&hugetlb_lock);
> -	if (mapping)
> -		hugetlb_put_quota(mapping, 1);
> +	hugepage_subpool_put_pages(spool, 1);
>  }
>  
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>
> ...
>
> @@ -2316,7 +2395,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 */
>  	address = address & huge_page_mask(h);
>  	pgoff = vma_hugecache_offset(h, vma, address);
> -	mapping = (struct address_space *)page_private(page);
> +	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;

Does

	mapping = vma->vm_file->f_mapping;

work?  Avoiding all this pointer hopping is why we added f_mapping,
actually.

>  	/*
>  	 * Take the mapping lock for the duration of the table walk. As
>
> ...
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-07 12:28   ` Hillf Danton
@ 2012-03-08  0:57     ` David Gibson
  2012-03-08  4:17     ` Aneesh Kumar K.V
  1 sibling, 0 replies; 16+ messages in thread
From: David Gibson @ 2012-03-08  0:57 UTC (permalink / raw)
  To: Hillf Danton
  Cc: akpm, hughd, paulus, linux-kernel, Andrew Barry, Mel Gorman,
	Minchan Kim, Aneesh Kumar K.V

On Wed, Mar 07, 2012 at 08:28:39PM +0800, Hillf Danton wrote:
> On Wed, Mar 7, 2012 at 12:48 PM, David Gibson
> <david@gibson.dropbear.id.au> wrote:
[snip]
> >  /*
> >  * Region tracking -- allows tracking of reservations and instantiated pages
> >  *                    across the pages in a mapping.
> > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
> >         */
> >        struct hstate *h = page_hstate(page);
> >        int nid = page_to_nid(page);
> > -       struct address_space *mapping;
> > +       struct hugepage_subpool *spool =
> > +               (struct hugepage_subpool *)page_private(page);
> >
> > -       mapping = (struct address_space *) page_private(page);
> >        set_page_private(page, 0);
> >        page->mapping = NULL;
> >        BUG_ON(page_count(page));
> > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
> >                enqueue_huge_page(h, page);
> >        }
> >        spin_unlock(&hugetlb_lock);
> > -       if (mapping)
> > -               hugetlb_put_quota(mapping, 1);
> > +       hugepage_subpool_put_pages(spool, 1);
> 
> Like current code, quota is handed back *unconditionally*, but ...

[snip]
> >        /*
> > -        * Processes that did not create the mapping will have no reserves and
> > -        * will not have accounted against quota. Check that the quota can be
> > -        * made before satisfying the allocation
> > -        * MAP_NORESERVE mappings may also need pages and quota allocated
> > -        * if no reserve mapping overlaps.
> > +        * Processes that did not create the mapping will have no
> > +        * reserves and will not have accounted against subpool
> > +        * limit. Check that the subpool limit can be made before
> > +        * satisfying the allocation MAP_NORESERVE mappings may also
> > +        * need pages and subpool limit allocated allocated if no reserve
> > +        * mapping overlaps.
> >         */
> >        chg = vma_needs_reservation(h, vma, addr);
> >        if (chg < 0)
> >                return ERR_PTR(-VM_FAULT_OOM);
> >        if (chg)
> > -               if (hugetlb_get_quota(inode->i_mapping, chg))
> > +               if (hugepage_subpool_get_pages(spool, chg))
> >                        return ERR_PTR(-VM_FAULT_SIGBUS);
> 
> ... quota is allocated only if needed.
> 
> Given that mismatch, are you fixing the use-after-free bug, or
> quota maintenance, or both?

As you say, this is as before.  So I'm only fixing the use-after-free
bug.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-08  0:27   ` Andrew Morton
@ 2012-03-08  2:09     ` David Gibson
  2012-03-09  3:25     ` David Gibson
  1 sibling, 0 replies; 16+ messages in thread
From: David Gibson @ 2012-03-08  2:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hughd, paulus, linux-kernel, Andrew Barry, Mel Gorman,
	Minchan Kim, Hillf Danton

On Wed, Mar 07, 2012 at 04:27:20PM -0800, Andrew Morton wrote:
> On Wed,  7 Mar 2012 15:48:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
> > general quota handling code, and they don't much resemble its behaviour.
> > Rather than being about maintaining limits on on-disk block usage by
> > particular users, they are instead about maintaining limits on in-memory
> > page usage (including anonymous MAP_PRIVATE copied-on-write pages)
> > associated with a particular hugetlbfs filesystem instance.
> > 
> > Worse, they work by having callbacks to the hugetlbfs filesystem code from
> > the low-level page handling code, in particular from free_huge_page().
> > This is a layering violation of itself, but more importantly, if the kernel
> > does a get_user_pages() on hugepages (which can happen from KVM amongst
> > others), then the free_huge_page() can be delayed until after the
> > associated inode has already been freed.  If an unmount occurs at the
> > wrong time, even the hugetlbfs superblock where the "quota" limits are
> > stored may have been freed.
> > 
> > Andrew Barry proposed a patch to fix this by having hugepages, instead of
> > storing a pointer to their address_space and reaching the superblock from
> > there, had the hugepages store pointers directly to the superblock, bumping
> > the reference count as appropriate to avoid it being freed.  Andrew Morton
> > rejected that version, however, on the grounds that it made the existing
> > layering violation worse.
> > 
> > This is a reworked version of Andrew's patch, which removes the extra, and
> > some of the existing, layering violation.  It works by introducing the
> > concept of a hugepage "subpool" at the lower hugepage mm layer - that is
> > a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
> > a subpool for each filesystem instance with a page limit set, and a pointer
> > to the subpool gets added to each allocated hugepage, instead of the
> > address_space pointer used now.  The subpool has its own lifetime and is
> > only freed once all pages in it _and_ all other references to it (i.e.
> > superblocks) are gone.
> > 
> > subpools are optional - a NULL subpool pointer is taken by the code to mean
> > that no subpool limits are in effect.
> > 
> > Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
> > quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
> > http://marc.info/?l=linux-mm&m=126928970510627&w=1
> > 
> > v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
> > alloc_huge_page() - since it already takes the vma, it is not necessary.
> 
> Looks good - thanks for doing this.
> 
> Some comments, nothing serious:

Ok.  Since you've pulled this into -mm do you want these address as a
further fixup patch, or with a new patch obsoleting the original?

> > @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
> >  	sb->s_root = root;
> >  	return 0;
> >  out_free:
> > +	if (sbinfo->spool)
> > +		kfree(sbinfo->spool);
> 
> kfree(NULL) is OK, and omitting the NULL test is idiomatic.

Noted.

> >  	kfree(sbinfo);
> >  	return -ENOMEM;
> >  }
> >  
> >
> > ...
> >
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -14,6 +14,15 @@ struct user_struct;
> >  #include <linux/shm.h>
> >  #include <asm/tlbflush.h>
> >  
> > +struct hugepage_subpool {
> > +	spinlock_t lock;
> > +	long count;
> > +	long max_hpages, used_hpages;
> > +};
> 
> Could we get this struct docmented please?  Why it exists, what it is
> used for.  Basically a copy-n-paste from the (nice) changelog.
> 
> Also please doucment the fields.  I'm sitting here reading the code
> trying to work out what these three fields semantically *mean*.  Armed
> with enough information to understand the code, I then get to read it
> all again :(

Ok, I'll add that.

> 
> I don't think any of these things can go negative, and a negative page
> count is absurd.  So perhaps they should have an unsigned type.

Ah, true.  The 'long' type was a hangover from when this was in the
hugetlbfs superblock and -1 was used to mean "no limit" (now we use
subpool == NULL for that).

> And maybe a 32-bit type?  How much memory is 2^32 hugepages, and does
> the present code even work correctly in that situation?

Well with 4MB or 16MB hugepages, then 2^32 is an awful lot of memory.
But powerpc can also have 64kB "huge"pages and potentially even
smaller on embedded.  I don't see an immediate reason that >2^32
hugepages would be impossible elsewhere, so using longs seems safer to
me.

I've made the refcount just an unsigned, though.

> > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
> > +void hugepage_put_subpool(struct hugepage_subpool *spool);
> > +
> >  int PageHuge(struct page *page);
> >  
> >
> > ...
> >
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
> >   */
> >  static DEFINE_SPINLOCK(hugetlb_lock);
> >  
> > +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> 
> Strange name.  unlock_and_release() would be closer?

True, that's because earlier versions didn't bother with the unlock
when the subpool was freed, until I realised that would screw up the
pre-empt count and probably horribly confuse various lock analysis
things too.

unlock_and_release_subpool() isn't entirely accurate either, but I
couldn't think of anything better, so I've renamed it as such.

> 
> > +{
> > +	bool free = (spool->count == 0) && (spool->used_hpages == 0);
> 
> I wish I knew what those fields do.
> 
> > +	spin_unlock(&spool->lock);
> > +
> > +	/* If no pages are used, and no other handles to the subpool
> > +	 * remain, free the subpool the subpool remain */
> 
> Comment is mangled.
> 
> Please use the normal multiline comment layout?  There are examples in
> this file.

Updated.

> > +	if (free)
> > +		kfree(spool);
> > +}
> > +
> > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
> 
> nr_blocks is logically an unsigned thing.  And maybe 32-bit.
> 
> > +{
> > +	struct hugepage_subpool *spool;
> > +
> > +	spool = kmalloc(sizeof(*spool), GFP_KERNEL);
> > +	if (!spool)
> > +		return NULL;
> > +
> > +	spin_lock_init(&spool->lock);
> > +	spool->count = 1;
> > +	spool->max_hpages = nr_blocks;
> > +	spool->used_hpages = 0;
> > +
> > +	return spool;
> > +}
> 
> hm, I wonder why we didn't support a modular hugetlbfs driver.  Perhaps
> because of core hugetlb code which fiddles around in hugetlbfs.  I
> wonder if this patch fixed that?

Not completely.  It does reduce the number of places where we fiddle
around with the hugetlbfs superblock, but there are still some in the
fault paths.

> > +void hugepage_put_subpool(struct hugepage_subpool *spool)
> > +{
> > +	spin_lock(&spool->lock);
> > +	BUG_ON(!spool->count);
> > +	spool->count--;
> > +	unlock_or_release_subpool(spool);
> > +}
> 
> ah-hah!  ->count is in fact a refcount?  Then why didn't we call it
> "refcount" to save me all that head scratching?  Here I was thinking it
> must count hugepages.

Ah, sorry.  I was thinking in analogy to page->count.  I've renamed it
refcount now.

> > +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
> > +				      long delta)
> > +{
> > +	int ret = 0;
> > +
> > +	if (!spool)
> > +		return 0;
> > +
> > +	spin_lock(&spool->lock);
> > +	if ((spool->used_hpages + delta) <= spool->max_hpages) {
> > +		spool->used_hpages += delta;
> > +	} else {
> > +		ret = -ENOMEM;
> > +	}
> 
> Can lose all the braces here.

Heh, been doing too much qemu work lately, where they insist on the
braces.

> > +	spin_unlock(&spool->lock);
> > +
> > +	return ret;
> > +}
> 
> And some documeentation would be nice.  It's strange to have a
> foo_get_pages() which doesn't get any pages!  

Yeah, it "get"s the pool not the page, in a sense.

I've added some comments, and renamed these to alloc_pages() and
release_pages(), though I'm not entirely sure if those names are
better, or just in danger of confusion with other things (suggestions
welcome).

> And what does it mean
> when spool==0?  Why would anyone call this function without a subpool?

When they've pulled a maybe-NULL pool pointer from page_private or
elsewhere.  Saves the callers having to do an if (spool) ...

> `delta' could be an unsigned int?

Ok, changed to unsigned long to match the new pool counts.

> > +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
> > +				       long delta)
> > +{
> > +	if (!spool)
> > +		return;
> > +
> > +	spin_lock(&spool->lock);
> > +	spool->used_hpages -= delta;
> > +	/* If hugetlbfs_put_super couldn't free spool due to
> > +	* an outstanding quota reference, free it now. */
> 
> This is the sole remaining use of the term "quota" in hugetlb (yay). 
> Can we make this one go away as well?

Updated.

> > +	unlock_or_release_subpool(spool);
> > +}
> > +
> > +static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
> > +{
> > +	return HUGETLBFS_SB(inode->i_sb)->spool;
> > +}
> > +
> > +static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> > +{
> > +	return subpool_inode(vma->vm_file->f_dentry->d_inode);
> > +}
> > +
> >  /*
> >   * Region tracking -- allows tracking of reservations and instantiated pages
> >   *                    across the pages in a mapping.
> > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
> >  	 */
> >  	struct hstate *h = page_hstate(page);
> >  	int nid = page_to_nid(page);
> > -	struct address_space *mapping;
> > +	struct hugepage_subpool *spool =
> > +		(struct hugepage_subpool *)page_private(page);
> 
> Could do
> 
> 	struct hugepage_subpool *spool;
> 	
> 	spool = (struct hugepage_subpool *)page_private(page);
> 
> and save the Ingo-upsetting 80-col trickery.

Ok.

> I'm shocked that we weren't already using the .private field of hugetlb
> head pages.

We were, they used to contain a reference to the address_space, which
I've replaced with the subpool.  Unlike the mapping, the subpool has
its lifetime constructed to be longer than that of the page private
pointer - the essence of the bugfix.

> > -	mapping = (struct address_space *) page_private(page);
> >  	set_page_private(page, 0);
> >  	page->mapping = NULL;
> >  	BUG_ON(page_count(page));
> > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
> >  		enqueue_huge_page(h, page);
> >  	}
> >  	spin_unlock(&hugetlb_lock);
> > -	if (mapping)
> > -		hugetlb_put_quota(mapping, 1);
> > +	hugepage_subpool_put_pages(spool, 1);
> >  }
> >  
> >  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> >
> > ...
> >
> > @@ -2316,7 +2395,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	 */
> >  	address = address & huge_page_mask(h);
> >  	pgoff = vma_hugecache_offset(h, vma, address);
> > -	mapping = (struct address_space *)page_private(page);
> > +	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
> 
> Does
> 
> 	mapping = vma->vm_file->f_mapping;
> 
> work?  Avoiding all this pointer hopping is why we added f_mapping,
> actually.

Should do.  It's just been a while since I worked in this area, so I
didn't know about f_mapping (and didn't spot the other uses in here,
obviously).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-07 12:28   ` Hillf Danton
  2012-03-08  0:57     ` David Gibson
@ 2012-03-08  4:17     ` Aneesh Kumar K.V
  2012-03-08 11:59       ` Hillf Danton
  1 sibling, 1 reply; 16+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-08  4:17 UTC (permalink / raw)
  To: Hillf Danton, David Gibson
  Cc: akpm, hughd, paulus, linux-kernel, Andrew Barry, Mel Gorman,
	Minchan Kim

On Wed, 7 Mar 2012 20:28:39 +0800, Hillf Danton <dhillf@gmail.com> wrote:
> On Wed, Mar 7, 2012 at 12:48 PM, David Gibson
> <david@gibson.dropbear.id.au> wrote:
> > hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
> > general quota handling code, and they don't much resemble its behaviour.
> > Rather than being about maintaining limits on on-disk block usage by
> > particular users, they are instead about maintaining limits on in-memory
> > page usage (including anonymous MAP_PRIVATE copied-on-write pages)
> > associated with a particular hugetlbfs filesystem instance.
> >
> > Worse, they work by having callbacks to the hugetlbfs filesystem code from
> > the low-level page handling code, in particular from free_huge_page().
> > This is a layering violation of itself, but more importantly, if the kernel
> > does a get_user_pages() on hugepages (which can happen from KVM amongst
> > others), then the free_huge_page() can be delayed until after the
> > associated inode has already been freed.  If an unmount occurs at the
> > wrong time, even the hugetlbfs superblock where the "quota" limits are
> > stored may have been freed.
> >
> > Andrew Barry proposed a patch to fix this by having hugepages, instead of
> > storing a pointer to their address_space and reaching the superblock from
> > there, had the hugepages store pointers directly to the superblock, bumping
> > the reference count as appropriate to avoid it being freed.  Andrew Morton
> > rejected that version, however, on the grounds that it made the existing
> > layering violation worse.
> >
> > This is a reworked version of Andrew's patch, which removes the extra, and
> > some of the existing, layering violation.  It works by introducing the
> > concept of a hugepage "subpool" at the lower hugepage mm layer - that is
> > a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
> > a subpool for each filesystem instance with a page limit set, and a pointer
> > to the subpool gets added to each allocated hugepage, instead of the
> > address_space pointer used now.  The subpool has its own lifetime and is
> > only freed once all pages in it _and_ all other references to it (i.e.
> > superblocks) are gone.
> >
> > subpools are optional - a NULL subpool pointer is taken by the code to mean
> > that no subpool limits are in effect.
> >
> > Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
> > quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
> > http://marc.info/?l=linux-mm&m=126928970510627&w=1
> >
> > v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
> > alloc_huge_page() - since it already takes the vma, it is not necessary.
> >
> > Signed-off-by: Andrew Barry <abarry@cray.com>
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Hillf Danton <dhillf@gmail.com>
> > ---
> >  fs/hugetlbfs/inode.c    |   54 +++++++-----------
> >  include/linux/hugetlb.h |   14 ++++--
> >  mm/hugetlb.c            |  135 +++++++++++++++++++++++++++++++++++++---------
> >  3 files changed, 139 insertions(+), 64 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index bb0e366..74c6ba2 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -626,9 +626,15 @@ static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> >                spin_lock(&sbinfo->stat_lock);
> >                /* If no limits set, just report 0 for max/free/used
> >                 * blocks, like simple_statfs() */
> > -               if (sbinfo->max_blocks >= 0) {
> > -                       buf->f_blocks = sbinfo->max_blocks;
> > -                       buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
> > +               if (sbinfo->spool) {
> > +                       long free_pages;
> > +
> > +                       spin_lock(&sbinfo->spool->lock);
> > +                       buf->f_blocks = sbinfo->spool->max_hpages;
> > +                       free_pages = sbinfo->spool->max_hpages
> > +                               - sbinfo->spool->used_hpages;
> > +                       buf->f_bavail = buf->f_bfree = free_pages;
> > +                       spin_unlock(&sbinfo->spool->lock);
> >                        buf->f_files = sbinfo->max_inodes;
> >                        buf->f_ffree = sbinfo->free_inodes;
> >                }
> > @@ -644,6 +650,10 @@ static void hugetlbfs_put_super(struct super_block *sb)
> >
> >        if (sbi) {
> >                sb->s_fs_info = NULL;
> > +
> > +               if (sbi->spool)
> > +                       hugepage_put_subpool(sbi->spool);
> > +
> >                kfree(sbi);
> >        }
> >  }
> > @@ -874,10 +884,14 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
> >        sb->s_fs_info = sbinfo;
> >        sbinfo->hstate = config.hstate;
> >        spin_lock_init(&sbinfo->stat_lock);
> > -       sbinfo->max_blocks = config.nr_blocks;
> > -       sbinfo->free_blocks = config.nr_blocks;
> >        sbinfo->max_inodes = config.nr_inodes;
> >        sbinfo->free_inodes = config.nr_inodes;
> > +       sbinfo->spool = NULL;
> > +       if (config.nr_blocks != -1) {
> > +               sbinfo->spool = hugepage_new_subpool(config.nr_blocks);
> > +               if (!sbinfo->spool)
> > +                       goto out_free;
> > +       }
> >        sb->s_maxbytes = MAX_LFS_FILESIZE;
> >        sb->s_blocksize = huge_page_size(config.hstate);
> >        sb->s_blocksize_bits = huge_page_shift(config.hstate);
> > @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
> >        sb->s_root = root;
> >        return 0;
> >  out_free:
> > +       if (sbinfo->spool)
> > +               kfree(sbinfo->spool);
> >        kfree(sbinfo);
> >        return -ENOMEM;
> >  }
> >
> > -int hugetlb_get_quota(struct address_space *mapping, long delta)
> > -{
> > -       int ret = 0;
> > -       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> > -
> > -       if (sbinfo->free_blocks > -1) {
> > -               spin_lock(&sbinfo->stat_lock);
> > -               if (sbinfo->free_blocks - delta >= 0)
> > -                       sbinfo->free_blocks -= delta;
> > -               else
> > -                       ret = -ENOMEM;
> > -               spin_unlock(&sbinfo->stat_lock);
> > -       }
> > -
> > -       return ret;
> > -}
> > -
> > -void hugetlb_put_quota(struct address_space *mapping, long delta)
> > -{
> > -       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> > -
> > -       if (sbinfo->free_blocks > -1) {
> > -               spin_lock(&sbinfo->stat_lock);
> > -               sbinfo->free_blocks += delta;
> > -               spin_unlock(&sbinfo->stat_lock);
> > -       }
> > -}
> > -
> >  static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
> >        int flags, const char *dev_name, void *data)
> >  {
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 7adc492..cf01817 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -14,6 +14,15 @@ struct user_struct;
> >  #include <linux/shm.h>
> >  #include <asm/tlbflush.h>
> >
> > +struct hugepage_subpool {
> > +       spinlock_t lock;
> > +       long count;
> > +       long max_hpages, used_hpages;
> > +};
> > +
> > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
> > +void hugepage_put_subpool(struct hugepage_subpool *spool);
> > +
> >  int PageHuge(struct page *page);
> >
> >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > @@ -129,12 +138,11 @@ enum {
> >
> >  #ifdef CONFIG_HUGETLBFS
> >  struct hugetlbfs_sb_info {
> > -       long    max_blocks;   /* blocks allowed */
> > -       long    free_blocks;  /* blocks free */
> >        long    max_inodes;   /* inodes allowed */
> >        long    free_inodes;  /* inodes free */
> >        spinlock_t      stat_lock;
> >        struct hstate *hstate;
> > +       struct hugepage_subpool *spool;
> >  };
> >
> >  static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
> > @@ -146,8 +154,6 @@ extern const struct file_operations hugetlbfs_file_operations;
> >  extern const struct vm_operations_struct hugetlb_vm_ops;
> >  struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
> >                                struct user_struct **user, int creat_flags);
> > -int hugetlb_get_quota(struct address_space *mapping, long delta);
> > -void hugetlb_put_quota(struct address_space *mapping, long delta);
> >
> >  static inline int is_file_hugepages(struct file *file)
> >  {
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5f34bd8..36b38b3a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
> >  */
> >  static DEFINE_SPINLOCK(hugetlb_lock);
> >
> > +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> > +{
> > +       bool free = (spool->count == 0) && (spool->used_hpages == 0);
> > +
> > +       spin_unlock(&spool->lock);
> > +
> > +       /* If no pages are used, and no other handles to the subpool
> > +        * remain, free the subpool the subpool remain */
> > +       if (free)
> > +               kfree(spool);
> > +}
> > +
> > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
> > +{
> > +       struct hugepage_subpool *spool;
> > +
> > +       spool = kmalloc(sizeof(*spool), GFP_KERNEL);
> > +       if (!spool)
> > +               return NULL;
> > +
> > +       spin_lock_init(&spool->lock);
> > +       spool->count = 1;
> > +       spool->max_hpages = nr_blocks;
> > +       spool->used_hpages = 0;
> > +
> > +       return spool;
> > +}
> > +
> > +void hugepage_put_subpool(struct hugepage_subpool *spool)
> > +{
> > +       spin_lock(&spool->lock);
> > +       BUG_ON(!spool->count);
> > +       spool->count--;
> > +       unlock_or_release_subpool(spool);
> > +}
> > +
> > +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
> > +                                     long delta)
> > +{
> > +       int ret = 0;
> > +
> > +       if (!spool)
> > +               return 0;
> > +
> > +       spin_lock(&spool->lock);
> > +       if ((spool->used_hpages + delta) <= spool->max_hpages) {
> > +               spool->used_hpages += delta;
> > +       } else {
> > +               ret = -ENOMEM;
> > +       }
> > +       spin_unlock(&spool->lock);
> > +
> > +       return ret;
> > +}
> > +
> > +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
> > +                                      long delta)
> > +{
> > +       if (!spool)
> > +               return;
> > +
> > +       spin_lock(&spool->lock);
> > +       spool->used_hpages -= delta;
> > +       /* If hugetlbfs_put_super couldn't free spool due to
> > +       * an outstanding quota reference, free it now. */
> > +       unlock_or_release_subpool(spool);
> > +}
> > +
> > +static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
> > +{
> > +       return HUGETLBFS_SB(inode->i_sb)->spool;
> > +}
> > +
> > +static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> > +{
> > +       return subpool_inode(vma->vm_file->f_dentry->d_inode);
> > +}
> > +
> >  /*
> >  * Region tracking -- allows tracking of reservations and instantiated pages
> >  *                    across the pages in a mapping.
> > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
> >         */
> >        struct hstate *h = page_hstate(page);
> >        int nid = page_to_nid(page);
> > -       struct address_space *mapping;
> > +       struct hugepage_subpool *spool =
> > +               (struct hugepage_subpool *)page_private(page);
> >
> > -       mapping = (struct address_space *) page_private(page);
> >        set_page_private(page, 0);
> >        page->mapping = NULL;
> >        BUG_ON(page_count(page));
> > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
> >                enqueue_huge_page(h, page);
> >        }
> >        spin_unlock(&hugetlb_lock);
> > -       if (mapping)
> > -               hugetlb_put_quota(mapping, 1);
> > +       hugepage_subpool_put_pages(spool, 1);
> 
> Like current code, quota is handed back *unconditionally*, but ...


We will end up doing get_quota for every allocated page. get_quota
happens either during mmap() if MAP_NORESERVE is not specified.
or during alloc_huge_page if we haven't done a quota reservation during
mmap for that range. Are you finding any part of code where we miss that ?

-aneesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-07  4:48 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
  2012-03-07 12:28   ` Hillf Danton
  2012-03-08  0:27   ` Andrew Morton
@ 2012-03-08  4:30   ` Aneesh Kumar K.V
  2012-03-08 14:18     ` David Gibson
  2 siblings, 1 reply; 16+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-08  4:30 UTC (permalink / raw)
  To: David Gibson, akpm, hughd
  Cc: paulus, linux-kernel, David Gibson, Andrew Barry, Mel Gorman,
	Minchan Kim, Hillf Danton

On Wed,  7 Mar 2012 15:48:14 +1100, David Gibson <david@gibson.dropbear.id.au> wrote:
> hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
> general quota handling code, and they don't much resemble its behaviour.
> Rather than being about maintaining limits on on-disk block usage by
> particular users, they are instead about maintaining limits on in-memory
> page usage (including anonymous MAP_PRIVATE copied-on-write pages)
> associated with a particular hugetlbfs filesystem instance.
> 
> Worse, they work by having callbacks to the hugetlbfs filesystem code from
> the low-level page handling code, in particular from free_huge_page().
> This is a layering violation of itself, but more importantly, if the kernel
> does a get_user_pages() on hugepages (which can happen from KVM amongst
> others), then the free_huge_page() can be delayed until after the
> associated inode has already been freed.  If an unmount occurs at the
> wrong time, even the hugetlbfs superblock where the "quota" limits are
> stored may have been freed.
> 
> Andrew Barry proposed a patch to fix this by having hugepages, instead of
> storing a pointer to their address_space and reaching the superblock from
> there, had the hugepages store pointers directly to the superblock, bumping
> the reference count as appropriate to avoid it being freed.  Andrew Morton
> rejected that version, however, on the grounds that it made the existing
> layering violation worse.
> 
> This is a reworked version of Andrew's patch, which removes the extra, and
> some of the existing, layering violation.  It works by introducing the
> concept of a hugepage "subpool" at the lower hugepage mm layer - that is
> a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
> a subpool for each filesystem instance with a page limit set, and a pointer
> to the subpool gets added to each allocated hugepage, instead of the
> address_space pointer used now.  The subpool has its own lifetime and is
> only freed once all pages in it _and_ all other references to it (i.e.
> superblocks) are gone.
> 
> subpools are optional - a NULL subpool pointer is taken by the code to mean
> that no subpool limits are in effect.
> 
> Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
> quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
> http://marc.info/?l=linux-mm&m=126928970510627&w=1
> 
> v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
> alloc_huge_page() - since it already takes the vma, it is not necessary.
> 
> Signed-off-by: Andrew Barry <abarry@cray.com>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hillf Danton <dhillf@gmail.com>
> ---
>  fs/hugetlbfs/inode.c    |   54 +++++++-----------
>  include/linux/hugetlb.h |   14 ++++--
>  mm/hugetlb.c            |  135 +++++++++++++++++++++++++++++++++++++---------
>  3 files changed, 139 insertions(+), 64 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index bb0e366..74c6ba2 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -626,9 +626,15 @@ static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  		spin_lock(&sbinfo->stat_lock);
>  		/* If no limits set, just report 0 for max/free/used
>  		 * blocks, like simple_statfs() */
> -		if (sbinfo->max_blocks >= 0) {
> -			buf->f_blocks = sbinfo->max_blocks;
> -			buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
> +		if (sbinfo->spool) {
> +			long free_pages;
> +
> +			spin_lock(&sbinfo->spool->lock);
> +			buf->f_blocks = sbinfo->spool->max_hpages;
> +			free_pages = sbinfo->spool->max_hpages
> +				- sbinfo->spool->used_hpages;
> +			buf->f_bavail = buf->f_bfree = free_pages;
> +			spin_unlock(&sbinfo->spool->lock);
>  			buf->f_files = sbinfo->max_inodes;
>  			buf->f_ffree = sbinfo->free_inodes;
>  		}
> @@ -644,6 +650,10 @@ static void hugetlbfs_put_super(struct super_block *sb)
> 
>  	if (sbi) {
>  		sb->s_fs_info = NULL;
> +
> +		if (sbi->spool)
> +			hugepage_put_subpool(sbi->spool);
> +
>  		kfree(sbi);
>  	}
>  }
> @@ -874,10 +884,14 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
>  	sb->s_fs_info = sbinfo;
>  	sbinfo->hstate = config.hstate;
>  	spin_lock_init(&sbinfo->stat_lock);
> -	sbinfo->max_blocks = config.nr_blocks;
> -	sbinfo->free_blocks = config.nr_blocks;
>  	sbinfo->max_inodes = config.nr_inodes;
>  	sbinfo->free_inodes = config.nr_inodes;
> +	sbinfo->spool = NULL;
> +	if (config.nr_blocks != -1) {
> +		sbinfo->spool = hugepage_new_subpool(config.nr_blocks);
> +		if (!sbinfo->spool)
> +			goto out_free;
> +	}
>  	sb->s_maxbytes = MAX_LFS_FILESIZE;
>  	sb->s_blocksize = huge_page_size(config.hstate);
>  	sb->s_blocksize_bits = huge_page_shift(config.hstate);
> @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
>  	sb->s_root = root;
>  	return 0;
>  out_free:
> +	if (sbinfo->spool)
> +		kfree(sbinfo->spool);

kfree() should handle NULL 


>  	kfree(sbinfo);
>  	return -ENOMEM;
>  }
> 
> -int hugetlb_get_quota(struct address_space *mapping, long delta)
> -{
> -	int ret = 0;
> -	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> -
> -	if (sbinfo->free_blocks > -1) {
> -		spin_lock(&sbinfo->stat_lock);
> -		if (sbinfo->free_blocks - delta >= 0)
> -			sbinfo->free_blocks -= delta;
> -		else
> -			ret = -ENOMEM;
> -		spin_unlock(&sbinfo->stat_lock);
> -	}
> -
> -	return ret;
> -}
> -
> -void hugetlb_put_quota(struct address_space *mapping, long delta)
> -{
> -	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> -
> -	if (sbinfo->free_blocks > -1) {
> -		spin_lock(&sbinfo->stat_lock);
> -		sbinfo->free_blocks += delta;
> -		spin_unlock(&sbinfo->stat_lock);
> -	}
> -}
> -
>  static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
>  	int flags, const char *dev_name, void *data)
>  {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 7adc492..cf01817 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -14,6 +14,15 @@ struct user_struct;
>  #include <linux/shm.h>
>  #include <asm/tlbflush.h>
> 
> +struct hugepage_subpool {
> +	spinlock_t lock;
> +	long count;
> +	long max_hpages, used_hpages;
> +};
> +
> +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
> +void hugepage_put_subpool(struct hugepage_subpool *spool);
> +
>  int PageHuge(struct page *page);
> 
>  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> @@ -129,12 +138,11 @@ enum {
> 
>  #ifdef CONFIG_HUGETLBFS
>  struct hugetlbfs_sb_info {
> -	long	max_blocks;   /* blocks allowed */
> -	long	free_blocks;  /* blocks free */
>  	long	max_inodes;   /* inodes allowed */
>  	long	free_inodes;  /* inodes free */
>  	spinlock_t	stat_lock;
>  	struct hstate *hstate;
> +	struct hugepage_subpool *spool;
>  };
> 
>  static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
> @@ -146,8 +154,6 @@ extern const struct file_operations hugetlbfs_file_operations;
>  extern const struct vm_operations_struct hugetlb_vm_ops;
>  struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
>  				struct user_struct **user, int creat_flags);
> -int hugetlb_get_quota(struct address_space *mapping, long delta);
> -void hugetlb_put_quota(struct address_space *mapping, long delta);
> 
>  static inline int is_file_hugepages(struct file *file)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f34bd8..36b38b3a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);
> 
> +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> +{
> +	bool free = (spool->count == 0) && (spool->used_hpages == 0);
> +

I have not see that style in other part of kernel. May be with proper

if (spool->count == 0 && spool->used_hpages == 0)
   free = 1

> +	spin_unlock(&spool->lock);
> +


Having the spin_lock held across functions is also strange. Since there
are only two callers, may be this can be inlined in the callers ?


> +	/* If no pages are used, and no other handles to the subpool
> +	 * remain, free the subpool the subpool remain */
> +	if (free)
> +		kfree(spool);
> +}
> +
> +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
> +{
> +	struct hugepage_subpool *spool;
> +
> +	spool = kmalloc(sizeof(*spool), GFP_KERNEL);
> +	if (!spool)
> +		return NULL;
> +
> +	spin_lock_init(&spool->lock);
> +	spool->count = 1;
> +	spool->max_hpages = nr_blocks;
> +	spool->used_hpages = 0;
> +
> +	return spool;
> +}
> +
> +void hugepage_put_subpool(struct hugepage_subpool *spool)
> +{
> +	spin_lock(&spool->lock);
> +	BUG_ON(!spool->count);
> +	spool->count--;
> +	unlock_or_release_subpool(spool);
> +}
> +
> +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
> +				      long delta)

s/long delta/long hpage/ or something like that ?


> +{
> +	int ret = 0;
> +
> +	if (!spool)
> +		return 0;
> +
> +	spin_lock(&spool->lock);
> +	if ((spool->used_hpages + delta) <= spool->max_hpages) {
> +		spool->used_hpages += delta;
> +	} else {
> +		ret = -ENOMEM;
> +	}
> +	spin_unlock(&spool->lock);
> +
> +	return ret;
> +}
> +
> +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
> +				       long delta)
> +{
> +	if (!spool)
> +		return;
> +
> +	spin_lock(&spool->lock);
> +	spool->used_hpages -= delta;
> +	/* If hugetlbfs_put_super couldn't free spool due to
> +	* an outstanding quota reference, free it now. */
> +	unlock_or_release_subpool(spool);
> +}
> +
> +static inline struct hugepage_subpool *subpool_inode(struct inode
> *inode)

s/subpool_inode/subpool_from_inode/ ?

> +{
> +	return HUGETLBFS_SB(inode->i_sb)->spool;
> +}
> +
> +static inline struct hugepage_subpool *subpool_vma(struct
> vm_area_struct *vma)

s/subpool_vma/subpool_from_vma/ ?

> +{
> +	return subpool_inode(vma->vm_file->f_dentry->d_inode);
> +}
> +
>  /*
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
> @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
>  	 */
>  	struct hstate *h = page_hstate(page);
>  	int nid = page_to_nid(page);
> -	struct address_space *mapping;
> +	struct hugepage_subpool *spool =
> +		(struct hugepage_subpool *)page_private(page);
> 
> -	mapping = (struct address_space *) page_private(page);
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	BUG_ON(page_count(page));
> @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
>  		enqueue_huge_page(h, page);
>  	}
>  	spin_unlock(&hugetlb_lock);
> -	if (mapping)
> -		hugetlb_put_quota(mapping, 1);
> +	hugepage_subpool_put_pages(spool, 1);


We would still need the if () checking there. When we do echo x >
/proc/sys/vm/nr_hugepages we would call free_huge_page to prepare new
huge page pool. But we don't have page_private set for them.


>  }
> 
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -966,11 +1043,12 @@ static void return_unused_surplus_pages(struct hstate *h,
>  /*
>   * Determine if the huge page at addr within the vma has an associated
>   * reservation.  Where it does not we will need to logically increase
> - * reservation and actually increase quota before an allocation can occur.
> - * Where any new reservation would be required the reservation change is
> - * prepared, but not committed.  Once the page has been quota'd allocated
> - * an instantiated the change should be committed via vma_commit_reservation.
> - * No action is required on failure.
> + * reservation and actually increase subpool usage before an allocation
> + * can occur.  Where any new reservation would be required the
> + * reservation change is prepared, but not committed.  Once the page
> + * has been allocated from the subpool and instantiated the change should
> + * be committed via vma_commit_reservation.  No action is required on
> + * failure.
>   */
>  static long vma_needs_reservation(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long addr)
> @@ -1019,24 +1097,24 @@ static void vma_commit_reservation(struct hstate *h,
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				    unsigned long addr, int avoid_reserve)
>  {
> +	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *page;
> -	struct address_space *mapping = vma->vm_file->f_mapping;
> -	struct inode *inode = mapping->host;
>  	long chg;
> 
>  	/*
> -	 * Processes that did not create the mapping will have no reserves and
> -	 * will not have accounted against quota. Check that the quota can be
> -	 * made before satisfying the allocation
> -	 * MAP_NORESERVE mappings may also need pages and quota allocated
> -	 * if no reserve mapping overlaps.
> +	 * Processes that did not create the mapping will have no
> +	 * reserves and will not have accounted against subpool
> +	 * limit. Check that the subpool limit can be made before
> +	 * satisfying the allocation MAP_NORESERVE mappings may also
> +	 * need pages and subpool limit allocated allocated if no reserve
> +	 * mapping overlaps.
>  	 */
>  	chg = vma_needs_reservation(h, vma, addr);
>  	if (chg < 0)
>  		return ERR_PTR(-VM_FAULT_OOM);
>  	if (chg)
> -		if (hugetlb_get_quota(inode->i_mapping, chg))
> +		if (hugepage_subpool_get_pages(spool, chg))
>  			return ERR_PTR(-VM_FAULT_SIGBUS);
> 
>  	spin_lock(&hugetlb_lock);
> @@ -1046,12 +1124,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	if (!page) {
>  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>  		if (!page) {
> -			hugetlb_put_quota(inode->i_mapping, chg);
> +			hugepage_subpool_put_pages(spool, chg);
>  			return ERR_PTR(-VM_FAULT_SIGBUS);
>  		}
>  	}
> 
> -	set_page_private(page, (unsigned long) mapping);
> +	set_page_private(page, (unsigned long)spool);
> 
>  	vma_commit_reservation(h, vma, addr);
> 
> @@ -2072,6 +2150,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  {
>  	struct hstate *h = hstate_vma(vma);
>  	struct resv_map *reservations = vma_resv_map(vma);
> +	struct hugepage_subpool *spool = subpool_vma(vma);
>  	unsigned long reserve;
>  	unsigned long start;
>  	unsigned long end;
> @@ -2087,7 +2166,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
> 
>  		if (reserve) {
>  			hugetlb_acct_memory(h, -reserve);
> -			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
> +			hugepage_subpool_put_pages(spool, reserve);
>  		}
>  	}
>  }
> @@ -2316,7 +2395,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 */
>  	address = address & huge_page_mask(h);
>  	pgoff = vma_hugecache_offset(h, vma, address);
> -	mapping = (struct address_space *)page_private(page);
> +	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
> 
>  	/*
>  	 * Take the mapping lock for the duration of the table walk. As
> @@ -2869,11 +2948,12 @@ int hugetlb_reserve_pages(struct inode *inode,
>  {
>  	long ret, chg;
>  	struct hstate *h = hstate_inode(inode);
> +	struct hugepage_subpool *spool = subpool_inode(inode);
> 
>  	/*
>  	 * Only apply hugepage reservation if asked. At fault time, an
>  	 * attempt will be made for VM_NORESERVE to allocate a page
> -	 * and filesystem quota without using reserves
> +	 * without using reserves
>  	 */
>  	if (vm_flags & VM_NORESERVE)
>  		return 0;
> @@ -2900,17 +2980,17 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	if (chg < 0)
>  		return chg;
> 
> -	/* There must be enough filesystem quota for the mapping */
> -	if (hugetlb_get_quota(inode->i_mapping, chg))
> +	/* There must be enough pages in the subpool for the mapping */
> +	if (hugepage_subpool_get_pages(spool, chg))
>  		return -ENOSPC;
> 
>  	/*
>  	 * Check enough hugepages are available for the reservation.
> -	 * Hand back the quota if there are not
> +	 * Hand the pages back to the subpool if there are not
>  	 */
>  	ret = hugetlb_acct_memory(h, chg);
>  	if (ret < 0) {
> -		hugetlb_put_quota(inode->i_mapping, chg);
> +		hugepage_subpool_put_pages(spool, chg);
>  		return ret;
>  	}
> 
> @@ -2934,12 +3014,13 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	long chg = region_truncate(&inode->i_mapping->private_list, offset);
> +	struct hugepage_subpool *spool = subpool_inode(inode);
> 
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> 
> -	hugetlb_put_quota(inode->i_mapping, (chg - freed));
> +	hugepage_subpool_put_pages(spool, (chg - freed));
>  	hugetlb_acct_memory(h, -(chg - freed));
>  }
> 

-aneesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-08  4:17     ` Aneesh Kumar K.V
@ 2012-03-08 11:59       ` Hillf Danton
  2012-03-08 14:19         ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Hillf Danton @ 2012-03-08 11:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: David Gibson, akpm, hughd, paulus, linux-kernel, Andrew Barry,
	Mel Gorman, Minchan Kim

On Thu, Mar 8, 2012 at 12:17 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> On Wed, 7 Mar 2012 20:28:39 +0800, Hillf Danton <dhillf@gmail.com> wrote:
>> On Wed, Mar 7, 2012 at 12:48 PM, David Gibson
>> > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
>> >         */
>> >        struct hstate *h = page_hstate(page);
>> >        int nid = page_to_nid(page);
>> > -       struct address_space *mapping;
>> > +       struct hugepage_subpool *spool =
>> > +               (struct hugepage_subpool *)page_private(page);
>> >
>> > -       mapping = (struct address_space *) page_private(page);
>> >        set_page_private(page, 0);
>> >        page->mapping = NULL;
>> >        BUG_ON(page_count(page));
>> > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
>> >                enqueue_huge_page(h, page);
>> >        }
>> >        spin_unlock(&hugetlb_lock);
>> > -       if (mapping)
>> > -               hugetlb_put_quota(mapping, 1);
>> > +       hugepage_subpool_put_pages(spool, 1);
>>
>> Like current code, quota is handed back *unconditionally*, but ...
>
>
> We will end up doing get_quota for every allocated page. get_quota
> happens either during mmap() if MAP_NORESERVE is not specified.
> or during alloc_huge_page if we haven't done a quota reservation during
> mmap for that range. Are you finding any part of code where we miss that ?
>
>
Thank you, Aneesh, I work it out along the direction of your question.

Apart from David's approach, quota is no longer reclaimed when pages are freed,
but when they are truncated, and we end up with smaller .text size, and with
the use-after-free bug fixed as well.

-hd

--- a/mm/hugetlb.c	Mon Mar  5 20:20:34 2012
+++ b/mm/hugetlb.c	Thu Mar  8 19:47:26 2012
@@ -533,9 +533,7 @@ static void free_huge_page(struct page *
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct address_space *mapping;

-	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
@@ -551,8 +549,6 @@ static void free_huge_page(struct page *
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	if (mapping)
-		hugetlb_put_quota(mapping, 1);
 }

 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -2944,8 +2940,8 @@ void hugetlb_unreserve_pages(struct inod
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);

-	hugetlb_put_quota(inode->i_mapping, (chg - freed));
-	hugetlb_acct_memory(h, -(chg - freed));
+	hugetlb_put_quota(inode->i_mapping, chg);
+	hugetlb_acct_memory(h, -chg);
 }

 #ifdef CONFIG_MEMORY_FAILURE
--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-08  4:30   ` Aneesh Kumar K.V
@ 2012-03-08 14:18     ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2012-03-08 14:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, hughd, paulus, linux-kernel, Andrew Barry, Mel Gorman,
	Minchan Kim, Hillf Danton

On Thu, Mar 08, 2012 at 10:00:57AM +0530, Aneesh Kumar K.V wrote:
> On Wed,  7 Mar 2012 15:48:14 +1100, David Gibson <david@gibson.dropbear.id.au> wrote:
[snip]
> >  out_free:
> > +	if (sbinfo->spool)
> > +		kfree(sbinfo->spool);
> 
> kfree() should handle NULL 

I've already changed that at akpm's suggestion.

> > @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
> >   */
> >  static DEFINE_SPINLOCK(hugetlb_lock);
> > 
> > +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> > +{
> > +	bool free = (spool->count == 0) && (spool->used_hpages == 0);
> > +
> 
> I have not see that style in other part of kernel. May be with proper
> 
> if (spool->count == 0 && spool->used_hpages == 0)
>    free = 1
> 
> > +	spin_unlock(&spool->lock);
> > +
> 
> 
> Having the spin_lock held across functions is also strange. Since there
> are only two callers, may be this can be inlined in the callers ?

It's a bit unusual, yes, but it seems to me clearer to outline this
operation in a function.

> > +static inline struct hugepage_subpool *subpool_inode(struct inode
> > *inode)
> 
> s/subpool_inode/subpool_from_inode/ ?
> 
> > +{
> > +	return HUGETLBFS_SB(inode->i_sb)->spool;
> > +}
> > +
> > +static inline struct hugepage_subpool *subpool_vma(struct
> > vm_area_struct *vma)
> 
> s/subpool_vma/subpool_from_vma/ ?

These functions are named to match hstate_inode() and hstate_vma().

> > +{
> > +	return subpool_inode(vma->vm_file->f_dentry->d_inode);
> > +}
> > +
> >  /*
> >   * Region tracking -- allows tracking of reservations and instantiated pages
> >   *                    across the pages in a mapping.
> > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
> >  	 */
> >  	struct hstate *h = page_hstate(page);
> >  	int nid = page_to_nid(page);
> > -	struct address_space *mapping;
> > +	struct hugepage_subpool *spool =
> > +		(struct hugepage_subpool *)page_private(page);
> > 
> > -	mapping = (struct address_space *) page_private(page);
> >  	set_page_private(page, 0);
> >  	page->mapping = NULL;
> >  	BUG_ON(page_count(page));
> > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
> >  		enqueue_huge_page(h, page);
> >  	}
> >  	spin_unlock(&hugetlb_lock);
> > -	if (mapping)
> > -		hugetlb_put_quota(mapping, 1);
> > +	hugepage_subpool_put_pages(spool, 1);
> 
> 
> We would still need the if () checking there. When we do echo x >
> /proc/sys/vm/nr_hugepages we would call free_huge_page to prepare new
> huge page pool. But we don't have page_private set for them.

No, this is correct.  hugepage_subpool_put_pages() is built to be a
nop if spool == NULL.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-08 11:59       ` Hillf Danton
@ 2012-03-08 14:19         ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2012-03-08 14:19 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Aneesh Kumar K.V, akpm, hughd, paulus, linux-kernel, Andrew Barry,
	Mel Gorman, Minchan Kim

On Thu, Mar 08, 2012 at 07:59:27PM +0800, Hillf Danton wrote:
> On Thu, Mar 8, 2012 at 12:17 PM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > On Wed, 7 Mar 2012 20:28:39 +0800, Hillf Danton <dhillf@gmail.com> wrote:
> >> On Wed, Mar 7, 2012 at 12:48 PM, David Gibson
> >> > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
> >> >         */
> >> >        struct hstate *h = page_hstate(page);
> >> >        int nid = page_to_nid(page);
> >> > -       struct address_space *mapping;
> >> > +       struct hugepage_subpool *spool =
> >> > +               (struct hugepage_subpool *)page_private(page);
> >> >
> >> > -       mapping = (struct address_space *) page_private(page);
> >> >        set_page_private(page, 0);
> >> >        page->mapping = NULL;
> >> >        BUG_ON(page_count(page));
> >> > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
> >> >                enqueue_huge_page(h, page);
> >> >        }
> >> >        spin_unlock(&hugetlb_lock);
> >> > -       if (mapping)
> >> > -               hugetlb_put_quota(mapping, 1);
> >> > +       hugepage_subpool_put_pages(spool, 1);
> >>
> >> Like current code, quota is handed back *unconditionally*, but ...
> >
> >
> > We will end up doing get_quota for every allocated page. get_quota
> > happens either during mmap() if MAP_NORESERVE is not specified.
> > or during alloc_huge_page if we haven't done a quota reservation during
> > mmap for that range. Are you finding any part of code where we miss that ?
> >
> >
> Thank you, Aneesh, I work it out along the direction of your question.
> 
> Apart from David's approach, quota is no longer reclaimed when pages are freed,
> but when they are truncated, and we end up with smaller .text size, and with
> the use-after-free bug fixed as well.

Sorry, I don't follow what you're saying here.

> --- a/mm/hugetlb.c	Mon Mar  5 20:20:34 2012
> +++ b/mm/hugetlb.c	Thu Mar  8 19:47:26 2012
> @@ -533,9 +533,7 @@ static void free_huge_page(struct page *
>  	 */
>  	struct hstate *h = page_hstate(page);
>  	int nid = page_to_nid(page);
> -	struct address_space *mapping;
> 
> -	mapping = (struct address_space *) page_private(page);
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	BUG_ON(page_count(page));
> @@ -551,8 +549,6 @@ static void free_huge_page(struct page *
>  		enqueue_huge_page(h, page);
>  	}
>  	spin_unlock(&hugetlb_lock);
> -	if (mapping)
> -		hugetlb_put_quota(mapping, 1);
>  }
> 
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -2944,8 +2940,8 @@ void hugetlb_unreserve_pages(struct inod
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> 
> -	hugetlb_put_quota(inode->i_mapping, (chg - freed));
> -	hugetlb_acct_memory(h, -(chg - freed));
> +	hugetlb_put_quota(inode->i_mapping, chg);
> +	hugetlb_acct_memory(h, -chg);
>  }
> 
>  #ifdef CONFIG_MEMORY_FAILURE

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling
  2012-03-08  0:27   ` Andrew Morton
  2012-03-08  2:09     ` David Gibson
@ 2012-03-09  3:25     ` David Gibson
  1 sibling, 0 replies; 16+ messages in thread
From: David Gibson @ 2012-03-09  3:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hughd, paulus, linux-kernel, Andrew Barry, Mel Gorman,
	Minchan Kim, Hillf Danton

On Wed, Mar 07, 2012 at 04:27:20PM -0800, Andrew Morton wrote:
> On Wed,  7 Mar 2012 15:48:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
> > general quota handling code, and they don't much resemble its behaviour.
> > Rather than being about maintaining limits on on-disk block usage by
> > particular users, they are instead about maintaining limits on in-memory
> > page usage (including anonymous MAP_PRIVATE copied-on-write pages)
> > associated with a particular hugetlbfs filesystem instance.
> > 
> > Worse, they work by having callbacks to the hugetlbfs filesystem code from
> > the low-level page handling code, in particular from free_huge_page().
> > This is a layering violation of itself, but more importantly, if the kernel
> > does a get_user_pages() on hugepages (which can happen from KVM amongst
> > others), then the free_huge_page() can be delayed until after the
> > associated inode has already been freed.  If an unmount occurs at the
> > wrong time, even the hugetlbfs superblock where the "quota" limits are
> > stored may have been freed.
> > 
> > Andrew Barry proposed a patch to fix this by having hugepages, instead of
> > storing a pointer to their address_space and reaching the superblock from
> > there, had the hugepages store pointers directly to the superblock, bumping
> > the reference count as appropriate to avoid it being freed.  Andrew Morton
> > rejected that version, however, on the grounds that it made the existing
> > layering violation worse.
> > 
> > This is a reworked version of Andrew's patch, which removes the extra, and
> > some of the existing, layering violation.  It works by introducing the
> > concept of a hugepage "subpool" at the lower hugepage mm layer - that is
> > a finite logical pool of hugepages to allocate from.  hugetlbfs now creates
> > a subpool for each filesystem instance with a page limit set, and a pointer
> > to the subpool gets added to each allocated hugepage, instead of the
> > address_space pointer used now.  The subpool has its own lifetime and is
> > only freed once all pages in it _and_ all other references to it (i.e.
> > superblocks) are gone.
> > 
> > subpools are optional - a NULL subpool pointer is taken by the code to mean
> > that no subpool limits are in effect.
> > 
> > Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
> > quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
> > http://marc.info/?l=linux-mm&m=126928970510627&w=1
> > 
> > v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
> > alloc_huge_page() - since it already takes the vma, it is not necessary.
> 
> Looks good - thanks for doing this.
> 
> Some comments, nothing serious:

Cleanup patch addressing these comments below.  I've done this as a
diff applied on top of the original patch.

>From 756a0901a21cce59cc82d0727f6463c3f71eecbf Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 8 Mar 2012 13:09:57 +1100
Subject: [PATCH] Cleanups for "hugepages: Fix use after free bug in 'quota'
 handling"

This patch makes some cleanups to an earlier patch of mine fixing a
use after free bug in the hugetlbfs "quota" handling (actually
per-filesystem page limits, not related to normal use of quotas).

These cleanups and extra documentation were mostly suggested by Andrew
Morton.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 fs/hugetlbfs/inode.c    |    3 +-
 include/linux/hugetlb.h |   16 +++++++++--
 mm/hugetlb.c            |   69 +++++++++++++++++++++++++++-------------------
 3 files changed, 54 insertions(+), 34 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 74c6ba2..536672a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -910,8 +910,7 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_root = root;
 	return 0;
 out_free:
-	if (sbinfo->spool)
-		kfree(sbinfo->spool);
+	kfree(sbinfo->spool);
 	kfree(sbinfo);
 	return -ENOMEM;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cf01817..8fdb595 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -14,13 +14,23 @@ struct user_struct;
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
 
+/*
+ * A hugepage subpool represents a notional finite bucket of
+ * hugepages.  They're used by the hugetlbfs code to implement
+ * per-filesystem-instance limits on hugepage usage.
+ */
 struct hugepage_subpool {
 	spinlock_t lock;
-	long count;
-	long max_hpages, used_hpages;
+	/* Total number of hugepages in the subpool */
+	unsigned long max_hpages;
+	/* Number of currently allocated hugepages in the subpool */
+	unsigned long used_hpages;
+	/* Reference count of anything else keeping the subpool in existence */
+	/* (e.g. hugetlbfs superblocks) */
+	unsigned refcount;
 };
 
-struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
+struct hugepage_subpool *hugepage_new_subpool(unsigned long nr_blocks);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 36b38b3a..aa6316b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -53,19 +53,22 @@ static unsigned long __initdata default_hstate_size;
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_and_release_subpool(struct hugepage_subpool *spool)
 {
-	bool free = (spool->count == 0) && (spool->used_hpages == 0);
+	bool free = (spool->refcount == 0) && (spool->used_hpages == 0);
 
 	spin_unlock(&spool->lock);
 
-	/* If no pages are used, and no other handles to the subpool
-	 * remain, free the subpool the subpool remain */
+	/*
+	 * If there are no pages left still in the subpool, _and_
+	 * there are no other references to it, we can free the
+	 * subpool.
+	 */
 	if (free)
 		kfree(spool);
 }
 
-struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
+struct hugepage_subpool *hugepage_new_subpool(unsigned long nr_blocks)
 {
 	struct hugepage_subpool *spool;
 
@@ -74,7 +77,7 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
 		return NULL;
 
 	spin_lock_init(&spool->lock);
-	spool->count = 1;
+	spool->refcount = 1;
 	spool->max_hpages = nr_blocks;
 	spool->used_hpages = 0;
 
@@ -84,13 +87,17 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
 	spin_lock(&spool->lock);
-	BUG_ON(!spool->count);
-	spool->count--;
-	unlock_or_release_subpool(spool);
+	BUG_ON(!spool->refcount);
+	spool->refcount--;
+	unlock_and_release_subpool(spool);
 }
 
-static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
-				      long delta)
+/*
+ * Allocate some pages from a subpool, or fail if there aren't enough
+ * pages left
+ */
+static int hugepage_subpool_alloc_pages(struct hugepage_subpool *spool,
+					unsigned long delta)
 {
 	int ret = 0;
 
@@ -98,27 +105,31 @@ static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 		return 0;
 
 	spin_lock(&spool->lock);
-	if ((spool->used_hpages + delta) <= spool->max_hpages) {
+	if ((spool->used_hpages + delta) <= spool->max_hpages)
 		spool->used_hpages += delta;
-	} else {
+	else
 		ret = -ENOMEM;
-	}
 	spin_unlock(&spool->lock);
 
 	return ret;
 }
 
-static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
-				       long delta)
+/*
+ * Release some pages back to a subpool
+ */
+static void hugepage_subpool_release_pages(struct hugepage_subpool *spool,
+					   unsigned long delta)
 {
 	if (!spool)
 		return;
 
 	spin_lock(&spool->lock);
 	spool->used_hpages -= delta;
-	/* If hugetlbfs_put_super couldn't free spool due to
-	* an outstanding quota reference, free it now. */
-	unlock_or_release_subpool(spool);
+	/*
+	 * If hugetlbfs_put_super couldn't free the subpool due to
+	 * pages remaining allocated from it, free it now.
+	 */
+	unlock_and_release_subpool(spool);
 }
 
 static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
@@ -611,9 +622,9 @@ static void free_huge_page(struct page *page)
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct hugepage_subpool *spool =
-		(struct hugepage_subpool *)page_private(page);
+	struct hugepage_subpool *spool;
 
+	spool =	(struct hugepage_subpool *)page_private(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
@@ -629,7 +640,7 @@ static void free_huge_page(struct page *page)
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	hugepage_subpool_put_pages(spool, 1);
+	hugepage_subpool_release_pages(spool, 1);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -1114,7 +1125,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
 	if (chg)
-		if (hugepage_subpool_get_pages(spool, chg))
+		if (hugepage_subpool_alloc_pages(spool, chg))
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 
 	spin_lock(&hugetlb_lock);
@@ -1124,7 +1135,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugepage_subpool_put_pages(spool, chg);
+			hugepage_subpool_release_pages(spool, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}
@@ -2166,7 +2177,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
-			hugepage_subpool_put_pages(spool, reserve);
+			hugepage_subpool_release_pages(spool, reserve);
 		}
 	}
 }
@@ -2395,7 +2406,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	address = address & huge_page_mask(h);
 	pgoff = vma_hugecache_offset(h, vma, address);
-	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+	mapping = vma->vm_file->f_mapping;
 
 	/*
 	 * Take the mapping lock for the duration of the table walk. As
@@ -2981,7 +2992,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 		return chg;
 
 	/* There must be enough pages in the subpool for the mapping */
-	if (hugepage_subpool_get_pages(spool, chg))
+	if (hugepage_subpool_alloc_pages(spool, chg))
 		return -ENOSPC;
 
 	/*
@@ -2990,7 +3001,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 */
 	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
-		hugepage_subpool_put_pages(spool, chg);
+		hugepage_subpool_release_pages(spool, chg);
 		return ret;
 	}
 
@@ -3020,7 +3031,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
 
-	hugepage_subpool_put_pages(spool, (chg - freed));
+	hugepage_subpool_release_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
-- 
1.7.9.1



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-03-09  3:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-16  4:23 Hugepage cleanup and bugfix David Gibson
2012-02-16  4:23 ` [PATCH 1/2] Cleanup to hugetlb.h David Gibson
2012-02-16  4:24 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
2012-02-16 12:33   ` Hillf Danton
2012-03-06  2:37     ` David Gibson
  -- strict thread matches above, loose matches on Subject: below --
2012-03-07  4:48 [PATCH 1/2] Cleanup to hugetlb.h David Gibson
2012-03-07  4:48 ` [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling David Gibson
2012-03-07 12:28   ` Hillf Danton
2012-03-08  0:57     ` David Gibson
2012-03-08  4:17     ` Aneesh Kumar K.V
2012-03-08 11:59       ` Hillf Danton
2012-03-08 14:19         ` David Gibson
2012-03-08  0:27   ` Andrew Morton
2012-03-08  2:09     ` David Gibson
2012-03-09  3:25     ` David Gibson
2012-03-08  4:30   ` Aneesh Kumar K.V
2012-03-08 14:18     ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).