* [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
@ 2011-08-19 19:14 Andrew Barry
2011-08-19 21:51 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Barry @ 2011-08-19 19:14 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
Mel Gorman, David Gibson, Hugh Dickins,
linux-kernel@vger.kernel.org, Andrew Hastings
This patch fixes a use-after-free problem in free_huge_page, with a quota update
happening after hugetlbfs umount. The problem results when a device driver,
which has mapped a hugepage, does a put_page. Put_page, calls free_huge_page,
which does a hugetlb_put_quota. As written, hugetlb_put_quota takes an
address_space struct pointer "mapping" as an argument. If the put_page occurs
after the hugetlbfs filesystem is unmounted, mapping points to freed memory.
Hugetlb_put_quota doesn't need the address_space pointer, it only uses it to
find the quota-record. Rather than an address-space struct pointer, this patched
code puts a hugetlbfs_sb_info struct pointer into page_private of the page
struct. In order to prevent the quota record from being freed at unmount, a
reference count and an active bit are added to the hugetlbfs_sb_info struct; the
reference count is increased by hugetlb_get_quota and decreased by
hugetlb_put_quota. When hugetlbfs is unmounted, it frees the hugetlbfs_sb_info
struct, but only if the reference count is zero. If the reference count is not
zero, it clears the active bit, and the last hugetlb_put_quota then frees the
hugetlbfs_sb_info struct.
Discussion was titled: Fix refcounting in hugetlbfs quota handling.
See: https://lkml.org/lkml/2011/8/11/28
See also: http://marc.info/?l=linux-mm&m=126928970510627&w=1
Version 2: Adding better description of how the race happens, removing
unnecessary inode->i_mapping->host->i_sb indirection when inode->i_sb has the
same meaning.
Signed-off-by: Andrew Barry <abarry@cray.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
fs/hugetlbfs/inode.c | 40 ++++++++++++++++++++++++++--------------
include/linux/hugetlb.h | 9 +++++++--
mm/hugetlb.c | 23 ++++++++++++-----------
3 files changed, 45 insertions(+), 27 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 87b6e04..2ed1cca 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -615,8 +615,12 @@ static void hugetlbfs_put_super(struct s
struct hugetlbfs_sb_info *sbi = HUGETLBFS_SB(sb);
if (sbi) {
+ sbi->active = HPAGE_INACTIVE;
sb->s_fs_info = NULL;
- kfree(sbi);
+
+ /*Free only if used quota is zero. */
+ if (sbi->used_blocks == 0)
+ kfree(sbi);
}
}
@@ -851,6 +855,8 @@ hugetlbfs_fill_super(struct super_block
sbinfo->free_blocks = config.nr_blocks;
sbinfo->max_inodes = config.nr_inodes;
sbinfo->free_inodes = config.nr_inodes;
+ sbinfo->used_blocks = 0;
+ sbinfo->active = HPAGE_ACTIVE;
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_blocksize = huge_page_size(config.hstate);
sb->s_blocksize_bits = huge_page_shift(config.hstate);
@@ -874,30 +880,36 @@ out_free:
return -ENOMEM;
}
-int hugetlb_get_quota(struct address_space *mapping, long delta)
+int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
{
int ret = 0;
- struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
- if (sbinfo->free_blocks > -1) {
- spin_lock(&sbinfo->stat_lock);
- if (sbinfo->free_blocks - delta >= 0)
+ spin_lock(&sbinfo->stat_lock);
+ if ((sbinfo->free_blocks == -1) || (sbinfo->free_blocks - delta >= 0)) {
+ if (sbinfo->free_blocks != -1)
sbinfo->free_blocks -= delta;
- else
- ret = -ENOMEM;
- spin_unlock(&sbinfo->stat_lock);
+ sbinfo->used_blocks += delta;
+ sbinfo->active = HPAGE_ACTIVE;
+ } else {
+ ret = -ENOMEM;
}
+ spin_unlock(&sbinfo->stat_lock);
return ret;
}
-void hugetlb_put_quota(struct address_space *mapping, long delta)
+void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
{
- struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
-
- if (sbinfo->free_blocks > -1) {
- spin_lock(&sbinfo->stat_lock);
+ spin_lock(&sbinfo->stat_lock);
+ if (sbinfo->free_blocks > -1)
sbinfo->free_blocks += delta;
+ sbinfo->used_blocks -= delta;
+ /* If hugetlbfs_put_super couldn't free sbinfo due to
+ * an outstanding quota reference, free it now. */
+ if ((sbinfo->used_blocks == 0) && (sbinfo->active == HPAGE_INACTIVE)) {
+ spin_unlock(&sbinfo->stat_lock);
+ kfree(sbinfo);
+ } else {
spin_unlock(&sbinfo->stat_lock);
}
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 19644e0..8780a91 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -142,11 +142,16 @@ struct hugetlbfs_config {
struct hstate *hstate;
};
+#define HPAGE_INACTIVE 0
+#define HPAGE_ACTIVE 1
+
struct hugetlbfs_sb_info {
long max_blocks; /* blocks allowed */
long free_blocks; /* blocks free */
long max_inodes; /* inodes allowed */
long free_inodes; /* inodes free */
+ long used_blocks; /* blocks used */
+ long active; /* active bit */
spinlock_t stat_lock;
struct hstate *hstate;
};
@@ -171,8 +176,8 @@ extern const struct file_operations huge
extern const struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
struct user_struct **user, int creat_flags);
-int hugetlb_get_quota(struct address_space *mapping, long delta);
-void hugetlb_put_quota(struct address_space *mapping, long delta);
+int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta);
+void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta);
static inline int is_file_hugepages(struct file *file)
{
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dae27ba..0f39cde 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -533,9 +533,9 @@ static void free_huge_page(struct page *
*/
struct hstate *h = page_hstate(page);
int nid = page_to_nid(page);
- struct address_space *mapping;
+ struct hugetlbfs_sb_info *sbinfo;
- mapping = (struct address_space *) page_private(page);
+ sbinfo = ( struct hugetlbfs_sb_info *) page_private(page);
set_page_private(page, 0);
page->mapping = NULL;
BUG_ON(page_count(page));
@@ -551,8 +551,8 @@ static void free_huge_page(struct page *
enqueue_huge_page(h, page);
}
spin_unlock(&hugetlb_lock);
- if (mapping)
- hugetlb_put_quota(mapping, 1);
+ if (sbinfo)
+ hugetlb_put_quota(sbinfo, 1);
}
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -1035,7 +1035,7 @@ static struct page *alloc_huge_page(stru
if (chg < 0)
return ERR_PTR(-VM_FAULT_OOM);
if (chg)
- if (hugetlb_get_quota(inode->i_mapping, chg))
+ if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_sb), chg))
return ERR_PTR(-VM_FAULT_SIGBUS);
spin_lock(&hugetlb_lock);
@@ -1045,12 +1045,12 @@ static struct page *alloc_huge_page(stru
if (!page) {
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
- hugetlb_put_quota(inode->i_mapping, chg);
+ hugetlb_put_quota(HUGETLBFS_SB(inode->i_sb), chg);
return ERR_PTR(-VM_FAULT_SIGBUS);
}
}
- set_page_private(page, (unsigned long) mapping);
+ set_page_private(page, (unsigned long) HUGETLBFS_SB(inode->i_sb));
vma_commit_reservation(h, vma, addr);
@@ -2086,7 +2086,8 @@ static void hugetlb_vm_op_close(struct v
if (reserve) {
hugetlb_acct_memory(h, -reserve);
- hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
+ hugetlb_put_quota(HUGETLBFS_SB(
+ vma->vm_file->f_mapping->host->i_sb), reserve);
}
}
}
@@ -2884,7 +2885,7 @@ int hugetlb_reserve_pages(struct inode *
return chg;
/* There must be enough filesystem quota for the mapping */
- if (hugetlb_get_quota(inode->i_mapping, chg))
+ if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_sb), chg))
return -ENOSPC;
/*
@@ -2893,7 +2894,7 @@ int hugetlb_reserve_pages(struct inode *
*/
ret = hugetlb_acct_memory(h, chg);
if (ret < 0) {
- hugetlb_put_quota(inode->i_mapping, chg);
+ hugetlb_put_quota(HUGETLBFS_SB(inode->i_sb), chg);
return ret;
}
@@ -2922,7 +2923,7 @@ void hugetlb_unreserve_pages(struct inod
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
- hugetlb_put_quota(inode->i_mapping, (chg - freed));
+ hugetlb_put_quota(HUGETLBFS_SB(inode->i_sb), (chg - freed));
hugetlb_acct_memory(h, -(chg - freed));
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
2011-08-19 19:14 [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update Andrew Barry
@ 2011-08-19 21:51 ` Andrew Morton
2011-08-22 20:07 ` Andrew Barry
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Andrew Morton @ 2011-08-19 21:51 UTC (permalink / raw)
To: Andrew Barry
Cc: linux-mm, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
David Gibson, Hugh Dickins, linux-kernel@vger.kernel.org,
Andrew Hastings
On Fri, 19 Aug 2011 14:14:11 -0500
Andrew Barry <abarry@cray.com> wrote:
> This patch fixes a use-after-free problem in free_huge_page, with a quota update
> happening after hugetlbfs umount. The problem results when a device driver,
> which has mapped a hugepage, does a put_page. Put_page, calls free_huge_page,
> which does a hugetlb_put_quota. As written, hugetlb_put_quota takes an
> address_space struct pointer "mapping" as an argument. If the put_page occurs
> after the hugetlbfs filesystem is unmounted, mapping points to freed memory.
OK. This sounds screwed up. If a device driver is currently using a
page from a hugetlbfs file then the unmount shouldn't have succeeded in
the first place!
Or is it the case that the device driver got a reference to the page by
other means, bypassing hugetlbfs? And there's undesirable/incorrect
interaction between the non-hugetlbfs operation and hugetlbfs?
Or something else?
<starts reading the mailing list>
OK, important missing information from the above is that the driver got
at this page via get_user_pages() and happened to stumble across a
hugetlbfs page. So it's indeed an incorrect interaction between a
non-hugetlbfs operation and hugetlbfs.
What's different about hugetlbfs? Why don't other filesystems hit this?
<investigates further>
OK so the incorrect interaction happened in free_huge_page(), which is
called via the compound page destructor (this dtor is "what's different
about hugetlbfs"). What is incorrect about this is
a) that we're doing fs operations in response to a
get_user_pages()/put_page() operation which has *nothing* to do with
filesystems!
b) that we continue to try to do that fs operation against an fs
which was unmounted and freed three days ago. duh.
So I hereby pronounce that
a) It was wrong to manipulate hugetlbfs quotas within
free_huge_page(). Because free_huge_page() is a low-level
page-management function which shouldn't know about one of its
specific clients (in this case, hugetlbfs).
In fact it's wrong for there to be *any* mention of hugetlbfs
within hugetlb.c.
b) I shouldn't have merged that hugetlbfs quota code. whodidthat.
Mel, Adam, Dave, at least...
c) The proper fix here is to get that hugetlbfs quota code out of
free_huge_page() and do it all where it belongs: within hugetlbfs
code.
Regular filesystems don't need to diddle quota counts within
page_cache_release(). Why should hugetlbfs need to?
>
> ...
>
> + /*Free only if used quota is zero. */
Missing a space there.
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -142,11 +142,16 @@ struct hugetlbfs_config {
> struct hstate *hstate;
> };
>
> +#define HPAGE_INACTIVE 0
> +#define HPAGE_ACTIVE 1
The above need documenting, please. That documentation would perhaps
help me understand why we need both an "active" flag *and* a refcount.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
2011-08-19 21:51 ` Andrew Morton
@ 2011-08-22 20:07 ` Andrew Barry
2011-08-23 4:10 ` David Gibson
2011-09-01 5:28 ` David Gibson
2011-10-12 4:43 ` Paul Mackerras
2 siblings, 1 reply; 8+ messages in thread
From: Andrew Barry @ 2011-08-22 20:07 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
David Gibson, Hugh Dickins, Andrew Hastings
On 08/19/2011 04:51 PM, Andrew Morton wrote:
> What's different about hugetlbfs? Why don't other filesystems hit this?
>
> <investigates further>
>
> OK so the incorrect interaction happened in free_huge_page(), which is
> called via the compound page destructor (this dtor is "what's different
> about hugetlbfs"). What is incorrect about this is
>
> a) that we're doing fs operations in response to a
> get_user_pages()/put_page() operation which has *nothing* to do with
> filesystems!
>
> b) that we continue to try to do that fs operation against an fs
> which was unmounted and freed three days ago. duh.
Yes.
> So I hereby pronounce that
>
> a) It was wrong to manipulate hugetlbfs quotas within
> free_huge_page(). Because free_huge_page() is a low-level
> page-management function which shouldn't know about one of its
> specific clients (in this case, hugetlbfs).
>
> In fact it's wrong for there to be *any* mention of hugetlbfs
> within hugetlb.c.
>
> b) I shouldn't have merged that hugetlbfs quota code. whodidthat.
> Mel, Adam, Dave, at least...
>
> c) The proper fix here is to get that hugetlbfs quota code out of
> free_huge_page() and do it all where it belongs: within hugetlbfs
> code.
>
>
> Regular filesystems don't need to diddle quota counts within
> page_cache_release(). Why should hugetlbfs need to?
Is there anyone, more expert in hugetlbfs code than I, who can/should/will take
that on?
>> +#define HPAGE_INACTIVE 0
>> +#define HPAGE_ACTIVE 1
>
> The above need documenting, please. That documentation would perhaps
> help me understand why we need both an "active" flag *and* a refcount.
It doesn't need both. Now that you mention it, it would be simpler to put it all
in the refcount. I'd send an updated patch, but it sounds like things will be
going in a different direction.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
2011-08-22 20:07 ` Andrew Barry
@ 2011-08-23 4:10 ` David Gibson
0 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2011-08-23 4:10 UTC (permalink / raw)
To: Andrew Barry
Cc: Andrew Morton, linux-mm, Rik van Riel, Minchan Kim,
KOSAKI Motohiro, Mel Gorman, Hugh Dickins, Andrew Hastings
On Mon, Aug 22, 2011 at 03:07:54PM -0500, Andrew Barry wrote:
> On 08/19/2011 04:51 PM, Andrew Morton wrote:
> > What's different about hugetlbfs? Why don't other filesystems hit this?
> >
> > <investigates further>
> >
> > OK so the incorrect interaction happened in free_huge_page(), which is
> > called via the compound page destructor (this dtor is "what's different
> > about hugetlbfs"). What is incorrect about this is
> >
> > a) that we're doing fs operations in response to a
> > get_user_pages()/put_page() operation which has *nothing* to do with
> > filesystems!
> >
> > b) that we continue to try to do that fs operation against an fs
> > which was unmounted and freed three days ago. duh.
>
> Yes.
>
> > So I hereby pronounce that
> >
> > a) It was wrong to manipulate hugetlbfs quotas within
> > free_huge_page(). Because free_huge_page() is a low-level
> > page-management function which shouldn't know about one of its
> > specific clients (in this case, hugetlbfs).
> >
> > In fact it's wrong for there to be *any* mention of hugetlbfs
> > within hugetlb.c.
> >
> > b) I shouldn't have merged that hugetlbfs quota code. whodidthat.
> > Mel, Adam, Dave, at least...
> >
> > c) The proper fix here is to get that hugetlbfs quota code out of
> > free_huge_page() and do it all where it belongs: within hugetlbfs
> > code.
> >
> >
> > Regular filesystems don't need to diddle quota counts within
> > page_cache_release(). Why should hugetlbfs need to?
>
> Is there anyone, more expert in hugetlbfs code than I, who can/should/will take
> that on?
As far as I can tell the hugetlbfs "quota" counts that are updated
here don't share much with the normal quota mechanisms. The way they
operate, they logically divide the pool of free huge pages between
different hugetlbfs instances. This means that you can give different
hugepage mounts to different applications and they won't be able to
exhaust each others resources.
I can't see how that can be done without updating the count somewhere
at free_huge_page() time.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
2011-08-19 21:51 ` Andrew Morton
2011-08-22 20:07 ` Andrew Barry
@ 2011-09-01 5:28 ` David Gibson
2011-10-12 4:43 ` Paul Mackerras
2 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2011-09-01 5:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrew Barry, linux-mm, Rik van Riel, Minchan Kim,
KOSAKI Motohiro, Mel Gorman, Hugh Dickins,
linux-kernel@vger.kernel.org, Andrew Hastings
On Fri, Aug 19, 2011 at 02:51:09PM -0700, Andrew Morton wrote:
> On Fri, 19 Aug 2011 14:14:11 -0500
> Andrew Barry <abarry@cray.com> wrote:
>
> > This patch fixes a use-after-free problem in free_huge_page, with a quota update
> > happening after hugetlbfs umount. The problem results when a device driver,
> > which has mapped a hugepage, does a put_page. Put_page, calls free_huge_page,
> > which does a hugetlb_put_quota. As written, hugetlb_put_quota takes an
> > address_space struct pointer "mapping" as an argument. If the put_page occurs
> > after the hugetlbfs filesystem is unmounted, mapping points to freed memory.
>
> OK. This sounds screwed up. If a device driver is currently using a
> page from a hugetlbfs file then the unmount shouldn't have succeeded in
> the first place!
>
> Or is it the case that the device driver got a reference to the page by
> other means, bypassing hugetlbfs? And there's undesirable/incorrect
> interaction between the non-hugetlbfs operation and hugetlbfs?
>
> Or something else?
>
> <starts reading the mailing list>
>
> OK, important missing information from the above is that the driver got
> at this page via get_user_pages() and happened to stumble across a
> hugetlbfs page. So it's indeed an incorrect interaction between a
> non-hugetlbfs operation and hugetlbfs.
>
> What's different about hugetlbfs? Why don't other filesystems hit this?
>
> <investigates further>
>
> OK so the incorrect interaction happened in free_huge_page(), which is
> called via the compound page destructor (this dtor is "what's different
> about hugetlbfs"). What is incorrect about this is
>
> a) that we're doing fs operations in response to a
> get_user_pages()/put_page() operation which has *nothing* to do with
> filesystems!
>
> b) that we continue to try to do that fs operation against an fs
> which was unmounted and freed three days ago. duh.
>
>
> So I hereby pronounce that
>
> a) It was wrong to manipulate hugetlbfs quotas within
> free_huge_page(). Because free_huge_page() is a low-level
> page-management function which shouldn't know about one of its
> specific clients (in this case, hugetlbfs).
>
> In fact it's wrong for there to be *any* mention of hugetlbfs
> within hugetlb.c.
>
> b) I shouldn't have merged that hugetlbfs quota code. whodidthat.
> Mel, Adam, Dave, at least...
>
> c) The proper fix here is to get that hugetlbfs quota code out of
> free_huge_page() and do it all where it belongs: within hugetlbfs
> code.
>
> Regular filesystems don't need to diddle quota counts within
> page_cache_release(). Why should hugetlbfs need to?
Regular filesystems can assume there's a few spare pages that can
buffer quota transitions. Hugepages on the other hand are scarce, and
it's common practice to want to actively use every single one of the
system.
I really can't see how to avoid poking the counts from
free_huge_page(), whether or not it's directly or via some sort of
callback.
Andrew (Morton) or Hugh, if you can suggest a more correct way to fix
this, I'm all ears, but at present we have a real bug and Andrew
Barry's patch is the best fix we have.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
2011-08-19 21:51 ` Andrew Morton
2011-08-22 20:07 ` Andrew Barry
2011-09-01 5:28 ` David Gibson
@ 2011-10-12 4:43 ` Paul Mackerras
2011-10-14 20:59 ` Andrew Morton
2 siblings, 1 reply; 8+ messages in thread
From: Paul Mackerras @ 2011-10-12 4:43 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrew Barry, linux-mm, Rik van Riel, Minchan Kim,
KOSAKI Motohiro, Mel Gorman, David Gibson, Hugh Dickins,
linux-kernel@vger.kernel.org, Andrew Hastings
On Fri, Aug 19, 2011 at 02:51:09PM -0700, Andrew Morton wrote:
>
> OK. This sounds screwed up. If a device driver is currently using a
> page from a hugetlbfs file then the unmount shouldn't have succeeded in
> the first place!
>
> Or is it the case that the device driver got a reference to the page by
> other means, bypassing hugetlbfs? And there's undesirable/incorrect
> interaction between the non-hugetlbfs operation and hugetlbfs?
>
> Or something else?
>
> <starts reading the mailing list>
>
> OK, important missing information from the above is that the driver got
> at this page via get_user_pages() and happened to stumble across a
> hugetlbfs page. So it's indeed an incorrect interaction between a
> non-hugetlbfs operation and hugetlbfs.
>
> What's different about hugetlbfs? Why don't other filesystems hit this?
What's different about hugetlbfs, as I understand it, is that the
"quota" mechanism is there to restrict memory usage, rather than disk
usage.
> <investigates further>
>
> OK so the incorrect interaction happened in free_huge_page(), which is
> called via the compound page destructor (this dtor is "what's different
> about hugetlbfs"). What is incorrect about this is
>
> a) that we're doing fs operations in response to a
> get_user_pages()/put_page() operation which has *nothing* to do with
> filesystems!
The hugetlbfs quota thing is more like an RSS limit than a disk
quota. Perhaps the "quota" name is misleading.
I assume we update RSS counts for ordinary pages when allocating and
freeing pages?
> b) that we continue to try to do that fs operation against an fs
> which was unmounted and freed three days ago. duh.
>
>
> So I hereby pronounce that
>
> a) It was wrong to manipulate hugetlbfs quotas within
> free_huge_page(). Because free_huge_page() is a low-level
> page-management function which shouldn't know about one of its
> specific clients (in this case, hugetlbfs).
>
> In fact it's wrong for there to be *any* mention of hugetlbfs
> within hugetlb.c.
>
> b) I shouldn't have merged that hugetlbfs quota code. whodidthat.
> Mel, Adam, Dave, at least...
>
> c) The proper fix here is to get that hugetlbfs quota code out of
> free_huge_page() and do it all where it belongs: within hugetlbfs
> code.
That doesn't sound right to me, if we need to limit usage of huge
memory pages in memory, rather than back out on the "filesystem".
An ordinary filesystem doesn't worry about memory consumption, it
worries about how its blocks of backing store are allocated.
Hugetlbfs is unusual here in that the "backing store" and the memory
pages that get mapped into userspace are one and the same thing.
> Regular filesystems don't need to diddle quota counts within
> page_cache_release(). Why should hugetlbfs need to?
In a regular filesystem you can reclaim a block of backing store, and
thus decrement a quota count, while there might still be a page of
memory in use that contains its old contents. That's problematic with
hugetlbfs.
In the meantime we have a user-triggerable kernel crash. As far as I
can see, if we did what you suggest, we would end up with a situation
where we could run out of huge pages even though everyone was within
quota. Which is arguably better than a kernel crash, but still less
than ideal. What do you suggest?
Paul.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
2011-10-12 4:43 ` Paul Mackerras
@ 2011-10-14 20:59 ` Andrew Morton
2011-10-17 5:14 ` David Gibson
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2011-10-14 20:59 UTC (permalink / raw)
To: Paul Mackerras
Cc: Andrew Barry, linux-mm, Rik van Riel, Minchan Kim,
KOSAKI Motohiro, Mel Gorman, David Gibson, Hugh Dickins,
linux-kernel@vger.kernel.org, Andrew Hastings
On Wed, 12 Oct 2011 15:43:17 +1100
Paul Mackerras <paulus@samba.org> wrote:
> In the meantime we have a user-triggerable kernel crash. As far as I
> can see, if we did what you suggest, we would end up with a situation
> where we could run out of huge pages even though everyone was within
> quota. Which is arguably better than a kernel crash, but still less
> than ideal. What do you suggest?
My issue with the patch is that it's rather horrible. We have a layer
of separation between core hugetlb pages and hugetlbfs. That layering
has already been mucked up in various places and this patch mucks it up
further, and quite severely.
So I believe we should rethink the patch. Either a) get the layering
correct by not poking into hugetlbfs internals from within hugetlb core
via one of the usual techniques or b) make a deliberate decision to
just give up on that layering: state that hugetlb and hugetlbfs are now
part of the same subsystem. Make the necessaary Kconfig changes,
remove ifdefs, move code around, etc.
If we go ahead with the proposed patch-n-run bugfix, the bad code will
be there permanently - nobody will go in and clean this mess up and the
kernel is permanently worsened.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.
2011-10-14 20:59 ` Andrew Morton
@ 2011-10-17 5:14 ` David Gibson
0 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2011-10-17 5:14 UTC (permalink / raw)
To: Andrew Morton
Cc: Paul Mackerras, Andrew Barry, linux-mm, Rik van Riel, Minchan Kim,
KOSAKI Motohiro, Mel Gorman, Hugh Dickins,
linux-kernel@vger.kernel.org, Andrew Hastings
On Fri, Oct 14, 2011 at 01:59:48PM -0700, Andrew Morton wrote:
> On Wed, 12 Oct 2011 15:43:17 +1100
> Paul Mackerras <paulus@samba.org> wrote:
>
> > In the meantime we have a user-triggerable kernel crash. As far as I
> > can see, if we did what you suggest, we would end up with a situation
> > where we could run out of huge pages even though everyone was within
> > quota. Which is arguably better than a kernel crash, but still less
> > than ideal. What do you suggest?
>
> My issue with the patch is that it's rather horrible. We have a layer
> of separation between core hugetlb pages and hugetlbfs. That layering
> has already been mucked up in various places and this patch mucks it up
> further, and quite severely.
>
> So I believe we should rethink the patch. Either a) get the layering
> correct by not poking into hugetlbfs internals from within hugetlb core
> via one of the usual techniques or
Which usual techniques did you have in mind?
> b) make a deliberate decision to
> just give up on that layering: state that hugetlb and hugetlbfs are now
> part of the same subsystem. Make the necessaary Kconfig changes,
> remove ifdefs, move code around, etc.
Well, that might have something to be said for it, the distinction has
always been tenuous at best.
> If we go ahead with the proposed patch-n-run bugfix, the bad code will
> be there permanently - nobody will go in and clean this mess up and the
> kernel is permanently worsened.
Hrm, as opposed to leaving the crash bug there until someone has time
to do the requested cleanup.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-10-17 5:15 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-19 19:14 [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update Andrew Barry
2011-08-19 21:51 ` Andrew Morton
2011-08-22 20:07 ` Andrew Barry
2011-08-23 4:10 ` David Gibson
2011-09-01 5:28 ` David Gibson
2011-10-12 4:43 ` Paul Mackerras
2011-10-14 20:59 ` Andrew Morton
2011-10-17 5:14 ` David Gibson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).