From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by kanga.kvack.org (Postfix) with ESMTP id A60636B0032 for ; Fri, 27 Feb 2015 17:59:05 -0500 (EST) Received: by mail-ob0-f182.google.com with SMTP id nt9so21175618obb.13 for ; Fri, 27 Feb 2015 14:59:05 -0800 (PST) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id t5si2793399oes.86.2015.02.27.14.59.04 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 27 Feb 2015 14:59:04 -0800 (PST) From: Mike Kravetz Subject: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time Date: Fri, 27 Feb 2015 14:58:08 -0800 Message-Id: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz hugetlbfs allocates huge pages from the global pool as needed. Even if the global pool contains a sufficient number pages for the filesystem size at mount time, those global pages could be grabbed for some other use. As a result, filesystem huge page allocations may fail due to lack of pages. Add a new hugetlbfs mount option 'reserved' to specify that the number of pages associated with the size of the filesystem will be reserved. If there are insufficient pages, the mount will fail. The reservation is maintained for the duration of the filesystem so that as pages are allocated and free'ed a sufficient number of pages remains reserved. Mike Kravetz (3): hugetlbfs: add reserved mount fields to subpool structure hugetlbfs: coordinate global and subpool reserve accounting hugetlbfs: accept subpool reserved option and setup accordingly fs/hugetlbfs/inode.c | 15 +++++++++++++-- include/linux/hugetlb.h | 7 +++++++ mm/hugetlb.c | 37 +++++++++++++++++++++++++++++-------- 3 files changed, 49 insertions(+), 10 deletions(-) -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f169.google.com (mail-ob0-f169.google.com [209.85.214.169]) by kanga.kvack.org (Postfix) with ESMTP id 9F1AF6B006C for ; Fri, 27 Feb 2015 17:59:38 -0500 (EST) Received: by mail-ob0-f169.google.com with SMTP id wp4so21559532obc.0 for ; Fri, 27 Feb 2015 14:59:38 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id qv1si1667547oec.96.2015.02.27.14.59.37 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 27 Feb 2015 14:59:38 -0800 (PST) From: Mike Kravetz Subject: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure Date: Fri, 27 Feb 2015 14:58:09 -0800 Message-Id: <1425077893-18366-2-git-send-email-mike.kravetz@oracle.com> In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Add a boolean to the subpool structure to indicate that the pages for subpool have been reserved. The hstate pointer in the subpool is convenient to have when it comes time to unreserve the pages. subool_reserved() is a handy way to check if reserved and take into account a NULL subpool. Signed-off-by: Mike Kravetz --- include/linux/hugetlb.h | 6 ++++++ mm/hugetlb.c | 2 ++ 2 files changed, 8 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 431b7fc..605c648 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -23,6 +23,8 @@ struct hugepage_subpool { spinlock_t lock; long count; long max_hpages, used_hpages; + struct hstate *hstate; + bool reserved; }; struct resv_map { @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; #define for_each_hstate(h) \ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) +static inline bool subpool_reserved(struct hugepage_subpool *spool) +{ + return spool && spool->reserved; +} struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); void hugepage_put_subpool(struct hugepage_subpool *spool); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 85032de..c6adf65 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -85,6 +85,8 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) spool->count = 1; spool->max_hpages = nr_blocks; spool->used_hpages = 0; + spool->hstate = NULL; + spool->reserved = false; return spool; } -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f179.google.com (mail-ob0-f179.google.com [209.85.214.179]) by kanga.kvack.org (Postfix) with ESMTP id 43DBC6B006E for ; Fri, 27 Feb 2015 17:59:39 -0500 (EST) Received: by mail-ob0-f179.google.com with SMTP id wp4so21510625obc.10 for ; Fri, 27 Feb 2015 14:59:39 -0800 (PST) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id df10si2784977oeb.100.2015.02.27.14.59.38 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 27 Feb 2015 14:59:38 -0800 (PST) From: Mike Kravetz Subject: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting Date: Fri, 27 Feb 2015 14:58:12 -0800 Message-Id: <1425077893-18366-5-git-send-email-mike.kravetz@oracle.com> In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz If the pages for a subpool are reserved, then the reservations have already been accounted for in the global pool. Therefore, when requesting a new reservation (such as for a mapping) for the subpool do not count again in global pool. However, when actually allocating a page for the subpool decrement gobal reserve count to correspond to with decrement in global free pages. Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c6adf65..4ef8379 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -879,7 +879,7 @@ void free_huge_page(struct page *page) spin_lock(&hugetlb_lock); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page); - if (restore_reserve) + if (restore_reserve || subpool_reserved(spool)) h->resv_huge_pages++; if (h->surplus_huge_pages_node[nid]) { @@ -2466,7 +2466,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) kref_put(&resv->refs, resv_map_release); if (reserve) { - hugetlb_acct_memory(h, -reserve); + if (!subpool_reserved(spool)) + hugetlb_acct_memory(h, -reserve); hugepage_subpool_put_pages(spool, reserve); } } @@ -3444,10 +3445,14 @@ int hugetlb_reserve_pages(struct inode *inode, * Check enough hugepages are available for the reservation. * Hand the pages back to the subpool if there are not */ - ret = hugetlb_acct_memory(h, chg); - if (ret < 0) { - hugepage_subpool_put_pages(spool, chg); - goto out_err; + if (subpool_reserved(spool)) + ret = 0; + else { + ret = hugetlb_acct_memory(h, chg); + if (ret < 0) { + hugepage_subpool_put_pages(spool, chg); + goto out_err; + } } /* @@ -3483,7 +3488,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) inode->i_blocks -= (blocks_per_huge_page(h) * freed); spin_unlock(&inode->i_lock); - hugepage_subpool_put_pages(spool, (chg - freed)); + if (!subpool_reserved(spool)) + hugepage_subpool_put_pages(spool, (chg - freed)); hugetlb_acct_memory(h, -(chg - freed)); } -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f175.google.com (mail-ob0-f175.google.com [209.85.214.175]) by kanga.kvack.org (Postfix) with ESMTP id 0AF716B0070 for ; Fri, 27 Feb 2015 18:00:01 -0500 (EST) Received: by mail-ob0-f175.google.com with SMTP id va2so21564335obc.6 for ; Fri, 27 Feb 2015 15:00:00 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id kg7si2819107obb.56.2015.02.27.15.00.00 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 27 Feb 2015 15:00:00 -0800 (PST) From: Mike Kravetz Subject: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure Date: Fri, 27 Feb 2015 14:58:10 -0800 Message-Id: <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Add a boolean to the subpool structure to indicate that the pages for subpool have been reserved. The hstate pointer in the subpool is convienient to have when it comes time to unreserve the pages. subool_reserved() is a handy way to check if reserved and take into account a NULL subpool. Signed-off-by: Mike Kravetz --- include/linux/hugetlb.h | 6 ++++++ mm/hugetlb.c | 2 ++ 2 files changed, 8 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 431b7fc..605c648 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -23,6 +23,8 @@ struct hugepage_subpool { spinlock_t lock; long count; long max_hpages, used_hpages; + struct hstate *hstate; + bool reserved; }; struct resv_map { @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; #define for_each_hstate(h) \ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) +static inline bool subpool_reserved(struct hugepage_subpool *spool) +{ + return spool && spool->reserved; +} struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); void hugepage_put_subpool(struct hugepage_subpool *spool); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 85032de..c6adf65 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -85,6 +85,8 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) spool->count = 1; spool->max_hpages = nr_blocks; spool->used_hpages = 0; + spool->hstate = NULL; + spool->reserved = false; return spool; } -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f177.google.com (mail-ob0-f177.google.com [209.85.214.177]) by kanga.kvack.org (Postfix) with ESMTP id 9A0E56B0071 for ; Fri, 27 Feb 2015 18:00:01 -0500 (EST) Received: by mail-ob0-f177.google.com with SMTP id wp18so20783570obc.8 for ; Fri, 27 Feb 2015 15:00:01 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id jh8si2818344oec.44.2015.02.27.15.00.00 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 27 Feb 2015 15:00:01 -0800 (PST) From: Mike Kravetz Subject: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting Date: Fri, 27 Feb 2015 14:58:11 -0800 Message-Id: <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz If the pages for a subpool are reserved, then the reservations have already been accounted for in the global pool. Therefore, when requesting a new reservation (such as for a mapping) for the subpool do not count again in global pool. However, when actually allocating a page for the subpool decrement global reserve count to correspond to with decrement in global free pages. Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c6adf65..4ef8379 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -879,7 +879,7 @@ void free_huge_page(struct page *page) spin_lock(&hugetlb_lock); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page); - if (restore_reserve) + if (restore_reserve || subpool_reserved(spool)) h->resv_huge_pages++; if (h->surplus_huge_pages_node[nid]) { @@ -2466,7 +2466,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) kref_put(&resv->refs, resv_map_release); if (reserve) { - hugetlb_acct_memory(h, -reserve); + if (!subpool_reserved(spool)) + hugetlb_acct_memory(h, -reserve); hugepage_subpool_put_pages(spool, reserve); } } @@ -3444,10 +3445,14 @@ int hugetlb_reserve_pages(struct inode *inode, * Check enough hugepages are available for the reservation. * Hand the pages back to the subpool if there are not */ - ret = hugetlb_acct_memory(h, chg); - if (ret < 0) { - hugepage_subpool_put_pages(spool, chg); - goto out_err; + if (subpool_reserved(spool)) + ret = 0; + else { + ret = hugetlb_acct_memory(h, chg); + if (ret < 0) { + hugepage_subpool_put_pages(spool, chg); + goto out_err; + } } /* @@ -3483,7 +3488,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) inode->i_blocks -= (blocks_per_huge_page(h) * freed); spin_unlock(&inode->i_lock); - hugepage_subpool_put_pages(spool, (chg - freed)); + if (!subpool_reserved(spool)) + hugepage_subpool_put_pages(spool, (chg - freed)); hugetlb_acct_memory(h, -(chg - freed)); } -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f54.google.com (mail-oi0-f54.google.com [209.85.218.54]) by kanga.kvack.org (Postfix) with ESMTP id 579776B0071 for ; Fri, 27 Feb 2015 18:00:19 -0500 (EST) Received: by mail-oi0-f54.google.com with SMTP id v63so18214519oia.13 for ; Fri, 27 Feb 2015 15:00:19 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id eq2si2823237obb.47.2015.02.27.15.00.18 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 27 Feb 2015 15:00:18 -0800 (PST) From: Mike Kravetz Subject: [RFC 3/3] hugetlbfs: accept subpool reserved option and setup accordingly Date: Fri, 27 Feb 2015 14:58:13 -0800 Message-Id: <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Make reserved be an option when mounting a hugetlbfs. reserved option is only possible if size option is also specified. On mount, reserve size hugepages and note in subpool. Unreserve pages when fs is unmounted. Signed-off-by: Mike Kravetz --- fs/hugetlbfs/inode.c | 15 +++++++++++++-- include/linux/hugetlb.h | 1 + mm/hugetlb.c | 15 ++++++++++++++- 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 5eba47f..99d0cec 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -50,6 +50,7 @@ struct hugetlbfs_config { long nr_blocks; long nr_inodes; struct hstate *hstate; + bool reserved; }; struct hugetlbfs_inode_info { @@ -73,7 +74,7 @@ int sysctl_hugetlb_shm_group; enum { Opt_size, Opt_nr_inodes, Opt_mode, Opt_uid, Opt_gid, - Opt_pagesize, + Opt_pagesize, Opt_reserved, Opt_err, }; @@ -84,6 +85,7 @@ static const match_table_t tokens = { {Opt_uid, "uid=%u"}, {Opt_gid, "gid=%u"}, {Opt_pagesize, "pagesize=%s"}, + {Opt_reserved, "reserved"}, {Opt_err, NULL}, }; @@ -832,6 +834,10 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig) break; } + case Opt_reserved: + pconfig->reserved = true; + break; + default: pr_err("Bad mount option: \"%s\"\n", p); return -EINVAL; @@ -872,6 +878,7 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) config.gid = current_fsgid(); config.mode = 0755; config.hstate = &default_hstate; + config.reserved = false; ret = hugetlbfs_parse_options(data, &config); if (ret) return ret; @@ -889,7 +896,11 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) sbinfo->spool = hugepage_new_subpool(config.nr_blocks); if (!sbinfo->spool) goto out_free; - } + sbinfo->spool->hstate = config.hstate; + if (config.reserved && !reserve_hugepage_subpool(sbinfo->spool)) + goto out_free; + } else if (config.reserved) + goto out_free; sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_blocksize = huge_page_size(config.hstate); sb->s_blocksize_bits = huge_page_shift(config.hstate); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 605c648..117e1bd 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -45,6 +45,7 @@ static inline bool subpool_reserved(struct hugepage_subpool *spool) return spool && spool->reserved; } struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); +bool reserve_hugepage_subpool(struct hugepage_subpool *spool); void hugepage_put_subpool(struct hugepage_subpool *spool); int PageHuge(struct page *page); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4ef8379..3ae3596 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -61,6 +61,8 @@ DEFINE_SPINLOCK(hugetlb_lock); static int num_fault_mutexes; static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp; +/* Forward declaration */ +static int hugetlb_acct_memory(struct hstate *h, long delta); static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) { bool free = (spool->count == 0) && (spool->used_hpages == 0); @@ -69,8 +71,11 @@ static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) /* If no pages are used, and no other handles to the subpool * remain, free the subpool the subpool remain */ - if (free) + if (free) { + if (spool->reserved) + hugetlb_acct_memory(spool->hstate, -spool->max_hpages); kfree(spool); + } } struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) @@ -91,6 +96,14 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) return spool; } +bool reserve_hugepage_subpool(struct hugepage_subpool *spool) +{ + if (hugetlb_acct_memory(spool->hstate, spool->max_hpages)) + return false; + spool->reserved = true; + return true; +} + void hugepage_put_subpool(struct hugepage_subpool *spool) { spin_lock(&spool->lock); -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id D57D56B0038 for ; Mon, 2 Mar 2015 18:10:11 -0500 (EST) Received: by pablj1 with SMTP id lj1so12107927pab.8 for ; Mon, 02 Mar 2015 15:10:11 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id bi15si18454610pdb.24.2015.03.02.15.10.10 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Mar 2015 15:10:10 -0800 (PST) Date: Mon, 2 Mar 2015 15:10:09 -0800 From: Andrew Morton Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time Message-Id: <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: > hugetlbfs allocates huge pages from the global pool as needed. Even if > the global pool contains a sufficient number pages for the filesystem > size at mount time, those global pages could be grabbed for some other > use. As a result, filesystem huge page allocations may fail due to lack > of pages. Well OK, but why is this a sufficiently serious problem to justify kernel changes? Please provide enough info for others to be able to understand the value of the change. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f169.google.com (mail-pd0-f169.google.com [209.85.192.169]) by kanga.kvack.org (Postfix) with ESMTP id 2ACAA6B006C for ; Mon, 2 Mar 2015 18:10:20 -0500 (EST) Received: by pdno5 with SMTP id o5so43219829pdn.8 for ; Mon, 02 Mar 2015 15:10:19 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id fh4si18271818pdb.133.2015.03.02.15.10.19 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Mar 2015 15:10:19 -0800 (PST) Date: Mon, 2 Mar 2015 15:10:18 -0800 From: Andrew Morton Subject: Re: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure Message-Id: <20150302151018.ce35298f22d04d6d0296e53c@linux-foundation.org> In-Reply-To: <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim On Fri, 27 Feb 2015 14:58:10 -0800 Mike Kravetz wrote: > Add a boolean to the subpool structure to indicate that the pages for > subpool have been reserved. The hstate pointer in the subpool is > convienient to have when it comes time to unreserve the pages. > subool_reserved() is a handy way to check if reserved and take into > account a NULL subpool. > > ... > > @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; > #define for_each_hstate(h) \ > for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) > > +static inline bool subpool_reserved(struct hugepage_subpool *spool) > +{ > + return spool && spool->reserved; > +} "subpool_reserved" is not a good identifier. > struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); > void hugepage_put_subpool(struct hugepage_subpool *spool); See what they did? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id C78EA6B006E for ; Mon, 2 Mar 2015 18:10:25 -0500 (EST) Received: by padet14 with SMTP id et14so21732716pad.0 for ; Mon, 02 Mar 2015 15:10:25 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id n4si4519928pdn.170.2015.03.02.15.10.24 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Mar 2015 15:10:24 -0800 (PST) Date: Mon, 2 Mar 2015 15:10:23 -0800 From: Andrew Morton Subject: Re: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting Message-Id: <20150302151023.e40dd1c6a9bf3d29cb6b657c@linux-foundation.org> In-Reply-To: <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim On Fri, 27 Feb 2015 14:58:11 -0800 Mike Kravetz wrote: > If the pages for a subpool are reserved, then the reservations have > already been accounted for in the global pool. Therefore, when > requesting a new reservation (such as for a mapping) for the subpool > do not count again in global pool. However, when actually allocating > a page for the subpool decrement global reserve count to correspond to > with decrement in global free pages. The last sentence made my brain hurt. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f169.google.com (mail-pd0-f169.google.com [209.85.192.169]) by kanga.kvack.org (Postfix) with ESMTP id 6D8F66B0070 for ; Mon, 2 Mar 2015 18:10:35 -0500 (EST) Received: by pdbfl12 with SMTP id fl12so12059414pdb.9 for ; Mon, 02 Mar 2015 15:10:35 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id cw14si3588053pac.189.2015.03.02.15.10.34 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Mar 2015 15:10:34 -0800 (PST) Date: Mon, 2 Mar 2015 15:10:33 -0800 From: Andrew Morton Subject: Re: [RFC 3/3] hugetlbfs: accept subpool reserved option and setup accordingly Message-Id: <20150302151033.562db79cd3da844392461795@linux-foundation.org> In-Reply-To: <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim On Fri, 27 Feb 2015 14:58:13 -0800 Mike Kravetz wrote: > Make reserved be an option when mounting a hugetlbfs. New mount option triggers a user documentation update. hugetlbfs isn't well documented, but Documentation/vm/hugetlbpage.txt looks like the place. > reserved > option is only possible if size option is also specified. The code doesn't appear to check for this (maybe it does). Probably it should do so, and warn when it fails. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f45.google.com (mail-oi0-f45.google.com [209.85.218.45]) by kanga.kvack.org (Postfix) with ESMTP id 6DED66B0038 for ; Mon, 2 Mar 2015 20:19:21 -0500 (EST) Received: by mail-oi0-f45.google.com with SMTP id i138so30488108oig.4 for ; Mon, 02 Mar 2015 17:19:21 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id zv11si1748118obb.39.2015.03.02.17.19.20 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 02 Mar 2015 17:19:20 -0800 (PST) Message-ID: <54F50BD6.1030706@oracle.com> Date: Mon, 02 Mar 2015 17:18:14 -0800 From: Mike Kravetz MIME-Version: 1.0 Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> In-Reply-To: <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: > >> hugetlbfs allocates huge pages from the global pool as needed. Even if >> the global pool contains a sufficient number pages for the filesystem >> size at mount time, those global pages could be grabbed for some other >> use. As a result, filesystem huge page allocations may fail due to lack >> of pages. > > Well OK, but why is this a sufficiently serious problem to justify > kernel changes? Please provide enough info for others to be able > to understand the value of the change. > Thanks for taking a look. Applications such as a database want to use huge pages for performance reasons. hugetlbfs filesystem semantics with ownership and modes work well to manage access to a pool of huge pages. However, the application would like some reasonable assurance that allocations will not fail due to a lack of huge pages. Before starting, the application will ensure that enough huge pages exist on the system in the global pools. What the application wants is exclusive use of a pool of huge pages. One could argue that this is a system administration issue. The global huge page pools are only available to users with root privilege. Therefore, exclusive use of a pool of huge pages can be obtained by limiting access. However, many applications are installed to run with elevated privilege to take advantage of resources like huge pages. It is quite possible for one application to interfere another, especially in the case of something like huge pages where the pool size is mostly fixed. Suggestions for other ways to approach this situation are appreciated. I saw the existing support for "reservations" within hugetlbfs and thought of extending this to cover the size of the filesystem. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f176.google.com (mail-ob0-f176.google.com [209.85.214.176]) by kanga.kvack.org (Postfix) with ESMTP id 4CC346B0038 for ; Mon, 2 Mar 2015 20:21:56 -0500 (EST) Received: by mail-ob0-f176.google.com with SMTP id wo20so35405848obc.7 for ; Mon, 02 Mar 2015 17:21:56 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id jg1si7202153obc.107.2015.03.02.17.21.55 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 02 Mar 2015 17:21:55 -0800 (PST) Message-ID: <54F50C73.9000401@oracle.com> Date: Mon, 02 Mar 2015 17:20:51 -0800 From: Mike Kravetz MIME-Version: 1.0 Subject: Re: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> <20150302151018.ce35298f22d04d6d0296e53c@linux-foundation.org> In-Reply-To: <20150302151018.ce35298f22d04d6d0296e53c@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:10 -0800 Mike Kravetz wrote: > >> Add a boolean to the subpool structure to indicate that the pages for >> subpool have been reserved. The hstate pointer in the subpool is >> convienient to have when it comes time to unreserve the pages. >> subool_reserved() is a handy way to check if reserved and take into >> account a NULL subpool. >> >> ... >> >> @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; >> #define for_each_hstate(h) \ >> for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) >> >> +static inline bool subpool_reserved(struct hugepage_subpool *spool) >> +{ >> + return spool && spool->reserved; >> +} > > "subpool_reserved" is not a good identifier. > >> struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); >> void hugepage_put_subpool(struct hugepage_subpool *spool); > > See what they did? Got it. Thanks. hugepage_subpool_reserved -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f44.google.com (mail-oi0-f44.google.com [209.85.218.44]) by kanga.kvack.org (Postfix) with ESMTP id 24E696B0038 for ; Mon, 2 Mar 2015 20:31:31 -0500 (EST) Received: by mail-oi0-f44.google.com with SMTP id a3so30516528oib.3 for ; Mon, 02 Mar 2015 17:31:31 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id v8si7222238oeo.56.2015.03.02.17.31.29 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 02 Mar 2015 17:31:30 -0800 (PST) Message-ID: <54F50EB1.5090102@oracle.com> Date: Mon, 02 Mar 2015 17:30:25 -0800 From: Mike Kravetz MIME-Version: 1.0 Subject: Re: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> <20150302151023.e40dd1c6a9bf3d29cb6b657c@linux-foundation.org> In-Reply-To: <20150302151023.e40dd1c6a9bf3d29cb6b657c@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:11 -0800 Mike Kravetz wrote: > >> If the pages for a subpool are reserved, then the reservations have >> already been accounted for in the global pool. Therefore, when >> requesting a new reservation (such as for a mapping) for the subpool >> do not count again in global pool. However, when actually allocating >> a page for the subpool decrement global reserve count to correspond to >> with decrement in global free pages. > > The last sentence made my brain hurt. > Sorry. I was trying to point out that the global free and reserve accounting is still the same when doing a page allocation, even though the entire size of the subpool was reserved. For example, when allocating a page the global free and reserve counts are both decremented. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f173.google.com (mail-pd0-f173.google.com [209.85.192.173]) by kanga.kvack.org (Postfix) with ESMTP id 5E3FB6B0038 for ; Mon, 2 Mar 2015 20:37:26 -0500 (EST) Received: by pdbfl12 with SMTP id fl12so12937342pdb.9 for ; Mon, 02 Mar 2015 17:37:26 -0800 (PST) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id oa11si8159137pdb.33.2015.03.02.17.37.24 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 02 Mar 2015 17:37:25 -0800 (PST) Message-ID: <54F5102F.50902@oracle.com> Date: Mon, 02 Mar 2015 17:36:47 -0800 From: Mike Kravetz MIME-Version: 1.0 Subject: Re: [RFC 3/3] hugetlbfs: accept subpool reserved option and setup accordingly References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> <20150302151033.562db79cd3da844392461795@linux-foundation.org> In-Reply-To: <20150302151033.562db79cd3da844392461795@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:13 -0800 Mike Kravetz wrote: > >> Make reserved be an option when mounting a hugetlbfs. > > New mount option triggers a user documentation update. hugetlbfs isn't > well documented, but Documentation/vm/hugetlbpage.txt looks like the > place. > Will do > >> reserved >> option is only possible if size option is also specified. > > The code doesn't appear to check for this (maybe it does). Probably it > should do so, and warn when it fails. > It is hard to see from the diffs, but this case is covered. If size is not specified, it implies the size is "unlimited". The code in the patch actually makes the mount fail in this case. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) by kanga.kvack.org (Postfix) with ESMTP id 912506B0038 for ; Fri, 6 Mar 2015 10:10:51 -0500 (EST) Received: by wghk14 with SMTP id k14so9959864wgh.3 for ; Fri, 06 Mar 2015 07:10:50 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id dc6si20836832wib.78.2015.03.06.07.10.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 06 Mar 2015 07:10:48 -0800 (PST) Date: Fri, 6 Mar 2015 16:10:45 +0100 From: Michal Hocko Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time Message-ID: <20150306151045.GA23443@dhcp22.suse.cz> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54F50BD6.1030706@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim On Mon 02-03-15 17:18:14, Mike Kravetz wrote: > On 03/02/2015 03:10 PM, Andrew Morton wrote: > >On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: > > > >>hugetlbfs allocates huge pages from the global pool as needed. Even if > >>the global pool contains a sufficient number pages for the filesystem > >>size at mount time, those global pages could be grabbed for some other > >>use. As a result, filesystem huge page allocations may fail due to lack > >>of pages. > > > >Well OK, but why is this a sufficiently serious problem to justify > >kernel changes? Please provide enough info for others to be able > >to understand the value of the change. > > > > Thanks for taking a look. > > Applications such as a database want to use huge pages for performance > reasons. hugetlbfs filesystem semantics with ownership and modes work > well to manage access to a pool of huge pages. However, the application > would like some reasonable assurance that allocations will not fail due > to a lack of huge pages. Before starting, the application will ensure > that enough huge pages exist on the system in the global pools. What > the application wants is exclusive use of a pool of huge pages. > > One could argue that this is a system administration issue. The global > huge page pools are only available to users with root privilege. > Therefore, exclusive use of a pool of huge pages can be obtained by > limiting access. However, many applications are installed to run with > elevated privilege to take advantage of resources like huge pages. It > is quite possible for one application to interfere another, especially > in the case of something like huge pages where the pool size is mostly > fixed. > > Suggestions for other ways to approach this situation are appreciated. > I saw the existing support for "reservations" within hugetlbfs and > thought of extending this to cover the size of the filesystem. Maybe I do not understand your usecase properly but wouldn't hugetlb cgroup (CONFIG_CGROUP_HUGETLB) help to guarantee the same? Just configure limits for different users/applications (inside different groups) so that they never overcommit the existing pool. Would that work for you? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f179.google.com (mail-ob0-f179.google.com [209.85.214.179]) by kanga.kvack.org (Postfix) with ESMTP id 964E96B0038 for ; Fri, 6 Mar 2015 13:59:04 -0500 (EST) Received: by obcvb8 with SMTP id vb8so18505915obc.0 for ; Fri, 06 Mar 2015 10:59:04 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id ix8si6484775obc.59.2015.03.06.10.59.03 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 06 Mar 2015 10:59:03 -0800 (PST) Message-ID: <54F9F8F1.4020203@oracle.com> Date: Fri, 06 Mar 2015 10:58:57 -0800 From: Mike Kravetz MIME-Version: 1.0 Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> <20150306151045.GA23443@dhcp22.suse.cz> In-Reply-To: <20150306151045.GA23443@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Aneesh Kumar , Joonsoo Kim , David Rientjes On 03/06/2015 07:10 AM, Michal Hocko wrote: > On Mon 02-03-15 17:18:14, Mike Kravetz wrote: >> On 03/02/2015 03:10 PM, Andrew Morton wrote: >>> On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: >>> >>>> hugetlbfs allocates huge pages from the global pool as needed. Even if >>>> the global pool contains a sufficient number pages for the filesystem >>>> size at mount time, those global pages could be grabbed for some other >>>> use. As a result, filesystem huge page allocations may fail due to lack >>>> of pages. >>> >>> Well OK, but why is this a sufficiently serious problem to justify >>> kernel changes? Please provide enough info for others to be able >>> to understand the value of the change. >>> >> >> Thanks for taking a look. >> >> Applications such as a database want to use huge pages for performance >> reasons. hugetlbfs filesystem semantics with ownership and modes work >> well to manage access to a pool of huge pages. However, the application >> would like some reasonable assurance that allocations will not fail due >> to a lack of huge pages. Before starting, the application will ensure >> that enough huge pages exist on the system in the global pools. What >> the application wants is exclusive use of a pool of huge pages. >> >> One could argue that this is a system administration issue. The global >> huge page pools are only available to users with root privilege. >> Therefore, exclusive use of a pool of huge pages can be obtained by >> limiting access. However, many applications are installed to run with >> elevated privilege to take advantage of resources like huge pages. It >> is quite possible for one application to interfere another, especially >> in the case of something like huge pages where the pool size is mostly >> fixed. >> >> Suggestions for other ways to approach this situation are appreciated. >> I saw the existing support for "reservations" within hugetlbfs and >> thought of extending this to cover the size of the filesystem. > > Maybe I do not understand your usecase properly but wouldn't hugetlb > cgroup (CONFIG_CGROUP_HUGETLB) help to guarantee the same? Just > configure limits for different users/applications (inside different > groups) so that they never overcommit the existing pool. Would that work > for you? Thanks for the CONFIG_CGROUP_HUGETLB suggestion, however I do not believe this will be a satisfactory solution for my usecase. As you point out, cgroups could be set up (by a sysadmin) for every hugetlb user/application. In this case, the sysadmin needs to have knowledge of every huge page user/application and configure appropriately. I was approaching this from the point of view of the application. The application wants the guarantee of a minimum number of huge pages, independent of other users/applications. The "reserve" approach allows the application to set aside those pages at initialization time. If it can not get the pages it needs, it can refuse to start, or configure itself to use less, or take other action. As you point out, the cgroup approach could also provide guarantees to the application if set up properly. I was trying for an approach that would provide more control to the application independent of the sysadmin and other users/applications. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f174.google.com (mail-ig0-f174.google.com [209.85.213.174]) by kanga.kvack.org (Postfix) with ESMTP id 6D5DF6B0038 for ; Fri, 6 Mar 2015 16:14:45 -0500 (EST) Received: by igbhl2 with SMTP id hl2so7241549igb.5 for ; Fri, 06 Mar 2015 13:14:45 -0800 (PST) Received: from mail-ie0-x22e.google.com (mail-ie0-x22e.google.com. [2607:f8b0:4001:c03::22e]) by mx.google.com with ESMTPS id b19si12909384ioe.59.2015.03.06.13.14.44 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Mar 2015 13:14:45 -0800 (PST) Received: by iebtr6 with SMTP id tr6so15735180ieb.4 for ; Fri, 06 Mar 2015 13:14:44 -0800 (PST) Date: Fri, 6 Mar 2015 13:14:43 -0800 (PST) From: David Rientjes Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time In-Reply-To: <54F9F8F1.4020203@oracle.com> Message-ID: References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> <20150306151045.GA23443@dhcp22.suse.cz> <54F9F8F1.4020203@oracle.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Michal Hocko , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Aneesh Kumar , Joonsoo Kim On Fri, 6 Mar 2015, Mike Kravetz wrote: > Thanks for the CONFIG_CGROUP_HUGETLB suggestion, however I do not > believe this will be a satisfactory solution for my usecase. As you > point out, cgroups could be set up (by a sysadmin) for every hugetlb > user/application. In this case, the sysadmin needs to have knowledge > of every huge page user/application and configure appropriately. > > I was approaching this from the point of view of the application. The > application wants the guarantee of a minimum number of huge pages, > independent of other users/applications. The "reserve" approach allows > the application to set aside those pages at initialization time. If it > can not get the pages it needs, it can refuse to start, or configure > itself to use less, or take other action. > Would it be too difficult to modify the application to mmap() the hugepages at startup so they are no longer free in the global pool but rather get marked as reserved so other applications cannot map them? That should return MAP_FAILED if there is an insufficient number of hugepages available to be reserved (HugePages_Rsvd in /proc/meminfo). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f170.google.com (mail-ob0-f170.google.com [209.85.214.170]) by kanga.kvack.org (Postfix) with ESMTP id 37E5C6B0038 for ; Fri, 6 Mar 2015 16:32:54 -0500 (EST) Received: by obcwp18 with SMTP id wp18so11943593obc.1 for ; Fri, 06 Mar 2015 13:32:53 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id gg9si6742599obb.30.2015.03.06.13.32.53 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 06 Mar 2015 13:32:53 -0800 (PST) Message-ID: <54FA1CFE.1000500@oracle.com> Date: Fri, 06 Mar 2015 13:32:46 -0800 From: Mike Kravetz MIME-Version: 1.0 Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> <20150306151045.GA23443@dhcp22.suse.cz> <54F9F8F1.4020203@oracle.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Michal Hocko , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Aneesh Kumar , Joonsoo Kim On 03/06/2015 01:14 PM, David Rientjes wrote: > On Fri, 6 Mar 2015, Mike Kravetz wrote: > >> Thanks for the CONFIG_CGROUP_HUGETLB suggestion, however I do not >> believe this will be a satisfactory solution for my usecase. As you >> point out, cgroups could be set up (by a sysadmin) for every hugetlb >> user/application. In this case, the sysadmin needs to have knowledge >> of every huge page user/application and configure appropriately. >> >> I was approaching this from the point of view of the application. The >> application wants the guarantee of a minimum number of huge pages, >> independent of other users/applications. The "reserve" approach allows >> the application to set aside those pages at initialization time. If it >> can not get the pages it needs, it can refuse to start, or configure >> itself to use less, or take other action. >> > > Would it be too difficult to modify the application to mmap() the > hugepages at startup so they are no longer free in the global pool but > rather get marked as reserved so other applications cannot map them? That > should return MAP_FAILED if there is an insufficient number of hugepages > available to be reserved (HugePages_Rsvd in /proc/meminfo). The application is a database with multiple processes/tasks that will come and go over time. I thought about having one task do a big mmap() at initialization time, but then the issue is how to coordinate with the other tasks and their requests to allocate/free pages. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755463AbbB0W7K (ORCPT ); Fri, 27 Feb 2015 17:59:10 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:47844 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754770AbbB0W7I (ORCPT ); Fri, 27 Feb 2015 17:59:08 -0500 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Subject: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time Date: Fri, 27 Feb 2015 14:58:08 -0800 Message-Id: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.1.0 X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org hugetlbfs allocates huge pages from the global pool as needed. Even if the global pool contains a sufficient number pages for the filesystem size at mount time, those global pages could be grabbed for some other use. As a result, filesystem huge page allocations may fail due to lack of pages. Add a new hugetlbfs mount option 'reserved' to specify that the number of pages associated with the size of the filesystem will be reserved. If there are insufficient pages, the mount will fail. The reservation is maintained for the duration of the filesystem so that as pages are allocated and free'ed a sufficient number of pages remains reserved. Mike Kravetz (3): hugetlbfs: add reserved mount fields to subpool structure hugetlbfs: coordinate global and subpool reserve accounting hugetlbfs: accept subpool reserved option and setup accordingly fs/hugetlbfs/inode.c | 15 +++++++++++++-- include/linux/hugetlb.h | 7 +++++++ mm/hugetlb.c | 37 +++++++++++++++++++++++++++++-------- 3 files changed, 49 insertions(+), 10 deletions(-) -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755562AbbB0W7n (ORCPT ); Fri, 27 Feb 2015 17:59:43 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:43133 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754770AbbB0W7m (ORCPT ); Fri, 27 Feb 2015 17:59:42 -0500 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Subject: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure Date: Fri, 27 Feb 2015 14:58:09 -0800 Message-Id: <1425077893-18366-2-git-send-email-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add a boolean to the subpool structure to indicate that the pages for subpool have been reserved. The hstate pointer in the subpool is convenient to have when it comes time to unreserve the pages. subool_reserved() is a handy way to check if reserved and take into account a NULL subpool. Signed-off-by: Mike Kravetz --- include/linux/hugetlb.h | 6 ++++++ mm/hugetlb.c | 2 ++ 2 files changed, 8 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 431b7fc..605c648 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -23,6 +23,8 @@ struct hugepage_subpool { spinlock_t lock; long count; long max_hpages, used_hpages; + struct hstate *hstate; + bool reserved; }; struct resv_map { @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; #define for_each_hstate(h) \ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) +static inline bool subpool_reserved(struct hugepage_subpool *spool) +{ + return spool && spool->reserved; +} struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); void hugepage_put_subpool(struct hugepage_subpool *spool); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 85032de..c6adf65 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -85,6 +85,8 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) spool->count = 1; spool->max_hpages = nr_blocks; spool->used_hpages = 0; + spool->hstate = NULL; + spool->reserved = false; return spool; } -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755683AbbB0XAI (ORCPT ); Fri, 27 Feb 2015 18:00:08 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:43278 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755571AbbB0XAE (ORCPT ); Fri, 27 Feb 2015 18:00:04 -0500 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Subject: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure Date: Fri, 27 Feb 2015 14:58:10 -0800 Message-Id: <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add a boolean to the subpool structure to indicate that the pages for subpool have been reserved. The hstate pointer in the subpool is convienient to have when it comes time to unreserve the pages. subool_reserved() is a handy way to check if reserved and take into account a NULL subpool. Signed-off-by: Mike Kravetz --- include/linux/hugetlb.h | 6 ++++++ mm/hugetlb.c | 2 ++ 2 files changed, 8 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 431b7fc..605c648 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -23,6 +23,8 @@ struct hugepage_subpool { spinlock_t lock; long count; long max_hpages, used_hpages; + struct hstate *hstate; + bool reserved; }; struct resv_map { @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; #define for_each_hstate(h) \ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) +static inline bool subpool_reserved(struct hugepage_subpool *spool) +{ + return spool && spool->reserved; +} struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); void hugepage_put_subpool(struct hugepage_subpool *spool); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 85032de..c6adf65 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -85,6 +85,8 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) spool->count = 1; spool->max_hpages = nr_blocks; spool->used_hpages = 0; + spool->hstate = NULL; + spool->reserved = false; return spool; } -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755666AbbB0XAH (ORCPT ); Fri, 27 Feb 2015 18:00:07 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:43280 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755650AbbB0XAE (ORCPT ); Fri, 27 Feb 2015 18:00:04 -0500 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Subject: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting Date: Fri, 27 Feb 2015 14:58:11 -0800 Message-Id: <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If the pages for a subpool are reserved, then the reservations have already been accounted for in the global pool. Therefore, when requesting a new reservation (such as for a mapping) for the subpool do not count again in global pool. However, when actually allocating a page for the subpool decrement global reserve count to correspond to with decrement in global free pages. Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c6adf65..4ef8379 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -879,7 +879,7 @@ void free_huge_page(struct page *page) spin_lock(&hugetlb_lock); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page); - if (restore_reserve) + if (restore_reserve || subpool_reserved(spool)) h->resv_huge_pages++; if (h->surplus_huge_pages_node[nid]) { @@ -2466,7 +2466,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) kref_put(&resv->refs, resv_map_release); if (reserve) { - hugetlb_acct_memory(h, -reserve); + if (!subpool_reserved(spool)) + hugetlb_acct_memory(h, -reserve); hugepage_subpool_put_pages(spool, reserve); } } @@ -3444,10 +3445,14 @@ int hugetlb_reserve_pages(struct inode *inode, * Check enough hugepages are available for the reservation. * Hand the pages back to the subpool if there are not */ - ret = hugetlb_acct_memory(h, chg); - if (ret < 0) { - hugepage_subpool_put_pages(spool, chg); - goto out_err; + if (subpool_reserved(spool)) + ret = 0; + else { + ret = hugetlb_acct_memory(h, chg); + if (ret < 0) { + hugepage_subpool_put_pages(spool, chg); + goto out_err; + } } /* @@ -3483,7 +3488,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) inode->i_blocks -= (blocks_per_huge_page(h) * freed); spin_unlock(&inode->i_lock); - hugepage_subpool_put_pages(spool, (chg - freed)); + if (!subpool_reserved(spool)) + hugepage_subpool_put_pages(spool, (chg - freed)); hugetlb_acct_memory(h, -(chg - freed)); } -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755614AbbB0W7o (ORCPT ); Fri, 27 Feb 2015 17:59:44 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:48049 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755465AbbB0W7m (ORCPT ); Fri, 27 Feb 2015 17:59:42 -0500 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Subject: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting Date: Fri, 27 Feb 2015 14:58:12 -0800 Message-Id: <1425077893-18366-5-git-send-email-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If the pages for a subpool are reserved, then the reservations have already been accounted for in the global pool. Therefore, when requesting a new reservation (such as for a mapping) for the subpool do not count again in global pool. However, when actually allocating a page for the subpool decrement gobal reserve count to correspond to with decrement in global free pages. Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c6adf65..4ef8379 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -879,7 +879,7 @@ void free_huge_page(struct page *page) spin_lock(&hugetlb_lock); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page); - if (restore_reserve) + if (restore_reserve || subpool_reserved(spool)) h->resv_huge_pages++; if (h->surplus_huge_pages_node[nid]) { @@ -2466,7 +2466,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) kref_put(&resv->refs, resv_map_release); if (reserve) { - hugetlb_acct_memory(h, -reserve); + if (!subpool_reserved(spool)) + hugetlb_acct_memory(h, -reserve); hugepage_subpool_put_pages(spool, reserve); } } @@ -3444,10 +3445,14 @@ int hugetlb_reserve_pages(struct inode *inode, * Check enough hugepages are available for the reservation. * Hand the pages back to the subpool if there are not */ - ret = hugetlb_acct_memory(h, chg); - if (ret < 0) { - hugepage_subpool_put_pages(spool, chg); - goto out_err; + if (subpool_reserved(spool)) + ret = 0; + else { + ret = hugetlb_acct_memory(h, chg); + if (ret < 0) { + hugepage_subpool_put_pages(spool, chg); + goto out_err; + } } /* @@ -3483,7 +3488,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) inode->i_blocks -= (blocks_per_huge_page(h) * freed); spin_unlock(&inode->i_lock); - hugepage_subpool_put_pages(spool, (chg - freed)); + if (!subpool_reserved(spool)) + hugepage_subpool_put_pages(spool, (chg - freed)); hugetlb_acct_memory(h, -(chg - freed)); } -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755723AbbB0XA1 (ORCPT ); Fri, 27 Feb 2015 18:00:27 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:43499 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755702AbbB0XAX (ORCPT ); Fri, 27 Feb 2015 18:00:23 -0500 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadia Yvette Chambers , Andrew Morton , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim , Mike Kravetz Subject: [RFC 3/3] hugetlbfs: accept subpool reserved option and setup accordingly Date: Fri, 27 Feb 2015 14:58:13 -0800 Message-Id: <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Make reserved be an option when mounting a hugetlbfs. reserved option is only possible if size option is also specified. On mount, reserve size hugepages and note in subpool. Unreserve pages when fs is unmounted. Signed-off-by: Mike Kravetz --- fs/hugetlbfs/inode.c | 15 +++++++++++++-- include/linux/hugetlb.h | 1 + mm/hugetlb.c | 15 ++++++++++++++- 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 5eba47f..99d0cec 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -50,6 +50,7 @@ struct hugetlbfs_config { long nr_blocks; long nr_inodes; struct hstate *hstate; + bool reserved; }; struct hugetlbfs_inode_info { @@ -73,7 +74,7 @@ int sysctl_hugetlb_shm_group; enum { Opt_size, Opt_nr_inodes, Opt_mode, Opt_uid, Opt_gid, - Opt_pagesize, + Opt_pagesize, Opt_reserved, Opt_err, }; @@ -84,6 +85,7 @@ static const match_table_t tokens = { {Opt_uid, "uid=%u"}, {Opt_gid, "gid=%u"}, {Opt_pagesize, "pagesize=%s"}, + {Opt_reserved, "reserved"}, {Opt_err, NULL}, }; @@ -832,6 +834,10 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig) break; } + case Opt_reserved: + pconfig->reserved = true; + break; + default: pr_err("Bad mount option: \"%s\"\n", p); return -EINVAL; @@ -872,6 +878,7 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) config.gid = current_fsgid(); config.mode = 0755; config.hstate = &default_hstate; + config.reserved = false; ret = hugetlbfs_parse_options(data, &config); if (ret) return ret; @@ -889,7 +896,11 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) sbinfo->spool = hugepage_new_subpool(config.nr_blocks); if (!sbinfo->spool) goto out_free; - } + sbinfo->spool->hstate = config.hstate; + if (config.reserved && !reserve_hugepage_subpool(sbinfo->spool)) + goto out_free; + } else if (config.reserved) + goto out_free; sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_blocksize = huge_page_size(config.hstate); sb->s_blocksize_bits = huge_page_shift(config.hstate); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 605c648..117e1bd 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -45,6 +45,7 @@ static inline bool subpool_reserved(struct hugepage_subpool *spool) return spool && spool->reserved; } struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); +bool reserve_hugepage_subpool(struct hugepage_subpool *spool); void hugepage_put_subpool(struct hugepage_subpool *spool); int PageHuge(struct page *page); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4ef8379..3ae3596 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -61,6 +61,8 @@ DEFINE_SPINLOCK(hugetlb_lock); static int num_fault_mutexes; static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp; +/* Forward declaration */ +static int hugetlb_acct_memory(struct hstate *h, long delta); static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) { bool free = (spool->count == 0) && (spool->used_hpages == 0); @@ -69,8 +71,11 @@ static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) /* If no pages are used, and no other handles to the subpool * remain, free the subpool the subpool remain */ - if (free) + if (free) { + if (spool->reserved) + hugetlb_acct_memory(spool->hstate, -spool->max_hpages); kfree(spool); + } } struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) @@ -91,6 +96,14 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) return spool; } +bool reserve_hugepage_subpool(struct hugepage_subpool *spool) +{ + if (hugetlb_acct_memory(spool->hstate, spool->max_hpages)) + return false; + spool->reserved = true; + return true; +} + void hugepage_put_subpool(struct hugepage_subpool *spool) { spin_lock(&spool->lock); -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754631AbbCBXKO (ORCPT ); Mon, 2 Mar 2015 18:10:14 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:45250 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754002AbbCBXKL (ORCPT ); Mon, 2 Mar 2015 18:10:11 -0500 Date: Mon, 2 Mar 2015 15:10:09 -0800 From: Andrew Morton To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time Message-Id: <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> In-Reply-To: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: > hugetlbfs allocates huge pages from the global pool as needed. Even if > the global pool contains a sufficient number pages for the filesystem > size at mount time, those global pages could be grabbed for some other > use. As a result, filesystem huge page allocations may fail due to lack > of pages. Well OK, but why is this a sufficiently serious problem to justify kernel changes? Please provide enough info for others to be able to understand the value of the change. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754709AbbCBXKV (ORCPT ); Mon, 2 Mar 2015 18:10:21 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:45262 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754276AbbCBXKT (ORCPT ); Mon, 2 Mar 2015 18:10:19 -0500 Date: Mon, 2 Mar 2015 15:10:18 -0800 From: Andrew Morton To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure Message-Id: <20150302151018.ce35298f22d04d6d0296e53c@linux-foundation.org> In-Reply-To: <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Feb 2015 14:58:10 -0800 Mike Kravetz wrote: > Add a boolean to the subpool structure to indicate that the pages for > subpool have been reserved. The hstate pointer in the subpool is > convienient to have when it comes time to unreserve the pages. > subool_reserved() is a handy way to check if reserved and take into > account a NULL subpool. > > ... > > @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; > #define for_each_hstate(h) \ > for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) > > +static inline bool subpool_reserved(struct hugepage_subpool *spool) > +{ > + return spool && spool->reserved; > +} "subpool_reserved" is not a good identifier. > struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); > void hugepage_put_subpool(struct hugepage_subpool *spool); See what they did? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754800AbbCBXKa (ORCPT ); Mon, 2 Mar 2015 18:10:30 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:45270 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754322AbbCBXKZ (ORCPT ); Mon, 2 Mar 2015 18:10:25 -0500 Date: Mon, 2 Mar 2015 15:10:23 -0800 From: Andrew Morton To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting Message-Id: <20150302151023.e40dd1c6a9bf3d29cb6b657c@linux-foundation.org> In-Reply-To: <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Feb 2015 14:58:11 -0800 Mike Kravetz wrote: > If the pages for a subpool are reserved, then the reservations have > already been accounted for in the global pool. Therefore, when > requesting a new reservation (such as for a mapping) for the subpool > do not count again in global pool. However, when actually allocating > a page for the subpool decrement global reserve count to correspond to > with decrement in global free pages. The last sentence made my brain hurt. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754881AbbCBXKj (ORCPT ); Mon, 2 Mar 2015 18:10:39 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:45277 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754322AbbCBXKe (ORCPT ); Mon, 2 Mar 2015 18:10:34 -0500 Date: Mon, 2 Mar 2015 15:10:33 -0800 From: Andrew Morton To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Davidlohr Bueso , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 3/3] hugetlbfs: accept subpool reserved option and setup accordingly Message-Id: <20150302151033.562db79cd3da844392461795@linux-foundation.org> In-Reply-To: <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Feb 2015 14:58:13 -0800 Mike Kravetz wrote: > Make reserved be an option when mounting a hugetlbfs. New mount option triggers a user documentation update. hugetlbfs isn't well documented, but Documentation/vm/hugetlbpage.txt looks like the place. > reserved > option is only possible if size option is also specified. The code doesn't appear to check for this (maybe it does). Probably it should do so, and warn when it fails. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755517AbbCCBT1 (ORCPT ); Mon, 2 Mar 2015 20:19:27 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:38034 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754928AbbCCBTZ (ORCPT ); Mon, 2 Mar 2015 20:19:25 -0500 Message-ID: <54F50BD6.1030706@oracle.com> Date: Mon, 02 Mar 2015 17:18:14 -0800 From: Mike Kravetz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Andrew Morton CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> In-Reply-To: <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: > >> hugetlbfs allocates huge pages from the global pool as needed. Even if >> the global pool contains a sufficient number pages for the filesystem >> size at mount time, those global pages could be grabbed for some other >> use. As a result, filesystem huge page allocations may fail due to lack >> of pages. > > Well OK, but why is this a sufficiently serious problem to justify > kernel changes? Please provide enough info for others to be able > to understand the value of the change. > Thanks for taking a look. Applications such as a database want to use huge pages for performance reasons. hugetlbfs filesystem semantics with ownership and modes work well to manage access to a pool of huge pages. However, the application would like some reasonable assurance that allocations will not fail due to a lack of huge pages. Before starting, the application will ensure that enough huge pages exist on the system in the global pools. What the application wants is exclusive use of a pool of huge pages. One could argue that this is a system administration issue. The global huge page pools are only available to users with root privilege. Therefore, exclusive use of a pool of huge pages can be obtained by limiting access. However, many applications are installed to run with elevated privilege to take advantage of resources like huge pages. It is quite possible for one application to interfere another, especially in the case of something like huge pages where the pool size is mostly fixed. Suggestions for other ways to approach this situation are appreciated. I saw the existing support for "reservations" within hugetlbfs and thought of extending this to cover the size of the filesystem. -- Mike Kravetz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755404AbbCCBWA (ORCPT ); Mon, 2 Mar 2015 20:22:00 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:38826 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753786AbbCCBV7 (ORCPT ); Mon, 2 Mar 2015 20:21:59 -0500 Message-ID: <54F50C73.9000401@oracle.com> Date: Mon, 02 Mar 2015 17:20:51 -0800 From: Mike Kravetz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Andrew Morton CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 1/3] hugetlbfs: add reserved mount fields to subpool structure References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-3-git-send-email-mike.kravetz@oracle.com> <20150302151018.ce35298f22d04d6d0296e53c@linux-foundation.org> In-Reply-To: <20150302151018.ce35298f22d04d6d0296e53c@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: aserv0021.oracle.com [141.146.126.233] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:10 -0800 Mike Kravetz wrote: > >> Add a boolean to the subpool structure to indicate that the pages for >> subpool have been reserved. The hstate pointer in the subpool is >> convienient to have when it comes time to unreserve the pages. >> subool_reserved() is a handy way to check if reserved and take into >> account a NULL subpool. >> >> ... >> >> @@ -38,6 +40,10 @@ extern int hugetlb_max_hstate __read_mostly; >> #define for_each_hstate(h) \ >> for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) >> >> +static inline bool subpool_reserved(struct hugepage_subpool *spool) >> +{ >> + return spool && spool->reserved; >> +} > > "subpool_reserved" is not a good identifier. > >> struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); >> void hugepage_put_subpool(struct hugepage_subpool *spool); > > See what they did? Got it. Thanks. hugepage_subpool_reserved -- Mike Kravetz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754736AbbCCBbg (ORCPT ); Mon, 2 Mar 2015 20:31:36 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:41694 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753372AbbCCBbf (ORCPT ); Mon, 2 Mar 2015 20:31:35 -0500 Message-ID: <54F50EB1.5090102@oracle.com> Date: Mon, 02 Mar 2015 17:30:25 -0800 From: Mike Kravetz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Andrew Morton CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 2/3] hugetlbfs: coordinate global and subpool reserve accounting References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-4-git-send-email-mike.kravetz@oracle.com> <20150302151023.e40dd1c6a9bf3d29cb6b657c@linux-foundation.org> In-Reply-To: <20150302151023.e40dd1c6a9bf3d29cb6b657c@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:11 -0800 Mike Kravetz wrote: > >> If the pages for a subpool are reserved, then the reservations have >> already been accounted for in the global pool. Therefore, when >> requesting a new reservation (such as for a mapping) for the subpool >> do not count again in global pool. However, when actually allocating >> a page for the subpool decrement global reserve count to correspond to >> with decrement in global free pages. > > The last sentence made my brain hurt. > Sorry. I was trying to point out that the global free and reserve accounting is still the same when doing a page allocation, even though the entire size of the subpool was reserved. For example, when allocating a page the global free and reserve counts are both decremented. -- Mike Kravetz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755385AbbCCBhd (ORCPT ); Mon, 2 Mar 2015 20:37:33 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:46273 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752896AbbCCBhc (ORCPT ); Mon, 2 Mar 2015 20:37:32 -0500 Message-ID: <54F5102F.50902@oracle.com> Date: Mon, 02 Mar 2015 17:36:47 -0800 From: Mike Kravetz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Andrew Morton CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 3/3] hugetlbfs: accept subpool reserved option and setup accordingly References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <1425077893-18366-6-git-send-email-mike.kravetz@oracle.com> <20150302151033.562db79cd3da844392461795@linux-foundation.org> In-Reply-To: <20150302151033.562db79cd3da844392461795@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: aserv0021.oracle.com [141.146.126.233] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/02/2015 03:10 PM, Andrew Morton wrote: > On Fri, 27 Feb 2015 14:58:13 -0800 Mike Kravetz wrote: > >> Make reserved be an option when mounting a hugetlbfs. > > New mount option triggers a user documentation update. hugetlbfs isn't > well documented, but Documentation/vm/hugetlbpage.txt looks like the > place. > Will do > >> reserved >> option is only possible if size option is also specified. > > The code doesn't appear to check for this (maybe it does). Probably it > should do so, and warn when it fails. > It is hard to see from the diffs, but this case is covered. If size is not specified, it implies the size is "unlimited". The code in the patch actually makes the mount fail in this case. -- Mike Kravetz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755021AbbCFPKu (ORCPT ); Fri, 6 Mar 2015 10:10:50 -0500 Received: from cantor2.suse.de ([195.135.220.15]:39825 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754006AbbCFPKt (ORCPT ); Fri, 6 Mar 2015 10:10:49 -0500 Date: Fri, 6 Mar 2015 16:10:45 +0100 From: Michal Hocko To: Mike Kravetz Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadia Yvette Chambers , Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time Message-ID: <20150306151045.GA23443@dhcp22.suse.cz> References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54F50BD6.1030706@oracle.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 02-03-15 17:18:14, Mike Kravetz wrote: > On 03/02/2015 03:10 PM, Andrew Morton wrote: > >On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: > > > >>hugetlbfs allocates huge pages from the global pool as needed. Even if > >>the global pool contains a sufficient number pages for the filesystem > >>size at mount time, those global pages could be grabbed for some other > >>use. As a result, filesystem huge page allocations may fail due to lack > >>of pages. > > > >Well OK, but why is this a sufficiently serious problem to justify > >kernel changes? Please provide enough info for others to be able > >to understand the value of the change. > > > > Thanks for taking a look. > > Applications such as a database want to use huge pages for performance > reasons. hugetlbfs filesystem semantics with ownership and modes work > well to manage access to a pool of huge pages. However, the application > would like some reasonable assurance that allocations will not fail due > to a lack of huge pages. Before starting, the application will ensure > that enough huge pages exist on the system in the global pools. What > the application wants is exclusive use of a pool of huge pages. > > One could argue that this is a system administration issue. The global > huge page pools are only available to users with root privilege. > Therefore, exclusive use of a pool of huge pages can be obtained by > limiting access. However, many applications are installed to run with > elevated privilege to take advantage of resources like huge pages. It > is quite possible for one application to interfere another, especially > in the case of something like huge pages where the pool size is mostly > fixed. > > Suggestions for other ways to approach this situation are appreciated. > I saw the existing support for "reservations" within hugetlbfs and > thought of extending this to cover the size of the filesystem. Maybe I do not understand your usecase properly but wouldn't hugetlb cgroup (CONFIG_CGROUP_HUGETLB) help to guarantee the same? Just configure limits for different users/applications (inside different groups) so that they never overcommit the existing pool. Would that work for you? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756207AbbCFS7O (ORCPT ); Fri, 6 Mar 2015 13:59:14 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:19482 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755165AbbCFS7N (ORCPT ); Fri, 6 Mar 2015 13:59:13 -0500 Message-ID: <54F9F8F1.4020203@oracle.com> Date: Fri, 06 Mar 2015 10:58:57 -0800 From: Mike Kravetz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Michal Hocko CC: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Aneesh Kumar , Joonsoo Kim , David Rientjes Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> <20150306151045.GA23443@dhcp22.suse.cz> In-Reply-To: <20150306151045.GA23443@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet21.oracle.com [156.151.31.93] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/06/2015 07:10 AM, Michal Hocko wrote: > On Mon 02-03-15 17:18:14, Mike Kravetz wrote: >> On 03/02/2015 03:10 PM, Andrew Morton wrote: >>> On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz wrote: >>> >>>> hugetlbfs allocates huge pages from the global pool as needed. Even if >>>> the global pool contains a sufficient number pages for the filesystem >>>> size at mount time, those global pages could be grabbed for some other >>>> use. As a result, filesystem huge page allocations may fail due to lack >>>> of pages. >>> >>> Well OK, but why is this a sufficiently serious problem to justify >>> kernel changes? Please provide enough info for others to be able >>> to understand the value of the change. >>> >> >> Thanks for taking a look. >> >> Applications such as a database want to use huge pages for performance >> reasons. hugetlbfs filesystem semantics with ownership and modes work >> well to manage access to a pool of huge pages. However, the application >> would like some reasonable assurance that allocations will not fail due >> to a lack of huge pages. Before starting, the application will ensure >> that enough huge pages exist on the system in the global pools. What >> the application wants is exclusive use of a pool of huge pages. >> >> One could argue that this is a system administration issue. The global >> huge page pools are only available to users with root privilege. >> Therefore, exclusive use of a pool of huge pages can be obtained by >> limiting access. However, many applications are installed to run with >> elevated privilege to take advantage of resources like huge pages. It >> is quite possible for one application to interfere another, especially >> in the case of something like huge pages where the pool size is mostly >> fixed. >> >> Suggestions for other ways to approach this situation are appreciated. >> I saw the existing support for "reservations" within hugetlbfs and >> thought of extending this to cover the size of the filesystem. > > Maybe I do not understand your usecase properly but wouldn't hugetlb > cgroup (CONFIG_CGROUP_HUGETLB) help to guarantee the same? Just > configure limits for different users/applications (inside different > groups) so that they never overcommit the existing pool. Would that work > for you? Thanks for the CONFIG_CGROUP_HUGETLB suggestion, however I do not believe this will be a satisfactory solution for my usecase. As you point out, cgroups could be set up (by a sysadmin) for every hugetlb user/application. In this case, the sysadmin needs to have knowledge of every huge page user/application and configure appropriately. I was approaching this from the point of view of the application. The application wants the guarantee of a minimum number of huge pages, independent of other users/applications. The "reserve" approach allows the application to set aside those pages at initialization time. If it can not get the pages it needs, it can refuse to start, or configure itself to use less, or take other action. As you point out, the cgroup approach could also provide guarantees to the application if set up properly. I was trying for an approach that would provide more control to the application independent of the sysadmin and other users/applications. -- Mike Kravetz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754040AbbCFVOt (ORCPT ); Fri, 6 Mar 2015 16:14:49 -0500 Received: from mail-ie0-f181.google.com ([209.85.223.181]:38352 "EHLO mail-ie0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751542AbbCFVOp (ORCPT ); Fri, 6 Mar 2015 16:14:45 -0500 Date: Fri, 6 Mar 2015 13:14:43 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Mike Kravetz cc: Michal Hocko , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time In-Reply-To: <54F9F8F1.4020203@oracle.com> Message-ID: References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> <20150306151045.GA23443@dhcp22.suse.cz> <54F9F8F1.4020203@oracle.com> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 6 Mar 2015, Mike Kravetz wrote: > Thanks for the CONFIG_CGROUP_HUGETLB suggestion, however I do not > believe this will be a satisfactory solution for my usecase. As you > point out, cgroups could be set up (by a sysadmin) for every hugetlb > user/application. In this case, the sysadmin needs to have knowledge > of every huge page user/application and configure appropriately. > > I was approaching this from the point of view of the application. The > application wants the guarantee of a minimum number of huge pages, > independent of other users/applications. The "reserve" approach allows > the application to set aside those pages at initialization time. If it > can not get the pages it needs, it can refuse to start, or configure > itself to use less, or take other action. > Would it be too difficult to modify the application to mmap() the hugepages at startup so they are no longer free in the global pool but rather get marked as reserved so other applications cannot map them? That should return MAP_FAILED if there is an insufficient number of hugepages available to be reserved (HugePages_Rsvd in /proc/meminfo). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754182AbbCFVdG (ORCPT ); Fri, 6 Mar 2015 16:33:06 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:26535 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751276AbbCFVdD (ORCPT ); Fri, 6 Mar 2015 16:33:03 -0500 Message-ID: <54FA1CFE.1000500@oracle.com> Date: Fri, 06 Mar 2015 13:32:46 -0800 From: Mike Kravetz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: David Rientjes CC: Michal Hocko , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Aneesh Kumar , Joonsoo Kim Subject: Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time References: <1425077893-18366-1-git-send-email-mike.kravetz@oracle.com> <20150302151009.2ae58f4430f9f34b81533821@linux-foundation.org> <54F50BD6.1030706@oracle.com> <20150306151045.GA23443@dhcp22.suse.cz> <54F9F8F1.4020203@oracle.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/06/2015 01:14 PM, David Rientjes wrote: > On Fri, 6 Mar 2015, Mike Kravetz wrote: > >> Thanks for the CONFIG_CGROUP_HUGETLB suggestion, however I do not >> believe this will be a satisfactory solution for my usecase. As you >> point out, cgroups could be set up (by a sysadmin) for every hugetlb >> user/application. In this case, the sysadmin needs to have knowledge >> of every huge page user/application and configure appropriately. >> >> I was approaching this from the point of view of the application. The >> application wants the guarantee of a minimum number of huge pages, >> independent of other users/applications. The "reserve" approach allows >> the application to set aside those pages at initialization time. If it >> can not get the pages it needs, it can refuse to start, or configure >> itself to use less, or take other action. >> > > Would it be too difficult to modify the application to mmap() the > hugepages at startup so they are no longer free in the global pool but > rather get marked as reserved so other applications cannot map them? That > should return MAP_FAILED if there is an insufficient number of hugepages > available to be reserved (HugePages_Rsvd in /proc/meminfo). The application is a database with multiple processes/tasks that will come and go over time. I thought about having one task do a big mmap() at initialization time, but then the issue is how to coordinate with the other tasks and their requests to allocate/free pages. -- Mike Kravetz