From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 00/16] Transparent huge page cache Date: Mon, 28 Jan 2013 11:24:12 +0200 Message-ID: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" Here's first steps towards huge pages in page cache. The intend of the work is get code ready to enable transparent huge page cache for the most simple fs -- ramfs. It's not yet near feature-complete. It only provides basic infrastructure. At the moment we can read, write and truncate file on ramfs with huge pages in page cache. The most interesting part, mmap(), is not yet there. For now we split huge page on mmap() attempt. I can't say that I see whole picture. I'm not sure if I understand locking model around split_huge_page(). Probably, not. Andrea, could you check if it looks correct? Next steps (not necessary in this order): - mmap(); - migration (?); - collapse; - stats, knobs, etc.; - tmpfs/shmem enabling; - ... Kirill A. Shutemov (16): block: implement add_bdi_stat() mm: implement zero_huge_user_segment and friends mm: drop actor argument of do_generic_file_read() radix-tree: implement preload for multiple contiguous elements thp, mm: basic defines for transparent huge page cache thp, mm: rewrite add_to_page_cache_locked() to support huge pages thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: locking tail page is a bug thp, mm: handle tail pages in page_cache_get_speculative() thp, mm: implement grab_cache_huge_page_write_begin() thp, mm: naive support of thp in generic read/write routines thp, libfs: initial support of thp in simple_read/write_begin/write_end thp: handle file pages in split_huge_page() thp, mm: truncate support for transparent huge page cache thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache fs/libfs.c | 54 +++++++++--- fs/ramfs/inode.c | 6 +- include/linux/backing-dev.h | 10 +++ include/linux/huge_mm.h | 8 ++ include/linux/mm.h | 15 ++++ include/linux/pagemap.h | 14 ++- include/linux/radix-tree.h | 3 + lib/radix-tree.c | 32 +++++-- mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 62 +++++++++++-- mm/memory.c | 22 +++++ mm/truncate.c | 12 +++ 12 files changed, 375 insertions(+), 67 deletions(-) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 01/16] block: implement add_bdi_stat() Date: Mon, 28 Jan 2013 11:24:13 +0200 Message-ID: <1359365068-10147-2-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" It's required for batched stats update. Signed-off-by: Kirill A. Shutemov --- include/linux/backing-dev.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 3504599..b05d961 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi, __add_bdi_stat(bdi, item, -1); } +static inline void add_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item, s64 amount) +{ + unsigned long flags; + + local_irq_save(flags); + __add_bdi_stat(bdi, item, amount); + local_irq_restore(flags); +} + static inline void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 03/16] mm: drop actor argument of do_generic_file_read() Date: Mon, 28 Jan 2013 11:24:15 +0200 Message-ID: <1359365068-10147-4-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: Received: from mga09.intel.com ([134.134.136.24]:39006 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755710Ab3A1JXf (ORCPT ); Mon, 28 Jan 2013 04:23:35 -0500 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: From: "Kirill A. Shutemov" There's only one caller of do_generic_file_read() and the only actor is file_read_actor(). No reason to have a callback parameter. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index c610076..b6a6d7e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1070,7 +1070,6 @@ static void shrink_readahead_size_eio(struct file *filp, * @filp: the file to read * @ppos: current file position * @desc: read_descriptor - * @actor: read method * * This is a generic file read routine, and uses the * mapping->a_ops->readpage() function for the actual low-level stuff. @@ -1079,7 +1078,7 @@ static void shrink_readahead_size_eio(struct file *filp, * of the logic when it comes to error handling etc. */ static void do_generic_file_read(struct file *filp, loff_t *ppos, - read_descriptor_t *desc, read_actor_t actor) + read_descriptor_t *desc) { struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; @@ -1180,13 +1179,14 @@ page_ok: * Ok, we have the page, and it's up-to-date, so * now we can copy it to user space... * - * The actor routine returns how many bytes were actually used.. + * The file_read_actor routine returns how many bytes were + * actually used.. * NOTE! This may not be the same as how much of a user buffer * we filled up (we may be padding etc), so we can only update * "pos" here (the actor routine has to update the user buffer * pointers and the remaining count). */ - ret = actor(desc, page, offset, nr); + ret = file_read_actor(desc, page, offset, nr); offset += ret; index += offset >> PAGE_CACHE_SHIFT; offset &= ~PAGE_CACHE_MASK; @@ -1459,7 +1459,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, if (desc.count == 0) continue; desc.error = 0; - do_generic_file_read(filp, ppos, &desc, file_read_actor); + do_generic_file_read(filp, ppos, &desc); retval += desc.written; if (desc.error) { retval = retval ?: desc.error; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 08/16] thp, mm: locking tail page is a bug Date: Mon, 28 Jan 2013 11:24:20 +0200 Message-ID: <1359365068-10147-9-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index a4b4fd5..f59eaa1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -665,6 +665,7 @@ void __lock_page(struct page *page) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + VM_BUG_ON(PageTail(page)); __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page, TASK_UNINTERRUPTIBLE); } @@ -674,6 +675,7 @@ int __lock_page_killable(struct page *page) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + VM_BUG_ON(PageTail(page)); return __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page_killable, TASK_KILLABLE); } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Date: Mon, 28 Jan 2013 11:24:18 +0200 Message-ID: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head page for the specified index and HPAGE_CACHE_NR-1 tail pages for following indexes. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 75 +++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 53 insertions(+), 22 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index b6a6d7e..fa2fdab 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) { int error; + int nr = 1; VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, error = mem_cgroup_cache_charge(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) - goto out; + return error; - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); - if (error == 0) { - page_cache_get(page); - page->mapping = mapping; - page->index = offset; + if (PageTransHuge(page)) { + BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR); + nr = HPAGE_CACHE_NR; + } + error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM); + if (error) { + mem_cgroup_uncharge_cache_page(page); + return error; + } - spin_lock_irq(&mapping->tree_lock); - error = radix_tree_insert(&mapping->page_tree, offset, page); - if (likely(!error)) { - mapping->nrpages++; - __inc_zone_page_state(page, NR_FILE_PAGES); - spin_unlock_irq(&mapping->tree_lock); - } else { - page->mapping = NULL; - /* Leave page->index set: truncation relies upon it */ - spin_unlock_irq(&mapping->tree_lock); - mem_cgroup_uncharge_cache_page(page); - page_cache_release(page); + page_cache_get(page); + spin_lock_irq(&mapping->tree_lock); + page->mapping = mapping; + if (PageTransHuge(page)) { + int i; + for (i = 0; i < HPAGE_CACHE_NR; i++) { + page_cache_get(page + i); + page[i].index = offset + i; + error = radix_tree_insert(&mapping->page_tree, + offset + i, page + i); + if (error) { + page_cache_release(page + i); + break; + } } - radix_tree_preload_end(); - } else - mem_cgroup_uncharge_cache_page(page); -out: + if (error) { + if (i > 0 && error == EEXIST) + error = ENOSPC; /* no space for a huge page */ + for (i--; i > 0; i--) { + page_cache_release(page + i); + radix_tree_delete(&mapping->page_tree, + offset + i); + } + goto err; + } + } else { + page->index = offset; + error = radix_tree_insert(&mapping->page_tree, offset, page); + if (unlikely(error)) + goto err; + } + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr); + mapping->nrpages += nr; + spin_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + return 0; +err: + page->mapping = NULL; + /* Leave page->index set: truncation relies upon it */ + spin_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + mem_cgroup_uncharge_cache_page(page); + page_cache_release(page); return error; } EXPORT_SYMBOL(add_to_page_cache_locked); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 09/16] thp, mm: handle tail pages in page_cache_get_speculative() Date: Mon, 28 Jan 2013 11:24:21 +0200 Message-ID: <1359365068-10147-10-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" For tail page we call __get_page_tail(). It has the same semantics, but for tail page. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 0e38e13..1da2043 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -149,6 +149,9 @@ static inline int page_cache_get_speculative(struct page *page) { VM_BUG_ON(in_interrupt()); + if (unlikely(PageTail(page))) + return __get_page_tail(page); + #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU) # ifdef CONFIG_PREEMPT_COUNT VM_BUG_ON(!in_atomic()); @@ -175,7 +178,6 @@ static inline int page_cache_get_speculative(struct page *page) return 0; } #endif - VM_BUG_ON(PageTail(page)); return 1; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 04/16] radix-tree: implement preload for multiple contiguous elements Date: Mon, 28 Jan 2013 11:24:16 +0200 Message-ID: <1359365068-10147-5-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" Currently radix_tree_preload() only guarantees enough nodes to insert one element. It's a hard limit. You cannot batch a number insert under one tree_lock. This patch introduces radix_tree_preload_count(). It allows to preallocate nodes enough to insert a number of *contiguous* elements. Signed-off-by: Matthew Wilcox Signed-off-by: Kirill A. Shutemov --- include/linux/radix-tree.h | 3 +++ lib/radix-tree.c | 32 +++++++++++++++++++++++++------- 2 files changed, 28 insertions(+), 7 deletions(-) diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index ffc444c..81318cb 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -83,6 +83,8 @@ do { \ (root)->rnode = NULL; \ } while (0) +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */ + /** * Radix-tree synchronization * @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root, unsigned long radix_tree_prev_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); int radix_tree_preload(gfp_t gfp_mask); +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index e796429..9bef0ac 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep; * The worst case is a zero height tree with just a single item at index 0, * and then inserting an item at index ULONG_MAX. This requires 2 new branches * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared. + * + * Worst case for adding N contiguous items is adding entries at indexes + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case + * item plus extra nodes if you cross the boundary from one node to the next. + * * Hence: */ -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MAX \ + (RADIX_TREE_PRELOAD_MIN + \ + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE)) /* * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { int nr; - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE]; + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX]; }; static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, }; @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node) /* * Load up this CPU's radix_tree_node buffer with sufficient objects to - * ensure that the addition of a single element in the tree cannot fail. On - * success, return zero, with preemption disabled. On error, return -ENOMEM + * ensure that the addition of *contiguous* elements in the tree cannot fail. + * On success, return zero, with preemption disabled. On error, return -ENOMEM * with preemption not disabled. * * To make use of this facility, the radix tree must be initialised without * __GFP_WAIT being passed to INIT_RADIX_TREE(). */ -int radix_tree_preload(gfp_t gfp_mask) +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask) { struct radix_tree_preload *rtp; struct radix_tree_node *node; int ret = -ENOMEM; + int alloc = RADIX_TREE_PRELOAD_MIN + + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE); + + if (size > RADIX_TREE_PRELOAD_NR) + return -ENOMEM; preempt_disable(); rtp = &__get_cpu_var(radix_tree_preloads); - while (rtp->nr < ARRAY_SIZE(rtp->nodes)) { + while (rtp->nr < alloc) { preempt_enable(); node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); if (node == NULL) goto out; preempt_disable(); rtp = &__get_cpu_var(radix_tree_preloads); - if (rtp->nr < ARRAY_SIZE(rtp->nodes)) + if (rtp->nr < alloc) rtp->nodes[rtp->nr++] = node; else kmem_cache_free(radix_tree_node_cachep, node); @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask) out: return ret; } + +int radix_tree_preload(gfp_t gfp_mask) +{ + return radix_tree_preload_count(1, gfp_mask); +} EXPORT_SYMBOL(radix_tree_preload); /* -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 12/16] thp, libfs: initial support of thp in simple_read/write_begin/write_end Date: Mon, 28 Jan 2013 11:24:24 +0200 Message-ID: <1359365068-10147-13-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" For now we try to grab a huge cache page if gfp_mask has __GFP_COMP. It's probably to weak condition and need to be reworked later. Signed-off-by: Kirill A. Shutemov --- fs/libfs.c | 54 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 42 insertions(+), 12 deletions(-) diff --git a/fs/libfs.c b/fs/libfs.c index 916da8c..a4530d5 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr); int simple_readpage(struct file *file, struct page *page) { - clear_highpage(page); + if (PageTransHuge(page)) + zero_huge_user(page, 0, HPAGE_PMD_SIZE); + else + clear_highpage(page); flush_dcache_page(page); SetPageUptodate(page); unlock_page(page); @@ -394,21 +397,43 @@ int simple_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { - struct page *page; + struct page *page = NULL; pgoff_t index; + gfp_t gfp_mask; index = pos >> PAGE_CACHE_SHIFT; - - page = grab_cache_page_write_begin(mapping, index, flags); + gfp_mask = mapping_gfp_mask(mapping); + + /* XXX: too weak condition. Good enough for initial testing */ + if (gfp_mask & __GFP_COMP) { + page = grab_cache_huge_page_write_begin(mapping, + index & ~HPAGE_CACHE_INDEX_MASK, flags); + /* fallback to small page */ + if (!page || !PageTransHuge(page)) { + unsigned long offset; + offset = pos & ~PAGE_CACHE_MASK; + len = min_t(unsigned long, + len, PAGE_CACHE_SIZE - offset); + } + } + if (!page) + page = grab_cache_page_write_begin(mapping, index, flags); if (!page) return -ENOMEM; - *pagep = page; - if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) { - unsigned from = pos & (PAGE_CACHE_SIZE - 1); - - zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE); + if (!PageUptodate(page)) { + unsigned from; + + if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) { + from = pos & ~HPAGE_PMD_MASK; + zero_huge_user_segments(page, 0, from, + from + len, HPAGE_PMD_SIZE); + } else if (len != PAGE_CACHE_SIZE) { + from = pos & ~PAGE_CACHE_MASK; + zero_user_segments(page, 0, from, + from + len, PAGE_CACHE_SIZE); + } } return 0; } @@ -443,9 +468,14 @@ int simple_write_end(struct file *file, struct address_space *mapping, /* zero the stale part of the page if we did a short copy */ if (copied < len) { - unsigned from = pos & (PAGE_CACHE_SIZE - 1); - - zero_user(page, from + copied, len - copied); + unsigned from; + if (PageTransHuge(page)) { + from = pos & ~HPAGE_PMD_MASK; + zero_huge_user(page, from + copied, len - copied); + } else { + from = pos & ~PAGE_CACHE_MASK; + zero_user(page, from + copied, len - copied); + } } if (!PageUptodate(page)) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 05/16] thp, mm: basic defines for transparent huge page cache Date: Mon, 28 Jan 2013 11:24:17 +0200 Message-ID: <1359365068-10147-6-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" Signed-off-by: Kirill A. Shutemov --- include/linux/huge_mm.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ee1c244..a54939c 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page, #define HPAGE_PMD_MASK HPAGE_MASK #define HPAGE_PMD_SIZE HPAGE_SIZE +#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT) +#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER) +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1) + extern bool is_vma_temporary_stack(struct vm_area_struct *vma); #define transparent_hugepage_enabled(__vma) \ @@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; }) + #define hpage_nr_pages(x) 1 #define transparent_hugepage_enabled(__vma) 0 -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 11/16] thp, mm: naive support of thp in generic read/write routines Date: Mon, 28 Jan 2013 11:24:23 +0200 Message-ID: <1359365068-10147-12-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" For now we still write/read at most PAGE_CACHE_SIZE bytes a time. This implementation doesn't cover address spaces with backing store. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 35 ++++++++++++++++++++++++++++++----- 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 68e47e4..a7331fb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1161,12 +1161,23 @@ find_page: if (unlikely(page == NULL)) goto no_cached_page; } + if (PageTransTail(page)) { + page_cache_release(page); + page = find_get_page(mapping, + index & ~HPAGE_CACHE_INDEX_MASK); + if (!PageTransHuge(page)) { + page_cache_release(page); + goto find_page; + } + } if (PageReadahead(page)) { + BUG_ON(PageTransHuge(page)); page_cache_async_readahead(mapping, ra, filp, page, index, last_index - index); } if (!PageUptodate(page)) { + BUG_ON(PageTransHuge(page)); if (inode->i_blkbits == PAGE_CACHE_SHIFT || !mapping->a_ops->is_partially_uptodate) goto page_not_up_to_date; @@ -1208,18 +1219,25 @@ page_ok: } nr = nr - offset; + /* Recalculate offset in page if we've got a huge page */ + if (PageTransHuge(page)) { + offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset); + offset &= ~HPAGE_PMD_MASK; + } + /* If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing * before reading the page on the kernel side. */ if (mapping_writably_mapped(mapping)) - flush_dcache_page(page); + flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT)); /* * When a sequential read accesses a page several times, * only mark it as accessed the first time. */ - if (prev_index != index || offset != prev_offset) + if (prev_index != index || + (offset & ~PAGE_CACHE_MASK) != prev_offset) mark_page_accessed(page); prev_index = index; @@ -1234,8 +1252,9 @@ page_ok: * "pos" here (the actor routine has to update the user buffer * pointers and the remaining count). */ - ret = file_read_actor(desc, page, offset, nr); - offset += ret; + ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT), + offset & ~PAGE_CACHE_MASK, nr); + offset = (offset & ~PAGE_CACHE_MASK) + ret; index += offset >> PAGE_CACHE_SHIFT; offset &= ~PAGE_CACHE_MASK; prev_offset = offset; @@ -2433,8 +2452,13 @@ again: if (mapping_writably_mapped(mapping)) flush_dcache_page(page); + if (PageTransHuge(page)) + offset = pos & ~HPAGE_PMD_MASK; + pagefault_disable(); - copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); + copied = iov_iter_copy_from_user_atomic( + page + (offset >> PAGE_CACHE_SHIFT), + i, offset & ~PAGE_CACHE_MASK, bytes); pagefault_enable(); flush_dcache_page(page); @@ -2457,6 +2481,7 @@ again: * because not all segments in the iov can be copied at * once without a pagefault. */ + offset = pos & ~PAGE_CACHE_MASK; bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, iov_iter_single_seg_count(i)); goto again; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 02/16] mm: implement zero_huge_user_segment and friends Date: Mon, 28 Jan 2013 11:24:14 +0200 Message-ID: <1359365068-10147-3-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" Let's add helpers to clear huge page segment(s). They provide the same functionallity as zero_user_segment{,s} and zero_user, but for huge pages. Signed-off-by: Kirill A. Shutemov --- include/linux/mm.h | 15 +++++++++++++++ mm/memory.c | 22 ++++++++++++++++++++++ 2 files changed, 37 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index e4533a1..c011771 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1728,6 +1728,21 @@ extern void dump_page(struct page *page); extern void clear_huge_page(struct page *page, unsigned long addr, unsigned int pages_per_huge_page); +extern void zero_huge_user_segment(struct page *page, + unsigned start, unsigned end); +static inline void zero_huge_user_segments(struct page *page, + unsigned start1, unsigned end1, + unsigned start2, unsigned end2) +{ + zero_huge_user_segment(page, start1, end1); + zero_huge_user_segment(page, start2, end2); +} +static inline void zero_huge_user(struct page *page, + unsigned start, unsigned len) +{ + zero_huge_user_segment(page, start, start+len); +} + extern void copy_user_huge_page(struct page *dst, struct page *src, unsigned long addr, struct vm_area_struct *vma, unsigned int pages_per_huge_page); diff --git a/mm/memory.c b/mm/memory.c index c04078b..200a74d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4185,6 +4185,28 @@ void clear_huge_page(struct page *page, } } +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end) +{ + int i; + + BUG_ON(end < start); + + might_sleep(); + + /* start and end are on the same small page */ + if ((start & PAGE_MASK) == (end & PAGE_MASK)) + return zero_user_segment(page + (start >> PAGE_SHIFT), + start & ~PAGE_MASK, end & ~PAGE_MASK); + + zero_user_segment(page + (start >> PAGE_SHIFT), + start & ~PAGE_MASK, PAGE_SIZE); + for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) { + cond_resched(); + clear_highpage(page + i); + } + zero_user_segment(page + i, 0, end & ~PAGE_MASK); +} + static void copy_user_gigantic_page(struct page *dst, struct page *src, unsigned long addr, struct vm_area_struct *vma, -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 13/16] thp: handle file pages in split_huge_page() Date: Mon, 28 Jan 2013 11:24:25 +0200 Message-ID: <1359365068-10147-14-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" The base scheme is the same as for anonymous pages, but we walk by mapping->i_mmap rather then anon_vma->rb_root. __split_huge_page_refcount() has been tunned a bit: we need to transfer PG_swapbacked to tail pages. Splitting mapped pages haven't tested at all, since we cannot mmap() file-backed huge pages yet. Signed-off-by: Kirill A. Shutemov --- mm/huge_memory.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 53 insertions(+), 9 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c63a21d..008b2c9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1613,7 +1613,8 @@ static void __split_huge_page_refcount(struct page *page) ((1L << PG_referenced) | (1L << PG_swapbacked) | (1L << PG_mlocked) | - (1L << PG_uptodate))); + (1L << PG_uptodate) | + (1L << PG_swapbacked))); page_tail->flags |= (1L << PG_dirty); /* clear PageTail before overwriting first_page */ @@ -1641,10 +1642,8 @@ static void __split_huge_page_refcount(struct page *page) page_tail->index = page->index + i; page_xchg_last_nid(page_tail, page_last_nid(page)); - BUG_ON(!PageAnon(page_tail)); BUG_ON(!PageUptodate(page_tail)); BUG_ON(!PageDirty(page_tail)); - BUG_ON(!PageSwapBacked(page_tail)); lru_add_page_tail(page, page_tail, lruvec); } @@ -1752,7 +1751,7 @@ static int __split_huge_page_map(struct page *page, } /* must be called with anon_vma->root->rwsem held */ -static void __split_huge_page(struct page *page, +static void __split_anon_huge_page(struct page *page, struct anon_vma *anon_vma) { int mapcount, mapcount2; @@ -1799,14 +1798,11 @@ static void __split_huge_page(struct page *page, BUG_ON(mapcount != mapcount2); } -int split_huge_page(struct page *page) +static int split_anon_huge_page(struct page *page) { struct anon_vma *anon_vma; int ret = 1; - BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); - BUG_ON(!PageAnon(page)); - /* * The caller does not necessarily hold an mmap_sem that would prevent * the anon_vma disappearing so we first we take a reference to it @@ -1824,7 +1820,7 @@ int split_huge_page(struct page *page) goto out_unlock; BUG_ON(!PageSwapBacked(page)); - __split_huge_page(page, anon_vma); + __split_anon_huge_page(page, anon_vma); count_vm_event(THP_SPLIT); BUG_ON(PageCompound(page)); @@ -1835,6 +1831,54 @@ out: return ret; } +static int split_file_huge_page(struct page *page) +{ + struct address_space *mapping = page->mapping; + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + struct vm_area_struct *vma; + int mapcount, mapcount2; + + BUG_ON(!PageHead(page)); + BUG_ON(PageTail(page)); + + mutex_lock(&mapping->i_mmap_mutex); + mapcount = 0; + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + unsigned long addr = vma_address(page, vma); + mapcount += __split_huge_page_splitting(page, vma, addr); + } + + if (mapcount != page_mapcount(page)) + printk(KERN_ERR "mapcount %d page_mapcount %d\n", + mapcount, page_mapcount(page)); + BUG_ON(mapcount != page_mapcount(page)); + + __split_huge_page_refcount(page); + + mapcount2 = 0; + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + unsigned long addr = vma_address(page, vma); + mapcount2 += __split_huge_page_map(page, vma, addr); + } + + if (mapcount != mapcount2) + printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n", + mapcount, mapcount2, page_mapcount(page)); + BUG_ON(mapcount != mapcount2); + mutex_unlock(&mapping->i_mmap_mutex); + return 0; +} + +int split_huge_page(struct page *page) +{ + BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); + + if (PageAnon(page)) + return split_anon_huge_page(page); + else + return split_file_huge_page(page); +} + #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE) int hugepage_madvise(struct vm_area_struct *vma, -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 14/16] thp, mm: truncate support for transparent huge page cache Date: Mon, 28 Jan 2013 11:24:26 +0200 Message-ID: <1359365068-10147-15-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" If we starting position of truncation is in tail page we have to spilit the huge page page first. We also have to split if end is within the huge page. Otherwise we can truncate whole huge page at once. Signed-off-by: Kirill A. Shutemov --- mm/truncate.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/truncate.c b/mm/truncate.c index c75b736..87c247d 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping, if (index > end) break; + /* split page if we start from tail page */ + if (PageTransTail(page)) + split_huge_page(compound_trans_head(page)); + if (PageTransHuge(page)) { + /* split if end is within huge page */ + if (index == (end & ~HPAGE_CACHE_INDEX_MASK)) + split_huge_page(page); + else + /* skip tail pages */ + i += HPAGE_CACHE_NR - 1; + } if (!trylock_page(page)) continue; WARN_ON(page->index != index); @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping, if (index > end) break; + VM_BUG_ON(PageTransHuge(page)); lock_page(page); WARN_ON(page->index != index); wait_on_page_writeback(page); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 07/16] thp, mm: rewrite delete_from_page_cache() to support huge pages Date: Mon, 28 Jan 2013 11:24:19 +0200 Message-ID: <1359365068-10147-8-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a time. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index fa2fdab..a4b4fd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -112,6 +112,7 @@ void __delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + int nr = 1; /* * if we're uptodate, flush out into the cleancache, otherwise @@ -123,13 +124,23 @@ void __delete_from_page_cache(struct page *page) else cleancache_invalidate_page(mapping, page); - radix_tree_delete(&mapping->page_tree, page->index); + if (PageTransHuge(page)) { + int i; + + for (i = 0; i < HPAGE_CACHE_NR; i++) + radix_tree_delete(&mapping->page_tree, page->index + i); + nr = HPAGE_CACHE_NR; + } else { + radix_tree_delete(&mapping->page_tree, page->index); + } + page->mapping = NULL; /* Leave page->index set: truncation lookup relies upon it */ - mapping->nrpages--; - __dec_zone_page_state(page, NR_FILE_PAGES); + + mapping->nrpages -= nr; + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr); if (PageSwapBacked(page)) - __dec_zone_page_state(page, NR_SHMEM); + __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr); BUG_ON(page_mapped(page)); /* @@ -140,8 +151,8 @@ void __delete_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { - dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr); + add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr); } } @@ -157,6 +168,7 @@ void delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; void (*freepage)(struct page *); + int i; BUG_ON(!PageLocked(page)); @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page) if (freepage) freepage(page); + if (PageTransHuge(page)) + for (i = 1; i < HPAGE_CACHE_NR; i++) + page_cache_release(page); page_cache_release(page); } EXPORT_SYMBOL(delete_from_page_cache); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 15/16] thp, mm: split huge page on mmap file page Date: Mon, 28 Jan 2013 11:24:27 +0200 Message-ID: <1359365068-10147-16-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" We are not ready to mmap file-backed tranparent huge pages. Let's split them on mmap() attempt. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index a7331fb..2e08582 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1692,6 +1692,8 @@ retry_find: goto no_cached_page; } + if (PageTransCompound(page)) + split_huge_page(page); if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { page_cache_release(page); return ret | VM_FAULT_RETRY; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 10/16] thp, mm: implement grab_cache_huge_page_write_begin() Date: Mon, 28 Jan 2013 11:24:22 +0200 Message-ID: <1359365068-10147-11-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" The function is grab_cache_page_write_begin() twin but it tries to allocate huge page at given position aligned to HPAGE_CACHE_NR. If, for some reason, it's not possible allocate a huge page at this possition, it returns NULL. Caller should take care of fallback to small pages. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 10 +++++++++ mm/filemap.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 1da2043..5836d0d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -260,6 +260,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags); +#else +static inline struct page *grab_cache_huge_page_write_begin( + struct address_space *mapping, pgoff_t index, unsigned flags) +{ + return NULL; +} +#endif /* * Returns locked page at given index in given cache, creating it if needed. diff --git a/mm/filemap.c b/mm/filemap.c index f59eaa1..68e47e4 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2328,6 +2328,61 @@ found: } EXPORT_SYMBOL(grab_cache_page_write_begin); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/* + * Find or create a huge page at the given pagecache position, aligned to + * HPAGE_CACHE_NR. Return the locked huge page. + * + * If, for some reason, it's not possible allocate a huge page at this + * possition, it returns NULL. Caller should take care of fallback to small + * pages. + * + * This function is specifically for buffered writes. + */ +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) +{ + int status; + gfp_t gfp_mask; + struct page *page; + gfp_t gfp_notmask = 0; + + BUG_ON(index & HPAGE_CACHE_INDEX_MASK); + gfp_mask = mapping_gfp_mask(mapping); + BUG_ON(!(gfp_mask & __GFP_COMP)); + if (mapping_cap_account_dirty(mapping)) + gfp_mask |= __GFP_WRITE; + if (flags & AOP_FLAG_NOFS) + gfp_notmask = __GFP_FS; +repeat: + page = find_lock_page(mapping, index); + if (page) { + if (!PageTransHuge(page)) { + unlock_page(page); + page_cache_release(page); + return NULL; + } + goto found; + } + + page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER); + if (!page) + return NULL; + + status = add_to_page_cache_lru(page, mapping, index, + GFP_KERNEL & ~gfp_notmask); + if (unlikely(status)) { + page_cache_release(page); + if (status == -EEXIST) + goto repeat; + return NULL; + } +found: + wait_on_page_writeback(page); + return page; +} +#endif + static ssize_t generic_perform_write(struct file *file, struct iov_iter *i, loff_t pos) { -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: [PATCH, RFC 16/16] ramfs: enable transparent huge page cache Date: Mon, 28 Jan 2013 11:24:28 +0200 Message-ID: <1359365068-10147-17-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: "Kirill A. Shutemov" ramfs is the most simple fs from page cache point of view. Let's start transparent huge page cache enabling here. For now we allocate only non-movable huge page. It's not yet clear if movable page is safe here and what need to be done to make it safe. Signed-off-by: Kirill A. Shutemov --- fs/ramfs/inode.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index eab8c09..591457d 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb, inode_init_owner(inode, dir, mode); inode->i_mapping->a_ops = &ramfs_aops; inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + /* + * TODO: what should be done to make movable safe? + */ + mapping_set_gfp_mask(inode->i_mapping, + GFP_TRANSHUGE & ~__GFP_MOVABLE); mapping_set_unevictable(inode->i_mapping); inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; switch (mode & S_IFMT) { -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Mon, 28 Jan 2013 21:03:41 -0800 (PST) Message-ID: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: "Kirill A. Shutemov" Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > Here's first steps towards huge pages in page cache. > > The intend of the work is get code ready to enable transparent huge page > cache for the most simple fs -- ramfs. > > It's not yet near feature-complete. It only provides basic infrastructure. > At the moment we can read, write and truncate file on ramfs with huge pages in > page cache. The most interesting part, mmap(), is not yet there. For now > we split huge page on mmap() attempt. > > I can't say that I see whole picture. I'm not sure if I understand locking > model around split_huge_page(). Probably, not. > Andrea, could you check if it looks correct? > > Next steps (not necessary in this order): > - mmap(); > - migration (?); > - collapse; > - stats, knobs, etc.; > - tmpfs/shmem enabling; > - ... > > Kirill A. Shutemov (16): > block: implement add_bdi_stat() > mm: implement zero_huge_user_segment and friends > mm: drop actor argument of do_generic_file_read() > radix-tree: implement preload for multiple contiguous elements > thp, mm: basic defines for transparent huge page cache > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > thp, mm: rewrite delete_from_page_cache() to support huge pages > thp, mm: locking tail page is a bug > thp, mm: handle tail pages in page_cache_get_speculative() > thp, mm: implement grab_cache_huge_page_write_begin() > thp, mm: naive support of thp in generic read/write routines > thp, libfs: initial support of thp in > simple_read/write_begin/write_end > thp: handle file pages in split_huge_page() > thp, mm: truncate support for transparent huge page cache > thp, mm: split huge page on mmap file page > ramfs: enable transparent huge page cache > > fs/libfs.c | 54 +++++++++--- > fs/ramfs/inode.c | 6 +- > include/linux/backing-dev.h | 10 +++ > include/linux/huge_mm.h | 8 ++ > include/linux/mm.h | 15 ++++ > include/linux/pagemap.h | 14 ++- > include/linux/radix-tree.h | 3 + > lib/radix-tree.c | 32 +++++-- > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 62 +++++++++++-- > mm/memory.c | 22 +++++ > mm/truncate.c | 12 +++ > 12 files changed, 375 insertions(+), 67 deletions(-) Interesting. I was starting to think about Transparent Huge Pagecache a few months ago, but then got washed away by incoming waves as usual. Certainly I don't have a line of code to show for it; but my first impression of your patches is that we have very different ideas of where to start. Perhaps that's good complementarity, or perhaps I'll disagree with your approach. I'll be taking a look at yours in the coming days, and trying to summon back up my own ideas to summarize them for you. Perhaps I was naive to imagine it, but I did intend to start out generically, independent of filesystem; but content to narrow down on tmpfs alone where it gets hard to support the others (writeback springs to mind). khugepaged would be migrating little pages into huge pages, where it saw that the mmaps of the file would benefit (and for testing I would hack mmap alignment choice to favour it). I had arrived at a conviction that the first thing to change was the way that tail pages of a THP are refcounted, that it had been a mistake to use the compound page method of holding the THP together. But I'll have to enter a trance now to recall the arguments ;) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hillf Danton Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Date: Tue, 29 Jan 2013 20:11:01 +0800 Message-ID: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML To: "Kirill A. Shutemov" Return-path: In-Reply-To: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov wrote: > @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > pgoff_t offset, gfp_t gfp_mask) > { > int error; > + int nr = 1; > > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > error = mem_cgroup_cache_charge(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > - goto out; > + return error; Due to PageCompound check, thp could not be charged effectively. Any change added for charging it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hillf Danton Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Date: Tue, 29 Jan 2013 20:14:04 +0800 Message-ID: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML To: "Kirill A. Shutemov" Return-path: In-Reply-To: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov wrote: > + page_cache_get(page); > + spin_lock_irq(&mapping->tree_lock); > + page->mapping = mapping; > + if (PageTransHuge(page)) { > + int i; > + for (i = 0; i < HPAGE_CACHE_NR; i++) { > + page_cache_get(page + i); Page count is raised twice for head page, why? > + page[i].index = offset + i; > + error = radix_tree_insert(&mapping->page_tree, > + offset + i, page + i); > + if (error) { > + page_cache_release(page + i); > + break; > + } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hillf Danton Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Date: Tue, 29 Jan 2013 20:26:09 +0800 Message-ID: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML To: "Kirill A. Shutemov" Return-path: In-Reply-To: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov wrote: > + page_cache_get(page); > + spin_lock_irq(&mapping->tree_lock); > + page->mapping = mapping; > + if (PageTransHuge(page)) { > + int i; > + for (i = 0; i < HPAGE_CACHE_NR; i++) { > + page_cache_get(page + i); > + page[i].index = offset + i; > + error = radix_tree_insert(&mapping->page_tree, > + offset + i, page + i); > + if (error) { > + page_cache_release(page + i); > + break; > + } Is page count balanced with the following? @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page) if (freepage) freepage(page); + if (PageTransHuge(page)) + for (i = 1; i < HPAGE_CACHE_NR; i++) + page_cache_release(page); page_cache_release(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Date: Tue, 29 Jan 2013 14:48:37 +0200 Message-ID: <5107c525eb8d1_f167d78c8425f9@blue.mail> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML To: Hillf Danton , "Kirill A. Shutemov" Return-path: Received: from mga03.intel.com ([143.182.124.21]:64169 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751457Ab3A2Mso (ORCPT ); Tue, 29 Jan 2013 07:48:44 -0500 In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hillf Danton wrote: > On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov > wrote: > > + page_cache_get(page); > > + spin_lock_irq(&mapping->tree_lock); > > + page->mapping = mapping; > > + if (PageTransHuge(page)) { > > + int i; > > + for (i = 0; i < HPAGE_CACHE_NR; i++) { > > + page_cache_get(page + i); > > + page[i].index = offset + i; > > + error = radix_tree_insert(&mapping->page_tree, > > + offset + i, page + i); > > + if (error) { > > + page_cache_release(page + i); > > + break; > > + } > > Is page count balanced with the following? It's broken. Last minue changes are evil :( Thanks for catching it. I'll fix it in next revision. > @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page) > > if (freepage) > freepage(page); > + if (PageTransHuge(page)) > + for (i = 1; i < HPAGE_CACHE_NR; i++) > + page_cache_release(page); > page_cache_release(page); -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Date: Tue, 29 Jan 2013 15:01:27 +0200 Message-ID: <5107c827d855d_f167d78c842627@blue.mail> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML To: Hillf Danton , "Kirill A. Shutemov" Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hillf Danton wrote: > On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov > wrote: > > @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > pgoff_t offset, gfp_t gfp_mask) > > { > > int error; > > + int nr = 1; > > > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > error = mem_cgroup_cache_charge(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > - goto out; > > + return error; > > Due to PageCompound check, thp could not be charged effectively. > Any change added for charging it? I've missed this. Will fix. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Tue, 29 Jan 2013 15:14:58 +0200 Message-ID: <5107cb52e07b1_376199eb7059997@blue.mail> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins , "Kirill A. Shutemov" Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hugh Dickins wrote: > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: > > From: "Kirill A. Shutemov" > > > > Here's first steps towards huge pages in page cache. > > > > The intend of the work is get code ready to enable transparent huge page > > cache for the most simple fs -- ramfs. > > > > It's not yet near feature-complete. It only provides basic infrastructure. > > At the moment we can read, write and truncate file on ramfs with huge pages in > > page cache. The most interesting part, mmap(), is not yet there. For now > > we split huge page on mmap() attempt. > > > > I can't say that I see whole picture. I'm not sure if I understand locking > > model around split_huge_page(). Probably, not. > > Andrea, could you check if it looks correct? > > > > Next steps (not necessary in this order): > > - mmap(); > > - migration (?); > > - collapse; > > - stats, knobs, etc.; > > - tmpfs/shmem enabling; > > - ... > > > > Kirill A. Shutemov (16): > > block: implement add_bdi_stat() > > mm: implement zero_huge_user_segment and friends > > mm: drop actor argument of do_generic_file_read() > > radix-tree: implement preload for multiple contiguous elements > > thp, mm: basic defines for transparent huge page cache > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > > thp, mm: rewrite delete_from_page_cache() to support huge pages > > thp, mm: locking tail page is a bug > > thp, mm: handle tail pages in page_cache_get_speculative() > > thp, mm: implement grab_cache_huge_page_write_begin() > > thp, mm: naive support of thp in generic read/write routines > > thp, libfs: initial support of thp in > > simple_read/write_begin/write_end > > thp: handle file pages in split_huge_page() > > thp, mm: truncate support for transparent huge page cache > > thp, mm: split huge page on mmap file page > > ramfs: enable transparent huge page cache > > > > fs/libfs.c | 54 +++++++++--- > > fs/ramfs/inode.c | 6 +- > > include/linux/backing-dev.h | 10 +++ > > include/linux/huge_mm.h | 8 ++ > > include/linux/mm.h | 15 ++++ > > include/linux/pagemap.h | 14 ++- > > include/linux/radix-tree.h | 3 + > > lib/radix-tree.c | 32 +++++-- > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > > mm/huge_memory.c | 62 +++++++++++-- > > mm/memory.c | 22 +++++ > > mm/truncate.c | 12 +++ > > 12 files changed, 375 insertions(+), 67 deletions(-) > > Interesting. > > I was starting to think about Transparent Huge Pagecache a few > months ago, but then got washed away by incoming waves as usual. > > Certainly I don't have a line of code to show for it; but my first > impression of your patches is that we have very different ideas of > where to start. > > Perhaps that's good complementarity, or perhaps I'll disagree with > your approach. I'll be taking a look at yours in the coming days, > and trying to summon back up my own ideas to summarize them for you. Yeah, it would be nice to see alternative design ideas. Looking forward. > Perhaps I was naive to imagine it, but I did intend to start out > generically, independent of filesystem; but content to narrow down > on tmpfs alone where it gets hard to support the others (writeback > springs to mind). khugepaged would be migrating little pages into > huge pages, where it saw that the mmaps of the file would benefit > (and for testing I would hack mmap alignment choice to favour it). I don't think all fs at once would fly, but it's wonderful, if I'm wrong :) > I had arrived at a conviction that the first thing to change was > the way that tail pages of a THP are refcounted, that it had been a > mistake to use the compound page method of holding the THP together. > But I'll have to enter a trance now to recall the arguments ;) THP refcounting looks reasonable for me, if take split_huge_page() in account. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Wed, 30 Jan 2013 18:12:05 -0800 (PST) Message-ID: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: "Kirill A. Shutemov" Return-path: In-Reply-To: <5107cb52e07b1_376199eb7059997@blue.mail> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: > Hugh Dickins wrote: > > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: > > > From: "Kirill A. Shutemov" > > > > > > Here's first steps towards huge pages in page cache. > > > > > > The intend of the work is get code ready to enable transparent huge page > > > cache for the most simple fs -- ramfs. > > > > > > It's not yet near feature-complete. It only provides basic infrastructure. > > > At the moment we can read, write and truncate file on ramfs with huge pages in > > > page cache. The most interesting part, mmap(), is not yet there. For now > > > we split huge page on mmap() attempt. > > > > > > I can't say that I see whole picture. I'm not sure if I understand locking > > > model around split_huge_page(). Probably, not. > > > Andrea, could you check if it looks correct? > > > > > > Next steps (not necessary in this order): > > > - mmap(); > > > - migration (?); > > > - collapse; > > > - stats, knobs, etc.; > > > - tmpfs/shmem enabling; > > > - ... > > > > > > Kirill A. Shutemov (16): > > > block: implement add_bdi_stat() > > > mm: implement zero_huge_user_segment and friends > > > mm: drop actor argument of do_generic_file_read() > > > radix-tree: implement preload for multiple contiguous elements > > > thp, mm: basic defines for transparent huge page cache > > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > > > thp, mm: rewrite delete_from_page_cache() to support huge pages > > > thp, mm: locking tail page is a bug > > > thp, mm: handle tail pages in page_cache_get_speculative() > > > thp, mm: implement grab_cache_huge_page_write_begin() > > > thp, mm: naive support of thp in generic read/write routines > > > thp, libfs: initial support of thp in > > > simple_read/write_begin/write_end > > > thp: handle file pages in split_huge_page() > > > thp, mm: truncate support for transparent huge page cache > > > thp, mm: split huge page on mmap file page > > > ramfs: enable transparent huge page cache > > > > > > fs/libfs.c | 54 +++++++++--- > > > fs/ramfs/inode.c | 6 +- > > > include/linux/backing-dev.h | 10 +++ > > > include/linux/huge_mm.h | 8 ++ > > > include/linux/mm.h | 15 ++++ > > > include/linux/pagemap.h | 14 ++- > > > include/linux/radix-tree.h | 3 + > > > lib/radix-tree.c | 32 +++++-- > > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > > > mm/huge_memory.c | 62 +++++++++++-- > > > mm/memory.c | 22 +++++ > > > mm/truncate.c | 12 +++ > > > 12 files changed, 375 insertions(+), 67 deletions(-) > > > > Interesting. > > > > I was starting to think about Transparent Huge Pagecache a few > > months ago, but then got washed away by incoming waves as usual. > > > > Certainly I don't have a line of code to show for it; but my first > > impression of your patches is that we have very different ideas of > > where to start. A second impression confirms that we have very different ideas of where to start. I don't want to be dismissive, and please don't let me discourage you, but I just don't find what you have very interesting. I'm sure you'll agree that the interesting part, and the difficult part, comes with mmap(); and there's no point whatever to THPages without mmap() (of course, I'm including exec and brk and shm when I say mmap there). (There may be performance benefits in working with larger page cache size, which Christoph Lameter explored a few years back, but that's a different topic: I think 2MB - if I may be x86_64-centric - would not be the unit of choice for that, unless SSD erase block were to dominate.) I'm interested to get to the point of prototyping something that does support mmap() of THPageCache: I'm pretty sure that I'd then soon learn a lot about my misconceptions, and have to rework for a while (or give up!); but I don't see much point in posting anything without that. I don't know if we have 5 or 50 places which "know" that a THPage must be Anon: some I'll spot in advance, some I sadly won't. It's not clear to me that the infrastructural changes you make in this series will be needed or not, if I pursue my approach: some perhaps as optimizations on top of the poorly performing base that may emerge from going about it my way. But for me it's too soon to think about those. Something I notice that we do agree upon: the radix_tree holding the 4k subpages, at least for now. When I first started thinking towards THPageCache, I was fascinated by how we could manage the hugepages in the radix_tree, cutting out unnecessary levels etc; but after a while I realized that although there's probably nice scope for cleverness there (significantly constrained by RCU expectations), it would only be about optimization. Let's be simple and stupid about radix_tree for now, the problems that need to be worked out lie elsewhere. > > > > Perhaps that's good complementarity, or perhaps I'll disagree with > > your approach. I'll be taking a look at yours in the coming days, > > and trying to summon back up my own ideas to summarize them for you. > > Yeah, it would be nice to see alternative design ideas. Looking forward. > > > Perhaps I was naive to imagine it, but I did intend to start out > > generically, independent of filesystem; but content to narrow down > > on tmpfs alone where it gets hard to support the others (writeback > > springs to mind). khugepaged would be migrating little pages into > > huge pages, where it saw that the mmaps of the file would benefit > > (and for testing I would hack mmap alignment choice to favour it). > > I don't think all fs at once would fly, but it's wonderful, if I'm > wrong :) You are imagining the filesystem putting huge pages into its cache. Whereas I'm imagining khugepaged looking around at mmaped file areas, seeing which would benefit from huge pagecache (let's assume offset 0 belongs on hugepage boundary - maybe one day someone will want to tune some files or parts differently, but that's low priority), migrating 4k pages over to 2MB page (wouldn't have to be done all in one pass), then finally slotting in the pmds for that. But going this way, I expect we'd have to split at page_mkwrite(): we probably don't want a single touch to dirty 2MB at a time, unless tmpfs or ramfs. > > > I had arrived at a conviction that the first thing to change was > > the way that tail pages of a THP are refcounted, that it had been a > > mistake to use the compound page method of holding the THP together. > > But I'll have to enter a trance now to recall the arguments ;) > > THP refcounting looks reasonable for me, if take split_huge_page() in > account. I'm not claiming that the THP refcounting is wrong in what it's doing at present; but that I suspect we'll want to rework it for THPageCache. Something I take for granted, I think you do too but I'm not certain: a file with transparent huge pages in its page cache can also have small pages in other extents of its page cache; and can be mapped hugely (2MB extents) into one address space at the same time as individual 4k pages from those extents are mapped into another (or the same) address space. One can certainly imagine sacrificing that principle, splitting whenever there's such a "conflict"; but it then becomes uninteresting to me, too much like hugetlbfs. Splitting an anonymous hugepage in all address spaces that hold it when one of them needs it split, that has been a pragmatic strategy: it's not a common case for forks to diverge like that; but files are expected to be more widely shared. At present THP is using compound pages, with mapcount of tail pages reused to track their contribution to head page count; but I think we shall want to be able to use the mapcount, and the count, of TH tail pages for their original purpose if huge mappings can coexist with tiny. Not fully thought out, but that's my feeling. The use of compound pages, in particular the redirection of tail page count to head page count, was important in hugetlbfs: a get_user_pages reference on a subpage must prevent the containing hugepage from being freed, because hugetlbfs has its own separate pool of hugepages to which freeing returns them. But for transparent huge pages? It should not matter so much if the subpages are freed independently. So I'd like to devise another glue to hold them together more loosely (for prototyping I can certainly pretend we have infinite pageflag and pagefield space if that helps): I may find in practice that they're forever falling apart, and I run crying back to compound pages; but at present I'm hoping not. This mail might suggest that I'm about to start coding: I wish that were true, but in reality there's always a lot of unrelated things I have to look at, which dilute my focus. So if I've said anything that sparks ideas for you, go with them. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Sat, 2 Feb 2013 17:13:37 +0200 (EET) Message-ID: <20130202151337.5C454E0073@blue.fi.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> Content-Transfer-Encoding: 7bit Cc: "Kirill A. Shutemov" , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins , Andrea Arcangeli Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hugh Dickins wrote: > On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: > > Hugh Dickins wrote: > > > > > > Interesting. > > > > > > I was starting to think about Transparent Huge Pagecache a few > > > months ago, but then got washed away by incoming waves as usual. > > > > > > Certainly I don't have a line of code to show for it; but my first > > > impression of your patches is that we have very different ideas of > > > where to start. > > A second impression confirms that we have very different ideas of > where to start. I don't want to be dismissive, and please don't let > me discourage you, but I just don't find what you have very interesting. The main reason for publishing the patchset in current (not-really-useful) state is to start discussion early. Looks like it works :) > I'm sure you'll agree that the interesting part, and the difficult part, > comes with mmap(); and there's no point whatever to THPages without mmap() > (of course, I'm including exec and brk and shm when I say mmap there). > > (There may be performance benefits in working with larger page cache > size, which Christoph Lameter explored a few years back, but that's a > different topic: I think 2MB - if I may be x86_64-centric - would not be > the unit of choice for that, unless SSD erase block were to dominate.) > > I'm interested to get to the point of prototyping something that does > support mmap() of THPageCache: I'm pretty sure that I'd then soon learn > a lot about my misconceptions, and have to rework for a while (or give > up!); but I don't see much point in posting anything without that. > I don't know if we have 5 or 50 places which "know" that a THPage > must be Anon: some I'll spot in advance, some I sadly won't. > > It's not clear to me that the infrastructural changes you make in this > series will be needed or not, if I pursue my approach: some perhaps as > optimizations on top of the poorly performing base that may emerge from > going about it my way. But for me it's too soon to think about those. > > Something I notice that we do agree upon: the radix_tree holding the > 4k subpages, at least for now. When I first started thinking towards > THPageCache, I was fascinated by how we could manage the hugepages in > the radix_tree, cutting out unnecessary levels etc; but after a while > I realized that although there's probably nice scope for cleverness > there (significantly constrained by RCU expectations), it would only > be about optimization. One more point: you have still preserve memory for these levels anyway, since we must have never-fail split_huge_page(). > Let's be simple and stupid about radix_tree > for now, the problems that need to be worked out lie elsewhere. > > > > > > > Perhaps that's good complementarity, or perhaps I'll disagree with > > > your approach. I'll be taking a look at yours in the coming days, > > > and trying to summon back up my own ideas to summarize them for you. > > > > Yeah, it would be nice to see alternative design ideas. Looking forward. > > > > > Perhaps I was naive to imagine it, but I did intend to start out > > > generically, independent of filesystem; but content to narrow down > > > on tmpfs alone where it gets hard to support the others (writeback > > > springs to mind). khugepaged would be migrating little pages into > > > huge pages, where it saw that the mmaps of the file would benefit > > > (and for testing I would hack mmap alignment choice to favour it). > > > > I don't think all fs at once would fly, but it's wonderful, if I'm > > wrong :) > > You are imagining the filesystem putting huge pages into its cache. > Whereas I'm imagining khugepaged looking around at mmaped file areas, > seeing which would benefit from huge pagecache (let's assume offset 0 > belongs on hugepage boundary - maybe one day someone will want to tune > some files or parts differently, but that's low priority), migrating 4k > pages over to 2MB page (wouldn't have to be done all in one pass), then > finally slotting in the pmds for that. I had file huge page consolidation on todo list, but much later. I feel that our approaches are complimentary. > But going this way, I expect we'd have to split at page_mkwrite(): > we probably don't want a single touch to dirty 2MB at a time, > unless tmpfs or ramfs. Hm.. Splitting is rather expensive. I think it makes sense for fs with backing device to consolidate only pages which mapped without PROT_WRITE. This way we can avoid consolidate-split loops. > > > I had arrived at a conviction that the first thing to change was > > > the way that tail pages of a THP are refcounted, that it had been a > > > mistake to use the compound page method of holding the THP together. > > > But I'll have to enter a trance now to recall the arguments ;) > > > > THP refcounting looks reasonable for me, if take split_huge_page() in > > account. > > I'm not claiming that the THP refcounting is wrong in what it's doing > at present; but that I suspect we'll want to rework it for THPageCache. > > Something I take for granted, I think you do too but I'm not certain: > a file with transparent huge pages in its page cache can also have small > pages in other extents of its page cache; and can be mapped hugely (2MB > extents) into one address space at the same time as individual 4k pages > from those extents are mapped into another (or the same) address space. > > One can certainly imagine sacrificing that principle, splitting whenever > there's such a "conflict"; but it then becomes uninteresting to me, too > much like hugetlbfs. Splitting an anonymous hugepage in all address > spaces that hold it when one of them needs it split, that has been a > pragmatic strategy: it's not a common case for forks to diverge like > that; but files are expected to be more widely shared. > > At present THP is using compound pages, with mapcount of tail pages > reused to track their contribution to head page count; but I think we > shall want to be able to use the mapcount, and the count, of TH tail > pages for their original purpose if huge mappings can coexist with tiny. > Not fully thought out, but that's my feeling. > > The use of compound pages, in particular the redirection of tail page > count to head page count, was important in hugetlbfs: a get_user_pages > reference on a subpage must prevent the containing hugepage from being > freed, because hugetlbfs has its own separate pool of hugepages to > which freeing returns them. > > But for transparent huge pages? It should not matter so much if the > subpages are freed independently. So I'd like to devise another glue > to hold them together more loosely (for prototyping I can certainly > pretend we have infinite pageflag and pagefield space if that helps): > I may find in practice that they're forever falling apart, and I run > crying back to compound pages; but at present I'm hoping not. Looks interesting. But I'm not sure whether it will work. It would be nice to summon Andrea to the thread. > This mail might suggest that I'm about to start coding: I wish that > were true, but in reality there's always a lot of unrelated things > I have to look at, which dilute my focus. So if I've said anything > that sparks ideas for you, go with them. I want get my current approach work first. Will see. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Simon Jeons Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Mon, 18 Mar 2013 17:36:00 +0800 Message-ID: <5146E000.1070306@gmail.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: "Kirill A. Shutemov" Return-path: In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > Here's first steps towards huge pages in page cache. > > The intend of the work is get code ready to enable transparent huge page > cache for the most simple fs -- ramfs. > > It's not yet near feature-complete. It only provides basic infrastructure. > At the moment we can read, write and truncate file on ramfs with huge pages in > page cache. The most interesting part, mmap(), is not yet there. For now > we split huge page on mmap() attempt. > > I can't say that I see whole picture. I'm not sure if I understand locking > model around split_huge_page(). Probably, not. > Andrea, could you check if it looks correct? Another offline question: Why don't clear tail page PG_tail flag in function __split_huge_page_refcount? > > Next steps (not necessary in this order): > - mmap(); > - migration (?); > - collapse; > - stats, knobs, etc.; > - tmpfs/shmem enabling; > - ... > > Kirill A. Shutemov (16): > block: implement add_bdi_stat() > mm: implement zero_huge_user_segment and friends > mm: drop actor argument of do_generic_file_read() > radix-tree: implement preload for multiple contiguous elements > thp, mm: basic defines for transparent huge page cache > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > thp, mm: rewrite delete_from_page_cache() to support huge pages > thp, mm: locking tail page is a bug > thp, mm: handle tail pages in page_cache_get_speculative() > thp, mm: implement grab_cache_huge_page_write_begin() > thp, mm: naive support of thp in generic read/write routines > thp, libfs: initial support of thp in > simple_read/write_begin/write_end > thp: handle file pages in split_huge_page() > thp, mm: truncate support for transparent huge page cache > thp, mm: split huge page on mmap file page > ramfs: enable transparent huge page cache > > fs/libfs.c | 54 +++++++++--- > fs/ramfs/inode.c | 6 +- > include/linux/backing-dev.h | 10 +++ > include/linux/huge_mm.h | 8 ++ > include/linux/mm.h | 15 ++++ > include/linux/pagemap.h | 14 ++- > include/linux/radix-tree.h | 3 + > lib/radix-tree.c | 32 +++++-- > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 62 +++++++++++-- > mm/memory.c | 22 +++++ > mm/truncate.c | 12 +++ > 12 files changed, 375 insertions(+), 67 deletions(-) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Simon Jeons Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Thu, 21 Mar 2013 16:00:44 +0800 Message-ID: <514ABE2C.1090901@gmail.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: "Kirill A. Shutemov" Return-path: Received: from mail-ie0-f177.google.com ([209.85.223.177]:58833 "EHLO mail-ie0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751028Ab3CUIAx (ORCPT ); Thu, 21 Mar 2013 04:00:53 -0400 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > Here's first steps towards huge pages in page cache. > > The intend of the work is get code ready to enable transparent huge page > cache for the most simple fs -- ramfs. > > It's not yet near feature-complete. It only provides basic infrastructure. > At the moment we can read, write and truncate file on ramfs with huge pages in > page cache. The most interesting part, mmap(), is not yet there. For now > we split huge page on mmap() attempt. > > I can't say that I see whole picture. I'm not sure if I understand locking > model around split_huge_page(). Probably, not. > Andrea, could you check if it looks correct? Is there any thp performance test benchmark? For anonymous pages or file pages. > > Next steps (not necessary in this order): > - mmap(); > - migration (?); > - collapse; > - stats, knobs, etc.; > - tmpfs/shmem enabling; > - ... > > Kirill A. Shutemov (16): > block: implement add_bdi_stat() > mm: implement zero_huge_user_segment and friends > mm: drop actor argument of do_generic_file_read() > radix-tree: implement preload for multiple contiguous elements > thp, mm: basic defines for transparent huge page cache > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > thp, mm: rewrite delete_from_page_cache() to support huge pages > thp, mm: locking tail page is a bug > thp, mm: handle tail pages in page_cache_get_speculative() > thp, mm: implement grab_cache_huge_page_write_begin() > thp, mm: naive support of thp in generic read/write routines > thp, libfs: initial support of thp in > simple_read/write_begin/write_end > thp: handle file pages in split_huge_page() > thp, mm: truncate support for transparent huge page cache > thp, mm: split huge page on mmap file page > ramfs: enable transparent huge page cache > > fs/libfs.c | 54 +++++++++--- > fs/ramfs/inode.c | 6 +- > include/linux/backing-dev.h | 10 +++ > include/linux/huge_mm.h | 8 ++ > include/linux/mm.h | 15 ++++ > include/linux/pagemap.h | 14 ++- > include/linux/radix-tree.h | 3 + > lib/radix-tree.c | 32 +++++-- > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 62 +++++++++++-- > mm/memory.c | 22 +++++ > mm/truncate.c | 12 +++ > 12 files changed, 375 insertions(+), 67 deletions(-) > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Simon Jeons Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Fri, 05 Apr 2013 08:26:35 +0800 Message-ID: <515E1A3B.70508@gmail.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Hugh, On 01/31/2013 10:12 AM, Hugh Dickins wrote: > On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >> Hugh Dickins wrote: >>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >>>> From: "Kirill A. Shutemov" >>>> >>>> Here's first steps towards huge pages in page cache. >>>> >>>> The intend of the work is get code ready to enable transparent huge page >>>> cache for the most simple fs -- ramfs. >>>> >>>> It's not yet near feature-complete. It only provides basic infrastructure. >>>> At the moment we can read, write and truncate file on ramfs with huge pages in >>>> page cache. The most interesting part, mmap(), is not yet there. For now >>>> we split huge page on mmap() attempt. >>>> >>>> I can't say that I see whole picture. I'm not sure if I understand locking >>>> model around split_huge_page(). Probably, not. >>>> Andrea, could you check if it looks correct? >>>> >>>> Next steps (not necessary in this order): >>>> - mmap(); >>>> - migration (?); >>>> - collapse; >>>> - stats, knobs, etc.; >>>> - tmpfs/shmem enabling; >>>> - ... >>>> >>>> Kirill A. Shutemov (16): >>>> block: implement add_bdi_stat() >>>> mm: implement zero_huge_user_segment and friends >>>> mm: drop actor argument of do_generic_file_read() >>>> radix-tree: implement preload for multiple contiguous elements >>>> thp, mm: basic defines for transparent huge page cache >>>> thp, mm: rewrite add_to_page_cache_locked() to support huge pages >>>> thp, mm: rewrite delete_from_page_cache() to support huge pages >>>> thp, mm: locking tail page is a bug >>>> thp, mm: handle tail pages in page_cache_get_speculative() >>>> thp, mm: implement grab_cache_huge_page_write_begin() >>>> thp, mm: naive support of thp in generic read/write routines >>>> thp, libfs: initial support of thp in >>>> simple_read/write_begin/write_end >>>> thp: handle file pages in split_huge_page() >>>> thp, mm: truncate support for transparent huge page cache >>>> thp, mm: split huge page on mmap file page >>>> ramfs: enable transparent huge page cache >>>> >>>> fs/libfs.c | 54 +++++++++--- >>>> fs/ramfs/inode.c | 6 +- >>>> include/linux/backing-dev.h | 10 +++ >>>> include/linux/huge_mm.h | 8 ++ >>>> include/linux/mm.h | 15 ++++ >>>> include/linux/pagemap.h | 14 ++- >>>> include/linux/radix-tree.h | 3 + >>>> lib/radix-tree.c | 32 +++++-- >>>> mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >>>> mm/huge_memory.c | 62 +++++++++++-- >>>> mm/memory.c | 22 +++++ >>>> mm/truncate.c | 12 +++ >>>> 12 files changed, 375 insertions(+), 67 deletions(-) >>> Interesting. >>> >>> I was starting to think about Transparent Huge Pagecache a few >>> months ago, but then got washed away by incoming waves as usual. >>> >>> Certainly I don't have a line of code to show for it; but my first >>> impression of your patches is that we have very different ideas of >>> where to start. > A second impression confirms that we have very different ideas of > where to start. I don't want to be dismissive, and please don't let > me discourage you, but I just don't find what you have very interesting. > > I'm sure you'll agree that the interesting part, and the difficult part, > comes with mmap(); and there's no point whatever to THPages without mmap() > (of course, I'm including exec and brk and shm when I say mmap there). > > (There may be performance benefits in working with larger page cache > size, which Christoph Lameter explored a few years back, but that's a > different topic: I think 2MB - if I may be x86_64-centric - would not be > the unit of choice for that, unless SSD erase block were to dominate.) > > I'm interested to get to the point of prototyping something that does > support mmap() of THPageCache: I'm pretty sure that I'd then soon learn > a lot about my misconceptions, and have to rework for a while (or give > up!); but I don't see much point in posting anything without that. > I don't know if we have 5 or 50 places which "know" that a THPage > must be Anon: some I'll spot in advance, some I sadly won't. > > It's not clear to me that the infrastructural changes you make in this > series will be needed or not, if I pursue my approach: some perhaps as > optimizations on top of the poorly performing base that may emerge from > going about it my way. But for me it's too soon to think about those. > > Something I notice that we do agree upon: the radix_tree holding the > 4k subpages, at least for now. When I first started thinking towards > THPageCache, I was fascinated by how we could manage the hugepages in > the radix_tree, cutting out unnecessary levels etc; but after a while > I realized that although there's probably nice scope for cleverness > there (significantly constrained by RCU expectations), it would only > be about optimization. Let's be simple and stupid about radix_tree > for now, the problems that need to be worked out lie elsewhere. > >>> Perhaps that's good complementarity, or perhaps I'll disagree with >>> your approach. I'll be taking a look at yours in the coming days, >>> and trying to summon back up my own ideas to summarize them for you. >> Yeah, it would be nice to see alternative design ideas. Looking forward. >> >>> Perhaps I was naive to imagine it, but I did intend to start out >>> generically, independent of filesystem; but content to narrow down >>> on tmpfs alone where it gets hard to support the others (writeback >>> springs to mind). khugepaged would be migrating little pages into >>> huge pages, where it saw that the mmaps of the file would benefit >>> (and for testing I would hack mmap alignment choice to favour it). >> I don't think all fs at once would fly, but it's wonderful, if I'm >> wrong :) > You are imagining the filesystem putting huge pages into its cache. > Whereas I'm imagining khugepaged looking around at mmaped file areas, > seeing which would benefit from huge pagecache (let's assume offset 0 > belongs on hugepage boundary - maybe one day someone will want to tune > some files or parts differently, but that's low priority), migrating 4k > pages over to 2MB page (wouldn't have to be done all in one pass), they There are isolation and migration process during collapse. But why didn't use migration entry in migration process? > finally slotting in the pmds for that. > > But going this way, I expect we'd have to split at page_mkwrite(): > we probably don't want a single touch to dirty 2MB at a time, > unless tmpfs or ramfs. > >>> I had arrived at a conviction that the first thing to change was >>> the way that tail pages of a THP are refcounted, that it had been a >>> mistake to use the compound page method of holding the THP together. >>> But I'll have to enter a trance now to recall the arguments ;) >> THP refcounting looks reasonable for me, if take split_huge_page() in >> account. > I'm not claiming that the THP refcounting is wrong in what it's doing > at present; but that I suspect we'll want to rework it for THPageCache. > > Something I take for granted, I think you do too but I'm not certain: > a file with transparent huge pages in its page cache can also have small > pages in other extents of its page cache; and can be mapped hugely (2MB > extents) into one address space at the same time as individual 4k pages > from those extents are mapped into another (or the same) address space. > > One can certainly imagine sacrificing that principle, splitting whenever > there's such a "conflict"; but it then becomes uninteresting to me, too > much like hugetlbfs. Splitting an anonymous hugepage in all address > spaces that hold it when one of them needs it split, that has been a > pragmatic strategy: it's not a common case for forks to diverge like > that; but files are expected to be more widely shared. > > At present THP is using compound pages, with mapcount of tail pages > reused to track their contribution to head page count; but I think we > shall want to be able to use the mapcount, and the count, of TH tail > pages for their original purpose if huge mappings can coexist with tiny. > Not fully thought out, but that's my feeling. > > The use of compound pages, in particular the redirection of tail page > count to head page count, was important in hugetlbfs: a get_user_pages > reference on a subpage must prevent the containing hugepage from being > freed, because hugetlbfs has its own separate pool of hugepages to > which freeing returns them. > > But for transparent huge pages? It should not matter so much if the > subpages are freed independently. So I'd like to devise another glue > to hold them together more loosely (for prototyping I can certainly > pretend we have infinite pageflag and pagefield space if that helps): > I may find in practice that they're forever falling apart, and I run > crying back to compound pages; but at present I'm hoping not. > > This mail might suggest that I'm about to start coding: I wish that > were true, but in reality there's always a lot of unrelated things > I have to look at, which dilute my focus. So if I've said anything > that sparks ideas for you, go with them. > > Hugh > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Simon Jeons Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Fri, 05 Apr 2013 09:03:26 +0800 Message-ID: <515E22DE.1010207@gmail.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Hugh, On 01/31/2013 10:12 AM, Hugh Dickins wrote: > On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >> Hugh Dickins wrote: >>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >>>> From: "Kirill A. Shutemov" >>>> >>>> Here's first steps towards huge pages in page cache. >>>> >>>> The intend of the work is get code ready to enable transparent huge page >>>> cache for the most simple fs -- ramfs. >>>> >>>> It's not yet near feature-complete. It only provides basic infrastructure. >>>> At the moment we can read, write and truncate file on ramfs with huge pages in >>>> page cache. The most interesting part, mmap(), is not yet there. For now >>>> we split huge page on mmap() attempt. >>>> >>>> I can't say that I see whole picture. I'm not sure if I understand locking >>>> model around split_huge_page(). Probably, not. >>>> Andrea, could you check if it looks correct? >>>> >>>> Next steps (not necessary in this order): >>>> - mmap(); >>>> - migration (?); >>>> - collapse; >>>> - stats, knobs, etc.; >>>> - tmpfs/shmem enabling; >>>> - ... >>>> >>>> Kirill A. Shutemov (16): >>>> block: implement add_bdi_stat() >>>> mm: implement zero_huge_user_segment and friends >>>> mm: drop actor argument of do_generic_file_read() >>>> radix-tree: implement preload for multiple contiguous elements >>>> thp, mm: basic defines for transparent huge page cache >>>> thp, mm: rewrite add_to_page_cache_locked() to support huge pages >>>> thp, mm: rewrite delete_from_page_cache() to support huge pages >>>> thp, mm: locking tail page is a bug >>>> thp, mm: handle tail pages in page_cache_get_speculative() >>>> thp, mm: implement grab_cache_huge_page_write_begin() >>>> thp, mm: naive support of thp in generic read/write routines >>>> thp, libfs: initial support of thp in >>>> simple_read/write_begin/write_end >>>> thp: handle file pages in split_huge_page() >>>> thp, mm: truncate support for transparent huge page cache >>>> thp, mm: split huge page on mmap file page >>>> ramfs: enable transparent huge page cache >>>> >>>> fs/libfs.c | 54 +++++++++--- >>>> fs/ramfs/inode.c | 6 +- >>>> include/linux/backing-dev.h | 10 +++ >>>> include/linux/huge_mm.h | 8 ++ >>>> include/linux/mm.h | 15 ++++ >>>> include/linux/pagemap.h | 14 ++- >>>> include/linux/radix-tree.h | 3 + >>>> lib/radix-tree.c | 32 +++++-- >>>> mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >>>> mm/huge_memory.c | 62 +++++++++++-- >>>> mm/memory.c | 22 +++++ >>>> mm/truncate.c | 12 +++ >>>> 12 files changed, 375 insertions(+), 67 deletions(-) >>> Interesting. >>> >>> I was starting to think about Transparent Huge Pagecache a few >>> months ago, but then got washed away by incoming waves as usual. >>> >>> Certainly I don't have a line of code to show for it; but my first >>> impression of your patches is that we have very different ideas of >>> where to start. > A second impression confirms that we have very different ideas of > where to start. I don't want to be dismissive, and please don't let > me discourage you, but I just don't find what you have very interesting. > > I'm sure you'll agree that the interesting part, and the difficult part, > comes with mmap(); and there's no point whatever to THPages without mmap() > (of course, I'm including exec and brk and shm when I say mmap there). > > (There may be performance benefits in working with larger page cache > size, which Christoph Lameter explored a few years back, but that's a > different topic: I think 2MB - if I may be x86_64-centric - would not be > the unit of choice for that, unless SSD erase block were to dominate.) > > I'm interested to get to the point of prototyping something that does > support mmap() of THPageCache: I'm pretty sure that I'd then soon learn > a lot about my misconceptions, and have to rework for a while (or give > up!); but I don't see much point in posting anything without that. > I don't know if we have 5 or 50 places which "know" that a THPage > must be Anon: some I'll spot in advance, some I sadly won't. > > It's not clear to me that the infrastructural changes you make in this > series will be needed or not, if I pursue my approach: some perhaps as > optimizations on top of the poorly performing base that may emerge from > going about it my way. But for me it's too soon to think about those. > > Something I notice that we do agree upon: the radix_tree holding the > 4k subpages, at least for now. When I first started thinking towards > THPageCache, I was fascinated by how we could manage the hugepages in > the radix_tree, cutting out unnecessary levels etc; but after a while > I realized that although there's probably nice scope for cleverness > there (significantly constrained by RCU expectations), it would only > be about optimization. Let's be simple and stupid about radix_tree > for now, the problems that need to be worked out lie elsewhere. > >>> Perhaps that's good complementarity, or perhaps I'll disagree with >>> your approach. I'll be taking a look at yours in the coming days, >>> and trying to summon back up my own ideas to summarize them for you. >> Yeah, it would be nice to see alternative design ideas. Looking forward. >> >>> Perhaps I was naive to imagine it, but I did intend to start out >>> generically, independent of filesystem; but content to narrow down >>> on tmpfs alone where it gets hard to support the others (writeback >>> springs to mind). khugepaged would be migrating little pages into >>> huge pages, where it saw that the mmaps of the file would benefit If add heuristic to adjust khugepaged_max_ptes_none make sense? Reduce its value if memoy pressure is big and increase it if memory pressure is small. >>> (and for testing I would hack mmap alignment choice to favour it). >> I don't think all fs at once would fly, but it's wonderful, if I'm >> wrong :) > You are imagining the filesystem putting huge pages into its cache. > Whereas I'm imagining khugepaged looking around at mmaped file areas, > seeing which would benefit from huge pagecache (let's assume offset 0 > belongs on hugepage boundary - maybe one day someone will want to tune > some files or parts differently, but that's low priority), migrating 4k > pages over to 2MB page (wouldn't have to be done all in one pass), then > finally slotting in the pmds for that. > > But going this way, I expect we'd have to split at page_mkwrite(): > we probably don't want a single touch to dirty 2MB at a time, > unless tmpfs or ramfs. > >>> I had arrived at a conviction that the first thing to change was >>> the way that tail pages of a THP are refcounted, that it had been a >>> mistake to use the compound page method of holding the THP together. >>> But I'll have to enter a trance now to recall the arguments ;) >> THP refcounting looks reasonable for me, if take split_huge_page() in >> account. > I'm not claiming that the THP refcounting is wrong in what it's doing > at present; but that I suspect we'll want to rework it for THPageCache. > > Something I take for granted, I think you do too but I'm not certain: > a file with transparent huge pages in its page cache can also have small > pages in other extents of its page cache; and can be mapped hugely (2MB > extents) into one address space at the same time as individual 4k pages > from those extents are mapped into another (or the same) address space. > > One can certainly imagine sacrificing that principle, splitting whenever > there's such a "conflict"; but it then becomes uninteresting to me, too > much like hugetlbfs. Splitting an anonymous hugepage in all address > spaces that hold it when one of them needs it split, that has been a > pragmatic strategy: it's not a common case for forks to diverge like > that; but files are expected to be more widely shared. > > At present THP is using compound pages, with mapcount of tail pages > reused to track their contribution to head page count; but I think we > shall want to be able to use the mapcount, and the count, of TH tail > pages for their original purpose if huge mappings can coexist with tiny. > Not fully thought out, but that's my feeling. > > The use of compound pages, in particular the redirection of tail page > count to head page count, was important in hugetlbfs: a get_user_pages > reference on a subpage must prevent the containing hugepage from being > freed, because hugetlbfs has its own separate pool of hugepages to > which freeing returns them. > > But for transparent huge pages? It should not matter so much if the > subpages are freed independently. So I'd like to devise another glue > to hold them together more loosely (for prototyping I can certainly > pretend we have infinite pageflag and pagefield space if that helps): > I may find in practice that they're forever falling apart, and I run > crying back to compound pages; but at present I'm hoping not. > > This mail might suggest that I'm about to start coding: I wish that > were true, but in reality there's always a lot of unrelated things > I have to look at, which dilute my focus. So if I've said anything > that sparks ideas for you, go with them. > > Hugh > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Mason Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Fri, 05 Apr 2013 09:24:32 +0800 Message-ID: <515E27D0.5090105@gmail.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Hugh, On 01/29/2013 01:03 PM, Hugh Dickins wrote: > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >> From: "Kirill A. Shutemov" >> >> Here's first steps towards huge pages in page cache. >> >> The intend of the work is get code ready to enable transparent huge page >> cache for the most simple fs -- ramfs. >> >> It's not yet near feature-complete. It only provides basic infrastructure. >> At the moment we can read, write and truncate file on ramfs with huge pages in >> page cache. The most interesting part, mmap(), is not yet there. For now >> we split huge page on mmap() attempt. >> >> I can't say that I see whole picture. I'm not sure if I understand locking >> model around split_huge_page(). Probably, not. >> Andrea, could you check if it looks correct? >> >> Next steps (not necessary in this order): >> - mmap(); >> - migration (?); >> - collapse; >> - stats, knobs, etc.; >> - tmpfs/shmem enabling; >> - ... >> >> Kirill A. Shutemov (16): >> block: implement add_bdi_stat() >> mm: implement zero_huge_user_segment and friends >> mm: drop actor argument of do_generic_file_read() >> radix-tree: implement preload for multiple contiguous elements >> thp, mm: basic defines for transparent huge page cache >> thp, mm: rewrite add_to_page_cache_locked() to support huge pages >> thp, mm: rewrite delete_from_page_cache() to support huge pages >> thp, mm: locking tail page is a bug >> thp, mm: handle tail pages in page_cache_get_speculative() >> thp, mm: implement grab_cache_huge_page_write_begin() >> thp, mm: naive support of thp in generic read/write routines >> thp, libfs: initial support of thp in >> simple_read/write_begin/write_end >> thp: handle file pages in split_huge_page() >> thp, mm: truncate support for transparent huge page cache >> thp, mm: split huge page on mmap file page >> ramfs: enable transparent huge page cache >> >> fs/libfs.c | 54 +++++++++--- >> fs/ramfs/inode.c | 6 +- >> include/linux/backing-dev.h | 10 +++ >> include/linux/huge_mm.h | 8 ++ >> include/linux/mm.h | 15 ++++ >> include/linux/pagemap.h | 14 ++- >> include/linux/radix-tree.h | 3 + >> lib/radix-tree.c | 32 +++++-- >> mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >> mm/huge_memory.c | 62 +++++++++++-- >> mm/memory.c | 22 +++++ >> mm/truncate.c | 12 +++ >> 12 files changed, 375 insertions(+), 67 deletions(-) > Interesting. > > I was starting to think about Transparent Huge Pagecache a few > months ago, but then got washed away by incoming waves as usual. > > Certainly I don't have a line of code to show for it; but my first > impression of your patches is that we have very different ideas of > where to start. > > Perhaps that's good complementarity, or perhaps I'll disagree with > your approach. I'll be taking a look at yours in the coming days, > and trying to summon back up my own ideas to summarize them for you. > > Perhaps I was naive to imagine it, but I did intend to start out > generically, independent of filesystem; but content to narrow down > on tmpfs alone where it gets hard to support the others (writeback > springs to mind). khugepaged would be migrating little pages into > huge pages, where it saw that the mmaps of the file would benefit > (and for testing I would hack mmap alignment choice to favour it). > > I had arrived at a conviction that the first thing to change was > the way that tail pages of a THP are refcounted, that it had been a > mistake to use the compound page method of holding the THP together. > But I'll have to enter a trance now to recall the arguments ;) One offline question, do you have any idea hugetlbfs pages support swapping? > > Hugh > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Fri, 5 Apr 2013 09:42:08 +0800 Message-ID: <16612.7521697947$1365126175@news.gmane.org> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote: >On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >> Hugh Dickins wrote: >> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >> > > From: "Kirill A. Shutemov" >> > > >> > > Here's first steps towards huge pages in page cache. >> > > >> > > The intend of the work is get code ready to enable transparent huge page >> > > cache for the most simple fs -- ramfs. >> > > >> > > It's not yet near feature-complete. It only provides basic infrastructure. >> > > At the moment we can read, write and truncate file on ramfs with huge pages in >> > > page cache. The most interesting part, mmap(), is not yet there. For now >> > > we split huge page on mmap() attempt. >> > > >> > > I can't say that I see whole picture. I'm not sure if I understand locking >> > > model around split_huge_page(). Probably, not. >> > > Andrea, could you check if it looks correct? >> > > >> > > Next steps (not necessary in this order): >> > > - mmap(); >> > > - migration (?); >> > > - collapse; >> > > - stats, knobs, etc.; >> > > - tmpfs/shmem enabling; >> > > - ... >> > > >> > > Kirill A. Shutemov (16): >> > > block: implement add_bdi_stat() >> > > mm: implement zero_huge_user_segment and friends >> > > mm: drop actor argument of do_generic_file_read() >> > > radix-tree: implement preload for multiple contiguous elements >> > > thp, mm: basic defines for transparent huge page cache >> > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages >> > > thp, mm: rewrite delete_from_page_cache() to support huge pages >> > > thp, mm: locking tail page is a bug >> > > thp, mm: handle tail pages in page_cache_get_speculative() >> > > thp, mm: implement grab_cache_huge_page_write_begin() >> > > thp, mm: naive support of thp in generic read/write routines >> > > thp, libfs: initial support of thp in >> > > simple_read/write_begin/write_end >> > > thp: handle file pages in split_huge_page() >> > > thp, mm: truncate support for transparent huge page cache >> > > thp, mm: split huge page on mmap file page >> > > ramfs: enable transparent huge page cache >> > > >> > > fs/libfs.c | 54 +++++++++--- >> > > fs/ramfs/inode.c | 6 +- >> > > include/linux/backing-dev.h | 10 +++ >> > > include/linux/huge_mm.h | 8 ++ >> > > include/linux/mm.h | 15 ++++ >> > > include/linux/pagemap.h | 14 ++- >> > > include/linux/radix-tree.h | 3 + >> > > lib/radix-tree.c | 32 +++++-- >> > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >> > > mm/huge_memory.c | 62 +++++++++++-- >> > > mm/memory.c | 22 +++++ >> > > mm/truncate.c | 12 +++ >> > > 12 files changed, 375 insertions(+), 67 deletions(-) >> > >> > Interesting. >> > >> > I was starting to think about Transparent Huge Pagecache a few >> > months ago, but then got washed away by incoming waves as usual. >> > >> > Certainly I don't have a line of code to show for it; but my first >> > impression of your patches is that we have very different ideas of >> > where to start. > >A second impression confirms that we have very different ideas of >where to start. I don't want to be dismissive, and please don't let >me discourage you, but I just don't find what you have very interesting. > >I'm sure you'll agree that the interesting part, and the difficult part, >comes with mmap(); and there's no point whatever to THPages without mmap() >(of course, I'm including exec and brk and shm when I say mmap there). > >(There may be performance benefits in working with larger page cache >size, which Christoph Lameter explored a few years back, but that's a >different topic: I think 2MB - if I may be x86_64-centric - would not be >the unit of choice for that, unless SSD erase block were to dominate.) > >I'm interested to get to the point of prototyping something that does >support mmap() of THPageCache: I'm pretty sure that I'd then soon learn >a lot about my misconceptions, and have to rework for a while (or give >up!); but I don't see much point in posting anything without that. >I don't know if we have 5 or 50 places which "know" that a THPage >must be Anon: some I'll spot in advance, some I sadly won't. > >It's not clear to me that the infrastructural changes you make in this >series will be needed or not, if I pursue my approach: some perhaps as >optimizations on top of the poorly performing base that may emerge from >going about it my way. But for me it's too soon to think about those. > >Something I notice that we do agree upon: the radix_tree holding the >4k subpages, at least for now. When I first started thinking towards >THPageCache, I was fascinated by how we could manage the hugepages in >the radix_tree, cutting out unnecessary levels etc; but after a while >I realized that although there's probably nice scope for cleverness >there (significantly constrained by RCU expectations), it would only >be about optimization. Let's be simple and stupid about radix_tree >for now, the problems that need to be worked out lie elsewhere. > >> > >> > Perhaps that's good complementarity, or perhaps I'll disagree with >> > your approach. I'll be taking a look at yours in the coming days, >> > and trying to summon back up my own ideas to summarize them for you. >> >> Yeah, it would be nice to see alternative design ideas. Looking forward. >> >> > Perhaps I was naive to imagine it, but I did intend to start out >> > generically, independent of filesystem; but content to narrow down >> > on tmpfs alone where it gets hard to support the others (writeback >> > springs to mind). khugepaged would be migrating little pages into >> > huge pages, where it saw that the mmaps of the file would benefit >> > (and for testing I would hack mmap alignment choice to favour it). >> >> I don't think all fs at once would fly, but it's wonderful, if I'm >> wrong :) > >You are imagining the filesystem putting huge pages into its cache. >Whereas I'm imagining khugepaged looking around at mmaped file areas, >seeing which would benefit from huge pagecache (let's assume offset 0 >belongs on hugepage boundary - maybe one day someone will want to tune >some files or parts differently, but that's low priority), migrating 4k >pages over to 2MB page (wouldn't have to be done all in one pass), then >finally slotting in the pmds for that. > >But going this way, I expect we'd have to split at page_mkwrite(): >we probably don't want a single touch to dirty 2MB at a time, >unless tmpfs or ramfs. > >> >> > I had arrived at a conviction that the first thing to change was >> > the way that tail pages of a THP are refcounted, that it had been a >> > mistake to use the compound page method of holding the THP together. >> > But I'll have to enter a trance now to recall the arguments ;) >> >> THP refcounting looks reasonable for me, if take split_huge_page() in >> account. > >I'm not claiming that the THP refcounting is wrong in what it's doing >at present; but that I suspect we'll want to rework it for THPageCache. > >Something I take for granted, I think you do too but I'm not certain: >a file with transparent huge pages in its page cache can also have small >pages in other extents of its page cache; and can be mapped hugely (2MB >extents) into one address space at the same time as individual 4k pages >from those extents are mapped into another (or the same) address space. > >One can certainly imagine sacrificing that principle, splitting whenever >there's such a "conflict"; but it then becomes uninteresting to me, too >much like hugetlbfs. Splitting an anonymous hugepage in all address >spaces that hold it when one of them needs it split, that has been a >pragmatic strategy: it's not a common case for forks to diverge like >that; but files are expected to be more widely shared. > >At present THP is using compound pages, with mapcount of tail pages >reused to track their contribution to head page count; but I think we >shall want to be able to use the mapcount, and the count, of TH tail >pages for their original purpose if huge mappings can coexist with tiny. >Not fully thought out, but that's my feeling. > >The use of compound pages, in particular the redirection of tail page >count to head page count, was important in hugetlbfs: a get_user_pages >reference on a subpage must prevent the containing hugepage from being >freed, because hugetlbfs has its own separate pool of hugepages to >which freeing returns them. > >But for transparent huge pages? It should not matter so much if the >subpages are freed independently. So I'd like to devise another glue >to hold them together more loosely (for prototyping I can certainly >pretend we have infinite pageflag and pagefield space if that helps): >I may find in practice that they're forever falling apart, and I run >crying back to compound pages; but at present I'm hoping not. > >This mail might suggest that I'm about to start coding: I wish that >were true, but in reality there's always a lot of unrelated things >I have to look at, which dilute my focus. So if I've said anything >that sparks ideas for you, go with them. It seems that it's a good idea, Hugh. I will start coding this. ;-) Regards, Wanpeng Li > >Hugh > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Date: Sun, 7 Apr 2013 08:26:19 +0800 Message-ID: <24948.0193052024$1365294406@news.gmane.org> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> <20130405014208.GC362@hacker.(null)> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: Content-Disposition: inline In-Reply-To: <20130405014208.GC362@hacker.(null)> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 05, 2013 at 09:42:08AM +0800, Wanpeng Li wrote: >On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote: >>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >>> Hugh Dickins wrote: >>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >>> > > From: "Kirill A. Shutemov" >>> > > >>> > > Here's first steps towards huge pages in page cache. >>> > > >>> > > The intend of the work is get code ready to enable transparent huge page >>> > > cache for the most simple fs -- ramfs. >>> > > >>> > > It's not yet near feature-complete. It only provides basic infrastructure. >>> > > At the moment we can read, write and truncate file on ramfs with huge pages in >>> > > page cache. The most interesting part, mmap(), is not yet there. For now >>> > > we split huge page on mmap() attempt. >>> > > >>> > > I can't say that I see whole picture. I'm not sure if I understand locking >>> > > model around split_huge_page(). Probably, not. >>> > > Andrea, could you check if it looks correct? >>> > > >>> > > Next steps (not necessary in this order): >>> > > - mmap(); >>> > > - migration (?); >>> > > - collapse; >>> > > - stats, knobs, etc.; >>> > > - tmpfs/shmem enabling; >>> > > - ... >>> > > >>> > > Kirill A. Shutemov (16): >>> > > block: implement add_bdi_stat() >>> > > mm: implement zero_huge_user_segment and friends >>> > > mm: drop actor argument of do_generic_file_read() >>> > > radix-tree: implement preload for multiple contiguous elements >>> > > thp, mm: basic defines for transparent huge page cache >>> > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages >>> > > thp, mm: rewrite delete_from_page_cache() to support huge pages >>> > > thp, mm: locking tail page is a bug >>> > > thp, mm: handle tail pages in page_cache_get_speculative() >>> > > thp, mm: implement grab_cache_huge_page_write_begin() >>> > > thp, mm: naive support of thp in generic read/write routines >>> > > thp, libfs: initial support of thp in >>> > > simple_read/write_begin/write_end >>> > > thp: handle file pages in split_huge_page() >>> > > thp, mm: truncate support for transparent huge page cache >>> > > thp, mm: split huge page on mmap file page >>> > > ramfs: enable transparent huge page cache >>> > > >>> > > fs/libfs.c | 54 +++++++++--- >>> > > fs/ramfs/inode.c | 6 +- >>> > > include/linux/backing-dev.h | 10 +++ >>> > > include/linux/huge_mm.h | 8 ++ >>> > > include/linux/mm.h | 15 ++++ >>> > > include/linux/pagemap.h | 14 ++- >>> > > include/linux/radix-tree.h | 3 + >>> > > lib/radix-tree.c | 32 +++++-- >>> > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >>> > > mm/huge_memory.c | 62 +++++++++++-- >>> > > mm/memory.c | 22 +++++ >>> > > mm/truncate.c | 12 +++ >>> > > 12 files changed, 375 insertions(+), 67 deletions(-) >>> > >>> > Interesting. >>> > >>> > I was starting to think about Transparent Huge Pagecache a few >>> > months ago, but then got washed away by incoming waves as usual. >>> > >>> > Certainly I don't have a line of code to show for it; but my first >>> > impression of your patches is that we have very different ideas of >>> > where to start. >> >>A second impression confirms that we have very different ideas of >>where to start. I don't want to be dismissive, and please don't let >>me discourage you, but I just don't find what you have very interesting. >> >>I'm sure you'll agree that the interesting part, and the difficult part, >>comes with mmap(); and there's no point whatever to THPages without mmap() >>(of course, I'm including exec and brk and shm when I say mmap there). >> >>(There may be performance benefits in working with larger page cache >>size, which Christoph Lameter explored a few years back, but that's a >>different topic: I think 2MB - if I may be x86_64-centric - would not be >>the unit of choice for that, unless SSD erase block were to dominate.) >> >>I'm interested to get to the point of prototyping something that does >>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn >>a lot about my misconceptions, and have to rework for a while (or give >>up!); but I don't see much point in posting anything without that. >>I don't know if we have 5 or 50 places which "know" that a THPage >>must be Anon: some I'll spot in advance, some I sadly won't. >> >>It's not clear to me that the infrastructural changes you make in this >>series will be needed or not, if I pursue my approach: some perhaps as >>optimizations on top of the poorly performing base that may emerge from >>going about it my way. But for me it's too soon to think about those. >> >>Something I notice that we do agree upon: the radix_tree holding the >>4k subpages, at least for now. When I first started thinking towards >>THPageCache, I was fascinated by how we could manage the hugepages in >>the radix_tree, cutting out unnecessary levels etc; but after a while >>I realized that although there's probably nice scope for cleverness >>there (significantly constrained by RCU expectations), it would only >>be about optimization. Let's be simple and stupid about radix_tree >>for now, the problems that need to be worked out lie elsewhere. >> >>> > >>> > Perhaps that's good complementarity, or perhaps I'll disagree with >>> > your approach. I'll be taking a look at yours in the coming days, >>> > and trying to summon back up my own ideas to summarize them for you. >>> >>> Yeah, it would be nice to see alternative design ideas. Looking forward. >>> >>> > Perhaps I was naive to imagine it, but I did intend to start out >>> > generically, independent of filesystem; but content to narrow down >>> > on tmpfs alone where it gets hard to support the others (writeback >>> > springs to mind). khugepaged would be migrating little pages into >>> > huge pages, where it saw that the mmaps of the file would benefit >>> > (and for testing I would hack mmap alignment choice to favour it). >>> >>> I don't think all fs at once would fly, but it's wonderful, if I'm >>> wrong :) >> >>You are imagining the filesystem putting huge pages into its cache. >>Whereas I'm imagining khugepaged looking around at mmaped file areas, >>seeing which would benefit from huge pagecache (let's assume offset 0 >>belongs on hugepage boundary - maybe one day someone will want to tune >>some files or parts differently, but that's low priority), migrating 4k >>pages over to 2MB page (wouldn't have to be done all in one pass), then >>finally slotting in the pmds for that. >> >>But going this way, I expect we'd have to split at page_mkwrite(): >>we probably don't want a single touch to dirty 2MB at a time, >>unless tmpfs or ramfs. >> >>> >>> > I had arrived at a conviction that the first thing to change was >>> > the way that tail pages of a THP are refcounted, that it had been a >>> > mistake to use the compound page method of holding the THP together. >>> > But I'll have to enter a trance now to recall the arguments ;) >>> >>> THP refcounting looks reasonable for me, if take split_huge_page() in >>> account. >> >>I'm not claiming that the THP refcounting is wrong in what it's doing >>at present; but that I suspect we'll want to rework it for THPageCache. >> >>Something I take for granted, I think you do too but I'm not certain: >>a file with transparent huge pages in its page cache can also have small >>pages in other extents of its page cache; and can be mapped hugely (2MB >>extents) into one address space at the same time as individual 4k pages >>from those extents are mapped into another (or the same) address space. >> >>One can certainly imagine sacrificing that principle, splitting whenever >>there's such a "conflict"; but it then becomes uninteresting to me, too >>much like hugetlbfs. Splitting an anonymous hugepage in all address >>spaces that hold it when one of them needs it split, that has been a >>pragmatic strategy: it's not a common case for forks to diverge like >>that; but files are expected to be more widely shared. >> >>At present THP is using compound pages, with mapcount of tail pages >>reused to track their contribution to head page count; but I think we >>shall want to be able to use the mapcount, and the count, of TH tail >>pages for their original purpose if huge mappings can coexist with tiny. >>Not fully thought out, but that's my feeling. >> >>The use of compound pages, in particular the redirection of tail page >>count to head page count, was important in hugetlbfs: a get_user_pages >>reference on a subpage must prevent the containing hugepage from being >>freed, because hugetlbfs has its own separate pool of hugepages to >>which freeing returns them. >> >>But for transparent huge pages? It should not matter so much if the >>subpages are freed independently. So I'd like to devise another glue >>to hold them together more loosely (for prototyping I can certainly >>pretend we have infinite pageflag and pagefield space if that helps): >>I may find in practice that they're forever falling apart, and I run >>crying back to compound pages; but at present I'm hoping not. >> >>This mail might suggest that I'm about to start coding: I wish that >>were true, but in reality there's always a lot of unrelated things >>I have to look at, which dilute my focus. So if I've said anything >>that sparks ideas for you, go with them. Hi Hugh, commit 70b50f94f16 ("mm: thp: tail page refcounting fix") tells us account the tail page references on tail_page->_count wasn't safe. Regards, Wanpeng Li > >It seems that it's a good idea, Hugh. I will start coding this. ;-) > >Regards, >Wanpeng Li > >> >>Hugh >> >>-- >>To unsubscribe, send a message with 'unsubscribe linux-mm' in >>the body to majordomo@kvack.org. For more info on Linux MM, >>see: http://www.linux-mm.org/ . >>Don't email: email@kvack.org > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx185.postini.com [74.125.245.185]) by kanga.kvack.org (Postfix) with SMTP id A6FE06B000C for ; Mon, 28 Jan 2013 04:23:34 -0500 (EST) From: "Kirill A. Shutemov" Subject: [PATCH, RFC 03/16] mm: drop actor argument of do_generic_file_read() Date: Mon, 28 Jan 2013 11:24:15 +0200 Message-Id: <1359365068-10147-4-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" From: "Kirill A. Shutemov" There's only one caller of do_generic_file_read() and the only actor is file_read_actor(). No reason to have a callback parameter. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index c610076..b6a6d7e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1070,7 +1070,6 @@ static void shrink_readahead_size_eio(struct file *filp, * @filp: the file to read * @ppos: current file position * @desc: read_descriptor - * @actor: read method * * This is a generic file read routine, and uses the * mapping->a_ops->readpage() function for the actual low-level stuff. @@ -1079,7 +1078,7 @@ static void shrink_readahead_size_eio(struct file *filp, * of the logic when it comes to error handling etc. */ static void do_generic_file_read(struct file *filp, loff_t *ppos, - read_descriptor_t *desc, read_actor_t actor) + read_descriptor_t *desc) { struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; @@ -1180,13 +1179,14 @@ page_ok: * Ok, we have the page, and it's up-to-date, so * now we can copy it to user space... * - * The actor routine returns how many bytes were actually used.. + * The file_read_actor routine returns how many bytes were + * actually used.. * NOTE! This may not be the same as how much of a user buffer * we filled up (we may be padding etc), so we can only update * "pos" here (the actor routine has to update the user buffer * pointers and the remaining count). */ - ret = actor(desc, page, offset, nr); + ret = file_read_actor(desc, page, offset, nr); offset += ret; index += offset >> PAGE_CACHE_SHIFT; offset &= ~PAGE_CACHE_MASK; @@ -1459,7 +1459,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, if (desc.count == 0) continue; desc.error = 0; - do_generic_file_read(filp, ppos, &desc, file_read_actor); + do_generic_file_read(filp, ppos, &desc); retval += desc.written; if (desc.error) { retval = retval ?: desc.error; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx185.postini.com [74.125.245.185]) by kanga.kvack.org (Postfix) with SMTP id 54DCE6B003D for ; Tue, 29 Jan 2013 07:48:58 -0500 (EST) Date: Tue, 29 Jan 2013 14:48:37 +0200 From: "Kirill A. Shutemov" Message-ID: <5107c525eb8d1_f167d78c8425f9@blue.mail> In-Reply-To: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hillf Danton , "Kirill A. Shutemov" Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML Hillf Danton wrote: > On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov > wrote: > > + page_cache_get(page); > > + spin_lock_irq(&mapping->tree_lock); > > + page->mapping = mapping; > > + if (PageTransHuge(page)) { > > + int i; > > + for (i = 0; i < HPAGE_CACHE_NR; i++) { > > + page_cache_get(page + i); > > + page[i].index = offset + i; > > + error = radix_tree_insert(&mapping->page_tree, > > + offset + i, page + i); > > + if (error) { > > + page_cache_release(page + i); > > + break; > > + } > > Is page count balanced with the following? It's broken. Last minue changes are evil :( Thanks for catching it. I'll fix it in next revision. > @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page) > > if (freepage) > freepage(page); > + if (PageTransHuge(page)) > + for (i = 1; i < HPAGE_CACHE_NR; i++) > + page_cache_release(page); > page_cache_release(page); -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id 3CF346B0005 for ; Thu, 21 Mar 2013 04:00:53 -0400 (EDT) Received: by mail-ia0-f175.google.com with SMTP id y26so2196227iab.34 for ; Thu, 21 Mar 2013 01:00:52 -0700 (PDT) Message-ID: <514ABE2C.1090901@gmail.com> Date: Thu, 21 Mar 2013 16:00:44 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > Here's first steps towards huge pages in page cache. > > The intend of the work is get code ready to enable transparent huge page > cache for the most simple fs -- ramfs. > > It's not yet near feature-complete. It only provides basic infrastructure. > At the moment we can read, write and truncate file on ramfs with huge pages in > page cache. The most interesting part, mmap(), is not yet there. For now > we split huge page on mmap() attempt. > > I can't say that I see whole picture. I'm not sure if I understand locking > model around split_huge_page(). Probably, not. > Andrea, could you check if it looks correct? Is there any thp performance test benchmark? For anonymous pages or file pages. > > Next steps (not necessary in this order): > - mmap(); > - migration (?); > - collapse; > - stats, knobs, etc.; > - tmpfs/shmem enabling; > - ... > > Kirill A. Shutemov (16): > block: implement add_bdi_stat() > mm: implement zero_huge_user_segment and friends > mm: drop actor argument of do_generic_file_read() > radix-tree: implement preload for multiple contiguous elements > thp, mm: basic defines for transparent huge page cache > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > thp, mm: rewrite delete_from_page_cache() to support huge pages > thp, mm: locking tail page is a bug > thp, mm: handle tail pages in page_cache_get_speculative() > thp, mm: implement grab_cache_huge_page_write_begin() > thp, mm: naive support of thp in generic read/write routines > thp, libfs: initial support of thp in > simple_read/write_begin/write_end > thp: handle file pages in split_huge_page() > thp, mm: truncate support for transparent huge page cache > thp, mm: split huge page on mmap file page > ramfs: enable transparent huge page cache > > fs/libfs.c | 54 +++++++++--- > fs/ramfs/inode.c | 6 +- > include/linux/backing-dev.h | 10 +++ > include/linux/huge_mm.h | 8 ++ > include/linux/mm.h | 15 ++++ > include/linux/pagemap.h | 14 ++- > include/linux/radix-tree.h | 3 + > lib/radix-tree.c | 32 +++++-- > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 62 +++++++++++-- > mm/memory.c | 22 +++++ > mm/truncate.c | 12 +++ > 12 files changed, 375 insertions(+), 67 deletions(-) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 1E94D6B0005 for ; Thu, 4 Apr 2013 21:42:20 -0400 (EDT) Received: from /spool/local by e28smtp08.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 5 Apr 2013 07:07:02 +0530 Received: from d28relay04.in.ibm.com (d28relay04.in.ibm.com [9.184.220.61]) by d28dlp03.in.ibm.com (Postfix) with ESMTP id 919791258023 for ; Fri, 5 Apr 2013 07:13:31 +0530 (IST) Received: from d28av02.in.ibm.com (d28av02.in.ibm.com [9.184.220.64]) by d28relay04.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r351g7bY262456 for ; Fri, 5 Apr 2013 07:12:07 +0530 Received: from d28av02.in.ibm.com (loopback [127.0.0.1]) by d28av02.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r351g9q0030087 for ; Fri, 5 Apr 2013 12:42:10 +1100 Date: Fri, 5 Apr 2013 09:42:08 +0800 From: Wanpeng Li Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Message-ID: <20130405014208.GC362@hacker.(null)> Reply-To: Wanpeng Li References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote: >On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >> Hugh Dickins wrote: >> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >> > > From: "Kirill A. Shutemov" >> > > >> > > Here's first steps towards huge pages in page cache. >> > > >> > > The intend of the work is get code ready to enable transparent huge page >> > > cache for the most simple fs -- ramfs. >> > > >> > > It's not yet near feature-complete. It only provides basic infrastructure. >> > > At the moment we can read, write and truncate file on ramfs with huge pages in >> > > page cache. The most interesting part, mmap(), is not yet there. For now >> > > we split huge page on mmap() attempt. >> > > >> > > I can't say that I see whole picture. I'm not sure if I understand locking >> > > model around split_huge_page(). Probably, not. >> > > Andrea, could you check if it looks correct? >> > > >> > > Next steps (not necessary in this order): >> > > - mmap(); >> > > - migration (?); >> > > - collapse; >> > > - stats, knobs, etc.; >> > > - tmpfs/shmem enabling; >> > > - ... >> > > >> > > Kirill A. Shutemov (16): >> > > block: implement add_bdi_stat() >> > > mm: implement zero_huge_user_segment and friends >> > > mm: drop actor argument of do_generic_file_read() >> > > radix-tree: implement preload for multiple contiguous elements >> > > thp, mm: basic defines for transparent huge page cache >> > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages >> > > thp, mm: rewrite delete_from_page_cache() to support huge pages >> > > thp, mm: locking tail page is a bug >> > > thp, mm: handle tail pages in page_cache_get_speculative() >> > > thp, mm: implement grab_cache_huge_page_write_begin() >> > > thp, mm: naive support of thp in generic read/write routines >> > > thp, libfs: initial support of thp in >> > > simple_read/write_begin/write_end >> > > thp: handle file pages in split_huge_page() >> > > thp, mm: truncate support for transparent huge page cache >> > > thp, mm: split huge page on mmap file page >> > > ramfs: enable transparent huge page cache >> > > >> > > fs/libfs.c | 54 +++++++++--- >> > > fs/ramfs/inode.c | 6 +- >> > > include/linux/backing-dev.h | 10 +++ >> > > include/linux/huge_mm.h | 8 ++ >> > > include/linux/mm.h | 15 ++++ >> > > include/linux/pagemap.h | 14 ++- >> > > include/linux/radix-tree.h | 3 + >> > > lib/radix-tree.c | 32 +++++-- >> > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >> > > mm/huge_memory.c | 62 +++++++++++-- >> > > mm/memory.c | 22 +++++ >> > > mm/truncate.c | 12 +++ >> > > 12 files changed, 375 insertions(+), 67 deletions(-) >> > >> > Interesting. >> > >> > I was starting to think about Transparent Huge Pagecache a few >> > months ago, but then got washed away by incoming waves as usual. >> > >> > Certainly I don't have a line of code to show for it; but my first >> > impression of your patches is that we have very different ideas of >> > where to start. > >A second impression confirms that we have very different ideas of >where to start. I don't want to be dismissive, and please don't let >me discourage you, but I just don't find what you have very interesting. > >I'm sure you'll agree that the interesting part, and the difficult part, >comes with mmap(); and there's no point whatever to THPages without mmap() >(of course, I'm including exec and brk and shm when I say mmap there). > >(There may be performance benefits in working with larger page cache >size, which Christoph Lameter explored a few years back, but that's a >different topic: I think 2MB - if I may be x86_64-centric - would not be >the unit of choice for that, unless SSD erase block were to dominate.) > >I'm interested to get to the point of prototyping something that does >support mmap() of THPageCache: I'm pretty sure that I'd then soon learn >a lot about my misconceptions, and have to rework for a while (or give >up!); but I don't see much point in posting anything without that. >I don't know if we have 5 or 50 places which "know" that a THPage >must be Anon: some I'll spot in advance, some I sadly won't. > >It's not clear to me that the infrastructural changes you make in this >series will be needed or not, if I pursue my approach: some perhaps as >optimizations on top of the poorly performing base that may emerge from >going about it my way. But for me it's too soon to think about those. > >Something I notice that we do agree upon: the radix_tree holding the >4k subpages, at least for now. When I first started thinking towards >THPageCache, I was fascinated by how we could manage the hugepages in >the radix_tree, cutting out unnecessary levels etc; but after a while >I realized that although there's probably nice scope for cleverness >there (significantly constrained by RCU expectations), it would only >be about optimization. Let's be simple and stupid about radix_tree >for now, the problems that need to be worked out lie elsewhere. > >> > >> > Perhaps that's good complementarity, or perhaps I'll disagree with >> > your approach. I'll be taking a look at yours in the coming days, >> > and trying to summon back up my own ideas to summarize them for you. >> >> Yeah, it would be nice to see alternative design ideas. Looking forward. >> >> > Perhaps I was naive to imagine it, but I did intend to start out >> > generically, independent of filesystem; but content to narrow down >> > on tmpfs alone where it gets hard to support the others (writeback >> > springs to mind). khugepaged would be migrating little pages into >> > huge pages, where it saw that the mmaps of the file would benefit >> > (and for testing I would hack mmap alignment choice to favour it). >> >> I don't think all fs at once would fly, but it's wonderful, if I'm >> wrong :) > >You are imagining the filesystem putting huge pages into its cache. >Whereas I'm imagining khugepaged looking around at mmaped file areas, >seeing which would benefit from huge pagecache (let's assume offset 0 >belongs on hugepage boundary - maybe one day someone will want to tune >some files or parts differently, but that's low priority), migrating 4k >pages over to 2MB page (wouldn't have to be done all in one pass), then >finally slotting in the pmds for that. > >But going this way, I expect we'd have to split at page_mkwrite(): >we probably don't want a single touch to dirty 2MB at a time, >unless tmpfs or ramfs. > >> >> > I had arrived at a conviction that the first thing to change was >> > the way that tail pages of a THP are refcounted, that it had been a >> > mistake to use the compound page method of holding the THP together. >> > But I'll have to enter a trance now to recall the arguments ;) >> >> THP refcounting looks reasonable for me, if take split_huge_page() in >> account. > >I'm not claiming that the THP refcounting is wrong in what it's doing >at present; but that I suspect we'll want to rework it for THPageCache. > >Something I take for granted, I think you do too but I'm not certain: >a file with transparent huge pages in its page cache can also have small >pages in other extents of its page cache; and can be mapped hugely (2MB >extents) into one address space at the same time as individual 4k pages >from those extents are mapped into another (or the same) address space. > >One can certainly imagine sacrificing that principle, splitting whenever >there's such a "conflict"; but it then becomes uninteresting to me, too >much like hugetlbfs. Splitting an anonymous hugepage in all address >spaces that hold it when one of them needs it split, that has been a >pragmatic strategy: it's not a common case for forks to diverge like >that; but files are expected to be more widely shared. > >At present THP is using compound pages, with mapcount of tail pages >reused to track their contribution to head page count; but I think we >shall want to be able to use the mapcount, and the count, of TH tail >pages for their original purpose if huge mappings can coexist with tiny. >Not fully thought out, but that's my feeling. > >The use of compound pages, in particular the redirection of tail page >count to head page count, was important in hugetlbfs: a get_user_pages >reference on a subpage must prevent the containing hugepage from being >freed, because hugetlbfs has its own separate pool of hugepages to >which freeing returns them. > >But for transparent huge pages? It should not matter so much if the >subpages are freed independently. So I'd like to devise another glue >to hold them together more loosely (for prototyping I can certainly >pretend we have infinite pageflag and pagefield space if that helps): >I may find in practice that they're forever falling apart, and I run >crying back to compound pages; but at present I'm hoping not. > >This mail might suggest that I'm about to start coding: I wish that >were true, but in reality there's always a lot of unrelated things >I have to look at, which dilute my focus. So if I've said anything >that sparks ideas for you, go with them. It seems that it's a good idea, Hugh. I will start coding this. ;-) Regards, Wanpeng Li > >Hugh > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx102.postini.com [74.125.245.102]) by kanga.kvack.org (Postfix) with SMTP id 9668E6B0005 for ; Sat, 6 Apr 2013 20:26:32 -0400 (EDT) Received: from /spool/local by e23smtp01.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 7 Apr 2013 10:19:39 +1000 Received: from d23relay03.au.ibm.com (d23relay03.au.ibm.com [9.190.235.21]) by d23dlp02.au.ibm.com (Postfix) with ESMTP id 3CED42BB0052 for ; Sun, 7 Apr 2013 10:26:24 +1000 (EST) Received: from d23av04.au.ibm.com (d23av04.au.ibm.com [9.190.235.139]) by d23relay03.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r370QIEv65208326 for ; Sun, 7 Apr 2013 10:26:19 +1000 Received: from d23av04.au.ibm.com (loopback [127.0.0.1]) by d23av04.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r370QMeH028867 for ; Sun, 7 Apr 2013 10:26:23 +1000 Date: Sun, 7 Apr 2013 08:26:19 +0800 From: Wanpeng Li Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Message-ID: <20130407002619.GA19381@hacker.(null)> Reply-To: Wanpeng Li References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> <20130405014208.GC362@hacker.(null)> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130405014208.GC362@hacker.(null)> Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, Apr 05, 2013 at 09:42:08AM +0800, Wanpeng Li wrote: >On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote: >>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >>> Hugh Dickins wrote: >>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >>> > > From: "Kirill A. Shutemov" >>> > > >>> > > Here's first steps towards huge pages in page cache. >>> > > >>> > > The intend of the work is get code ready to enable transparent huge page >>> > > cache for the most simple fs -- ramfs. >>> > > >>> > > It's not yet near feature-complete. It only provides basic infrastructure. >>> > > At the moment we can read, write and truncate file on ramfs with huge pages in >>> > > page cache. The most interesting part, mmap(), is not yet there. For now >>> > > we split huge page on mmap() attempt. >>> > > >>> > > I can't say that I see whole picture. I'm not sure if I understand locking >>> > > model around split_huge_page(). Probably, not. >>> > > Andrea, could you check if it looks correct? >>> > > >>> > > Next steps (not necessary in this order): >>> > > - mmap(); >>> > > - migration (?); >>> > > - collapse; >>> > > - stats, knobs, etc.; >>> > > - tmpfs/shmem enabling; >>> > > - ... >>> > > >>> > > Kirill A. Shutemov (16): >>> > > block: implement add_bdi_stat() >>> > > mm: implement zero_huge_user_segment and friends >>> > > mm: drop actor argument of do_generic_file_read() >>> > > radix-tree: implement preload for multiple contiguous elements >>> > > thp, mm: basic defines for transparent huge page cache >>> > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages >>> > > thp, mm: rewrite delete_from_page_cache() to support huge pages >>> > > thp, mm: locking tail page is a bug >>> > > thp, mm: handle tail pages in page_cache_get_speculative() >>> > > thp, mm: implement grab_cache_huge_page_write_begin() >>> > > thp, mm: naive support of thp in generic read/write routines >>> > > thp, libfs: initial support of thp in >>> > > simple_read/write_begin/write_end >>> > > thp: handle file pages in split_huge_page() >>> > > thp, mm: truncate support for transparent huge page cache >>> > > thp, mm: split huge page on mmap file page >>> > > ramfs: enable transparent huge page cache >>> > > >>> > > fs/libfs.c | 54 +++++++++--- >>> > > fs/ramfs/inode.c | 6 +- >>> > > include/linux/backing-dev.h | 10 +++ >>> > > include/linux/huge_mm.h | 8 ++ >>> > > include/linux/mm.h | 15 ++++ >>> > > include/linux/pagemap.h | 14 ++- >>> > > include/linux/radix-tree.h | 3 + >>> > > lib/radix-tree.c | 32 +++++-- >>> > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >>> > > mm/huge_memory.c | 62 +++++++++++-- >>> > > mm/memory.c | 22 +++++ >>> > > mm/truncate.c | 12 +++ >>> > > 12 files changed, 375 insertions(+), 67 deletions(-) >>> > >>> > Interesting. >>> > >>> > I was starting to think about Transparent Huge Pagecache a few >>> > months ago, but then got washed away by incoming waves as usual. >>> > >>> > Certainly I don't have a line of code to show for it; but my first >>> > impression of your patches is that we have very different ideas of >>> > where to start. >> >>A second impression confirms that we have very different ideas of >>where to start. I don't want to be dismissive, and please don't let >>me discourage you, but I just don't find what you have very interesting. >> >>I'm sure you'll agree that the interesting part, and the difficult part, >>comes with mmap(); and there's no point whatever to THPages without mmap() >>(of course, I'm including exec and brk and shm when I say mmap there). >> >>(There may be performance benefits in working with larger page cache >>size, which Christoph Lameter explored a few years back, but that's a >>different topic: I think 2MB - if I may be x86_64-centric - would not be >>the unit of choice for that, unless SSD erase block were to dominate.) >> >>I'm interested to get to the point of prototyping something that does >>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn >>a lot about my misconceptions, and have to rework for a while (or give >>up!); but I don't see much point in posting anything without that. >>I don't know if we have 5 or 50 places which "know" that a THPage >>must be Anon: some I'll spot in advance, some I sadly won't. >> >>It's not clear to me that the infrastructural changes you make in this >>series will be needed or not, if I pursue my approach: some perhaps as >>optimizations on top of the poorly performing base that may emerge from >>going about it my way. But for me it's too soon to think about those. >> >>Something I notice that we do agree upon: the radix_tree holding the >>4k subpages, at least for now. When I first started thinking towards >>THPageCache, I was fascinated by how we could manage the hugepages in >>the radix_tree, cutting out unnecessary levels etc; but after a while >>I realized that although there's probably nice scope for cleverness >>there (significantly constrained by RCU expectations), it would only >>be about optimization. Let's be simple and stupid about radix_tree >>for now, the problems that need to be worked out lie elsewhere. >> >>> > >>> > Perhaps that's good complementarity, or perhaps I'll disagree with >>> > your approach. I'll be taking a look at yours in the coming days, >>> > and trying to summon back up my own ideas to summarize them for you. >>> >>> Yeah, it would be nice to see alternative design ideas. Looking forward. >>> >>> > Perhaps I was naive to imagine it, but I did intend to start out >>> > generically, independent of filesystem; but content to narrow down >>> > on tmpfs alone where it gets hard to support the others (writeback >>> > springs to mind). khugepaged would be migrating little pages into >>> > huge pages, where it saw that the mmaps of the file would benefit >>> > (and for testing I would hack mmap alignment choice to favour it). >>> >>> I don't think all fs at once would fly, but it's wonderful, if I'm >>> wrong :) >> >>You are imagining the filesystem putting huge pages into its cache. >>Whereas I'm imagining khugepaged looking around at mmaped file areas, >>seeing which would benefit from huge pagecache (let's assume offset 0 >>belongs on hugepage boundary - maybe one day someone will want to tune >>some files or parts differently, but that's low priority), migrating 4k >>pages over to 2MB page (wouldn't have to be done all in one pass), then >>finally slotting in the pmds for that. >> >>But going this way, I expect we'd have to split at page_mkwrite(): >>we probably don't want a single touch to dirty 2MB at a time, >>unless tmpfs or ramfs. >> >>> >>> > I had arrived at a conviction that the first thing to change was >>> > the way that tail pages of a THP are refcounted, that it had been a >>> > mistake to use the compound page method of holding the THP together. >>> > But I'll have to enter a trance now to recall the arguments ;) >>> >>> THP refcounting looks reasonable for me, if take split_huge_page() in >>> account. >> >>I'm not claiming that the THP refcounting is wrong in what it's doing >>at present; but that I suspect we'll want to rework it for THPageCache. >> >>Something I take for granted, I think you do too but I'm not certain: >>a file with transparent huge pages in its page cache can also have small >>pages in other extents of its page cache; and can be mapped hugely (2MB >>extents) into one address space at the same time as individual 4k pages >>from those extents are mapped into another (or the same) address space. >> >>One can certainly imagine sacrificing that principle, splitting whenever >>there's such a "conflict"; but it then becomes uninteresting to me, too >>much like hugetlbfs. Splitting an anonymous hugepage in all address >>spaces that hold it when one of them needs it split, that has been a >>pragmatic strategy: it's not a common case for forks to diverge like >>that; but files are expected to be more widely shared. >> >>At present THP is using compound pages, with mapcount of tail pages >>reused to track their contribution to head page count; but I think we >>shall want to be able to use the mapcount, and the count, of TH tail >>pages for their original purpose if huge mappings can coexist with tiny. >>Not fully thought out, but that's my feeling. >> >>The use of compound pages, in particular the redirection of tail page >>count to head page count, was important in hugetlbfs: a get_user_pages >>reference on a subpage must prevent the containing hugepage from being >>freed, because hugetlbfs has its own separate pool of hugepages to >>which freeing returns them. >> >>But for transparent huge pages? It should not matter so much if the >>subpages are freed independently. So I'd like to devise another glue >>to hold them together more loosely (for prototyping I can certainly >>pretend we have infinite pageflag and pagefield space if that helps): >>I may find in practice that they're forever falling apart, and I run >>crying back to compound pages; but at present I'm hoping not. >> >>This mail might suggest that I'm about to start coding: I wish that >>were true, but in reality there's always a lot of unrelated things >>I have to look at, which dilute my focus. So if I've said anything >>that sparks ideas for you, go with them. Hi Hugh, commit 70b50f94f16 ("mm: thp: tail page refcounting fix") tells us account the tail page references on tail_page->_count wasn't safe. Regards, Wanpeng Li > >It seems that it's a good idea, Hugh. I will start coding this. ;-) > >Regards, >Wanpeng Li > >> >>Hugh >> >>-- >>To unsubscribe, send a message with 'unsubscribe linux-mm' in >>the body to majordomo@kvack.org. For more info on Linux MM, >>see: http://www.linux-mm.org/ . >>Don't email: email@kvack.org > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756741Ab3A1JXt (ORCPT ); Mon, 28 Jan 2013 04:23:49 -0500 Received: from mga01.intel.com ([192.55.52.88]:60169 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755873Ab3A1JXg (ORCPT ); Mon, 28 Jan 2013 04:23:36 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="282975132" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 08/16] thp, mm: locking tail page is a bug Date: Mon, 28 Jan 2013 11:24:20 +0200 Message-Id: <1359365068-10147-9-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index a4b4fd5..f59eaa1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -665,6 +665,7 @@ void __lock_page(struct page *page) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + VM_BUG_ON(PageTail(page)); __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page, TASK_UNINTERRUPTIBLE); } @@ -674,6 +675,7 @@ int __lock_page_killable(struct page *page) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + VM_BUG_ON(PageTail(page)); return __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page_killable, TASK_KILLABLE); } -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756631Ab3A1JXm (ORCPT ); Mon, 28 Jan 2013 04:23:42 -0500 Received: from mga01.intel.com ([192.55.52.88]:60169 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755276Ab3A1JXd (ORCPT ); Mon, 28 Jan 2013 04:23:33 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="282975103" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 00/16] Transparent huge page cache Date: Mon, 28 Jan 2013 11:24:12 +0200 Message-Id: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" Here's first steps towards huge pages in page cache. The intend of the work is get code ready to enable transparent huge page cache for the most simple fs -- ramfs. It's not yet near feature-complete. It only provides basic infrastructure. At the moment we can read, write and truncate file on ramfs with huge pages in page cache. The most interesting part, mmap(), is not yet there. For now we split huge page on mmap() attempt. I can't say that I see whole picture. I'm not sure if I understand locking model around split_huge_page(). Probably, not. Andrea, could you check if it looks correct? Next steps (not necessary in this order): - mmap(); - migration (?); - collapse; - stats, knobs, etc.; - tmpfs/shmem enabling; - ... Kirill A. Shutemov (16): block: implement add_bdi_stat() mm: implement zero_huge_user_segment and friends mm: drop actor argument of do_generic_file_read() radix-tree: implement preload for multiple contiguous elements thp, mm: basic defines for transparent huge page cache thp, mm: rewrite add_to_page_cache_locked() to support huge pages thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: locking tail page is a bug thp, mm: handle tail pages in page_cache_get_speculative() thp, mm: implement grab_cache_huge_page_write_begin() thp, mm: naive support of thp in generic read/write routines thp, libfs: initial support of thp in simple_read/write_begin/write_end thp: handle file pages in split_huge_page() thp, mm: truncate support for transparent huge page cache thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache fs/libfs.c | 54 +++++++++--- fs/ramfs/inode.c | 6 +- include/linux/backing-dev.h | 10 +++ include/linux/huge_mm.h | 8 ++ include/linux/mm.h | 15 ++++ include/linux/pagemap.h | 14 ++- include/linux/radix-tree.h | 3 + lib/radix-tree.c | 32 +++++-- mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 62 +++++++++++-- mm/memory.c | 22 +++++ mm/truncate.c | 12 +++ 12 files changed, 375 insertions(+), 67 deletions(-) -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756805Ab3A1JYm (ORCPT ); Mon, 28 Jan 2013 04:24:42 -0500 Received: from mga01.intel.com ([192.55.52.88]:26134 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756484Ab3A1JXl (ORCPT ); Mon, 28 Jan 2013 04:23:41 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="282975167" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 15/16] thp, mm: split huge page on mmap file page Date: Mon, 28 Jan 2013 11:24:27 +0200 Message-Id: <1359365068-10147-16-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" We are not ready to mmap file-backed tranparent huge pages. Let's split them on mmap() attempt. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index a7331fb..2e08582 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1692,6 +1692,8 @@ retry_find: goto no_cached_page; } + if (PageTransCompound(page)) + split_huge_page(page); if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { page_cache_release(page); return ret | VM_FAULT_RETRY; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756733Ab3A1JYl (ORCPT ); Mon, 28 Jan 2013 04:24:41 -0500 Received: from mga02.intel.com ([134.134.136.20]:56891 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755931Ab3A1JXl (ORCPT ); Mon, 28 Jan 2013 04:23:41 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="253481882" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 07/16] thp, mm: rewrite delete_from_page_cache() to support huge pages Date: Mon, 28 Jan 2013 11:24:19 +0200 Message-Id: <1359365068-10147-8-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a time. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index fa2fdab..a4b4fd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -112,6 +112,7 @@ void __delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + int nr = 1; /* * if we're uptodate, flush out into the cleancache, otherwise @@ -123,13 +124,23 @@ void __delete_from_page_cache(struct page *page) else cleancache_invalidate_page(mapping, page); - radix_tree_delete(&mapping->page_tree, page->index); + if (PageTransHuge(page)) { + int i; + + for (i = 0; i < HPAGE_CACHE_NR; i++) + radix_tree_delete(&mapping->page_tree, page->index + i); + nr = HPAGE_CACHE_NR; + } else { + radix_tree_delete(&mapping->page_tree, page->index); + } + page->mapping = NULL; /* Leave page->index set: truncation lookup relies upon it */ - mapping->nrpages--; - __dec_zone_page_state(page, NR_FILE_PAGES); + + mapping->nrpages -= nr; + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr); if (PageSwapBacked(page)) - __dec_zone_page_state(page, NR_SHMEM); + __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr); BUG_ON(page_mapped(page)); /* @@ -140,8 +151,8 @@ void __delete_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { - dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr); + add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr); } } @@ -157,6 +168,7 @@ void delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; void (*freepage)(struct page *); + int i; BUG_ON(!PageLocked(page)); @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page) if (freepage) freepage(page); + if (PageTransHuge(page)) + for (i = 1; i < HPAGE_CACHE_NR; i++) + page_cache_release(page); page_cache_release(page); } EXPORT_SYMBOL(delete_from_page_cache); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756681Ab3A1JYj (ORCPT ); Mon, 28 Jan 2013 04:24:39 -0500 Received: from mga01.intel.com ([192.55.52.88]:60169 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756501Ab3A1JXl (ORCPT ); Mon, 28 Jan 2013 04:23:41 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="279163227" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 16/16] ramfs: enable transparent huge page cache Date: Mon, 28 Jan 2013 11:24:28 +0200 Message-Id: <1359365068-10147-17-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" ramfs is the most simple fs from page cache point of view. Let's start transparent huge page cache enabling here. For now we allocate only non-movable huge page. It's not yet clear if movable page is safe here and what need to be done to make it safe. Signed-off-by: Kirill A. Shutemov --- fs/ramfs/inode.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index eab8c09..591457d 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb, inode_init_owner(inode, dir, mode); inode->i_mapping->a_ops = &ramfs_aops; inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + /* + * TODO: what should be done to make movable safe? + */ + mapping_set_gfp_mask(inode->i_mapping, + GFP_TRANSHUGE & ~__GFP_MOVABLE); mapping_set_unevictable(inode->i_mapping); inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; switch (mode & S_IFMT) { -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755878Ab3A1JYg (ORCPT ); Mon, 28 Jan 2013 04:24:36 -0500 Received: from mga02.intel.com ([134.134.136.20]:16477 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756516Ab3A1JXl (ORCPT ); Mon, 28 Jan 2013 04:23:41 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="253481905" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 10/16] thp, mm: implement grab_cache_huge_page_write_begin() Date: Mon, 28 Jan 2013 11:24:22 +0200 Message-Id: <1359365068-10147-11-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" The function is grab_cache_page_write_begin() twin but it tries to allocate huge page at given position aligned to HPAGE_CACHE_NR. If, for some reason, it's not possible allocate a huge page at this possition, it returns NULL. Caller should take care of fallback to small pages. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 10 +++++++++ mm/filemap.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 1da2043..5836d0d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -260,6 +260,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags); +#else +static inline struct page *grab_cache_huge_page_write_begin( + struct address_space *mapping, pgoff_t index, unsigned flags) +{ + return NULL; +} +#endif /* * Returns locked page at given index in given cache, creating it if needed. diff --git a/mm/filemap.c b/mm/filemap.c index f59eaa1..68e47e4 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2328,6 +2328,61 @@ found: } EXPORT_SYMBOL(grab_cache_page_write_begin); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/* + * Find or create a huge page at the given pagecache position, aligned to + * HPAGE_CACHE_NR. Return the locked huge page. + * + * If, for some reason, it's not possible allocate a huge page at this + * possition, it returns NULL. Caller should take care of fallback to small + * pages. + * + * This function is specifically for buffered writes. + */ +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) +{ + int status; + gfp_t gfp_mask; + struct page *page; + gfp_t gfp_notmask = 0; + + BUG_ON(index & HPAGE_CACHE_INDEX_MASK); + gfp_mask = mapping_gfp_mask(mapping); + BUG_ON(!(gfp_mask & __GFP_COMP)); + if (mapping_cap_account_dirty(mapping)) + gfp_mask |= __GFP_WRITE; + if (flags & AOP_FLAG_NOFS) + gfp_notmask = __GFP_FS; +repeat: + page = find_lock_page(mapping, index); + if (page) { + if (!PageTransHuge(page)) { + unlock_page(page); + page_cache_release(page); + return NULL; + } + goto found; + } + + page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER); + if (!page) + return NULL; + + status = add_to_page_cache_lru(page, mapping, index, + GFP_KERNEL & ~gfp_notmask); + if (unlikely(status)) { + page_cache_release(page); + if (status == -EEXIST) + goto repeat; + return NULL; + } +found: + wait_on_page_writeback(page); + return page; +} +#endif + static ssize_t generic_perform_write(struct file *file, struct iov_iter *i, loff_t pos) { -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756430Ab3A1JXj (ORCPT ); Mon, 28 Jan 2013 04:23:39 -0500 Received: from mga01.intel.com ([192.55.52.88]:60169 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755931Ab3A1JXe (ORCPT ); Mon, 28 Jan 2013 04:23:34 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="282975105" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 01/16] block: implement add_bdi_stat() Date: Mon, 28 Jan 2013 11:24:13 +0200 Message-Id: <1359365068-10147-2-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" It's required for batched stats update. Signed-off-by: Kirill A. Shutemov --- include/linux/backing-dev.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 3504599..b05d961 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi, __add_bdi_stat(bdi, item, -1); } +static inline void add_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item, s64 amount) +{ + unsigned long flags; + + local_irq_save(flags); + __add_bdi_stat(bdi, item, amount); + local_irq_restore(flags); +} + static inline void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754068Ab3A1JZt (ORCPT ); Mon, 28 Jan 2013 04:25:49 -0500 Received: from mga03.intel.com ([143.182.124.21]:65438 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756345Ab3A1JXj (ORCPT ); Mon, 28 Jan 2013 04:23:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="249123714" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 11/16] thp, mm: naive support of thp in generic read/write routines Date: Mon, 28 Jan 2013 11:24:23 +0200 Message-Id: <1359365068-10147-12-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" For now we still write/read at most PAGE_CACHE_SIZE bytes a time. This implementation doesn't cover address spaces with backing store. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 35 ++++++++++++++++++++++++++++++----- 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 68e47e4..a7331fb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1161,12 +1161,23 @@ find_page: if (unlikely(page == NULL)) goto no_cached_page; } + if (PageTransTail(page)) { + page_cache_release(page); + page = find_get_page(mapping, + index & ~HPAGE_CACHE_INDEX_MASK); + if (!PageTransHuge(page)) { + page_cache_release(page); + goto find_page; + } + } if (PageReadahead(page)) { + BUG_ON(PageTransHuge(page)); page_cache_async_readahead(mapping, ra, filp, page, index, last_index - index); } if (!PageUptodate(page)) { + BUG_ON(PageTransHuge(page)); if (inode->i_blkbits == PAGE_CACHE_SHIFT || !mapping->a_ops->is_partially_uptodate) goto page_not_up_to_date; @@ -1208,18 +1219,25 @@ page_ok: } nr = nr - offset; + /* Recalculate offset in page if we've got a huge page */ + if (PageTransHuge(page)) { + offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset); + offset &= ~HPAGE_PMD_MASK; + } + /* If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing * before reading the page on the kernel side. */ if (mapping_writably_mapped(mapping)) - flush_dcache_page(page); + flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT)); /* * When a sequential read accesses a page several times, * only mark it as accessed the first time. */ - if (prev_index != index || offset != prev_offset) + if (prev_index != index || + (offset & ~PAGE_CACHE_MASK) != prev_offset) mark_page_accessed(page); prev_index = index; @@ -1234,8 +1252,9 @@ page_ok: * "pos" here (the actor routine has to update the user buffer * pointers and the remaining count). */ - ret = file_read_actor(desc, page, offset, nr); - offset += ret; + ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT), + offset & ~PAGE_CACHE_MASK, nr); + offset = (offset & ~PAGE_CACHE_MASK) + ret; index += offset >> PAGE_CACHE_SHIFT; offset &= ~PAGE_CACHE_MASK; prev_offset = offset; @@ -2433,8 +2452,13 @@ again: if (mapping_writably_mapped(mapping)) flush_dcache_page(page); + if (PageTransHuge(page)) + offset = pos & ~HPAGE_PMD_MASK; + pagefault_disable(); - copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); + copied = iov_iter_copy_from_user_atomic( + page + (offset >> PAGE_CACHE_SHIFT), + i, offset & ~PAGE_CACHE_MASK, bytes); pagefault_enable(); flush_dcache_page(page); @@ -2457,6 +2481,7 @@ again: * because not all segments in the iov can be copied at * once without a pagefault. */ + offset = pos & ~PAGE_CACHE_MASK; bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, iov_iter_single_seg_count(i)); goto again; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756837Ab3A1JZz (ORCPT ); Mon, 28 Jan 2013 04:25:55 -0500 Received: from mga02.intel.com ([134.134.136.20]:16477 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756342Ab3A1JXj (ORCPT ); Mon, 28 Jan 2013 04:23:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="253481857" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 02/16] mm: implement zero_huge_user_segment and friends Date: Mon, 28 Jan 2013 11:24:14 +0200 Message-Id: <1359365068-10147-3-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" Let's add helpers to clear huge page segment(s). They provide the same functionallity as zero_user_segment{,s} and zero_user, but for huge pages. Signed-off-by: Kirill A. Shutemov --- include/linux/mm.h | 15 +++++++++++++++ mm/memory.c | 22 ++++++++++++++++++++++ 2 files changed, 37 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index e4533a1..c011771 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1728,6 +1728,21 @@ extern void dump_page(struct page *page); extern void clear_huge_page(struct page *page, unsigned long addr, unsigned int pages_per_huge_page); +extern void zero_huge_user_segment(struct page *page, + unsigned start, unsigned end); +static inline void zero_huge_user_segments(struct page *page, + unsigned start1, unsigned end1, + unsigned start2, unsigned end2) +{ + zero_huge_user_segment(page, start1, end1); + zero_huge_user_segment(page, start2, end2); +} +static inline void zero_huge_user(struct page *page, + unsigned start, unsigned len) +{ + zero_huge_user_segment(page, start, start+len); +} + extern void copy_user_huge_page(struct page *dst, struct page *src, unsigned long addr, struct vm_area_struct *vma, unsigned int pages_per_huge_page); diff --git a/mm/memory.c b/mm/memory.c index c04078b..200a74d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4185,6 +4185,28 @@ void clear_huge_page(struct page *page, } } +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end) +{ + int i; + + BUG_ON(end < start); + + might_sleep(); + + /* start and end are on the same small page */ + if ((start & PAGE_MASK) == (end & PAGE_MASK)) + return zero_user_segment(page + (start >> PAGE_SHIFT), + start & ~PAGE_MASK, end & ~PAGE_MASK); + + zero_user_segment(page + (start >> PAGE_SHIFT), + start & ~PAGE_MASK, PAGE_SIZE); + for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) { + cond_resched(); + clear_highpage(page + i); + } + zero_user_segment(page + i, 0, end & ~PAGE_MASK); +} + static void copy_user_gigantic_page(struct page *dst, struct page *src, unsigned long addr, struct vm_area_struct *vma, -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755729Ab3A1JZ7 (ORCPT ); Mon, 28 Jan 2013 04:25:59 -0500 Received: from mga01.intel.com ([192.55.52.88]:60169 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756017Ab3A1JXj (ORCPT ); Mon, 28 Jan 2013 04:23:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="282975155" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 12/16] thp, libfs: initial support of thp in simple_read/write_begin/write_end Date: Mon, 28 Jan 2013 11:24:24 +0200 Message-Id: <1359365068-10147-13-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" For now we try to grab a huge cache page if gfp_mask has __GFP_COMP. It's probably to weak condition and need to be reworked later. Signed-off-by: Kirill A. Shutemov --- fs/libfs.c | 54 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 42 insertions(+), 12 deletions(-) diff --git a/fs/libfs.c b/fs/libfs.c index 916da8c..a4530d5 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr); int simple_readpage(struct file *file, struct page *page) { - clear_highpage(page); + if (PageTransHuge(page)) + zero_huge_user(page, 0, HPAGE_PMD_SIZE); + else + clear_highpage(page); flush_dcache_page(page); SetPageUptodate(page); unlock_page(page); @@ -394,21 +397,43 @@ int simple_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { - struct page *page; + struct page *page = NULL; pgoff_t index; + gfp_t gfp_mask; index = pos >> PAGE_CACHE_SHIFT; - - page = grab_cache_page_write_begin(mapping, index, flags); + gfp_mask = mapping_gfp_mask(mapping); + + /* XXX: too weak condition. Good enough for initial testing */ + if (gfp_mask & __GFP_COMP) { + page = grab_cache_huge_page_write_begin(mapping, + index & ~HPAGE_CACHE_INDEX_MASK, flags); + /* fallback to small page */ + if (!page || !PageTransHuge(page)) { + unsigned long offset; + offset = pos & ~PAGE_CACHE_MASK; + len = min_t(unsigned long, + len, PAGE_CACHE_SIZE - offset); + } + } + if (!page) + page = grab_cache_page_write_begin(mapping, index, flags); if (!page) return -ENOMEM; - *pagep = page; - if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) { - unsigned from = pos & (PAGE_CACHE_SIZE - 1); - - zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE); + if (!PageUptodate(page)) { + unsigned from; + + if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) { + from = pos & ~HPAGE_PMD_MASK; + zero_huge_user_segments(page, 0, from, + from + len, HPAGE_PMD_SIZE); + } else if (len != PAGE_CACHE_SIZE) { + from = pos & ~PAGE_CACHE_MASK; + zero_user_segments(page, 0, from, + from + len, PAGE_CACHE_SIZE); + } } return 0; } @@ -443,9 +468,14 @@ int simple_write_end(struct file *file, struct address_space *mapping, /* zero the stale part of the page if we did a short copy */ if (copied < len) { - unsigned from = pos & (PAGE_CACHE_SIZE - 1); - - zero_user(page, from + copied, len - copied); + unsigned from; + if (PageTransHuge(page)) { + from = pos & ~HPAGE_PMD_MASK; + zero_huge_user(page, from + copied, len - copied); + } else { + from = pos & ~PAGE_CACHE_MASK; + zero_user(page, from + copied, len - copied); + } } if (!PageUptodate(page)) -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756744Ab3A1JZw (ORCPT ); Mon, 28 Jan 2013 04:25:52 -0500 Received: from mga01.intel.com ([192.55.52.88]:5042 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756381Ab3A1JXj (ORCPT ); Mon, 28 Jan 2013 04:23:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="282975162" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 13/16] thp: handle file pages in split_huge_page() Date: Mon, 28 Jan 2013 11:24:25 +0200 Message-Id: <1359365068-10147-14-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" The base scheme is the same as for anonymous pages, but we walk by mapping->i_mmap rather then anon_vma->rb_root. __split_huge_page_refcount() has been tunned a bit: we need to transfer PG_swapbacked to tail pages. Splitting mapped pages haven't tested at all, since we cannot mmap() file-backed huge pages yet. Signed-off-by: Kirill A. Shutemov --- mm/huge_memory.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 53 insertions(+), 9 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c63a21d..008b2c9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1613,7 +1613,8 @@ static void __split_huge_page_refcount(struct page *page) ((1L << PG_referenced) | (1L << PG_swapbacked) | (1L << PG_mlocked) | - (1L << PG_uptodate))); + (1L << PG_uptodate) | + (1L << PG_swapbacked))); page_tail->flags |= (1L << PG_dirty); /* clear PageTail before overwriting first_page */ @@ -1641,10 +1642,8 @@ static void __split_huge_page_refcount(struct page *page) page_tail->index = page->index + i; page_xchg_last_nid(page_tail, page_last_nid(page)); - BUG_ON(!PageAnon(page_tail)); BUG_ON(!PageUptodate(page_tail)); BUG_ON(!PageDirty(page_tail)); - BUG_ON(!PageSwapBacked(page_tail)); lru_add_page_tail(page, page_tail, lruvec); } @@ -1752,7 +1751,7 @@ static int __split_huge_page_map(struct page *page, } /* must be called with anon_vma->root->rwsem held */ -static void __split_huge_page(struct page *page, +static void __split_anon_huge_page(struct page *page, struct anon_vma *anon_vma) { int mapcount, mapcount2; @@ -1799,14 +1798,11 @@ static void __split_huge_page(struct page *page, BUG_ON(mapcount != mapcount2); } -int split_huge_page(struct page *page) +static int split_anon_huge_page(struct page *page) { struct anon_vma *anon_vma; int ret = 1; - BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); - BUG_ON(!PageAnon(page)); - /* * The caller does not necessarily hold an mmap_sem that would prevent * the anon_vma disappearing so we first we take a reference to it @@ -1824,7 +1820,7 @@ int split_huge_page(struct page *page) goto out_unlock; BUG_ON(!PageSwapBacked(page)); - __split_huge_page(page, anon_vma); + __split_anon_huge_page(page, anon_vma); count_vm_event(THP_SPLIT); BUG_ON(PageCompound(page)); @@ -1835,6 +1831,54 @@ out: return ret; } +static int split_file_huge_page(struct page *page) +{ + struct address_space *mapping = page->mapping; + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + struct vm_area_struct *vma; + int mapcount, mapcount2; + + BUG_ON(!PageHead(page)); + BUG_ON(PageTail(page)); + + mutex_lock(&mapping->i_mmap_mutex); + mapcount = 0; + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + unsigned long addr = vma_address(page, vma); + mapcount += __split_huge_page_splitting(page, vma, addr); + } + + if (mapcount != page_mapcount(page)) + printk(KERN_ERR "mapcount %d page_mapcount %d\n", + mapcount, page_mapcount(page)); + BUG_ON(mapcount != page_mapcount(page)); + + __split_huge_page_refcount(page); + + mapcount2 = 0; + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + unsigned long addr = vma_address(page, vma); + mapcount2 += __split_huge_page_map(page, vma, addr); + } + + if (mapcount != mapcount2) + printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n", + mapcount, mapcount2, page_mapcount(page)); + BUG_ON(mapcount != mapcount2); + mutex_unlock(&mapping->i_mmap_mutex); + return 0; +} + +int split_huge_page(struct page *page) +{ + BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); + + if (PageAnon(page)) + return split_anon_huge_page(page); + else + return split_file_huge_page(page); +} + #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE) int hugepage_madvise(struct vm_area_struct *vma, -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756569Ab3A1JZo (ORCPT ); Mon, 28 Jan 2013 04:25:44 -0500 Received: from mga03.intel.com ([143.182.124.21]:49106 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756423Ab3A1JXj (ORCPT ); Mon, 28 Jan 2013 04:23:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="249123726" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 14/16] thp, mm: truncate support for transparent huge page cache Date: Mon, 28 Jan 2013 11:24:26 +0200 Message-Id: <1359365068-10147-15-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" If we starting position of truncation is in tail page we have to spilit the huge page page first. We also have to split if end is within the huge page. Otherwise we can truncate whole huge page at once. Signed-off-by: Kirill A. Shutemov --- mm/truncate.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/truncate.c b/mm/truncate.c index c75b736..87c247d 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping, if (index > end) break; + /* split page if we start from tail page */ + if (PageTransTail(page)) + split_huge_page(compound_trans_head(page)); + if (PageTransHuge(page)) { + /* split if end is within huge page */ + if (index == (end & ~HPAGE_CACHE_INDEX_MASK)) + split_huge_page(page); + else + /* skip tail pages */ + i += HPAGE_CACHE_NR - 1; + } if (!trylock_page(page)) continue; WARN_ON(page->index != index); @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping, if (index > end) break; + VM_BUG_ON(PageTransHuge(page)); lock_page(page); WARN_ON(page->index != index); wait_on_page_writeback(page); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756815Ab3A1J1j (ORCPT ); Mon, 28 Jan 2013 04:27:39 -0500 Received: from mga01.intel.com ([192.55.52.88]:60169 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756263Ab3A1JXi (ORCPT ); Mon, 28 Jan 2013 04:23:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="282975151" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 09/16] thp, mm: handle tail pages in page_cache_get_speculative() Date: Mon, 28 Jan 2013 11:24:21 +0200 Message-Id: <1359365068-10147-10-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" For tail page we call __get_page_tail(). It has the same semantics, but for tail page. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 0e38e13..1da2043 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -149,6 +149,9 @@ static inline int page_cache_get_speculative(struct page *page) { VM_BUG_ON(in_interrupt()); + if (unlikely(PageTail(page))) + return __get_page_tail(page); + #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU) # ifdef CONFIG_PREEMPT_COUNT VM_BUG_ON(!in_atomic()); @@ -175,7 +178,6 @@ static inline int page_cache_get_speculative(struct page *page) return 0; } #endif - VM_BUG_ON(PageTail(page)); return 1; } -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756714Ab3A1J1g (ORCPT ); Mon, 28 Jan 2013 04:27:36 -0500 Received: from mga03.intel.com ([143.182.124.21]:49106 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756321Ab3A1JXi (ORCPT ); Mon, 28 Jan 2013 04:23:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="249123707" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 05/16] thp, mm: basic defines for transparent huge page cache Date: Mon, 28 Jan 2013 11:24:17 +0200 Message-Id: <1359365068-10147-6-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" Signed-off-by: Kirill A. Shutemov --- include/linux/huge_mm.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ee1c244..a54939c 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page, #define HPAGE_PMD_MASK HPAGE_MASK #define HPAGE_PMD_SIZE HPAGE_SIZE +#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT) +#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER) +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1) + extern bool is_vma_temporary_stack(struct vm_area_struct *vma); #define transparent_hugepage_enabled(__vma) \ @@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; }) + #define hpage_nr_pages(x) 1 #define transparent_hugepage_enabled(__vma) 0 -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756451Ab3A1J2M (ORCPT ); Mon, 28 Jan 2013 04:28:12 -0500 Received: from mga03.intel.com ([143.182.124.21]:49106 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755981Ab3A1JXh (ORCPT ); Mon, 28 Jan 2013 04:23:37 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="195771584" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Date: Mon, 28 Jan 2013 11:24:18 +0200 Message-Id: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head page for the specified index and HPAGE_CACHE_NR-1 tail pages for following indexes. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 75 +++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 53 insertions(+), 22 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index b6a6d7e..fa2fdab 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) { int error; + int nr = 1; VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, error = mem_cgroup_cache_charge(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) - goto out; + return error; - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); - if (error == 0) { - page_cache_get(page); - page->mapping = mapping; - page->index = offset; + if (PageTransHuge(page)) { + BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR); + nr = HPAGE_CACHE_NR; + } + error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM); + if (error) { + mem_cgroup_uncharge_cache_page(page); + return error; + } - spin_lock_irq(&mapping->tree_lock); - error = radix_tree_insert(&mapping->page_tree, offset, page); - if (likely(!error)) { - mapping->nrpages++; - __inc_zone_page_state(page, NR_FILE_PAGES); - spin_unlock_irq(&mapping->tree_lock); - } else { - page->mapping = NULL; - /* Leave page->index set: truncation relies upon it */ - spin_unlock_irq(&mapping->tree_lock); - mem_cgroup_uncharge_cache_page(page); - page_cache_release(page); + page_cache_get(page); + spin_lock_irq(&mapping->tree_lock); + page->mapping = mapping; + if (PageTransHuge(page)) { + int i; + for (i = 0; i < HPAGE_CACHE_NR; i++) { + page_cache_get(page + i); + page[i].index = offset + i; + error = radix_tree_insert(&mapping->page_tree, + offset + i, page + i); + if (error) { + page_cache_release(page + i); + break; + } } - radix_tree_preload_end(); - } else - mem_cgroup_uncharge_cache_page(page); -out: + if (error) { + if (i > 0 && error == EEXIST) + error = ENOSPC; /* no space for a huge page */ + for (i--; i > 0; i--) { + page_cache_release(page + i); + radix_tree_delete(&mapping->page_tree, + offset + i); + } + goto err; + } + } else { + page->index = offset; + error = radix_tree_insert(&mapping->page_tree, offset, page); + if (unlikely(error)) + goto err; + } + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr); + mapping->nrpages += nr; + spin_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + return 0; +err: + page->mapping = NULL; + /* Leave page->index set: truncation relies upon it */ + spin_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + mem_cgroup_uncharge_cache_page(page); + page_cache_release(page); return error; } EXPORT_SYMBOL(add_to_page_cache_locked); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756690Ab3A1J2e (ORCPT ); Mon, 28 Jan 2013 04:28:34 -0500 Received: from mga03.intel.com ([143.182.124.21]:49106 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755935Ab3A1JXf (ORCPT ); Mon, 28 Jan 2013 04:23:35 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,550,1355126400"; d="scan'208";a="249123641" From: "Kirill A. Shutemov" To: Andrea Arcangeli , Andrew Morton , Al Viro Cc: Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH, RFC 04/16] radix-tree: implement preload for multiple contiguous elements Date: Mon, 28 Jan 2013 11:24:16 +0200 Message-Id: <1359365068-10147-5-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" Currently radix_tree_preload() only guarantees enough nodes to insert one element. It's a hard limit. You cannot batch a number insert under one tree_lock. This patch introduces radix_tree_preload_count(). It allows to preallocate nodes enough to insert a number of *contiguous* elements. Signed-off-by: Matthew Wilcox Signed-off-by: Kirill A. Shutemov --- include/linux/radix-tree.h | 3 +++ lib/radix-tree.c | 32 +++++++++++++++++++++++++------- 2 files changed, 28 insertions(+), 7 deletions(-) diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index ffc444c..81318cb 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -83,6 +83,8 @@ do { \ (root)->rnode = NULL; \ } while (0) +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */ + /** * Radix-tree synchronization * @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root, unsigned long radix_tree_prev_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); int radix_tree_preload(gfp_t gfp_mask); +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index e796429..9bef0ac 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep; * The worst case is a zero height tree with just a single item at index 0, * and then inserting an item at index ULONG_MAX. This requires 2 new branches * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared. + * + * Worst case for adding N contiguous items is adding entries at indexes + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case + * item plus extra nodes if you cross the boundary from one node to the next. + * * Hence: */ -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MAX \ + (RADIX_TREE_PRELOAD_MIN + \ + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE)) /* * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { int nr; - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE]; + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX]; }; static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, }; @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node) /* * Load up this CPU's radix_tree_node buffer with sufficient objects to - * ensure that the addition of a single element in the tree cannot fail. On - * success, return zero, with preemption disabled. On error, return -ENOMEM + * ensure that the addition of *contiguous* elements in the tree cannot fail. + * On success, return zero, with preemption disabled. On error, return -ENOMEM * with preemption not disabled. * * To make use of this facility, the radix tree must be initialised without * __GFP_WAIT being passed to INIT_RADIX_TREE(). */ -int radix_tree_preload(gfp_t gfp_mask) +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask) { struct radix_tree_preload *rtp; struct radix_tree_node *node; int ret = -ENOMEM; + int alloc = RADIX_TREE_PRELOAD_MIN + + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE); + + if (size > RADIX_TREE_PRELOAD_NR) + return -ENOMEM; preempt_disable(); rtp = &__get_cpu_var(radix_tree_preloads); - while (rtp->nr < ARRAY_SIZE(rtp->nodes)) { + while (rtp->nr < alloc) { preempt_enable(); node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); if (node == NULL) goto out; preempt_disable(); rtp = &__get_cpu_var(radix_tree_preloads); - if (rtp->nr < ARRAY_SIZE(rtp->nodes)) + if (rtp->nr < alloc) rtp->nodes[rtp->nr++] = node; else kmem_cache_free(radix_tree_node_cachep, node); @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask) out: return ret; } + +int radix_tree_preload(gfp_t gfp_mask) +{ + return radix_tree_preload_count(1, gfp_mask); +} EXPORT_SYMBOL(radix_tree_preload); /* -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752260Ab3A2FDr (ORCPT ); Tue, 29 Jan 2013 00:03:47 -0500 Received: from mail-pb0-f50.google.com ([209.85.160.50]:57757 "EHLO mail-pb0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751073Ab3A2FDo (ORCPT ); Tue, 29 Jan 2013 00:03:44 -0500 Date: Mon, 28 Jan 2013 21:03:41 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: "Kirill A. Shutemov" cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Message-ID: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > Here's first steps towards huge pages in page cache. > > The intend of the work is get code ready to enable transparent huge page > cache for the most simple fs -- ramfs. > > It's not yet near feature-complete. It only provides basic infrastructure. > At the moment we can read, write and truncate file on ramfs with huge pages in > page cache. The most interesting part, mmap(), is not yet there. For now > we split huge page on mmap() attempt. > > I can't say that I see whole picture. I'm not sure if I understand locking > model around split_huge_page(). Probably, not. > Andrea, could you check if it looks correct? > > Next steps (not necessary in this order): > - mmap(); > - migration (?); > - collapse; > - stats, knobs, etc.; > - tmpfs/shmem enabling; > - ... > > Kirill A. Shutemov (16): > block: implement add_bdi_stat() > mm: implement zero_huge_user_segment and friends > mm: drop actor argument of do_generic_file_read() > radix-tree: implement preload for multiple contiguous elements > thp, mm: basic defines for transparent huge page cache > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > thp, mm: rewrite delete_from_page_cache() to support huge pages > thp, mm: locking tail page is a bug > thp, mm: handle tail pages in page_cache_get_speculative() > thp, mm: implement grab_cache_huge_page_write_begin() > thp, mm: naive support of thp in generic read/write routines > thp, libfs: initial support of thp in > simple_read/write_begin/write_end > thp: handle file pages in split_huge_page() > thp, mm: truncate support for transparent huge page cache > thp, mm: split huge page on mmap file page > ramfs: enable transparent huge page cache > > fs/libfs.c | 54 +++++++++--- > fs/ramfs/inode.c | 6 +- > include/linux/backing-dev.h | 10 +++ > include/linux/huge_mm.h | 8 ++ > include/linux/mm.h | 15 ++++ > include/linux/pagemap.h | 14 ++- > include/linux/radix-tree.h | 3 + > lib/radix-tree.c | 32 +++++-- > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 62 +++++++++++-- > mm/memory.c | 22 +++++ > mm/truncate.c | 12 +++ > 12 files changed, 375 insertions(+), 67 deletions(-) Interesting. I was starting to think about Transparent Huge Pagecache a few months ago, but then got washed away by incoming waves as usual. Certainly I don't have a line of code to show for it; but my first impression of your patches is that we have very different ideas of where to start. Perhaps that's good complementarity, or perhaps I'll disagree with your approach. I'll be taking a look at yours in the coming days, and trying to summon back up my own ideas to summarize them for you. Perhaps I was naive to imagine it, but I did intend to start out generically, independent of filesystem; but content to narrow down on tmpfs alone where it gets hard to support the others (writeback springs to mind). khugepaged would be migrating little pages into huge pages, where it saw that the mmaps of the file would benefit (and for testing I would hack mmap alignment choice to favour it). I had arrived at a conviction that the first thing to change was the way that tail pages of a THP are refcounted, that it had been a mistake to use the compound page method of holding the THP together. But I'll have to enter a trance now to recall the arguments ;) Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756655Ab3A2MLF (ORCPT ); Tue, 29 Jan 2013 07:11:05 -0500 Received: from mail-oa0-f46.google.com ([209.85.219.46]:36124 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756080Ab3A2MLD (ORCPT ); Tue, 29 Jan 2013 07:11:03 -0500 MIME-Version: 1.0 In-Reply-To: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Date: Tue, 29 Jan 2013 20:11:01 +0800 Message-ID: Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages From: Hillf Danton To: "Kirill A. Shutemov" Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov wrote: > @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > pgoff_t offset, gfp_t gfp_mask) > { > int error; > + int nr = 1; > > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > error = mem_cgroup_cache_charge(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > - goto out; > + return error; Due to PageCompound check, thp could not be charged effectively. Any change added for charging it? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756561Ab3A2MOH (ORCPT ); Tue, 29 Jan 2013 07:14:07 -0500 Received: from mail-oa0-f52.google.com ([209.85.219.52]:59884 "EHLO mail-oa0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751624Ab3A2MOG (ORCPT ); Tue, 29 Jan 2013 07:14:06 -0500 MIME-Version: 1.0 In-Reply-To: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Date: Tue, 29 Jan 2013 20:14:04 +0800 Message-ID: Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages From: Hillf Danton To: "Kirill A. Shutemov" Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov wrote: > + page_cache_get(page); > + spin_lock_irq(&mapping->tree_lock); > + page->mapping = mapping; > + if (PageTransHuge(page)) { > + int i; > + for (i = 0; i < HPAGE_CACHE_NR; i++) { > + page_cache_get(page + i); Page count is raised twice for head page, why? > + page[i].index = offset + i; > + error = radix_tree_insert(&mapping->page_tree, > + offset + i, page + i); > + if (error) { > + page_cache_release(page + i); > + break; > + } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756024Ab3A2M0N (ORCPT ); Tue, 29 Jan 2013 07:26:13 -0500 Received: from mail-ob0-f177.google.com ([209.85.214.177]:57772 "EHLO mail-ob0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752165Ab3A2M0K (ORCPT ); Tue, 29 Jan 2013 07:26:10 -0500 MIME-Version: 1.0 In-Reply-To: <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Date: Tue, 29 Jan 2013 20:26:09 +0800 Message-ID: Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages From: Hillf Danton To: "Kirill A. Shutemov" Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov wrote: > + page_cache_get(page); > + spin_lock_irq(&mapping->tree_lock); > + page->mapping = mapping; > + if (PageTransHuge(page)) { > + int i; > + for (i = 0; i < HPAGE_CACHE_NR; i++) { > + page_cache_get(page + i); > + page[i].index = offset + i; > + error = radix_tree_insert(&mapping->page_tree, > + offset + i, page + i); > + if (error) { > + page_cache_release(page + i); > + break; > + } Is page count balanced with the following? @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page) if (freepage) freepage(page); + if (PageTransHuge(page)) + for (i = 1; i < HPAGE_CACHE_NR; i++) + page_cache_release(page); page_cache_release(page); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756772Ab3A2NA4 (ORCPT ); Tue, 29 Jan 2013 08:00:56 -0500 Received: from mga11.intel.com ([192.55.52.93]:1154 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755802Ab3A2NAy (ORCPT ); Tue, 29 Jan 2013 08:00:54 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,559,1355126400"; d="scan'208";a="279764473" Date: Tue, 29 Jan 2013 15:01:27 +0200 From: "Kirill A. Shutemov" To: Hillf Danton , "Kirill A. Shutemov" Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel , LKML Message-ID: <5107c827d855d_f167d78c842627@blue.mail> In-Reply-To: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <1359365068-10147-7-git-send-email-kirill.shutemov@linux.intel.com> Subject: Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hillf Danton wrote: > On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov > wrote: > > @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > pgoff_t offset, gfp_t gfp_mask) > > { > > int error; > > + int nr = 1; > > > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > error = mem_cgroup_cache_charge(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > - goto out; > > + return error; > > Due to PageCompound check, thp could not be charged effectively. > Any change added for charging it? I've missed this. Will fix. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756833Ab3A2N2q (ORCPT ); Tue, 29 Jan 2013 08:28:46 -0500 Received: from mga14.intel.com ([143.182.124.37]:24920 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751681Ab3A2N2m (ORCPT ); Tue, 29 Jan 2013 08:28:42 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,559,1355126400"; d="scan'208";a="196248744" Date: Tue, 29 Jan 2013 15:14:58 +0200 From: "Kirill A. Shutemov" To: Hugh Dickins , "Kirill A. Shutemov" Cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Message-ID: <5107cb52e07b1_376199eb7059997@blue.mail> In-Reply-To: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hugh Dickins wrote: > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: > > From: "Kirill A. Shutemov" > > > > Here's first steps towards huge pages in page cache. > > > > The intend of the work is get code ready to enable transparent huge page > > cache for the most simple fs -- ramfs. > > > > It's not yet near feature-complete. It only provides basic infrastructure. > > At the moment we can read, write and truncate file on ramfs with huge pages in > > page cache. The most interesting part, mmap(), is not yet there. For now > > we split huge page on mmap() attempt. > > > > I can't say that I see whole picture. I'm not sure if I understand locking > > model around split_huge_page(). Probably, not. > > Andrea, could you check if it looks correct? > > > > Next steps (not necessary in this order): > > - mmap(); > > - migration (?); > > - collapse; > > - stats, knobs, etc.; > > - tmpfs/shmem enabling; > > - ... > > > > Kirill A. Shutemov (16): > > block: implement add_bdi_stat() > > mm: implement zero_huge_user_segment and friends > > mm: drop actor argument of do_generic_file_read() > > radix-tree: implement preload for multiple contiguous elements > > thp, mm: basic defines for transparent huge page cache > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > > thp, mm: rewrite delete_from_page_cache() to support huge pages > > thp, mm: locking tail page is a bug > > thp, mm: handle tail pages in page_cache_get_speculative() > > thp, mm: implement grab_cache_huge_page_write_begin() > > thp, mm: naive support of thp in generic read/write routines > > thp, libfs: initial support of thp in > > simple_read/write_begin/write_end > > thp: handle file pages in split_huge_page() > > thp, mm: truncate support for transparent huge page cache > > thp, mm: split huge page on mmap file page > > ramfs: enable transparent huge page cache > > > > fs/libfs.c | 54 +++++++++--- > > fs/ramfs/inode.c | 6 +- > > include/linux/backing-dev.h | 10 +++ > > include/linux/huge_mm.h | 8 ++ > > include/linux/mm.h | 15 ++++ > > include/linux/pagemap.h | 14 ++- > > include/linux/radix-tree.h | 3 + > > lib/radix-tree.c | 32 +++++-- > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > > mm/huge_memory.c | 62 +++++++++++-- > > mm/memory.c | 22 +++++ > > mm/truncate.c | 12 +++ > > 12 files changed, 375 insertions(+), 67 deletions(-) > > Interesting. > > I was starting to think about Transparent Huge Pagecache a few > months ago, but then got washed away by incoming waves as usual. > > Certainly I don't have a line of code to show for it; but my first > impression of your patches is that we have very different ideas of > where to start. > > Perhaps that's good complementarity, or perhaps I'll disagree with > your approach. I'll be taking a look at yours in the coming days, > and trying to summon back up my own ideas to summarize them for you. Yeah, it would be nice to see alternative design ideas. Looking forward. > Perhaps I was naive to imagine it, but I did intend to start out > generically, independent of filesystem; but content to narrow down > on tmpfs alone where it gets hard to support the others (writeback > springs to mind). khugepaged would be migrating little pages into > huge pages, where it saw that the mmaps of the file would benefit > (and for testing I would hack mmap alignment choice to favour it). I don't think all fs at once would fly, but it's wonderful, if I'm wrong :) > I had arrived at a conviction that the first thing to change was > the way that tail pages of a THP are refcounted, that it had been a > mistake to use the compound page method of holding the THP together. > But I'll have to enter a trance now to recall the arguments ;) THP refcounting looks reasonable for me, if take split_huge_page() in account. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755938Ab3AaCMK (ORCPT ); Wed, 30 Jan 2013 21:12:10 -0500 Received: from mail-da0-f48.google.com ([209.85.210.48]:59951 "EHLO mail-da0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753485Ab3AaCMG (ORCPT ); Wed, 30 Jan 2013 21:12:06 -0500 Date: Wed, 30 Jan 2013 18:12:05 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: "Kirill A. Shutemov" cc: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache In-Reply-To: <5107cb52e07b1_376199eb7059997@blue.mail> Message-ID: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: > Hugh Dickins wrote: > > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: > > > From: "Kirill A. Shutemov" > > > > > > Here's first steps towards huge pages in page cache. > > > > > > The intend of the work is get code ready to enable transparent huge page > > > cache for the most simple fs -- ramfs. > > > > > > It's not yet near feature-complete. It only provides basic infrastructure. > > > At the moment we can read, write and truncate file on ramfs with huge pages in > > > page cache. The most interesting part, mmap(), is not yet there. For now > > > we split huge page on mmap() attempt. > > > > > > I can't say that I see whole picture. I'm not sure if I understand locking > > > model around split_huge_page(). Probably, not. > > > Andrea, could you check if it looks correct? > > > > > > Next steps (not necessary in this order): > > > - mmap(); > > > - migration (?); > > > - collapse; > > > - stats, knobs, etc.; > > > - tmpfs/shmem enabling; > > > - ... > > > > > > Kirill A. Shutemov (16): > > > block: implement add_bdi_stat() > > > mm: implement zero_huge_user_segment and friends > > > mm: drop actor argument of do_generic_file_read() > > > radix-tree: implement preload for multiple contiguous elements > > > thp, mm: basic defines for transparent huge page cache > > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > > > thp, mm: rewrite delete_from_page_cache() to support huge pages > > > thp, mm: locking tail page is a bug > > > thp, mm: handle tail pages in page_cache_get_speculative() > > > thp, mm: implement grab_cache_huge_page_write_begin() > > > thp, mm: naive support of thp in generic read/write routines > > > thp, libfs: initial support of thp in > > > simple_read/write_begin/write_end > > > thp: handle file pages in split_huge_page() > > > thp, mm: truncate support for transparent huge page cache > > > thp, mm: split huge page on mmap file page > > > ramfs: enable transparent huge page cache > > > > > > fs/libfs.c | 54 +++++++++--- > > > fs/ramfs/inode.c | 6 +- > > > include/linux/backing-dev.h | 10 +++ > > > include/linux/huge_mm.h | 8 ++ > > > include/linux/mm.h | 15 ++++ > > > include/linux/pagemap.h | 14 ++- > > > include/linux/radix-tree.h | 3 + > > > lib/radix-tree.c | 32 +++++-- > > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > > > mm/huge_memory.c | 62 +++++++++++-- > > > mm/memory.c | 22 +++++ > > > mm/truncate.c | 12 +++ > > > 12 files changed, 375 insertions(+), 67 deletions(-) > > > > Interesting. > > > > I was starting to think about Transparent Huge Pagecache a few > > months ago, but then got washed away by incoming waves as usual. > > > > Certainly I don't have a line of code to show for it; but my first > > impression of your patches is that we have very different ideas of > > where to start. A second impression confirms that we have very different ideas of where to start. I don't want to be dismissive, and please don't let me discourage you, but I just don't find what you have very interesting. I'm sure you'll agree that the interesting part, and the difficult part, comes with mmap(); and there's no point whatever to THPages without mmap() (of course, I'm including exec and brk and shm when I say mmap there). (There may be performance benefits in working with larger page cache size, which Christoph Lameter explored a few years back, but that's a different topic: I think 2MB - if I may be x86_64-centric - would not be the unit of choice for that, unless SSD erase block were to dominate.) I'm interested to get to the point of prototyping something that does support mmap() of THPageCache: I'm pretty sure that I'd then soon learn a lot about my misconceptions, and have to rework for a while (or give up!); but I don't see much point in posting anything without that. I don't know if we have 5 or 50 places which "know" that a THPage must be Anon: some I'll spot in advance, some I sadly won't. It's not clear to me that the infrastructural changes you make in this series will be needed or not, if I pursue my approach: some perhaps as optimizations on top of the poorly performing base that may emerge from going about it my way. But for me it's too soon to think about those. Something I notice that we do agree upon: the radix_tree holding the 4k subpages, at least for now. When I first started thinking towards THPageCache, I was fascinated by how we could manage the hugepages in the radix_tree, cutting out unnecessary levels etc; but after a while I realized that although there's probably nice scope for cleverness there (significantly constrained by RCU expectations), it would only be about optimization. Let's be simple and stupid about radix_tree for now, the problems that need to be worked out lie elsewhere. > > > > Perhaps that's good complementarity, or perhaps I'll disagree with > > your approach. I'll be taking a look at yours in the coming days, > > and trying to summon back up my own ideas to summarize them for you. > > Yeah, it would be nice to see alternative design ideas. Looking forward. > > > Perhaps I was naive to imagine it, but I did intend to start out > > generically, independent of filesystem; but content to narrow down > > on tmpfs alone where it gets hard to support the others (writeback > > springs to mind). khugepaged would be migrating little pages into > > huge pages, where it saw that the mmaps of the file would benefit > > (and for testing I would hack mmap alignment choice to favour it). > > I don't think all fs at once would fly, but it's wonderful, if I'm > wrong :) You are imagining the filesystem putting huge pages into its cache. Whereas I'm imagining khugepaged looking around at mmaped file areas, seeing which would benefit from huge pagecache (let's assume offset 0 belongs on hugepage boundary - maybe one day someone will want to tune some files or parts differently, but that's low priority), migrating 4k pages over to 2MB page (wouldn't have to be done all in one pass), then finally slotting in the pmds for that. But going this way, I expect we'd have to split at page_mkwrite(): we probably don't want a single touch to dirty 2MB at a time, unless tmpfs or ramfs. > > > I had arrived at a conviction that the first thing to change was > > the way that tail pages of a THP are refcounted, that it had been a > > mistake to use the compound page method of holding the THP together. > > But I'll have to enter a trance now to recall the arguments ;) > > THP refcounting looks reasonable for me, if take split_huge_page() in > account. I'm not claiming that the THP refcounting is wrong in what it's doing at present; but that I suspect we'll want to rework it for THPageCache. Something I take for granted, I think you do too but I'm not certain: a file with transparent huge pages in its page cache can also have small pages in other extents of its page cache; and can be mapped hugely (2MB extents) into one address space at the same time as individual 4k pages from those extents are mapped into another (or the same) address space. One can certainly imagine sacrificing that principle, splitting whenever there's such a "conflict"; but it then becomes uninteresting to me, too much like hugetlbfs. Splitting an anonymous hugepage in all address spaces that hold it when one of them needs it split, that has been a pragmatic strategy: it's not a common case for forks to diverge like that; but files are expected to be more widely shared. At present THP is using compound pages, with mapcount of tail pages reused to track their contribution to head page count; but I think we shall want to be able to use the mapcount, and the count, of TH tail pages for their original purpose if huge mappings can coexist with tiny. Not fully thought out, but that's my feeling. The use of compound pages, in particular the redirection of tail page count to head page count, was important in hugetlbfs: a get_user_pages reference on a subpage must prevent the containing hugepage from being freed, because hugetlbfs has its own separate pool of hugepages to which freeing returns them. But for transparent huge pages? It should not matter so much if the subpages are freed independently. So I'd like to devise another glue to hold them together more loosely (for prototyping I can certainly pretend we have infinite pageflag and pagefield space if that helps): I may find in practice that they're forever falling apart, and I run crying back to compound pages; but at present I'm hoping not. This mail might suggest that I'm about to start coding: I wish that were true, but in reality there's always a lot of unrelated things I have to look at, which dilute my focus. So if I've said anything that sparks ideas for you, go with them. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757891Ab3BBPMa (ORCPT ); Sat, 2 Feb 2013 10:12:30 -0500 Received: from mga03.intel.com ([143.182.124.21]:33167 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757722Ab3BBPM1 (ORCPT ); Sat, 2 Feb 2013 10:12:27 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,591,1355126400"; d="scan'208";a="251835427" From: "Kirill A. Shutemov" To: Hugh Dickins , Andrea Arcangeli Cc: "Kirill A. Shutemov" , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org In-Reply-To: References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache Content-Transfer-Encoding: 7bit Message-Id: <20130202151337.5C454E0073@blue.fi.intel.com> Date: Sat, 2 Feb 2013 17:13:37 +0200 (EET) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hugh Dickins wrote: > On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: > > Hugh Dickins wrote: > > > > > > Interesting. > > > > > > I was starting to think about Transparent Huge Pagecache a few > > > months ago, but then got washed away by incoming waves as usual. > > > > > > Certainly I don't have a line of code to show for it; but my first > > > impression of your patches is that we have very different ideas of > > > where to start. > > A second impression confirms that we have very different ideas of > where to start. I don't want to be dismissive, and please don't let > me discourage you, but I just don't find what you have very interesting. The main reason for publishing the patchset in current (not-really-useful) state is to start discussion early. Looks like it works :) > I'm sure you'll agree that the interesting part, and the difficult part, > comes with mmap(); and there's no point whatever to THPages without mmap() > (of course, I'm including exec and brk and shm when I say mmap there). > > (There may be performance benefits in working with larger page cache > size, which Christoph Lameter explored a few years back, but that's a > different topic: I think 2MB - if I may be x86_64-centric - would not be > the unit of choice for that, unless SSD erase block were to dominate.) > > I'm interested to get to the point of prototyping something that does > support mmap() of THPageCache: I'm pretty sure that I'd then soon learn > a lot about my misconceptions, and have to rework for a while (or give > up!); but I don't see much point in posting anything without that. > I don't know if we have 5 or 50 places which "know" that a THPage > must be Anon: some I'll spot in advance, some I sadly won't. > > It's not clear to me that the infrastructural changes you make in this > series will be needed or not, if I pursue my approach: some perhaps as > optimizations on top of the poorly performing base that may emerge from > going about it my way. But for me it's too soon to think about those. > > Something I notice that we do agree upon: the radix_tree holding the > 4k subpages, at least for now. When I first started thinking towards > THPageCache, I was fascinated by how we could manage the hugepages in > the radix_tree, cutting out unnecessary levels etc; but after a while > I realized that although there's probably nice scope for cleverness > there (significantly constrained by RCU expectations), it would only > be about optimization. One more point: you have still preserve memory for these levels anyway, since we must have never-fail split_huge_page(). > Let's be simple and stupid about radix_tree > for now, the problems that need to be worked out lie elsewhere. > > > > > > > Perhaps that's good complementarity, or perhaps I'll disagree with > > > your approach. I'll be taking a look at yours in the coming days, > > > and trying to summon back up my own ideas to summarize them for you. > > > > Yeah, it would be nice to see alternative design ideas. Looking forward. > > > > > Perhaps I was naive to imagine it, but I did intend to start out > > > generically, independent of filesystem; but content to narrow down > > > on tmpfs alone where it gets hard to support the others (writeback > > > springs to mind). khugepaged would be migrating little pages into > > > huge pages, where it saw that the mmaps of the file would benefit > > > (and for testing I would hack mmap alignment choice to favour it). > > > > I don't think all fs at once would fly, but it's wonderful, if I'm > > wrong :) > > You are imagining the filesystem putting huge pages into its cache. > Whereas I'm imagining khugepaged looking around at mmaped file areas, > seeing which would benefit from huge pagecache (let's assume offset 0 > belongs on hugepage boundary - maybe one day someone will want to tune > some files or parts differently, but that's low priority), migrating 4k > pages over to 2MB page (wouldn't have to be done all in one pass), then > finally slotting in the pmds for that. I had file huge page consolidation on todo list, but much later. I feel that our approaches are complimentary. > But going this way, I expect we'd have to split at page_mkwrite(): > we probably don't want a single touch to dirty 2MB at a time, > unless tmpfs or ramfs. Hm.. Splitting is rather expensive. I think it makes sense for fs with backing device to consolidate only pages which mapped without PROT_WRITE. This way we can avoid consolidate-split loops. > > > I had arrived at a conviction that the first thing to change was > > > the way that tail pages of a THP are refcounted, that it had been a > > > mistake to use the compound page method of holding the THP together. > > > But I'll have to enter a trance now to recall the arguments ;) > > > > THP refcounting looks reasonable for me, if take split_huge_page() in > > account. > > I'm not claiming that the THP refcounting is wrong in what it's doing > at present; but that I suspect we'll want to rework it for THPageCache. > > Something I take for granted, I think you do too but I'm not certain: > a file with transparent huge pages in its page cache can also have small > pages in other extents of its page cache; and can be mapped hugely (2MB > extents) into one address space at the same time as individual 4k pages > from those extents are mapped into another (or the same) address space. > > One can certainly imagine sacrificing that principle, splitting whenever > there's such a "conflict"; but it then becomes uninteresting to me, too > much like hugetlbfs. Splitting an anonymous hugepage in all address > spaces that hold it when one of them needs it split, that has been a > pragmatic strategy: it's not a common case for forks to diverge like > that; but files are expected to be more widely shared. > > At present THP is using compound pages, with mapcount of tail pages > reused to track their contribution to head page count; but I think we > shall want to be able to use the mapcount, and the count, of TH tail > pages for their original purpose if huge mappings can coexist with tiny. > Not fully thought out, but that's my feeling. > > The use of compound pages, in particular the redirection of tail page > count to head page count, was important in hugetlbfs: a get_user_pages > reference on a subpage must prevent the containing hugepage from being > freed, because hugetlbfs has its own separate pool of hugepages to > which freeing returns them. > > But for transparent huge pages? It should not matter so much if the > subpages are freed independently. So I'd like to devise another glue > to hold them together more loosely (for prototyping I can certainly > pretend we have infinite pageflag and pagefield space if that helps): > I may find in practice that they're forever falling apart, and I run > crying back to compound pages; but at present I'm hoping not. Looks interesting. But I'm not sure whether it will work. It would be nice to summon Andrea to the thread. > This mail might suggest that I'm about to start coding: I wish that > were true, but in reality there's always a lot of unrelated things > I have to look at, which dilute my focus. So if I've said anything > that sparks ideas for you, go with them. I want get my current approach work first. Will see. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752935Ab3CRJgM (ORCPT ); Mon, 18 Mar 2013 05:36:12 -0400 Received: from mail-qc0-f181.google.com ([209.85.216.181]:56405 "EHLO mail-qc0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751677Ab3CRJgJ (ORCPT ); Mon, 18 Mar 2013 05:36:09 -0400 Message-ID: <5146E000.1070306@gmail.com> Date: Mon, 18 Mar 2013 17:36:00 +0800 From: Simon Jeons User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: "Kirill A. Shutemov" CC: Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > Here's first steps towards huge pages in page cache. > > The intend of the work is get code ready to enable transparent huge page > cache for the most simple fs -- ramfs. > > It's not yet near feature-complete. It only provides basic infrastructure. > At the moment we can read, write and truncate file on ramfs with huge pages in > page cache. The most interesting part, mmap(), is not yet there. For now > we split huge page on mmap() attempt. > > I can't say that I see whole picture. I'm not sure if I understand locking > model around split_huge_page(). Probably, not. > Andrea, could you check if it looks correct? Another offline question: Why don't clear tail page PG_tail flag in function __split_huge_page_refcount? > > Next steps (not necessary in this order): > - mmap(); > - migration (?); > - collapse; > - stats, knobs, etc.; > - tmpfs/shmem enabling; > - ... > > Kirill A. Shutemov (16): > block: implement add_bdi_stat() > mm: implement zero_huge_user_segment and friends > mm: drop actor argument of do_generic_file_read() > radix-tree: implement preload for multiple contiguous elements > thp, mm: basic defines for transparent huge page cache > thp, mm: rewrite add_to_page_cache_locked() to support huge pages > thp, mm: rewrite delete_from_page_cache() to support huge pages > thp, mm: locking tail page is a bug > thp, mm: handle tail pages in page_cache_get_speculative() > thp, mm: implement grab_cache_huge_page_write_begin() > thp, mm: naive support of thp in generic read/write routines > thp, libfs: initial support of thp in > simple_read/write_begin/write_end > thp: handle file pages in split_huge_page() > thp, mm: truncate support for transparent huge page cache > thp, mm: split huge page on mmap file page > ramfs: enable transparent huge page cache > > fs/libfs.c | 54 +++++++++--- > fs/ramfs/inode.c | 6 +- > include/linux/backing-dev.h | 10 +++ > include/linux/huge_mm.h | 8 ++ > include/linux/mm.h | 15 ++++ > include/linux/pagemap.h | 14 ++- > include/linux/radix-tree.h | 3 + > lib/radix-tree.c | 32 +++++-- > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 62 +++++++++++-- > mm/memory.c | 22 +++++ > mm/truncate.c | 12 +++ > 12 files changed, 375 insertions(+), 67 deletions(-) > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765696Ab3DEA0p (ORCPT ); Thu, 4 Apr 2013 20:26:45 -0400 Received: from mail-oa0-f48.google.com ([209.85.219.48]:39881 "EHLO mail-oa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1764941Ab3DEA0n (ORCPT ); Thu, 4 Apr 2013 20:26:43 -0400 Message-ID: <515E1A3B.70508@gmail.com> Date: Fri, 05 Apr 2013 08:26:35 +0800 From: Simon Jeons User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Hugh Dickins CC: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Hugh, On 01/31/2013 10:12 AM, Hugh Dickins wrote: > On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >> Hugh Dickins wrote: >>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >>>> From: "Kirill A. Shutemov" >>>> >>>> Here's first steps towards huge pages in page cache. >>>> >>>> The intend of the work is get code ready to enable transparent huge page >>>> cache for the most simple fs -- ramfs. >>>> >>>> It's not yet near feature-complete. It only provides basic infrastructure. >>>> At the moment we can read, write and truncate file on ramfs with huge pages in >>>> page cache. The most interesting part, mmap(), is not yet there. For now >>>> we split huge page on mmap() attempt. >>>> >>>> I can't say that I see whole picture. I'm not sure if I understand locking >>>> model around split_huge_page(). Probably, not. >>>> Andrea, could you check if it looks correct? >>>> >>>> Next steps (not necessary in this order): >>>> - mmap(); >>>> - migration (?); >>>> - collapse; >>>> - stats, knobs, etc.; >>>> - tmpfs/shmem enabling; >>>> - ... >>>> >>>> Kirill A. Shutemov (16): >>>> block: implement add_bdi_stat() >>>> mm: implement zero_huge_user_segment and friends >>>> mm: drop actor argument of do_generic_file_read() >>>> radix-tree: implement preload for multiple contiguous elements >>>> thp, mm: basic defines for transparent huge page cache >>>> thp, mm: rewrite add_to_page_cache_locked() to support huge pages >>>> thp, mm: rewrite delete_from_page_cache() to support huge pages >>>> thp, mm: locking tail page is a bug >>>> thp, mm: handle tail pages in page_cache_get_speculative() >>>> thp, mm: implement grab_cache_huge_page_write_begin() >>>> thp, mm: naive support of thp in generic read/write routines >>>> thp, libfs: initial support of thp in >>>> simple_read/write_begin/write_end >>>> thp: handle file pages in split_huge_page() >>>> thp, mm: truncate support for transparent huge page cache >>>> thp, mm: split huge page on mmap file page >>>> ramfs: enable transparent huge page cache >>>> >>>> fs/libfs.c | 54 +++++++++--- >>>> fs/ramfs/inode.c | 6 +- >>>> include/linux/backing-dev.h | 10 +++ >>>> include/linux/huge_mm.h | 8 ++ >>>> include/linux/mm.h | 15 ++++ >>>> include/linux/pagemap.h | 14 ++- >>>> include/linux/radix-tree.h | 3 + >>>> lib/radix-tree.c | 32 +++++-- >>>> mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >>>> mm/huge_memory.c | 62 +++++++++++-- >>>> mm/memory.c | 22 +++++ >>>> mm/truncate.c | 12 +++ >>>> 12 files changed, 375 insertions(+), 67 deletions(-) >>> Interesting. >>> >>> I was starting to think about Transparent Huge Pagecache a few >>> months ago, but then got washed away by incoming waves as usual. >>> >>> Certainly I don't have a line of code to show for it; but my first >>> impression of your patches is that we have very different ideas of >>> where to start. > A second impression confirms that we have very different ideas of > where to start. I don't want to be dismissive, and please don't let > me discourage you, but I just don't find what you have very interesting. > > I'm sure you'll agree that the interesting part, and the difficult part, > comes with mmap(); and there's no point whatever to THPages without mmap() > (of course, I'm including exec and brk and shm when I say mmap there). > > (There may be performance benefits in working with larger page cache > size, which Christoph Lameter explored a few years back, but that's a > different topic: I think 2MB - if I may be x86_64-centric - would not be > the unit of choice for that, unless SSD erase block were to dominate.) > > I'm interested to get to the point of prototyping something that does > support mmap() of THPageCache: I'm pretty sure that I'd then soon learn > a lot about my misconceptions, and have to rework for a while (or give > up!); but I don't see much point in posting anything without that. > I don't know if we have 5 or 50 places which "know" that a THPage > must be Anon: some I'll spot in advance, some I sadly won't. > > It's not clear to me that the infrastructural changes you make in this > series will be needed or not, if I pursue my approach: some perhaps as > optimizations on top of the poorly performing base that may emerge from > going about it my way. But for me it's too soon to think about those. > > Something I notice that we do agree upon: the radix_tree holding the > 4k subpages, at least for now. When I first started thinking towards > THPageCache, I was fascinated by how we could manage the hugepages in > the radix_tree, cutting out unnecessary levels etc; but after a while > I realized that although there's probably nice scope for cleverness > there (significantly constrained by RCU expectations), it would only > be about optimization. Let's be simple and stupid about radix_tree > for now, the problems that need to be worked out lie elsewhere. > >>> Perhaps that's good complementarity, or perhaps I'll disagree with >>> your approach. I'll be taking a look at yours in the coming days, >>> and trying to summon back up my own ideas to summarize them for you. >> Yeah, it would be nice to see alternative design ideas. Looking forward. >> >>> Perhaps I was naive to imagine it, but I did intend to start out >>> generically, independent of filesystem; but content to narrow down >>> on tmpfs alone where it gets hard to support the others (writeback >>> springs to mind). khugepaged would be migrating little pages into >>> huge pages, where it saw that the mmaps of the file would benefit >>> (and for testing I would hack mmap alignment choice to favour it). >> I don't think all fs at once would fly, but it's wonderful, if I'm >> wrong :) > You are imagining the filesystem putting huge pages into its cache. > Whereas I'm imagining khugepaged looking around at mmaped file areas, > seeing which would benefit from huge pagecache (let's assume offset 0 > belongs on hugepage boundary - maybe one day someone will want to tune > some files or parts differently, but that's low priority), migrating 4k > pages over to 2MB page (wouldn't have to be done all in one pass), they There are isolation and migration process during collapse. But why didn't use migration entry in migration process? > finally slotting in the pmds for that. > > But going this way, I expect we'd have to split at page_mkwrite(): > we probably don't want a single touch to dirty 2MB at a time, > unless tmpfs or ramfs. > >>> I had arrived at a conviction that the first thing to change was >>> the way that tail pages of a THP are refcounted, that it had been a >>> mistake to use the compound page method of holding the THP together. >>> But I'll have to enter a trance now to recall the arguments ;) >> THP refcounting looks reasonable for me, if take split_huge_page() in >> account. > I'm not claiming that the THP refcounting is wrong in what it's doing > at present; but that I suspect we'll want to rework it for THPageCache. > > Something I take for granted, I think you do too but I'm not certain: > a file with transparent huge pages in its page cache can also have small > pages in other extents of its page cache; and can be mapped hugely (2MB > extents) into one address space at the same time as individual 4k pages > from those extents are mapped into another (or the same) address space. > > One can certainly imagine sacrificing that principle, splitting whenever > there's such a "conflict"; but it then becomes uninteresting to me, too > much like hugetlbfs. Splitting an anonymous hugepage in all address > spaces that hold it when one of them needs it split, that has been a > pragmatic strategy: it's not a common case for forks to diverge like > that; but files are expected to be more widely shared. > > At present THP is using compound pages, with mapcount of tail pages > reused to track their contribution to head page count; but I think we > shall want to be able to use the mapcount, and the count, of TH tail > pages for their original purpose if huge mappings can coexist with tiny. > Not fully thought out, but that's my feeling. > > The use of compound pages, in particular the redirection of tail page > count to head page count, was important in hugetlbfs: a get_user_pages > reference on a subpage must prevent the containing hugepage from being > freed, because hugetlbfs has its own separate pool of hugepages to > which freeing returns them. > > But for transparent huge pages? It should not matter so much if the > subpages are freed independently. So I'd like to devise another glue > to hold them together more loosely (for prototyping I can certainly > pretend we have infinite pageflag and pagefield space if that helps): > I may find in practice that they're forever falling apart, and I run > crying back to compound pages; but at present I'm hoping not. > > This mail might suggest that I'm about to start coding: I wish that > were true, but in reality there's always a lot of unrelated things > I have to look at, which dilute my focus. So if I've said anything > that sparks ideas for you, go with them. > > Hugh > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765868Ab3DEBDi (ORCPT ); Thu, 4 Apr 2013 21:03:38 -0400 Received: from mail-ob0-f170.google.com ([209.85.214.170]:39442 "EHLO mail-ob0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1765836Ab3DEBDg (ORCPT ); Thu, 4 Apr 2013 21:03:36 -0400 Message-ID: <515E22DE.1010207@gmail.com> Date: Fri, 05 Apr 2013 09:03:26 +0800 From: Simon Jeons User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Hugh Dickins CC: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> <5107cb52e07b1_376199eb7059997@blue.mail> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Hugh, On 01/31/2013 10:12 AM, Hugh Dickins wrote: > On Tue, 29 Jan 2013, Kirill A. Shutemov wrote: >> Hugh Dickins wrote: >>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >>>> From: "Kirill A. Shutemov" >>>> >>>> Here's first steps towards huge pages in page cache. >>>> >>>> The intend of the work is get code ready to enable transparent huge page >>>> cache for the most simple fs -- ramfs. >>>> >>>> It's not yet near feature-complete. It only provides basic infrastructure. >>>> At the moment we can read, write and truncate file on ramfs with huge pages in >>>> page cache. The most interesting part, mmap(), is not yet there. For now >>>> we split huge page on mmap() attempt. >>>> >>>> I can't say that I see whole picture. I'm not sure if I understand locking >>>> model around split_huge_page(). Probably, not. >>>> Andrea, could you check if it looks correct? >>>> >>>> Next steps (not necessary in this order): >>>> - mmap(); >>>> - migration (?); >>>> - collapse; >>>> - stats, knobs, etc.; >>>> - tmpfs/shmem enabling; >>>> - ... >>>> >>>> Kirill A. Shutemov (16): >>>> block: implement add_bdi_stat() >>>> mm: implement zero_huge_user_segment and friends >>>> mm: drop actor argument of do_generic_file_read() >>>> radix-tree: implement preload for multiple contiguous elements >>>> thp, mm: basic defines for transparent huge page cache >>>> thp, mm: rewrite add_to_page_cache_locked() to support huge pages >>>> thp, mm: rewrite delete_from_page_cache() to support huge pages >>>> thp, mm: locking tail page is a bug >>>> thp, mm: handle tail pages in page_cache_get_speculative() >>>> thp, mm: implement grab_cache_huge_page_write_begin() >>>> thp, mm: naive support of thp in generic read/write routines >>>> thp, libfs: initial support of thp in >>>> simple_read/write_begin/write_end >>>> thp: handle file pages in split_huge_page() >>>> thp, mm: truncate support for transparent huge page cache >>>> thp, mm: split huge page on mmap file page >>>> ramfs: enable transparent huge page cache >>>> >>>> fs/libfs.c | 54 +++++++++--- >>>> fs/ramfs/inode.c | 6 +- >>>> include/linux/backing-dev.h | 10 +++ >>>> include/linux/huge_mm.h | 8 ++ >>>> include/linux/mm.h | 15 ++++ >>>> include/linux/pagemap.h | 14 ++- >>>> include/linux/radix-tree.h | 3 + >>>> lib/radix-tree.c | 32 +++++-- >>>> mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >>>> mm/huge_memory.c | 62 +++++++++++-- >>>> mm/memory.c | 22 +++++ >>>> mm/truncate.c | 12 +++ >>>> 12 files changed, 375 insertions(+), 67 deletions(-) >>> Interesting. >>> >>> I was starting to think about Transparent Huge Pagecache a few >>> months ago, but then got washed away by incoming waves as usual. >>> >>> Certainly I don't have a line of code to show for it; but my first >>> impression of your patches is that we have very different ideas of >>> where to start. > A second impression confirms that we have very different ideas of > where to start. I don't want to be dismissive, and please don't let > me discourage you, but I just don't find what you have very interesting. > > I'm sure you'll agree that the interesting part, and the difficult part, > comes with mmap(); and there's no point whatever to THPages without mmap() > (of course, I'm including exec and brk and shm when I say mmap there). > > (There may be performance benefits in working with larger page cache > size, which Christoph Lameter explored a few years back, but that's a > different topic: I think 2MB - if I may be x86_64-centric - would not be > the unit of choice for that, unless SSD erase block were to dominate.) > > I'm interested to get to the point of prototyping something that does > support mmap() of THPageCache: I'm pretty sure that I'd then soon learn > a lot about my misconceptions, and have to rework for a while (or give > up!); but I don't see much point in posting anything without that. > I don't know if we have 5 or 50 places which "know" that a THPage > must be Anon: some I'll spot in advance, some I sadly won't. > > It's not clear to me that the infrastructural changes you make in this > series will be needed or not, if I pursue my approach: some perhaps as > optimizations on top of the poorly performing base that may emerge from > going about it my way. But for me it's too soon to think about those. > > Something I notice that we do agree upon: the radix_tree holding the > 4k subpages, at least for now. When I first started thinking towards > THPageCache, I was fascinated by how we could manage the hugepages in > the radix_tree, cutting out unnecessary levels etc; but after a while > I realized that although there's probably nice scope for cleverness > there (significantly constrained by RCU expectations), it would only > be about optimization. Let's be simple and stupid about radix_tree > for now, the problems that need to be worked out lie elsewhere. > >>> Perhaps that's good complementarity, or perhaps I'll disagree with >>> your approach. I'll be taking a look at yours in the coming days, >>> and trying to summon back up my own ideas to summarize them for you. >> Yeah, it would be nice to see alternative design ideas. Looking forward. >> >>> Perhaps I was naive to imagine it, but I did intend to start out >>> generically, independent of filesystem; but content to narrow down >>> on tmpfs alone where it gets hard to support the others (writeback >>> springs to mind). khugepaged would be migrating little pages into >>> huge pages, where it saw that the mmaps of the file would benefit If add heuristic to adjust khugepaged_max_ptes_none make sense? Reduce its value if memoy pressure is big and increase it if memory pressure is small. >>> (and for testing I would hack mmap alignment choice to favour it). >> I don't think all fs at once would fly, but it's wonderful, if I'm >> wrong :) > You are imagining the filesystem putting huge pages into its cache. > Whereas I'm imagining khugepaged looking around at mmaped file areas, > seeing which would benefit from huge pagecache (let's assume offset 0 > belongs on hugepage boundary - maybe one day someone will want to tune > some files or parts differently, but that's low priority), migrating 4k > pages over to 2MB page (wouldn't have to be done all in one pass), then > finally slotting in the pmds for that. > > But going this way, I expect we'd have to split at page_mkwrite(): > we probably don't want a single touch to dirty 2MB at a time, > unless tmpfs or ramfs. > >>> I had arrived at a conviction that the first thing to change was >>> the way that tail pages of a THP are refcounted, that it had been a >>> mistake to use the compound page method of holding the THP together. >>> But I'll have to enter a trance now to recall the arguments ;) >> THP refcounting looks reasonable for me, if take split_huge_page() in >> account. > I'm not claiming that the THP refcounting is wrong in what it's doing > at present; but that I suspect we'll want to rework it for THPageCache. > > Something I take for granted, I think you do too but I'm not certain: > a file with transparent huge pages in its page cache can also have small > pages in other extents of its page cache; and can be mapped hugely (2MB > extents) into one address space at the same time as individual 4k pages > from those extents are mapped into another (or the same) address space. > > One can certainly imagine sacrificing that principle, splitting whenever > there's such a "conflict"; but it then becomes uninteresting to me, too > much like hugetlbfs. Splitting an anonymous hugepage in all address > spaces that hold it when one of them needs it split, that has been a > pragmatic strategy: it's not a common case for forks to diverge like > that; but files are expected to be more widely shared. > > At present THP is using compound pages, with mapcount of tail pages > reused to track their contribution to head page count; but I think we > shall want to be able to use the mapcount, and the count, of TH tail > pages for their original purpose if huge mappings can coexist with tiny. > Not fully thought out, but that's my feeling. > > The use of compound pages, in particular the redirection of tail page > count to head page count, was important in hugetlbfs: a get_user_pages > reference on a subpage must prevent the containing hugepage from being > freed, because hugetlbfs has its own separate pool of hugepages to > which freeing returns them. > > But for transparent huge pages? It should not matter so much if the > subpages are freed independently. So I'd like to devise another glue > to hold them together more loosely (for prototyping I can certainly > pretend we have infinite pageflag and pagefield space if that helps): > I may find in practice that they're forever falling apart, and I run > crying back to compound pages; but at present I'm hoping not. > > This mail might suggest that I'm about to start coding: I wish that > were true, but in reality there's always a lot of unrelated things > I have to look at, which dilute my focus. So if I've said anything > that sparks ideas for you, go with them. > > Hugh > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765999Ab3DEBYn (ORCPT ); Thu, 4 Apr 2013 21:24:43 -0400 Received: from mail-da0-f44.google.com ([209.85.210.44]:55548 "EHLO mail-da0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1765984Ab3DEBYl (ORCPT ); Thu, 4 Apr 2013 21:24:41 -0400 Message-ID: <515E27D0.5090105@gmail.com> Date: Fri, 05 Apr 2013 09:24:32 +0800 From: Ric Mason User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Hugh Dickins CC: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , Al Viro , Wu Fengguang , Jan Kara , Mel Gorman , linux-mm@kvack.org, Andi Kleen , Matthew Wilcox , "Kirill A. Shutemov" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache References: <1359365068-10147-1-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Hugh, On 01/29/2013 01:03 PM, Hugh Dickins wrote: > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: >> From: "Kirill A. Shutemov" >> >> Here's first steps towards huge pages in page cache. >> >> The intend of the work is get code ready to enable transparent huge page >> cache for the most simple fs -- ramfs. >> >> It's not yet near feature-complete. It only provides basic infrastructure. >> At the moment we can read, write and truncate file on ramfs with huge pages in >> page cache. The most interesting part, mmap(), is not yet there. For now >> we split huge page on mmap() attempt. >> >> I can't say that I see whole picture. I'm not sure if I understand locking >> model around split_huge_page(). Probably, not. >> Andrea, could you check if it looks correct? >> >> Next steps (not necessary in this order): >> - mmap(); >> - migration (?); >> - collapse; >> - stats, knobs, etc.; >> - tmpfs/shmem enabling; >> - ... >> >> Kirill A. Shutemov (16): >> block: implement add_bdi_stat() >> mm: implement zero_huge_user_segment and friends >> mm: drop actor argument of do_generic_file_read() >> radix-tree: implement preload for multiple contiguous elements >> thp, mm: basic defines for transparent huge page cache >> thp, mm: rewrite add_to_page_cache_locked() to support huge pages >> thp, mm: rewrite delete_from_page_cache() to support huge pages >> thp, mm: locking tail page is a bug >> thp, mm: handle tail pages in page_cache_get_speculative() >> thp, mm: implement grab_cache_huge_page_write_begin() >> thp, mm: naive support of thp in generic read/write routines >> thp, libfs: initial support of thp in >> simple_read/write_begin/write_end >> thp: handle file pages in split_huge_page() >> thp, mm: truncate support for transparent huge page cache >> thp, mm: split huge page on mmap file page >> ramfs: enable transparent huge page cache >> >> fs/libfs.c | 54 +++++++++--- >> fs/ramfs/inode.c | 6 +- >> include/linux/backing-dev.h | 10 +++ >> include/linux/huge_mm.h | 8 ++ >> include/linux/mm.h | 15 ++++ >> include/linux/pagemap.h | 14 ++- >> include/linux/radix-tree.h | 3 + >> lib/radix-tree.c | 32 +++++-- >> mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++-------- >> mm/huge_memory.c | 62 +++++++++++-- >> mm/memory.c | 22 +++++ >> mm/truncate.c | 12 +++ >> 12 files changed, 375 insertions(+), 67 deletions(-) > Interesting. > > I was starting to think about Transparent Huge Pagecache a few > months ago, but then got washed away by incoming waves as usual. > > Certainly I don't have a line of code to show for it; but my first > impression of your patches is that we have very different ideas of > where to start. > > Perhaps that's good complementarity, or perhaps I'll disagree with > your approach. I'll be taking a look at yours in the coming days, > and trying to summon back up my own ideas to summarize them for you. > > Perhaps I was naive to imagine it, but I did intend to start out > generically, independent of filesystem; but content to narrow down > on tmpfs alone where it gets hard to support the others (writeback > springs to mind). khugepaged would be migrating little pages into > huge pages, where it saw that the mmaps of the file would benefit > (and for testing I would hack mmap alignment choice to favour it). > > I had arrived at a conviction that the first thing to change was > the way that tail pages of a THP are refcounted, that it had been a > mistake to use the compound page method of holding the THP together. > But I'll have to enter a trance now to recall the arguments ;) One offline question, do you have any idea hugetlbfs pages support swapping? > > Hugh > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org