From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 0/10] shmem/tmpfs: misc and fallocate Date: Sat, 12 May 2012 04:52:51 -0700 (PDT) Message-ID: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Cong Wang , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Here's a bunch of shmem/tmpfs updates: mostly completed in January, then put aside while I attended to other stuff. But the more recent 1/10 has some urgency, so I'm expediting the descriptions, and shipping them off to you now for v3.5. They're diffed against 3.4.0-rc5-next-20120504, but apply and build and work on most v3.4-rc and v3.4-rc-next. The fallocate ones were prompted by posts from Cong Wang in November: I've attributed four of those with Based-on-patch-by, but could not put From or Signed-off-by, since the originals were somewhat flawed, and I needed to start again and reorder it all. Whether 10/10 should go any further than exposure in -next is in doubt: we shall have to see if it's useful to anyone. 1/10 shmem: replace page if mapping excludes its zone 2/10 tmpfs: enable NOSEC optimization 3/10 tmpfs: optimize clearing when writing 4/10 tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE 5/10 mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE 6/10 mm/fs: remove truncate_range 7/10 tmpfs: support fallocate preallocation 8/10 tmpfs: undo fallocation on failure 9/10 tmpfs: quit when fallocate fills memory 10/10 tmpfs: support SEEK_DATA and SEEK_HOLE Documentation/filesystems/Locking | 2 Documentation/filesystems/vfs.txt | 13 drivers/staging/android/ashmem.c | 8 fs/bad_inode.c | 1 include/linux/fs.h | 1 include/linux/mm.h | 4 include/linux/swap.h | 6 mm/madvise.c | 15 mm/memcontrol.c | 17 mm/shmem.c | 513 +++++++++++++++++++++++++--- mm/swapfile.c | 2 mm/truncate.c | 25 - 12 files changed, 500 insertions(+), 107 deletions(-) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Sat, 12 May 2012 04:59:56 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the backing RAM has to be below 4GB. Not a problem while the boards supported only 4GB: but now Intel's D2700MUD boards support 8GB, and their GMA3600 is managed by the GMA500 driver. shmem/tmpfs has never pretended to support hardware restrictions on the backing memory, but it might have appeared to do so before v3.1, and even now it works fine until a page is swapped out then back in. When read_cache_page_gfp() supplied a freshly allocated page for copy, that compensated for whatever choice might have been made by earlier swapin readahead; but swapoff was likely to destroy the illusion. We'd like to continue to support GMA500, so now add a new shmem_should_replace_page() check on the zone when about to move a page from swapcache to filecache (in swapin and swapoff cases), with shmem_replace_page() to allocate and substitute a suitable page (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). This does involve a minor extension to mem_cgroup_replace_page_cache() (the page may or may not have already been charged); and I've removed a comment and call to mem_cgroup_uncharge_cache_page(), which in fact is always a no-op while PageSwapCache. Also removed optimization of an unlikely path in shmem_getpage_gfp(), now that we need to check PageSwapCache more carefully (a racing caller might already have made the copy). And at one point shmem_unuse_inode() needs to use the hitherto private page_swapcount(), to guard against racing with inode eviction. It would make sense to extend shmem_should_replace_page(), to cover cpuset and NUMA mempolicy restrictions too, but set that aside for now: needs a cleanup of shmem mempolicy handling, and more testing, and ought to handle swap faults in do_swap_page() as well as shmem. Signed-off-by: Hugh Dickins --- I've Cc'ed Stephane, Andi, Dave, Daniel and Rob because of their interest in the i915 Sandybridge issue; but reiterate that this patch does nothing for that case. include/linux/swap.h | 6 + mm/memcontrol.c | 17 +++- mm/shmem.c | 141 +++++++++++++++++++++++++++++++++++------ mm/swapfile.c | 2 4 files changed, 142 insertions(+), 24 deletions(-) --- 3045N.orig/include/linux/swap.h 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/include/linux/swap.h 2012-05-05 10:45:17.884060895 -0700 @@ -355,6 +355,7 @@ extern int swap_type_of(dev_t, sector_t, extern unsigned int count_swap_pages(int, int); extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); +extern int page_swapcount(struct page *); extern int reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; @@ -448,6 +449,11 @@ static inline void delete_from_swap_cach { } +static inline int page_swapcount(struct page *page) +{ + return 0; +} + #define reuse_swap_page(page) (page_mapcount(page) == 1) static inline int try_to_free_swap(struct page *page) --- 3045N.orig/mm/memcontrol.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/memcontrol.c 2012-05-05 10:45:17.888060878 -0700 @@ -3548,7 +3548,7 @@ void mem_cgroup_end_migration(struct mem void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; struct page_cgroup *pc; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3558,11 +3558,20 @@ void mem_cgroup_replace_page_cache(struc pc = lookup_page_cgroup(oldpage); /* fix accounting on old pages */ lock_page_cgroup(pc); - memcg = pc->mem_cgroup; - mem_cgroup_charge_statistics(memcg, false, -1); - ClearPageCgroupUsed(pc); + if (PageCgroupUsed(pc)) { + memcg = pc->mem_cgroup; + mem_cgroup_charge_statistics(memcg, false, -1); + ClearPageCgroupUsed(pc); + } unlock_page_cgroup(pc); + /* + * When called from shmem_replace_page(), in some cases the + * oldpage has already been charged, and in some cases not. + */ + if (!memcg) + return; + if (PageSwapBacked(oldpage)) type = MEM_CGROUP_CHARGE_TYPE_SHMEM; --- 3045N.orig/mm/shmem.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 @@ -103,6 +103,9 @@ static unsigned long shmem_default_max_i } #endif +static bool shmem_should_replace_page(struct page *page, gfp_t gfp); +static int shmem_replace_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index); static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type); @@ -604,12 +607,13 @@ static void shmem_evict_inode(struct ino * If swap found in inode, free it and move page from swapcache to filecache. */ static int shmem_unuse_inode(struct shmem_inode_info *info, - swp_entry_t swap, struct page *page) + swp_entry_t swap, struct page **pagep) { struct address_space *mapping = info->vfs_inode.i_mapping; void *radswap; pgoff_t index; - int error; + gfp_t gfp; + int error = 0; radswap = swp_to_radix_entry(swap); index = radix_tree_locate_item(&mapping->page_tree, radswap); @@ -625,22 +629,37 @@ static int shmem_unuse_inode(struct shme if (shmem_swaplist.next != &info->swaplist) list_move_tail(&shmem_swaplist, &info->swaplist); + gfp = mapping_gfp_mask(mapping); + if (shmem_should_replace_page(*pagep, gfp)) { + mutex_unlock(&shmem_swaplist_mutex); + error = shmem_replace_page(pagep, gfp, info, index); + mutex_lock(&shmem_swaplist_mutex); + /* + * We needed to drop mutex to make that restrictive page + * allocation; but the inode might already be freed by now, + * and we cannot refer to inode or mapping or info to check. + * However, we do hold page lock on the PageSwapCache page, + * so can check if that still has our reference remaining. + */ + if (!page_swapcount(*pagep)) + error = -ENOENT; + } + /* * We rely on shmem_swaplist_mutex, not only to protect the swaplist, * but also to hold up shmem_evict_inode(): so inode cannot be freed * beneath us (pagelock doesn't help until the page is in pagecache). */ - error = shmem_add_to_page_cache(page, mapping, index, + if (!error) + error = shmem_add_to_page_cache(*pagep, mapping, index, GFP_NOWAIT, radswap); - /* which does mem_cgroup_uncharge_cache_page on error */ - if (error != -ENOMEM) { /* * Truncation and eviction use free_swap_and_cache(), which * only does trylock page: if we raced, best clean up here. */ - delete_from_swap_cache(page); - set_page_dirty(page); + delete_from_swap_cache(*pagep); + set_page_dirty(*pagep); if (!error) { spin_lock(&info->lock); info->swapped--; @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct struct list_head *this, *next; struct shmem_inode_info *info; int found = 0; - int error; + int error = 0; + + /* + * There's a faint possibility that swap page was replaced before + * caller locked it: it will come back later with the right page. + */ + if (unlikely(!PageSwapCache(page))) + goto out; /* * Charge page using GFP_KERNEL while we can wait, before taking @@ -676,7 +702,7 @@ int shmem_unuse(swp_entry_t swap, struct list_for_each_safe(this, next, &shmem_swaplist) { info = list_entry(this, struct shmem_inode_info, swaplist); if (info->swapped) - found = shmem_unuse_inode(info, swap, page); + found = shmem_unuse_inode(info, swap, &page); else list_del_init(&info->swaplist); cond_resched(); @@ -685,8 +711,6 @@ int shmem_unuse(swp_entry_t swap, struct } mutex_unlock(&shmem_swaplist_mutex); - if (!found) - mem_cgroup_uncharge_cache_page(page); if (found < 0) error = found; out: @@ -856,6 +880,84 @@ static inline struct mempolicy *shmem_ge #endif /* + * When a page is moved from swapcache to shmem filecache (either by the + * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of + * shmem_unuse_inode()), it may have been read in earlier from swap, in + * ignorance of the mapping it belongs to. If that mapping has special + * constraints (like the gma500 GEM driver, which requires RAM below 4GB), + * we may need to copy to a suitable page before moving to filecache. + * + * In a future release, this may well be extended to respect cpuset and + * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); + * but for now it is a simple matter of zone. + */ +static bool shmem_should_replace_page(struct page *page, gfp_t gfp) +{ + return page_zonenum(page) > gfp_zone(gfp); +} + +static int shmem_replace_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index) +{ + struct page *oldpage, *newpage; + struct address_space *swap_mapping; + pgoff_t swap_index; + int error; + + oldpage = *pagep; + swap_index = page_private(oldpage); + swap_mapping = page_mapping(oldpage); + + /* + * We have arrived here because our zones are constrained, so don't + * limit chance of success by further cpuset and node constraints. + */ + gfp &= ~GFP_CONSTRAINT_MASK; + newpage = shmem_alloc_page(gfp, info, index); + if (!newpage) + return -ENOMEM; + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); + + *pagep = newpage; + page_cache_get(newpage); + copy_highpage(newpage, oldpage); + + VM_BUG_ON(!PageLocked(oldpage)); + __set_page_locked(newpage); + VM_BUG_ON(!PageUptodate(oldpage)); + SetPageUptodate(newpage); + VM_BUG_ON(!PageSwapBacked(oldpage)); + SetPageSwapBacked(newpage); + VM_BUG_ON(!swap_index); + set_page_private(newpage, swap_index); + VM_BUG_ON(!PageSwapCache(oldpage)); + SetPageSwapCache(newpage); + + /* + * Our caller will very soon move newpage out of swapcache, but it's + * a nice clean interface for us to replace oldpage by newpage there. + */ + spin_lock_irq(&swap_mapping->tree_lock); + error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, + newpage); + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + spin_unlock_irq(&swap_mapping->tree_lock); + BUG_ON(error); + + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + + ClearPageSwapCache(oldpage); + set_page_private(oldpage, 0); + + unlock_page(oldpage); + page_cache_release(oldpage); + page_cache_release(oldpage); + return 0; +} + +/* * shmem_getpage_gfp - find page in cache, or get from swap, or allocate * * If we allocate a new one we do not mark it dirty. That's up to the @@ -923,19 +1025,20 @@ repeat: /* We have to do this with page locked to prevent races */ lock_page(page); + if (!PageSwapCache(page) || page->mapping) { + error = -EEXIST; /* try again */ + goto failed; + } if (!PageUptodate(page)) { error = -EIO; goto failed; } wait_on_page_writeback(page); - /* Someone may have already done it for us */ - if (page->mapping) { - if (page->mapping == mapping && - page->index == index) - goto done; - error = -EEXIST; - goto failed; + if (shmem_should_replace_page(page, gfp)) { + error = shmem_replace_page(&page, gfp, info, index); + if (error) + goto failed; } error = mem_cgroup_cache_charge(page, current->mm, @@ -998,7 +1101,7 @@ repeat: if (sgp == SGP_DIRTY) set_page_dirty(page); } -done: + /* Perhaps the file has been truncated since we checked */ if (sgp != SGP_WRITE && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { --- 3045N.orig/mm/swapfile.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/swapfile.c 2012-05-05 10:45:17.888060878 -0700 @@ -604,7 +604,7 @@ void swapcache_free(swp_entry_t entry, s * This does not give an exact answer when swap count is continued, * but does include the high COUNT_CONTINUED flag to allow for that. */ -static inline int page_swapcount(struct page *page) +int page_swapcount(struct page *page) { int count = 0; struct swap_info_struct *p; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 2/10] tmpfs: enable NOSEC optimization Date: Sat, 12 May 2012 05:02:29 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Andi Kleen , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() overhead on most common writes): set MS_NOSEC on its superblocks. Signed-off-by: Hugh Dickins --- mm/shmem.c | 1 + 1 file changed, 1 insertion(+) --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block } } sb->s_export_op = &shmem_export_ops; + sb->s_flags |= MS_NOSEC; #else sb->s_flags |= MS_NOUSER; #endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 3/10] tmpfs: optimize clearing when writing Date: Sat, 12 May 2012 05:04:12 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Nick Piggin , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Nick proposed years ago that tmpfs should avoid clearing its pages where write will overwrite them with new data, as ramfs has long done. But I messed it up and just got bad data. Tried again recently, it works fine. Here's time output for writing 4GiB 16 times on this Core i5 laptop: before: real 0m21.169s user 0m0.028s sys 0m21.057s real 0m21.382s user 0m0.016s sys 0m21.289s real 0m21.311s user 0m0.020s sys 0m21.217s after: real 0m18.273s user 0m0.032s sys 0m18.165s real 0m18.354s user 0m0.020s sys 0m18.265s real 0m18.440s user 0m0.032s sys 0m18.337s ramfs: real 0m16.860s user 0m0.028s sys 0m16.765s real 0m17.382s user 0m0.040s sys 0m17.273s real 0m17.133s user 0m0.044s sys 0m17.021s Yes, I have done perf reports, but they need more explanation than they deserve: in summary, clear_page vanishes, its cache loading shifts into copy_user_generic_unrolled; shmem_getpage_gfp goes down, and surprisingly mark_page_accessed goes way up - I think because they are respectively where the cache gets to be reloaded after being purged by clear or copy. Suggested-by: Nick Piggin Signed-off-by: Hugh Dickins --- mm/shmem.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:12.316062172 -0700 @@ -1095,9 +1095,14 @@ repeat: shmem_recalc_inode(inode); spin_unlock(&info->lock); - clear_highpage(page); - flush_dcache_page(page); - SetPageUptodate(page); + /* + * Let SGP_WRITE caller clear ends if write does not fill page + */ + if (sgp != SGP_WRITE) { + clear_highpage(page); + flush_dcache_page(page); + SetPageUptodate(page); + } if (sgp == SGP_DIRTY) set_page_dirty(page); } @@ -1307,6 +1312,14 @@ shmem_write_end(struct file *file, struc if (pos + copied > inode->i_size) i_size_write(inode, pos + copied); + if (!PageUptodate(page)) { + if (copied < PAGE_CACHE_SIZE) { + unsigned from = pos & (PAGE_CACHE_SIZE - 1); + zero_user_segments(page, 0, from, + from + copied, PAGE_CACHE_SIZE); + } + SetPageUptodate(page); + } set_page_dirty(page); unlock_page(page); page_cache_release(page); @@ -1768,6 +1781,7 @@ static int shmem_symlink(struct inode *d kaddr = kmap_atomic(page); memcpy(kaddr, symname, len); kunmap_atomic(kaddr); + SetPageUptodate(page); set_page_dirty(page); unlock_page(page); page_cache_release(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 4/10] tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE Date: Sat, 12 May 2012 05:05:51 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org tmpfs has supported hole-punching since 2.6.16, via madvise(,,MADV_REMOVE). But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is the agreed way to punch holes. So add shmem_fallocate() to support that, and tweak shmem_truncate_range() to support partial pages at both the beginning and end of range (never needed for madvise, which demands rounded addr and rounds up length). Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- mm/shmem.c | 68 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 57 insertions(+), 11 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:12.316062172 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:18.768062321 -0700 @@ -53,6 +53,7 @@ static struct vfsmount *shm_mnt; #include #include #include +#include #include #include #include @@ -432,21 +433,23 @@ void shmem_truncate_range(struct inode * struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; - unsigned partial = lstart & (PAGE_CACHE_SIZE - 1); - pgoff_t end = (lend >> PAGE_CACHE_SHIFT); + pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT; + unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1); + unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1); struct pagevec pvec; pgoff_t indices[PAGEVEC_SIZE]; long nr_swaps_freed = 0; pgoff_t index; int i; - BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); + if (lend == -1) + end = -1; /* unsigned, so actually very big */ pagevec_init(&pvec, 0); index = start; - while (index <= end) { + while (index < end) { pvec.nr = shmem_find_get_pages_and_swap(mapping, index, - min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1, + min(end - index, (pgoff_t)PAGEVEC_SIZE), pvec.pages, indices); if (!pvec.nr) break; @@ -455,7 +458,7 @@ void shmem_truncate_range(struct inode * struct page *page = pvec.pages[i]; index = indices[i]; - if (index > end) + if (index >= end) break; if (radix_tree_exceptional_entry(page)) { @@ -479,22 +482,39 @@ void shmem_truncate_range(struct inode * index++; } - if (partial) { + if (partial_start) { struct page *page = NULL; shmem_getpage(inode, start - 1, &page, SGP_READ, NULL); if (page) { - zero_user_segment(page, partial, PAGE_CACHE_SIZE); + unsigned int top = PAGE_CACHE_SIZE; + if (start > end) { + top = partial_end; + partial_end = 0; + } + zero_user_segment(page, partial_start, top); set_page_dirty(page); unlock_page(page); page_cache_release(page); } } + if (partial_end) { + struct page *page = NULL; + shmem_getpage(inode, end, &page, SGP_READ, NULL); + if (page) { + zero_user_segment(page, 0, partial_end); + set_page_dirty(page); + unlock_page(page); + page_cache_release(page); + } + } + if (start >= end) + return; index = start; for ( ; ; ) { cond_resched(); pvec.nr = shmem_find_get_pages_and_swap(mapping, index, - min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1, + min(end - index, (pgoff_t)PAGEVEC_SIZE), pvec.pages, indices); if (!pvec.nr) { if (index == start) @@ -502,7 +522,7 @@ void shmem_truncate_range(struct inode * index = start; continue; } - if (index == start && indices[0] > end) { + if (index == start && indices[0] >= end) { shmem_deswap_pagevec(&pvec); pagevec_release(&pvec); break; @@ -512,7 +532,7 @@ void shmem_truncate_range(struct inode * struct page *page = pvec.pages[i]; index = indices[i]; - if (index > end) + if (index >= end) break; if (radix_tree_exceptional_entry(page)) { @@ -1578,6 +1598,31 @@ static ssize_t shmem_file_splice_read(st return error; } +static long shmem_fallocate(struct file *file, int mode, loff_t offset, + loff_t len) +{ + struct inode *inode = file->f_path.dentry->d_inode; + int error = -EOPNOTSUPP; + + mutex_lock(&inode->i_mutex); + + if (mode & FALLOC_FL_PUNCH_HOLE) { + struct address_space *mapping = file->f_mapping; + loff_t unmap_start = round_up(offset, PAGE_SIZE); + loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; + + if ((u64)unmap_end > (u64)unmap_start) + unmap_mapping_range(mapping, unmap_start, + 1 + unmap_end - unmap_start, 0); + shmem_truncate_range(inode, offset, offset + len - 1); + /* No need to unmap again: hole-punching leaves COWed pages */ + error = 0; + } + + mutex_unlock(&inode->i_mutex); + return error; +} + static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) { struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb); @@ -2478,6 +2523,7 @@ static const struct file_operations shme .fsync = noop_fsync, .splice_read = shmem_file_splice_read, .splice_write = generic_file_splice_write, + .fallocate = shmem_fallocate, #endif }; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE Date: Sat, 12 May 2012 05:13:18 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Cong Wang , Al Viro , Colin Cross , John Stulz , Greg Kroah-Hartman , Theodore Ts'o , Andreas Dilger , Mark Fasheh , Joel Becker , Dave Chinner , Ben Myers , Michael Kerrisk , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Now tmpfs supports hole-punching via fallocate(), switch madvise_remove() to use do_fallocate() instead of vmtruncate_range(): which extends madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs. There is one more user of vmtruncate_range() in our tree, staging/android's ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned areas are already unmapped - I don't know - then it would do better to use shmem_truncate_range() directly). Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- drivers/staging/android/ashmem.c | 8 +++++--- mm/madvise.c | 15 +++++++-------- 2 files changed, 12 insertions(+), 11 deletions(-) --- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.564056626 -0700 +++ 3045N/drivers/staging/android/ashmem.c 2012-05-05 10:46:25.692062478 -0700 @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker mutex_lock(&ashmem_mutex); list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) { - struct inode *inode = range->asma->file->f_dentry->d_inode; loff_t start = range->pgstart * PAGE_SIZE; - loff_t end = (range->pgend + 1) * PAGE_SIZE - 1; + loff_t end = (range->pgend + 1) * PAGE_SIZE; - vmtruncate_range(inode, start, end); + do_fallocate(range->asma->file, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + start, end - start); range->purged = ASHMEM_WAS_PURGED; lru_del(range); --- 3045N.orig/mm/madvise.c 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/mm/madvise.c 2012-05-05 10:46:25.692062478 -0700 @@ -11,8 +11,10 @@ #include #include #include +#include #include #include +#include /* * Any behaviour which results in changes to the vma->vm_flags needs to @@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are struct vm_area_struct **prev, unsigned long start, unsigned long end) { - struct address_space *mapping; - loff_t offset, endoff; + loff_t offset; int error; *prev = NULL; /* tell sys_madvise we drop mmap_sem */ @@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE)) return -EACCES; - mapping = vma->vm_file->f_mapping; - offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - endoff = (loff_t)(end - vma->vm_start - 1) - + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - /* vmtruncate_range needs to take i_mutex */ + /* filesystem's fallocate may need to take i_mutex */ up_read(¤t->mm->mmap_sem); - error = vmtruncate_range(mapping->host, offset, endoff); + error = do_fallocate(vma->vm_file, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + offset, end - start); down_read(¤t->mm->mmap_sem); return error; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 6/10] mm/fs: remove truncate_range Date: Sat, 12 May 2012 05:15:03 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Cong Wang , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Remove vmtruncate_range(), and remove the truncate_range method from struct inode_operations: only tmpfs ever supported it, and tmpfs has now converted over to using the fallocate method of file_operations. Update Documentation accordingly, adding (setlease and) fallocate lines. And while we're in mm.h, remove duplicate declarations of shmem_lock() and shmem_file_setup(): everyone is now using the ones in shmem_fs.h. Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- Documentation/filesystems/Locking | 2 -- Documentation/filesystems/vfs.txt | 13 ++++++++----- fs/bad_inode.c | 1 - include/linux/fs.h | 1 - include/linux/mm.h | 4 ---- mm/shmem.c | 1 - mm/truncate.c | 25 ------------------------- 7 files changed, 8 insertions(+), 39 deletions(-) --- 3045N.orig/Documentation/filesystems/Locking 2012-05-05 10:42:33.560056589 -0700 +++ 3045N/Documentation/filesystems/Locking 2012-05-05 10:46:35.220062708 -0700 @@ -60,7 +60,6 @@ ata *); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); - void (*truncate_range)(struct inode *, loff_t, loff_t); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); locking rules: @@ -87,7 +86,6 @@ setxattr: yes getxattr: no listxattr: no removexattr: yes -truncate_range: yes fiemap: no Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on victim. --- 3045N.orig/Documentation/filesystems/vfs.txt 2012-05-05 10:42:33.560056589 -0700 +++ 3045N/Documentation/filesystems/vfs.txt 2012-05-05 10:46:35.220062708 -0700 @@ -363,7 +363,6 @@ struct inode_operations { ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); - void (*truncate_range)(struct inode *, loff_t, loff_t); }; Again, all methods are called without any locks being held, unless @@ -472,9 +471,6 @@ otherwise noted. removexattr: called by the VFS to remove an extended attribute from a file. This method is called by removexattr(2) system call. - truncate_range: a method provided by the underlying filesystem to truncate a - range of blocks , i.e. punch a hole somewhere in a file. - The Address Space Object ======================== @@ -760,7 +756,7 @@ struct file_operations ---------------------- This describes how the VFS can manipulate an open file. As of kernel -2.6.22, the following members are defined: +3.5, the following members are defined: struct file_operations { struct module *owner; @@ -790,6 +786,8 @@ struct file_operations { int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); + int (*setlease)(struct file *, long arg, struct file_lock **); + long (*fallocate)(struct file *, int mode, loff_t offset, loff_t len); }; Again, all methods are called without any locks being held, unless @@ -858,6 +856,11 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + setlease: called by the VFS to set or release a file lock lease. + setlease has the file_lock_lock held and must not sleep. + + fallocate: called by the VFS to preallocate blocks or punch a hole. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special --- 3045N.orig/fs/bad_inode.c 2012-05-05 10:42:33.564056626 -0700 +++ 3045N/fs/bad_inode.c 2012-05-05 10:46:35.220062708 -0700 @@ -292,7 +292,6 @@ static const struct inode_operations bad .getxattr = bad_inode_getxattr, .listxattr = bad_inode_listxattr, .removexattr = bad_inode_removexattr, - /* truncate_range returns void */ }; --- 3045N.orig/include/linux/fs.h 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/include/linux/fs.h 2012-05-05 10:46:35.220062708 -0700 @@ -1673,7 +1673,6 @@ struct inode_operations { ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); - void (*truncate_range)(struct inode *, loff_t, loff_t); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); } ____cacheline_aligned; --- 3045N.orig/include/linux/mm.h 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/include/linux/mm.h 2012-05-05 10:46:35.220062708 -0700 @@ -871,8 +871,6 @@ extern void pagefault_out_of_memory(void extern void show_free_areas(unsigned int flags); extern bool skip_free_areas_node(unsigned int flags, int nid); -int shmem_lock(struct file *file, int lock, struct user_struct *user); -struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags); int shmem_zero_setup(struct vm_area_struct *); extern int can_do_mlock(void); @@ -953,11 +951,9 @@ static inline void unmap_shared_mapping_ extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new); extern void truncate_setsize(struct inode *inode, loff_t newsize); extern int vmtruncate(struct inode *inode, loff_t offset); -extern int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end); void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); int generic_error_remove_page(struct address_space *mapping, struct page *page); - int invalidate_inode_page(struct page *page); #ifdef CONFIG_MMU --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:18.768062321 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:35.224062700 -0700 @@ -2529,7 +2529,6 @@ static const struct file_operations shme static const struct inode_operations shmem_inode_operations = { .setattr = shmem_setattr, - .truncate_range = shmem_truncate_range, #ifdef CONFIG_TMPFS_XATTR .setxattr = shmem_setxattr, .getxattr = shmem_getxattr, --- 3045N.orig/mm/truncate.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/truncate.c 2012-05-05 10:46:35.224062700 -0700 @@ -602,31 +602,6 @@ int vmtruncate(struct inode *inode, loff } EXPORT_SYMBOL(vmtruncate); -int vmtruncate_range(struct inode *inode, loff_t lstart, loff_t lend) -{ - struct address_space *mapping = inode->i_mapping; - loff_t holebegin = round_up(lstart, PAGE_SIZE); - loff_t holelen = 1 + lend - holebegin; - - /* - * If the underlying filesystem is not going to provide - * a way to truncate a range of blocks (punch a hole) - - * we should return failure right now. - */ - if (!inode->i_op->truncate_range) - return -ENOSYS; - - mutex_lock(&inode->i_mutex); - inode_dio_wait(inode); - unmap_mapping_range(mapping, holebegin, holelen, 1); - inode->i_op->truncate_range(inode, lstart, lend); - /* unmap again to remove racily COWed private pages */ - unmap_mapping_range(mapping, holebegin, holelen, 1); - mutex_unlock(&inode->i_mutex); - - return 0; -} - /** * truncate_pagecache_range - unmap and remove pagecache that is hole-punched * @inode: inode -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 7/10] tmpfs: support fallocate preallocation Date: Sat, 12 May 2012 05:17:24 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Cong Wang , Kay Sievers , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:51098 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751421Ab2ELMRk (ORCPT ); Sat, 12 May 2012 08:17:40 -0400 Received: by pbbrp8 with SMTP id rp8so4328193pbb.19 for ; Sat, 12 May 2012 05:17:39 -0700 (PDT) In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: The systemd plumbers expressed a wish that tmpfs support preallocation. Cong Wang wrote a patch, but several kernel guys expressed scepticism: https://lkml.org/lkml/2011/11/18/137 Christoph Hellwig: What for exactly? Please explain why preallocating on tmpfs would make any sense. Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or -EOPNOTSUPP] on fallocate is just ugly. Hugh Dickins: If tmpfs is going to support fallocate(FALLOC_FL_PUNCH_HOLE), it would seem perverse to permit the deallocation but fail the allocation. Christoph Hellwig: Agreed. Now that we do have shmem_fallocate() for hole-punching, plumb in basic support for preallocation mode too. It's fairly straightforward (though quite a few details needed attention), except for when it fails part way through. What a pity that fallocate(2) was not specified to return the length allocated, permitting short fallocations! As it is, when it fails part way through, we ought to free what has just been allocated by this system call; but must be very sure not to free any allocated earlier, or any allocated by racing accesses (not all excluded by i_mutex). But we cannot distinguish them: so in this patch simply leak allocations on partial failure (they will be freed later if the file is removed). An attractive alternative approach would have been for fallocate() not to allocate pages at all, but note reservations by entries in the radix-tree. But that would give less assurance, and, critically, would be hard to fit with mem cgroups (who owns the reservations?): allocating pages lets fallocate() behave in just the same way as write(). Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- mm/shmem.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 60 insertions(+), 1 deletion(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:35.224062700 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:45.312062979 -0700 @@ -1602,7 +1602,9 @@ static long shmem_fallocate(struct file loff_t len) { struct inode *inode = file->f_path.dentry->d_inode; - int error = -EOPNOTSUPP; + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + pgoff_t start, index, end; + int error; mutex_lock(&inode->i_mutex); @@ -1617,8 +1619,65 @@ static long shmem_fallocate(struct file shmem_truncate_range(inode, offset, offset + len - 1); /* No need to unmap again: hole-punching leaves COWed pages */ error = 0; + goto out; } + /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ + error = inode_newsize_ok(inode, offset + len); + if (error) + goto out; + + start = offset >> PAGE_CACHE_SHIFT; + end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + /* Try to avoid a swapstorm if len is impossible to satisfy */ + if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) { + error = -ENOSPC; + goto out; + } + + for (index = start; index < end; index++) { + struct page *page; + + /* + * Good, the fallocate(2) manpage permits EINTR: we may have + * been interrupted because we are using up too much memory. + */ + if (signal_pending(current)) + error = -EINTR; + else + error = shmem_getpage(inode, index, &page, SGP_WRITE, + NULL); + if (error) { + /* + * We really ought to free what we allocated so far, + * but it would be wrong to free pages allocated + * earlier, or already now in use: i_mutex does not + * exclude all cases. We do not know what to free. + */ + goto ctime; + } + + if (!PageUptodate(page)) { + clear_highpage(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + /* + * set_page_dirty so that memory pressure will swap rather + * than free the pages we are allocating (and SGP_CACHE pages + * might still be clean: we now need to mark those dirty too). + */ + set_page_dirty(page); + unlock_page(page); + page_cache_release(page); + cond_resched(); + } + + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) + i_size_write(inode, offset + len); +ctime: + inode->i_ctime = CURRENT_TIME; +out: mutex_unlock(&inode->i_mutex); return error; } From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 8/10] tmpfs: undo fallocation on failure Date: Sat, 12 May 2012 05:19:18 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Cong Wang , Kay Sievers , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org In the previous episode, we left the already-fallocated pages attached to the file when shmem_fallocate() fails part way through. Now try to do better, by extending the earlier optimization of !Uptodate pages (then always under page lock) to !Uptodate pages (outside of page lock), representing fallocated pages. And don't waste time clearing them at the time of fallocate(), leave that until later if necessary. Adapt shmem_truncate_range() to shmem_undo_range(), so that a failing fallocate can recognize and remove precisely those !Uptodate allocations which it added (and were not independently allocated by racing tasks). But unless we start playing with swapfile.c and memcontrol.c too, once one of our fallocated pages reaches shmem_writepage(), we do then have to instantiate it as an ordinarily allocated page, before swapping out. This is unsatisfactory, but improved in the next episode. Signed-off-by: Hugh Dickins --- mm/shmem.c | 105 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 72 insertions(+), 33 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:45.312062979 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:53.860063201 -0700 @@ -89,7 +89,8 @@ enum sgp_type { SGP_READ, /* don't exceed i_size, don't allocate page */ SGP_CACHE, /* don't exceed i_size, may allocate page */ SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */ - SGP_WRITE, /* may exceed i_size, may allocate page */ + SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ + SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ }; #ifdef CONFIG_TMPFS @@ -427,8 +428,10 @@ void shmem_unlock_mapping(struct address /* * Remove range of pages and swap entries from radix tree, and free them. + * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. */ -void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) +static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, + bool unfalloc) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); @@ -462,6 +465,8 @@ void shmem_truncate_range(struct inode * break; if (radix_tree_exceptional_entry(page)) { + if (unfalloc) + continue; nr_swaps_freed += !shmem_free_swap(mapping, index, page); continue; @@ -469,9 +474,11 @@ void shmem_truncate_range(struct inode * if (!trylock_page(page)) continue; - if (page->mapping == mapping) { - VM_BUG_ON(PageWriteback(page)); - truncate_inode_page(mapping, page); + if (!unfalloc || !PageUptodate(page)) { + if (page->mapping == mapping) { + VM_BUG_ON(PageWriteback(page)); + truncate_inode_page(mapping, page); + } } unlock_page(page); } @@ -517,12 +524,12 @@ void shmem_truncate_range(struct inode * min(end - index, (pgoff_t)PAGEVEC_SIZE), pvec.pages, indices); if (!pvec.nr) { - if (index == start) + if (index == start || unfalloc) break; index = start; continue; } - if (index == start && indices[0] >= end) { + if ((index == start || unfalloc) && indices[0] >= end) { shmem_deswap_pagevec(&pvec); pagevec_release(&pvec); break; @@ -536,15 +543,19 @@ void shmem_truncate_range(struct inode * break; if (radix_tree_exceptional_entry(page)) { + if (unfalloc) + continue; nr_swaps_freed += !shmem_free_swap(mapping, index, page); continue; } lock_page(page); - if (page->mapping == mapping) { - VM_BUG_ON(PageWriteback(page)); - truncate_inode_page(mapping, page); + if (!unfalloc || !PageUptodate(page)) { + if (page->mapping == mapping) { + VM_BUG_ON(PageWriteback(page)); + truncate_inode_page(mapping, page); + } } unlock_page(page); } @@ -558,7 +569,11 @@ void shmem_truncate_range(struct inode * info->swapped -= nr_swaps_freed; shmem_recalc_inode(inode); spin_unlock(&info->lock); +} +void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) +{ + shmem_undo_range(inode, lstart, lend, false); inode->i_ctime = inode->i_mtime = CURRENT_TIME; } EXPORT_SYMBOL_GPL(shmem_truncate_range); @@ -771,6 +786,18 @@ static int shmem_writepage(struct page * WARN_ON_ONCE(1); /* Still happens? Tell us about it! */ goto redirty; } + + /* + * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC + * value into swapfile.c, the only way we can correctly account for a + * fallocated page arriving here is now to initialize it and write it. + */ + if (!PageUptodate(page)) { + clear_highpage(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + swap = get_swap_page(); if (!swap.val) goto redirty; @@ -994,6 +1021,7 @@ static int shmem_getpage_gfp(struct inod swp_entry_t swap; int error; int once = 0; + int alloced = 0; if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT)) return -EFBIG; @@ -1005,19 +1033,21 @@ repeat: page = NULL; } - if (sgp != SGP_WRITE && + if (sgp != SGP_WRITE && sgp != SGP_FALLOC && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { error = -EINVAL; goto failed; } + /* fallocated page? */ + if (page && !PageUptodate(page)) { + if (sgp != SGP_READ) + goto clear; + unlock_page(page); + page_cache_release(page); + page = NULL; + } if (page || (sgp == SGP_READ && !swap.val)) { - /* - * Once we can get the page lock, it must be uptodate: - * if there were an error in reading back from swap, - * the page would not be inserted into the filecache. - */ - BUG_ON(page && !PageUptodate(page)); *pagep = page; return 0; } @@ -1114,9 +1144,18 @@ repeat: inode->i_blocks += BLOCKS_PER_PAGE; shmem_recalc_inode(inode); spin_unlock(&info->lock); + alloced = true; /* - * Let SGP_WRITE caller clear ends if write does not fill page + * Let SGP_FALLOC use the SGP_WRITE optimization on a new page. + */ + if (sgp == SGP_FALLOC) + sgp = SGP_WRITE; +clear: + /* + * Let SGP_WRITE caller clear ends if write does not fill page; + * but SGP_FALLOC on a page fallocated earlier must initialize + * it now, lest undo on failure cancel our earlier guarantee. */ if (sgp != SGP_WRITE) { clear_highpage(page); @@ -1128,10 +1167,13 @@ repeat: } /* Perhaps the file has been truncated since we checked */ - if (sgp != SGP_WRITE && + if (sgp != SGP_WRITE && sgp != SGP_FALLOC && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { error = -EINVAL; - goto trunc; + if (alloced) + goto trunc; + else + goto failed; } *pagep = page; return 0; @@ -1140,6 +1182,7 @@ repeat: * Error recovery. */ trunc: + info = SHMEM_I(inode); ClearPageDirty(page); delete_from_page_cache(page); spin_lock(&info->lock); @@ -1147,6 +1190,7 @@ trunc: inode->i_blocks -= BLOCKS_PER_PAGE; spin_unlock(&info->lock); decused: + sbinfo = SHMEM_SB(inode->i_sb); if (sbinfo->max_blocks) percpu_counter_add(&sbinfo->used_blocks, -1); unacct: @@ -1645,25 +1689,20 @@ static long shmem_fallocate(struct file if (signal_pending(current)) error = -EINTR; else - error = shmem_getpage(inode, index, &page, SGP_WRITE, + error = shmem_getpage(inode, index, &page, SGP_FALLOC, NULL); if (error) { - /* - * We really ought to free what we allocated so far, - * but it would be wrong to free pages allocated - * earlier, or already now in use: i_mutex does not - * exclude all cases. We do not know what to free. - */ + /* Remove the !PageUptodate pages we added */ + shmem_undo_range(inode, + (loff_t)start << PAGE_CACHE_SHIFT, + (loff_t)index << PAGE_CACHE_SHIFT, true); goto ctime; } - if (!PageUptodate(page)) { - clear_highpage(page); - flush_dcache_page(page); - SetPageUptodate(page); - } /* - * set_page_dirty so that memory pressure will swap rather + * If !PageUptodate, leave it that way so that freeable pages + * can be recognized if we need to rollback on error later. + * But set_page_dirty so that memory pressure will swap rather * than free the pages we are allocating (and SGP_CACHE pages * might still be clean: we now need to mark those dirty too). */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 9/10] tmpfs: quit when fallocate fills memory Date: Sat, 12 May 2012 05:21:22 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Cong Wang , Kay Sievers , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org As it stands, a large fallocate() on tmpfs is liable to fill memory with pages, freed on failure except when they run into swap, at which point they become fixed into the file despite the failure. That feels quite wrong, to be consuming resources precisely when they're in short supply. Go the other way instead: shmem_fallocate() indicate the range it has fallocated to shmem_writepage(), keeping count of pages it's allocating; shmem_writepage() reactivate instead of swapping out pages fallocated by this syscall (but happily swap out those from earlier occasions), keeping count; shmem_fallocate() compare counts and give up once the reactivated pages have started to coming back to writepage (approximately: some zones would in fact recycle faster than others). This is a little unusual, but works well: although we could consider the failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling added in swapfile.c and memcontrol.c, I doubt that we shall ever want to. (If there's no swap, an over-large fallocate() on tmpfs is limited in the same way as writing: stopped by rlimit, or by tmpfs mount size if that was set sensibly, or by __vm_enough_memory() heuristics if OVERCOMMIT_GUESS or OVERCOMMIT_NEVER. If OVERCOMMIT_ALWAYS, then it is liable to OOM-kill others as writing would, but stops and frees if interrupted.) Now that everything is freed on failure, we can then skip updating ctime. Signed-off-by: Hugh Dickins --- mm/shmem.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 56 insertions(+), 2 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:53.860063201 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:47:02.216063339 -0700 @@ -84,6 +84,18 @@ struct shmem_xattr { char value[0]; }; +/* + * shmem_fallocate and shmem_writepage communicate via inode->i_private + * (with i_mutex making sure that it has only one user at a time): + * we would prefer not to enlarge the shmem inode just for that. + */ +struct shmem_falloc { + pgoff_t start; /* start of range currently being fallocated */ + pgoff_t next; /* the next page offset to be fallocated */ + pgoff_t nr_falloced; /* how many new pages have been fallocated */ + pgoff_t nr_unswapped; /* how often writepage refused to swap out */ +}; + /* Flag allocation requirements to shmem_getpage */ enum sgp_type { SGP_READ, /* don't exceed i_size, don't allocate page */ @@ -791,8 +803,28 @@ static int shmem_writepage(struct page * * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC * value into swapfile.c, the only way we can correctly account for a * fallocated page arriving here is now to initialize it and write it. + * + * That's okay for a page already fallocated earlier, but if we have + * not yet completed the fallocation, then (a) we want to keep track + * of this page in case we have to undo it, and (b) it may not be a + * good idea to continue anyway, once we're pushing into swap. So + * reactivate the page, and let shmem_fallocate() quit when too many. */ if (!PageUptodate(page)) { + if (inode->i_private) { + struct shmem_falloc *shmem_falloc; + spin_lock(&inode->i_lock); + shmem_falloc = inode->i_private; + if (shmem_falloc && + index >= shmem_falloc->start && + index < shmem_falloc->next) + shmem_falloc->nr_unswapped++; + else + shmem_falloc = NULL; + spin_unlock(&inode->i_lock); + if (shmem_falloc) + goto redirty; + } clear_highpage(page); flush_dcache_page(page); SetPageUptodate(page); @@ -1647,6 +1679,7 @@ static long shmem_fallocate(struct file { struct inode *inode = file->f_path.dentry->d_inode; struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + struct shmem_falloc shmem_falloc; pgoff_t start, index, end; int error; @@ -1679,6 +1712,14 @@ static long shmem_fallocate(struct file goto out; } + shmem_falloc.start = start; + shmem_falloc.next = start; + shmem_falloc.nr_falloced = 0; + shmem_falloc.nr_unswapped = 0; + spin_lock(&inode->i_lock); + inode->i_private = &shmem_falloc; + spin_unlock(&inode->i_lock); + for (index = start; index < end; index++) { struct page *page; @@ -1688,6 +1729,8 @@ static long shmem_fallocate(struct file */ if (signal_pending(current)) error = -EINTR; + else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) + error = -ENOMEM; else error = shmem_getpage(inode, index, &page, SGP_FALLOC, NULL); @@ -1696,10 +1739,18 @@ static long shmem_fallocate(struct file shmem_undo_range(inode, (loff_t)start << PAGE_CACHE_SHIFT, (loff_t)index << PAGE_CACHE_SHIFT, true); - goto ctime; + goto undone; } /* + * Inform shmem_writepage() how far we have reached. + * No need for lock or barrier: we have the page lock. + */ + shmem_falloc.next++; + if (!PageUptodate(page)) + shmem_falloc.nr_falloced++; + + /* * If !PageUptodate, leave it that way so that freeable pages * can be recognized if we need to rollback on error later. * But set_page_dirty so that memory pressure will swap rather @@ -1714,8 +1765,11 @@ static long shmem_fallocate(struct file if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) i_size_write(inode, offset + len); -ctime: inode->i_ctime = CURRENT_TIME; +undone: + spin_lock(&inode->i_lock); + inode->i_private = NULL; + spin_unlock(&inode->i_lock); out: mutex_unlock(&inode->i_mutex); return error; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH 10/10] tmpfs: support SEEK_DATA and SEEK_HOLE Date: Sat, 12 May 2012 05:27:31 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , Josef Bacik , Andi Kleen , Andreas Dilger , Dave Chinner , Marco Stornelli , Jeff liu , Chris Mason , Sunil Mushran , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org It's quite easy for tmpfs to scan the radix_tree to support llseek's new SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still on my mind (in particular, the !PageUptodate-ness of pages fallocated but still unwritten). But I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it would be of any use to them on tmpfs. This code adds 92 lines and 752 bytes on x86_64 - is that bloat or worthwhile? Signed-off-by: Hugh Dickins --- mm/shmem.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 93 insertions(+), 1 deletion(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:47:02.216063339 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:47:09.724063528 -0700 @@ -439,6 +439,56 @@ void shmem_unlock_mapping(struct address } /* + * llseek SEEK_DATA or SEEK_HOLE through the radix_tree. + */ +static pgoff_t shmem_seek_hole_data(struct address_space *mapping, + pgoff_t index, pgoff_t end, int origin) +{ + struct page *page; + struct pagevec pvec; + pgoff_t indices[PAGEVEC_SIZE]; + bool done = false; + int i; + + pagevec_init(&pvec, 0); + pvec.nr = 1; /* start small: we may be there already */ + while (!done) { + pvec.nr = shmem_find_get_pages_and_swap(mapping, index, + pvec.nr, pvec.pages, indices); + if (!pvec.nr) { + if (origin == SEEK_DATA) + index = end; + break; + } + for (i = 0; i < pvec.nr; i++, index++) { + if (index < indices[i]) { + if (origin == SEEK_HOLE) { + done = true; + break; + } + index = indices[i]; + } + page = pvec.pages[i]; + if (page && !radix_tree_exceptional_entry(page)) { + if (!PageUptodate(page)) + page = NULL; + } + if (index >= end || + (page && origin == SEEK_DATA) || + (!page && origin == SEEK_HOLE)) { + done = true; + break; + } + } + shmem_deswap_pagevec(&pvec); + pagevec_release(&pvec); + pvec.nr = PAGEVEC_SIZE; + cond_resched(); + } + return index; +} + +/* * Remove range of pages and swap entries from radix tree, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. */ @@ -1674,6 +1724,48 @@ static ssize_t shmem_file_splice_read(st return error; } +static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin) +{ + struct address_space *mapping; + struct inode *inode; + pgoff_t start, end; + loff_t new_offset; + + if (origin != SEEK_DATA && origin != SEEK_HOLE) + return generic_file_llseek_size(file, offset, origin, + MAX_LFS_FILESIZE); + mapping = file->f_mapping; + inode = mapping->host; + mutex_lock(&inode->i_mutex); + /* We're holding i_mutex so we can access i_size directly */ + + if (offset < 0) + offset = -EINVAL; + else if (offset >= inode->i_size) + offset = -ENXIO; + else { + start = offset >> PAGE_CACHE_SHIFT; + end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + new_offset = shmem_seek_hole_data(mapping, start, end, origin); + new_offset <<= PAGE_CACHE_SHIFT; + if (new_offset > offset) { + if (new_offset < inode->i_size) + offset = new_offset; + else if (origin == SEEK_DATA) + offset = -ENXIO; + else + offset = inode->i_size; + } + } + + if (offset >= 0 && offset != file->f_pos) { + file->f_pos = offset; + file->f_version = 0; + } + mutex_unlock(&inode->i_mutex); + return offset; +} + static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { @@ -2667,7 +2759,7 @@ static const struct address_space_operat static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, #ifdef CONFIG_TMPFS - .llseek = generic_file_llseek, + .llseek = shmem_file_llseek, .read = do_sync_read, .write = do_sync_write, .aio_read = shmem_file_aio_read, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Cong Wang Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Mon, 14 May 2012 16:55:36 +0800 Message-ID: <4FB0C888.8070805@gmail.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Andrew Morton , Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 05/12/2012 07:59 PM, Hugh Dickins wrote: > + VM_BUG_ON(!PageLocked(oldpage)); > + __set_page_locked(newpage); > + VM_BUG_ON(!PageUptodate(oldpage)); > + SetPageUptodate(newpage); > + VM_BUG_ON(!PageSwapBacked(oldpage)); > + SetPageSwapBacked(newpage); > + VM_BUG_ON(!swap_index); > + set_page_private(newpage, swap_index); > + VM_BUG_ON(!PageSwapCache(oldpage)); > + SetPageSwapCache(newpage); > + Are all of these VM_BUG_ON's necessary? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Mon, 14 May 2012 18:06:07 +0900 Message-ID: <4FB0CAFF.6060705@jp.fujitsu.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Andrew Morton , Christoph Hellwig , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org (2012/05/12 20:59), Hugh Dickins wrote: > The GMA500 GPU driver uses GEM shmem objects, but with a new twist: > the backing RAM has to be below 4GB. Not a problem while the boards > supported only 4GB: but now Intel's D2700MUD boards support 8GB, and > their GMA3600 is managed by the GMA500 driver. > > shmem/tmpfs has never pretended to support hardware restrictions on > the backing memory, but it might have appeared to do so before v3.1, > and even now it works fine until a page is swapped out then back in. > When read_cache_page_gfp() supplied a freshly allocated page for copy, > that compensated for whatever choice might have been made by earlier > swapin readahead; but swapoff was likely to destroy the illusion. > > We'd like to continue to support GMA500, so now add a new > shmem_should_replace_page() check on the zone when about to move > a page from swapcache to filecache (in swapin and swapoff cases), > with shmem_replace_page() to allocate and substitute a suitable page > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > This does involve a minor extension to mem_cgroup_replace_page_cache() > (the page may or may not have already been charged); and I've removed > a comment and call to mem_cgroup_uncharge_cache_page(), which in fact > is always a no-op while PageSwapCache. > > Also removed optimization of an unlikely path in shmem_getpage_gfp(), > now that we need to check PageSwapCache more carefully (a racing caller > might already have made the copy). And at one point shmem_unuse_inode() > needs to use the hitherto private page_swapcount(), to guard against > racing with inode eviction. > > It would make sense to extend shmem_should_replace_page(), to cover > cpuset and NUMA mempolicy restrictions too, but set that aside for > now: needs a cleanup of shmem mempolicy handling, and more testing, > and ought to handle swap faults in do_swap_page() as well as shmem. > > Signed-off-by: Hugh Dickins > --- Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Cong Wang Subject: Re: [PATCH 2/10] tmpfs: enable NOSEC optimization Date: Mon, 14 May 2012 17:35:02 +0800 Message-ID: <4FB0D1C6.7090801@gmail.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Andrew Morton , Christoph Hellwig , Andi Kleen , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 05/12/2012 08:02 PM, Hugh Dickins wrote: > Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() > overhead on most common writes): set MS_NOSEC on its superblocks. > > Signed-off-by: Hugh Dickins > --- > mm/shmem.c | 1 + > 1 file changed, 1 insertion(+) > > --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 > +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 > @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block > } > } > sb->s_export_op =&shmem_export_ops; > + sb->s_flags |= MS_NOSEC; Isn't setting the flag on inode better? Something like: diff --git a/mm/shmem.c b/mm/shmem.c index f99ff3e..7d98fb5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2325,6 +2325,7 @@ static void shmem_init_inode(void *foo) { struct shmem_inode_info *info = foo; inode_init_once(&info->vfs_inode); + info->vfs_inode.i_flags |= S_NOSEC; } static int shmem_init_inodecache(void) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Mon, 14 May 2012 12:42:21 -0700 (PDT) Message-ID: References: <4FB0C888.8070805@gmail.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Andrew Morton , Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Cong Wang Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:41404 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756803Ab2ENTmo (ORCPT ); Mon, 14 May 2012 15:42:44 -0400 Received: by pbbrp8 with SMTP id rp8so6527950pbb.19 for ; Mon, 14 May 2012 12:42:44 -0700 (PDT) In-Reply-To: <4FB0C888.8070805@gmail.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Mon, 14 May 2012, Cong Wang wrote: > On 05/12/2012 07:59 PM, Hugh Dickins wrote: > > + VM_BUG_ON(!PageLocked(oldpage)); > > + __set_page_locked(newpage); > > + VM_BUG_ON(!PageUptodate(oldpage)); > > + SetPageUptodate(newpage); > > + VM_BUG_ON(!PageSwapBacked(oldpage)); > > + SetPageSwapBacked(newpage); > > + VM_BUG_ON(!swap_index); > > + set_page_private(newpage, swap_index); > > + VM_BUG_ON(!PageSwapCache(oldpage)); > > + SetPageSwapCache(newpage); > > + > > Are all of these VM_BUG_ON's necessary? I'm really glad you asked that - thank you. At first I was just going to brush you off with a standard reply of something like "well, no BUG_ON should ever be necessary, but we do find them helpful in practice". But (a) these ones have probably outlived their usefulness: they were certainly reassuring to me when I was testing, but perhaps now are just cluttering up the flow. I did make them "VM_" BUG_ONs in the hope that distros wouldn't waste space and time switching them on, but now I'm inclined to agree with you that they should just be removed. Most of them are doing no more than confirm what's been checked before calling the function (and confirming that status cannot racily change). And (b) whereas they didn't actually catch anything for me, they have been giving false assurance: because (I believe) there really is a bug lurking there that they have not yet met and caught. And I would have missed it if you hadn't directed me back to think about these. It is an exceedingly unlikely bug (and need not delay use of the patch), but what I'm re-remembering is just how slippery swap is: the problem is that a swapcache page can get freed and reused before getting the page lock on it; and it might even get reused for swapcache. Perhaps I need also to be checking page->private, or perhaps I need to check for error instead of BUG_ON(error) just before the lines you picked out, or both. I'm not going to rush the incremental patch to fix this: need to think about it quietly first. If you're wondering what I'm talking about (sorry, I don't have time to explain more right now), take a look at comment and git history of line 2956 (in 3.4-rc7) of mm/memory.c: if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) I don't suppose anyone ever actually hit the bug in the years before we added that protection, but we still ought to guard against it, there and here in shmem_replace_page(). Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH 2/10] tmpfs: enable NOSEC optimization Date: Mon, 14 May 2012 12:48:16 -0700 (PDT) Message-ID: References: <4FB0D1C6.7090801@gmail.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Andrew Morton , Christoph Hellwig , Andi Kleen , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Cong Wang Return-path: In-Reply-To: <4FB0D1C6.7090801@gmail.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, 14 May 2012, Cong Wang wrote: > On 05/12/2012 08:02 PM, Hugh Dickins wrote: > > Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() > > overhead on most common writes): set MS_NOSEC on its superblocks. > > > > Signed-off-by: Hugh Dickins > > --- > > mm/shmem.c | 1 + > > 1 file changed, 1 insertion(+) > > > > --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 > > +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 > > @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block > > } > > } > > sb->s_export_op =&shmem_export_ops; > > + sb->s_flags |= MS_NOSEC; > > Isn't setting the flag on inode better? Something like: I don't think so. The MS_NOSEC S_NOSEC business is fairly subtle, and easy to miss if it's gone wrong, so I would much rather follow the established pattern in local block filesystems: which is to set MS_NOSEC in superblock flags, and leave S_NOSEC to file_remove_suid(). Hugh > > diff --git a/mm/shmem.c b/mm/shmem.c > index f99ff3e..7d98fb5 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2325,6 +2325,7 @@ static void shmem_init_inode(void *foo) > { > struct shmem_inode_info *info = foo; > inode_init_once(&info->vfs_inode); > + info->vfs_inode.i_flags |= S_NOSEC; > } > > static int shmem_init_inodecache(void) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Mon, 14 May 2012 16:13:30 -0700 Message-ID: <20120514161330.def0ac52.akpm@linux-foundation.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sat, 12 May 2012 04:59:56 -0700 (PDT) Hugh Dickins wrote: > The GMA500 GPU driver uses GEM shmem objects, but with a new twist: > the backing RAM has to be below 4GB. Not a problem while the boards > supported only 4GB: but now Intel's D2700MUD boards support 8GB, and > their GMA3600 is managed by the GMA500 driver. > > shmem/tmpfs has never pretended to support hardware restrictions on > the backing memory, but it might have appeared to do so before v3.1, > and even now it works fine until a page is swapped out then back in. > When read_cache_page_gfp() supplied a freshly allocated page for copy, > that compensated for whatever choice might have been made by earlier > swapin readahead; but swapoff was likely to destroy the illusion. > > We'd like to continue to support GMA500, so now add a new > shmem_should_replace_page() check on the zone when about to move > a page from swapcache to filecache (in swapin and swapoff cases), > with shmem_replace_page() to allocate and substitute a suitable page > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > This does involve a minor extension to mem_cgroup_replace_page_cache() > (the page may or may not have already been charged); and I've removed > a comment and call to mem_cgroup_uncharge_cache_page(), which in fact > is always a no-op while PageSwapCache. > > Also removed optimization of an unlikely path in shmem_getpage_gfp(), > now that we need to check PageSwapCache more carefully (a racing caller > might already have made the copy). And at one point shmem_unuse_inode() > needs to use the hitherto private page_swapcount(), to guard against > racing with inode eviction. > > It would make sense to extend shmem_should_replace_page(), to cover > cpuset and NUMA mempolicy restrictions too, but set that aside for > now: needs a cleanup of shmem mempolicy handling, and more testing, > and ought to handle swap faults in do_swap_page() as well as shmem. > > ... > > static int shmem_unuse_inode(struct shmem_inode_info *info, > - swp_entry_t swap, struct page *page) > + swp_entry_t swap, struct page **pagep) > { > struct address_space *mapping = info->vfs_inode.i_mapping; > void *radswap; > pgoff_t index; > - int error; > + gfp_t gfp; > + int error = 0; > > radswap = swp_to_radix_entry(swap); > index = radix_tree_locate_item(&mapping->page_tree, radswap); > @@ -625,22 +629,37 @@ static int shmem_unuse_inode(struct shme > if (shmem_swaplist.next != &info->swaplist) > list_move_tail(&shmem_swaplist, &info->swaplist); > > + gfp = mapping_gfp_mask(mapping); > + if (shmem_should_replace_page(*pagep, gfp)) { > + mutex_unlock(&shmem_swaplist_mutex); > + error = shmem_replace_page(pagep, gfp, info, index); > + mutex_lock(&shmem_swaplist_mutex); > + /* > + * We needed to drop mutex to make that restrictive page > + * allocation; but the inode might already be freed by now, > + * and we cannot refer to inode or mapping or info to check. > + * However, we do hold page lock on the PageSwapCache page, > + * so can check if that still has our reference remaining. > + */ > + if (!page_swapcount(*pagep)) > + error = -ENOENT; This has my head spinning a bit. What is "our reference"? I'd expect that to mean a temporary reference which was taken by this thread of control. But such a thing has no relevance when trying to determine the state of the page and/or data structures which refer to it. Also, what are we trying to determine here with this test? Whether the page was removed from swapcache under our feet? Presumably not, as it is locked. So perhaps you could spell out in more detail what we're trying to do here, and what contributes to page_swapcount() here? > + } > + > /* > * We rely on shmem_swaplist_mutex, not only to protect the swaplist, > * but also to hold up shmem_evict_inode(): so inode cannot be freed > * beneath us (pagelock doesn't help until the page is in pagecache). > */ > - error = shmem_add_to_page_cache(page, mapping, index, > + if (!error) > + error = shmem_add_to_page_cache(*pagep, mapping, index, > GFP_NOWAIT, radswap); > - /* which does mem_cgroup_uncharge_cache_page on error */ > - > if (error != -ENOMEM) { > /* > * Truncation and eviction use free_swap_and_cache(), which > * only does trylock page: if we raced, best clean up here. > */ > - delete_from_swap_cache(page); > - set_page_dirty(page); > + delete_from_swap_cache(*pagep); > + set_page_dirty(*pagep); > if (!error) { > spin_lock(&info->lock); > info->swapped--; > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct > struct list_head *this, *next; > struct shmem_inode_info *info; > int found = 0; > - int error; > + int error = 0; > + > + /* > + * There's a faint possibility that swap page was replaced before > + * caller locked it: it will come back later with the right page. So a caller locked the page then failed to check that it's still the right sort of page? Shouldn't the caller locally clean up its own mess rather than requiring a callee to know about the caller's intricate shortcomings? > + */ > + if (unlikely(!PageSwapCache(page))) > + goto out; > > /* > * Charge page using GFP_KERNEL while we can wait, before taking > > ... > > @@ -856,6 +880,84 @@ static inline struct mempolicy *shmem_ge > #endif > > /* > + * When a page is moved from swapcache to shmem filecache (either by the > + * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of > + * shmem_unuse_inode()), it may have been read in earlier from swap, in > + * ignorance of the mapping it belongs to. If that mapping has special > + * constraints (like the gma500 GEM driver, which requires RAM below 4GB), > + * we may need to copy to a suitable page before moving to filecache. > + * > + * In a future release, this may well be extended to respect cpuset and > + * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); > + * but for now it is a simple matter of zone. > + */ > +static bool shmem_should_replace_page(struct page *page, gfp_t gfp) > +{ > + return page_zonenum(page) > gfp_zone(gfp); > +} > + > +static int shmem_replace_page(struct page **pagep, gfp_t gfp, > + struct shmem_inode_info *info, pgoff_t index) > +{ > + struct page *oldpage, *newpage; > + struct address_space *swap_mapping; > + pgoff_t swap_index; > + int error; > + > + oldpage = *pagep; > + swap_index = page_private(oldpage); > + swap_mapping = page_mapping(oldpage); > + > + /* > + * We have arrived here because our zones are constrained, so don't > + * limit chance of success by further cpuset and node constraints. > + */ > + gfp &= ~GFP_CONSTRAINT_MASK; > + newpage = shmem_alloc_page(gfp, info, index); > + if (!newpage) > + return -ENOMEM; > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); > + > + *pagep = newpage; > + page_cache_get(newpage); > + copy_highpage(newpage, oldpage); copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()? > + VM_BUG_ON(!PageLocked(oldpage)); > + __set_page_locked(newpage); > + VM_BUG_ON(!PageUptodate(oldpage)); > + SetPageUptodate(newpage); > + VM_BUG_ON(!PageSwapBacked(oldpage)); > + SetPageSwapBacked(newpage); > + VM_BUG_ON(!swap_index); > + set_page_private(newpage, swap_index); > + VM_BUG_ON(!PageSwapCache(oldpage)); > + SetPageSwapCache(newpage); > + > + /* > + * Our caller will very soon move newpage out of swapcache, but it's > + * a nice clean interface for us to replace oldpage by newpage there. > + */ > + spin_lock_irq(&swap_mapping->tree_lock); > + error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, > + newpage); > + __inc_zone_page_state(newpage, NR_FILE_PAGES); > + __dec_zone_page_state(oldpage, NR_FILE_PAGES); > + spin_unlock_irq(&swap_mapping->tree_lock); > + BUG_ON(error); > + > + mem_cgroup_replace_page_cache(oldpage, newpage); > + lru_cache_add_anon(newpage); > + > + ClearPageSwapCache(oldpage); > + set_page_private(oldpage, 0); > + > + unlock_page(oldpage); > + page_cache_release(oldpage); > + page_cache_release(oldpage); > + return 0; > +} shmem_replace_page() is a fairly generic and unexceptional sounding thing. Methinks shmem_substitute_page() would be a better name. > +/* > * shmem_getpage_gfp - find page in cache, or get from swap, or allocate > * > * If we allocate a new one we do not mark it dirty. That's up to the > > ... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Mon, 14 May 2012 21:07:41 -0700 (PDT) Message-ID: References: <20120514161330.def0ac52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: <20120514161330.def0ac52.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, 14 May 2012, Andrew Morton wrote: > On Sat, 12 May 2012 04:59:56 -0700 (PDT) > Hugh Dickins wrote: > > > > We'd like to continue to support GMA500, so now add a new > > shmem_should_replace_page() check on the zone when about to move > > a page from swapcache to filecache (in swapin and swapoff cases), > > with shmem_replace_page() to allocate and substitute a suitable page > > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > ... > > + gfp = mapping_gfp_mask(mapping); > > + if (shmem_should_replace_page(*pagep, gfp)) { > > + mutex_unlock(&shmem_swaplist_mutex); > > + error = shmem_replace_page(pagep, gfp, info, index); > > + mutex_lock(&shmem_swaplist_mutex); > > + /* > > + * We needed to drop mutex to make that restrictive page > > + * allocation; but the inode might already be freed by now, > > + * and we cannot refer to inode or mapping or info to check. > > + * However, we do hold page lock on the PageSwapCache page, > > + * so can check if that still has our reference remaining. > > + */ > > + if (!page_swapcount(*pagep)) > > + error = -ENOENT; > > This has my head spinning a bit. What is "our reference"? I'd expect > that to mean a temporary reference which was taken by this thread of > control. (I'm sure you'll prefer a reworking of that comment in an incremental fixes patch, but let me try to explain better here too.) No, I didn't mean a temporary reference taken by this (swapoff) thread, but the reference (swap entry) which has just been located in the inode's radix_tree, just before this hunk: which would be tracked by page_swapcount 1 (there's also a page swapcache bit in the swap_map along with the count, corresponding to the reference from the swapcache page itself, but that's not included in page_swapcount). > But such a thing has no relevance when trying to determine > the state of the page and/or data structures which refer to it. I don't understand you there, but maybe it won't matter. > > Also, what are we trying to determine here with this test? Whether the > page was removed from swapcache under our feet? Presumably not, as it > is locked. > > So perhaps you could spell out in more detail what we're trying to do > here, and what contributes to page_swapcount() here? The danger here is that the inode we're dealing with has gone through shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was certainly in use before, and shmem_swaplist_mutex (together with inode being on shmem_swaplist) holds it up from being evicted and freed; but once we drop the mutex, it could go away at any moment. We cannot determine that by looking at struct inode or struct address_space or struct shmem_inode_info, they're all part of what would be freed; but we cannot proceed to shmem_add_to_page_cache() once they're freed. How to tell whether it's been freed? Once upon a time I "solved" it with igrab() and iput(), but Konstantin demonstrated how that gives no safety against unmounting, and I remain reluctant to go back to relying upon filesystem semantics to solve this. It occurred to me that the inode cannot be freed until that radix_tree entry has been removed (by shmem_evict_inode's shmem_truncate_range), and the act or removing that entry (free_swap_and_cache) brings page_swapcount down from 1 to 0. You're thinking that the page cannot be removed from swapcache while we hold page lock: correct, but... free_swap_and_cache() only does a trylock_page(), and happily leaves the swapcache page to be garbage collected later if it cannot get the page lock. (And I certainly would not want to change it to wait for page lock.) So, the inode can get evicted while the page is still in swapcache: the page lock gives no protection against that, until the page itself gets into the radix_tree. I doubt that writing this essay into a comment there will be the right thing to do (and I may still be losing you); but I shall try to rewrite it, and if there's one missing fact that needs highlighting, it probably is that last, that free_swap_and_cache() only does a trylock, so our page lock does not protect the inode from eviction. (At this moment, I can't think what is the relevance of my comment "we do hold page lock on the PageSwapCache page": in other contexts it would be important, but here in swapoff we know that that swap cannot get reused, or not before we're done.) > > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct > > struct list_head *this, *next; > > struct shmem_inode_info *info; > > int found = 0; > > - int error; > > + int error = 0; > > + > > + /* > > + * There's a faint possibility that swap page was replaced before > > + * caller locked it: it will come back later with the right page. > > So a caller locked the page then failed to check that it's still the > right sort of page? Shouldn't the caller locally clean up its own mess > rather than requiring a callee to know about the caller's intricate > shortcomings? The caller being try_to_unuse(). You're certainly not the first to argue that way. Perhaps I'm a bit perverse, in letting code which works even in the surprising cases, remain as it is without weeding out those surprising cases. And on this occasion didn't want to add an additional dependence on a slight subtle change in mm/swapfile.c functionality. Hmm, yes, I do still prefer to have the check here in shmem.c: particularly since it is this "shmem_replace_page" business which is increasing the likelihood of such a race, and making further demands on it (if we're going to make the copied page PageSwapCache, then we need to be sure that the page it's replacing was PageSwapCache - though that's something I need to think through again in the light of the race which I thought of in responding to Cong). > > + newpage = shmem_alloc_page(gfp, info, index); > > + if (!newpage) > > + return -ENOMEM; > > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); > > + > > + *pagep = newpage; > > + page_cache_get(newpage); > > + copy_highpage(newpage, oldpage); > > copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()? Ooh, I'm pretty sure you're right that we do need flush_dcache_page() there: good catch, thank you. We can't use copy_user_highpage() because in general we don't know any address and vma; but should be following the shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate. > > shmem_replace_page() is a fairly generic and unexceptional sounding > thing. Methinks shmem_substitute_page() would be a better name. Okay, shmem_replace_page() seemed appropriate to me (especially thinking of it as "re-place"), but I don't mind changing to shmem_substitute_page(). The flush_dcache_page() addition is important, but until people are using GMA500 on ARM or something (I doubt that combination) with more than 4GB, this code is not coming into play - so I'm not breaking anyone's system if it sneaks into linux-next before I fix that. The main thing I need to think through quietly is the slippery swap race: I'll send you an incremental patch to fix all these up once I'm satisfied on that. Thanks, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [PATCH 3/10] tmpfs: optimize clearing when writing Date: Tue, 15 May 2012 18:51:16 +1000 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: base64 Cc: Andrew Morton , Christoph Hellwig , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org T24gMTIgTWF5IDIwMTIgMjI6MDQsIEh1Z2ggRGlja2lucyA8aHVnaGRAZ29vZ2xlLmNvbT4gd3Jv dGU6Cj4gTmljayBwcm9wb3NlZCB5ZWFycyBhZ28gdGhhdCB0bXBmcyBzaG91bGQgYXZvaWQgY2xl YXJpbmcgaXRzIHBhZ2VzIHdoZXJlCj4gd3JpdGUgd2lsbCBvdmVyd3JpdGUgdGhlbSB3aXRoIG5l dyBkYXRhLCBhcyByYW1mcyBoYXMgbG9uZyBkb25lLiDCoEJ1dCBJCj4gbWVzc2VkIGl0IHVwIGFu ZCBqdXN0IGdvdCBiYWQgZGF0YS4gwqBUcmllZCBhZ2FpbiByZWNlbnRseSwgaXQgd29ya3MgZmlu ZS4KPgo+IEhlcmUncyB0aW1lIG91dHB1dCBmb3Igd3JpdGluZyA0R2lCIDE2IHRpbWVzIG9uIHRo aXMgQ29yZSBpNSBsYXB0b3A6Cj4KPiBiZWZvcmU6IHJlYWwgwqAgwqAwbTIxLjE2OXMgdXNlciDC oDBtMC4wMjhzIHN5cyDCoCDCoDBtMjEuMDU3cwo+IMKgIMKgIMKgIMKgcmVhbCDCoCDCoDBtMjEu MzgycyB1c2VyIMKgMG0wLjAxNnMgc3lzIMKgIMKgMG0yMS4yODlzCj4gwqAgwqAgwqAgwqByZWFs IMKgIMKgMG0yMS4zMTFzIHVzZXIgwqAwbTAuMDIwcyBzeXMgwqAgwqAwbTIxLjIxN3MKPgo+IGFm dGVyOiDCoHJlYWwgwqAgwqAwbTE4LjI3M3MgdXNlciDCoDBtMC4wMzJzIHN5cyDCoCDCoDBtMTgu MTY1cwo+IMKgIMKgIMKgIMKgcmVhbCDCoCDCoDBtMTguMzU0cyB1c2VyIMKgMG0wLjAyMHMgc3lz IMKgIMKgMG0xOC4yNjVzCj4gwqAgwqAgwqAgwqByZWFsIMKgIMKgMG0xOC40NDBzIHVzZXIgwqAw bTAuMDMycyBzeXMgwqAgwqAwbTE4LjMzN3MKPgo+IHJhbWZzOiDCoHJlYWwgwqAgwqAwbTE2Ljg2 MHMgdXNlciDCoDBtMC4wMjhzIHN5cyDCoCDCoDBtMTYuNzY1cwo+IMKgIMKgIMKgIMKgcmVhbCDC oCDCoDBtMTcuMzgycyB1c2VyIMKgMG0wLjA0MHMgc3lzIMKgIMKgMG0xNy4yNzNzCj4gwqAgwqAg wqAgwqByZWFsIMKgIMKgMG0xNy4xMzNzIHVzZXIgwqAwbTAuMDQ0cyBzeXMgwqAgwqAwbTE3LjAy MXMKCkNvb2wsIHRoYW5rcyBIdWdoISBWZXJ5IGJpZyBzcGVlZHVwLgoKCj4KPiBZZXMsIEkgaGF2 ZSBkb25lIHBlcmYgcmVwb3J0cywgYnV0IHRoZXkgbmVlZCBtb3JlIGV4cGxhbmF0aW9uIHRoYW4g dGhleQo+IGRlc2VydmU6IGluIHN1bW1hcnksIGNsZWFyX3BhZ2UgdmFuaXNoZXMsIGl0cyBjYWNo ZSBsb2FkaW5nIHNoaWZ0cyBpbnRvCj4gY29weV91c2VyX2dlbmVyaWNfdW5yb2xsZWQ7IHNobWVt X2dldHBhZ2VfZ2ZwIGdvZXMgZG93biwgYW5kIHN1cnByaXNpbmdseQo+IG1hcmtfcGFnZV9hY2Nl c3NlZCBnb2VzIHdheSB1cCAtIEkgdGhpbmsgYmVjYXVzZSB0aGV5IGFyZSByZXNwZWN0aXZlbHkK PiB3aGVyZSB0aGUgY2FjaGUgZ2V0cyB0byBiZSByZWxvYWRlZCBhZnRlciBiZWluZyBwdXJnZWQg YnkgY2xlYXIgb3IgY29weS4KPgo+IFN1Z2dlc3RlZC1ieTogTmljayBQaWdnaW4gPG5waWdnaW5A Z21haWwuY29tPgo+IFNpZ25lZC1vZmYtYnk6IEh1Z2ggRGlja2lucyA8aHVnaGRAZ29vZ2xlLmNv bT4KPiAtLS0KPiDCoG1tL3NobWVtLmMgfCDCoCAyMCArKysrKysrKysrKysrKysrKy0tLQo+IMKg MSBmaWxlIGNoYW5nZWQsIDE3IGluc2VydGlvbnMoKyksIDMgZGVsZXRpb25zKC0pCj4KPiAtLS0g MzA0NU4ub3JpZy9tbS9zaG1lbS5jIMKgIMKgIMKgIDIwMTItMDUtMDUgMTA6NDY6MDUuNzMyMDYy MDA2IC0wNzAwCj4gKysrIDMwNDVOL21tL3NobWVtLmMgwqAgwqAyMDEyLTA1LTA1IDEwOjQ2OjEy LjMxNjA2MjE3MiAtMDcwMAo+IEBAIC0xMDk1LDkgKzEwOTUsMTQgQEAgcmVwZWF0Ogo+IMKgIMKg IMKgIMKgIMKgIMKgIMKgIMKgc2htZW1fcmVjYWxjX2lub2RlKGlub2RlKTsKPiDCoCDCoCDCoCDC oCDCoCDCoCDCoCDCoHNwaW5fdW5sb2NrKCZpbmZvLT5sb2NrKTsKPgo+IC0gwqAgwqAgwqAgwqAg wqAgwqAgwqAgY2xlYXJfaGlnaHBhZ2UocGFnZSk7Cj4gLSDCoCDCoCDCoCDCoCDCoCDCoCDCoCBm bHVzaF9kY2FjaGVfcGFnZShwYWdlKTsKPiAtIMKgIMKgIMKgIMKgIMKgIMKgIMKgIFNldFBhZ2VV cHRvZGF0ZShwYWdlKTsKPiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIC8qCj4gKyDCoCDCoCDCoCDC oCDCoCDCoCDCoCDCoCogTGV0IFNHUF9XUklURSBjYWxsZXIgY2xlYXIgZW5kcyBpZiB3cml0ZSBk b2VzIG5vdCBmaWxsIHBhZ2UKPiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgKi8KPiArIMKgIMKg IMKgIMKgIMKgIMKgIMKgIGlmIChzZ3AgIT0gU0dQX1dSSVRFKSB7Cj4gKyDCoCDCoCDCoCDCoCDC oCDCoCDCoCDCoCDCoCDCoCDCoCBjbGVhcl9oaWdocGFnZShwYWdlKTsKPiArIMKgIMKgIMKgIMKg IMKgIMKgIMKgIMKgIMKgIMKgIMKgIGZsdXNoX2RjYWNoZV9wYWdlKHBhZ2UpOwo+ICsgwqAgwqAg wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgU2V0UGFnZVVwdG9kYXRlKHBhZ2UpOwo+ICsgwqAg wqAgwqAgwqAgwqAgwqAgwqAgfQo+IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgaWYgKHNncCA9PSBT R1BfRElSVFkpCj4gwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqBzZXRfcGFnZV9k aXJ0eShwYWdlKTsKPiDCoCDCoCDCoCDCoH0KPiBAQCAtMTMwNyw2ICsxMzEyLDE0IEBAIHNobWVt X3dyaXRlX2VuZChzdHJ1Y3QgZmlsZSAqZmlsZSwgc3RydWMKPiDCoCDCoCDCoCDCoGlmIChwb3Mg KyBjb3BpZWQgPiBpbm9kZS0+aV9zaXplKQo+IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgaV9zaXpl X3dyaXRlKGlub2RlLCBwb3MgKyBjb3BpZWQpOwo+Cj4gKyDCoCDCoCDCoCBpZiAoIVBhZ2VVcHRv ZGF0ZShwYWdlKSkgewo+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgaWYgKGNvcGllZCA8IFBBR0Vf Q0FDSEVfU0laRSkgewo+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgdW5zaWdu ZWQgZnJvbSA9IHBvcyAmIChQQUdFX0NBQ0hFX1NJWkUgLSAxKTsKPiArIMKgIMKgIMKgIMKgIMKg IMKgIMKgIMKgIMKgIMKgIMKgIHplcm9fdXNlcl9zZWdtZW50cyhwYWdlLCAwLCBmcm9tLAo+ICsg wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAg ZnJvbSArIGNvcGllZCwgUEFHRV9DQUNIRV9TSVpFKTsKPiArIMKgIMKgIMKgIMKgIMKgIMKgIMKg IH0KPiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIFNldFBhZ2VVcHRvZGF0ZShwYWdlKTsKPiArIMKg IMKgIMKgIH0KPiDCoCDCoCDCoCDCoHNldF9wYWdlX2RpcnR5KHBhZ2UpOwo+IMKgIMKgIMKgIMKg dW5sb2NrX3BhZ2UocGFnZSk7Cj4gwqAgwqAgwqAgwqBwYWdlX2NhY2hlX3JlbGVhc2UocGFnZSk7 Cj4gQEAgLTE3NjgsNiArMTc4MSw3IEBAIHN0YXRpYyBpbnQgc2htZW1fc3ltbGluayhzdHJ1Y3Qg aW5vZGUgKmQKPiDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoGthZGRyID0ga21hcF9hdG9taWMocGFn ZSk7Cj4gwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqBtZW1jcHkoa2FkZHIsIHN5bW5hbWUsIGxlbik7 Cj4gwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqBrdW5tYXBfYXRvbWljKGthZGRyKTsKPiArIMKgIMKg IMKgIMKgIMKgIMKgIMKgIFNldFBhZ2VVcHRvZGF0ZShwYWdlKTsKPiDCoCDCoCDCoCDCoCDCoCDC oCDCoCDCoHNldF9wYWdlX2RpcnR5KHBhZ2UpOwo+IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgdW5s b2NrX3BhZ2UocGFnZSk7Cj4gwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqBwYWdlX2NhY2hlX3JlbGVh c2UocGFnZSk7Cg== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Cong Wang Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Tue, 15 May 2012 17:44:41 +0800 Message-ID: <4FB22589.9050406@gmail.com> References: <4FB0C888.8070805@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Andrew Morton , Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 05/15/2012 03:42 AM, Hugh Dickins wrote: > I'm not going to rush the incremental patch to fix this: need to think > about it quietly first. > > If you're wondering what I'm talking about (sorry, I don't have time > to explain more right now), take a look at comment and git history of > line 2956 (in 3.4-rc7) of mm/memory.c: > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > I don't suppose anyone ever actually hit the bug in the years before > we added that protection, but we still ought to guard against it, > there and here in shmem_replace_page(). > Ok, I have no objections. Thanks for your patches anyway! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Date: Fri, 18 May 2012 18:41:38 -0700 (PDT) Message-ID: References: <20120514161330.def0ac52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, 14 May 2012, Hugh Dickins wrote: > On Mon, 14 May 2012, Andrew Morton wrote: > > On Sat, 12 May 2012 04:59:56 -0700 (PDT) > > Hugh Dickins wrote: > > > > > > We'd like to continue to support GMA500, so now add a new > > > shmem_should_replace_page() check on the zone when about to move > > > a page from swapcache to filecache (in swapin and swapoff cases), > > > with shmem_replace_page() to allocate and substitute a suitable page > > > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > > > ... > > > + gfp = mapping_gfp_mask(mapping); > > > + if (shmem_should_replace_page(*pagep, gfp)) { > > > + mutex_unlock(&shmem_swaplist_mutex); > > > + error = shmem_replace_page(pagep, gfp, info, index); > > > + mutex_lock(&shmem_swaplist_mutex); > > > + /* > > > + * We needed to drop mutex to make that restrictive page > > > + * allocation; but the inode might already be freed by now, > > > + * and we cannot refer to inode or mapping or info to check. > > > + * However, we do hold page lock on the PageSwapCache page, > > > + * so can check if that still has our reference remaining. > > > + */ > > > + if (!page_swapcount(*pagep)) > > > + error = -ENOENT; > > > > This has my head spinning a bit. What is "our reference"? I'd expect > > that to mean a temporary reference which was taken by this thread of > > control. > > (I'm sure you'll prefer a reworking of that comment in an incremental > fixes patch, but let me try to explain better here too.) > > No, I didn't mean a temporary reference taken by this (swapoff) thread, > but the reference (swap entry) which has just been located in the inode's > radix_tree, just before this hunk: which would be tracked by page_swapcount > 1 (there's also a page swapcache bit in the swap_map along with the count, > corresponding to the reference from the swapcache page itself, but that's > not included in page_swapcount). > > > But such a thing has no relevance when trying to determine > > the state of the page and/or data structures which refer to it. > > I don't understand you there, but maybe it won't matter. > > > > > Also, what are we trying to determine here with this test? Whether the > > page was removed from swapcache under our feet? Presumably not, as it > > is locked. > > > > So perhaps you could spell out in more detail what we're trying to do > > here, and what contributes to page_swapcount() here? > > The danger here is that the inode we're dealing with has gone through > shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was > certainly in use before, and shmem_swaplist_mutex (together with inode > being on shmem_swaplist) holds it up from being evicted and freed; but > once we drop the mutex, it could go away at any moment. We cannot > determine that by looking at struct inode or struct address_space or > struct shmem_inode_info, they're all part of what would be freed; > but we cannot proceed to shmem_add_to_page_cache() once they're freed. > How to tell whether it's been freed? > > Once upon a time I "solved" it with igrab() and iput(), but Konstantin > demonstrated how that gives no safety against unmounting, and I remain > reluctant to go back to relying upon filesystem semantics to solve this. > > It occurred to me that the inode cannot be freed until that radix_tree > entry has been removed (by shmem_evict_inode's shmem_truncate_range), > and the act or removing that entry (free_swap_and_cache) brings > page_swapcount down from 1 to 0. > > You're thinking that the page cannot be removed from swapcache while > we hold page lock: correct, but... free_swap_and_cache() only does a > trylock_page(), and happily leaves the swapcache page to be garbage > collected later if it cannot get the page lock. (And I certainly > would not want to change it to wait for page lock.) So, the inode > can get evicted while the page is still in swapcache: the page lock > gives no protection against that, until the page itself gets into > the radix_tree. > > I doubt that writing this essay into a comment there will be the > right thing to do (and I may still be losing you); but I shall try > to rewrite it, and if there's one missing fact that needs highlighting, > it probably is that last, that free_swap_and_cache() only does a trylock, > so our page lock does not protect the inode from eviction. > > (At this moment, I can't think what is the relevance of my comment > "we do hold page lock on the PageSwapCache page": in other contexts it > would be important, but here in swapoff we know that that swap cannot > get reused, or not before we're done.) > > > > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct > > > struct list_head *this, *next; > > > struct shmem_inode_info *info; > > > int found = 0; > > > - int error; > > > + int error = 0; > > > + > > > + /* > > > + * There's a faint possibility that swap page was replaced before > > > + * caller locked it: it will come back later with the right page. > > > > So a caller locked the page then failed to check that it's still the > > right sort of page? Shouldn't the caller locally clean up its own mess > > rather than requiring a callee to know about the caller's intricate > > shortcomings? > > The caller being try_to_unuse(). You're certainly not the first to argue > that way. Perhaps I'm a bit perverse, in letting code which works even > in the surprising cases, remain as it is without weeding out those > surprising cases. And on this occasion didn't want to add an additional > dependence on a slight subtle change in mm/swapfile.c functionality. > > Hmm, yes, I do still prefer to have the check here in shmem.c: > particularly since it is this "shmem_replace_page" business which is > increasing the likelihood of such a race, and making further demands > on it (if we're going to make the copied page PageSwapCache, then we > need to be sure that the page it's replacing was PageSwapCache - though > that's something I need to think through again in the light of the race > which I thought of in responding to Cong). > > > > + newpage = shmem_alloc_page(gfp, info, index); > > > + if (!newpage) > > > + return -ENOMEM; > > > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); > > > + > > > + *pagep = newpage; > > > + page_cache_get(newpage); > > > + copy_highpage(newpage, oldpage); > > > > copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()? > > Ooh, I'm pretty sure you're right that we do need flush_dcache_page() > there: good catch, thank you. We can't use copy_user_highpage() because > in general we don't know any address and vma; but should be following the > shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate. > > > > > shmem_replace_page() is a fairly generic and unexceptional sounding > > thing. Methinks shmem_substitute_page() would be a better name. > > Okay, shmem_replace_page() seemed appropriate to me (especially thinking > of it as "re-place"), but I don't mind changing to shmem_substitute_page(). > > The flush_dcache_page() addition is important, but until people are > using GMA500 on ARM or something (I doubt that combination) with more > than 4GB, this code is not coming into play - so I'm not breaking anyone's > system if it sneaks into linux-next before I fix that. > > The main thing I need to think through quietly is the slippery swap race: > I'll send you an incremental patch to fix all these up once I'm satisfied > on that. I promised you an incremental, but that's not really possible because of the name changes from "replace" to "substitute". So I'll be sending you a v2 patch in a moment, to replace (or substitute for) the original 1/10. It responds to feedback comment: 1. "substitute" instead of "replace" [akpm] 2. more explanation of page_swapcount test [akpm] 3. flush_dcache_page after copy_highpage [akpm] 4. removal of excessive VM_BUG_ONs [wangcong] 5. check page_private before and error path within substitute_page [hughd] See below for a diff from v1 for review, omitting replace->substitute mods. Please don't be disappointed if I send you a further patch to shmem_substitute_page() in the weeks ahead: although the page_private checks I've added in this one make it very very very unlikely, and its consequence very probably benign, there is still a surprising (never observed) race by which shmem_getpage_gfp() could get hold of someone else's swap. It's correctly resolved by shmem_add_to_page_cache(), but by that time we have already done a mem_cgroup charge, and now also this substitution. It would be better to rearrange a little here, to eliminate all chance of that surprise: I hoped to complete that earlier, but now think I'd better get the safer intermediate version to you first. Thanks, Hugh --- mm/shmem.c | 57 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 37 insertions(+), 20 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-17 16:28:43.278076430 -0700 +++ 3045N/mm/shmem.c 2012-05-18 16:28:33.642198028 -0700 @@ -636,10 +636,21 @@ static int shmem_unuse_inode(struct shme mutex_lock(&shmem_swaplist_mutex); /* * We needed to drop mutex to make that restrictive page - * allocation; but the inode might already be freed by now, - * and we cannot refer to inode or mapping or info to check. - * However, we do hold page lock on the PageSwapCache page, - * so can check if that still has our reference remaining. + * allocation, but the inode might have been freed while we + * dropped it: although a racing shmem_evict_inode() cannot + * complete without emptying the radix_tree, our page lock + * on this swapcache page is not enough to prevent that - + * free_swap_and_cache() of our swap entry will only + * trylock_page(), removing swap from radix_tree whatever. + * + * We must not proceed to shmem_add_to_page_cache() if the + * inode has been freed, but of course we cannot rely on + * inode or mapping or info to check that. However, we can + * safely check if our swap entry is still in use (and here + * it can't have got reused for another page): if it's still + * in use, then the inode cannot have been freed yet, and we + * can safely proceed (if it's no longer in use, that tells + * nothing about the inode, but we don't need to unuse swap). */ if (!page_swapcount(*pagep)) error = -ENOENT; @@ -683,9 +694,9 @@ int shmem_unuse(swp_entry_t swap, struct /* * There's a faint possibility that swap page was substituted before - * caller locked it: it will come back later with the right page. + * caller locked it: caller will come back later with the right page. */ - if (unlikely(!PageSwapCache(page))) + if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val)) goto out; /* @@ -916,21 +927,15 @@ static int shmem_substitute_page(struct newpage = shmem_alloc_page(gfp, info, index); if (!newpage) return -ENOMEM; - VM_BUG_ON(shmem_should_substitute_page(newpage, gfp)); - *pagep = newpage; page_cache_get(newpage); copy_highpage(newpage, oldpage); + flush_dcache_page(newpage); - VM_BUG_ON(!PageLocked(oldpage)); __set_page_locked(newpage); - VM_BUG_ON(!PageUptodate(oldpage)); SetPageUptodate(newpage); - VM_BUG_ON(!PageSwapBacked(oldpage)); SetPageSwapBacked(newpage); - VM_BUG_ON(!swap_index); set_page_private(newpage, swap_index); - VM_BUG_ON(!PageSwapCache(oldpage)); SetPageSwapCache(newpage); /* @@ -940,13 +945,24 @@ static int shmem_substitute_page(struct spin_lock_irq(&swap_mapping->tree_lock); error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, newpage); - __inc_zone_page_state(newpage, NR_FILE_PAGES); - __dec_zone_page_state(oldpage, NR_FILE_PAGES); + if (!error) { + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + } spin_unlock_irq(&swap_mapping->tree_lock); - BUG_ON(error); - mem_cgroup_replace_page_cache(oldpage, newpage); - lru_cache_add_anon(newpage); + if (unlikely(error)) { + /* + * Is this possible? I think not, now that our callers check + * both PageSwapCache and page_private after getting page lock; + * but be defensive. Reverse old to newpage for clear and free. + */ + oldpage = newpage; + } else { + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + *pagep = newpage; + } ClearPageSwapCache(oldpage); set_page_private(oldpage, 0); @@ -954,7 +970,7 @@ static int shmem_substitute_page(struct unlock_page(oldpage); page_cache_release(oldpage); page_cache_release(oldpage); - return 0; + return error; } /* @@ -1025,7 +1041,8 @@ repeat: /* We have to do this with page locked to prevent races */ lock_page(page); - if (!PageSwapCache(page) || page->mapping) { + if (!PageSwapCache(page) || page_private(page) != swap.val || + page->mapping) { error = -EEXIST; /* try again */ goto failed; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: [PATCH v2 1/10] shmem: substitute page if mapping excludes its zone Date: Fri, 18 May 2012 18:44:17 -0700 (PDT) Message-ID: References: <20120514161330.def0ac52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the backing RAM has to be below 4GB. Not a problem while the boards supported only 4GB: but now Intel's D2700MUD boards support 8GB, and their GMA3600 is managed by the GMA500 driver. shmem/tmpfs has never pretended to support hardware restrictions on the backing memory, but it might have appeared to do so before v3.1, and even now it works fine until a page is swapped out then back in. When read_cache_page_gfp() supplied a freshly allocated page for copy, that compensated for whatever choice might have been made by earlier swapin readahead; but swapoff was likely to destroy the illusion. We would like to continue to support GMA500, so now add a new shmem_should_substitute_page() check on the zone when about to move a page from swapcache to filecache (in swapin and swapoff cases), with shmem_substitute_page() to allocate and substitute a suitable page (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). This does involve a minor extension to mem_cgroup_replace_page_cache() (the page may or may not have already been charged); and I've removed a comment and call to mem_cgroup_uncharge_cache_page(), which in fact is always a no-op while PageSwapCache. Also removed optimization of an unlikely path in shmem_getpage_gfp(), now that we need to check PageSwapCache more carefully (a racing caller might already have made the copy). And at one point shmem_unuse_inode() needs to use the hitherto private page_swapcount(), to guard against racing with inode eviction. It would make sense to extend shmem_should_substitute_page(), to cover cpuset and NUMA mempolicy restrictions too, but set that aside for now: needs a cleanup of shmem mempolicy handling, and more testing, and ought to handle swap faults in do_swap_page() as well as shmem. Signed-off-by: Hugh Dickins Acked-by: KAMEZAWA Hiroyuki --- I've Cc'ed Stephane, Andi, Dave, Daniel and Rob because of their interest in the i915 Sandybridge issue; but reiterate that this patch does nothing for that case. include/linux/swap.h | 6 + mm/memcontrol.c | 17 +++- mm/shmem.c | 158 ++++++++++++++++++++++++++++++++++++----- mm/swapfile.c | 2 4 files changed, 159 insertions(+), 24 deletions(-) --- 3045N.orig/include/linux/swap.h 2012-05-17 16:30:20.222078994 -0700 +++ 3045N/include/linux/swap.h 2012-05-17 16:30:29.786078930 -0700 @@ -355,6 +355,7 @@ extern int swap_type_of(dev_t, sector_t, extern unsigned int count_swap_pages(int, int); extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); +extern int page_swapcount(struct page *); extern int reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; @@ -448,6 +449,11 @@ static inline void delete_from_swap_cach { } +static inline int page_swapcount(struct page *page) +{ + return 0; +} + #define reuse_swap_page(page) (page_mapcount(page) == 1) static inline int try_to_free_swap(struct page *page) --- 3045N.orig/mm/memcontrol.c 2012-05-17 16:30:20.226078835 -0700 +++ 3045N/mm/memcontrol.c 2012-05-17 16:30:29.786078930 -0700 @@ -3548,7 +3548,7 @@ void mem_cgroup_end_migration(struct mem void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; struct page_cgroup *pc; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3558,11 +3558,20 @@ void mem_cgroup_replace_page_cache(struc pc = lookup_page_cgroup(oldpage); /* fix accounting on old pages */ lock_page_cgroup(pc); - memcg = pc->mem_cgroup; - mem_cgroup_charge_statistics(memcg, false, -1); - ClearPageCgroupUsed(pc); + if (PageCgroupUsed(pc)) { + memcg = pc->mem_cgroup; + mem_cgroup_charge_statistics(memcg, false, -1); + ClearPageCgroupUsed(pc); + } unlock_page_cgroup(pc); + /* + * When called from shmem_substitute_page(), in some cases the + * oldpage has already been charged, and in some cases not. + */ + if (!memcg) + return; + if (PageSwapBacked(oldpage)) type = MEM_CGROUP_CHARGE_TYPE_SHMEM; --- 3045N.orig/mm/shmem.c 2012-05-17 16:30:20.226078835 -0700 +++ 3045N/mm/shmem.c 2012-05-18 16:28:33.642198028 -0700 @@ -103,6 +103,9 @@ static unsigned long shmem_default_max_i } #endif +static bool shmem_should_substitute_page(struct page *page, gfp_t gfp); +static int shmem_substitute_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index); static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type); @@ -604,12 +607,13 @@ static void shmem_evict_inode(struct ino * If swap found in inode, free it and move page from swapcache to filecache. */ static int shmem_unuse_inode(struct shmem_inode_info *info, - swp_entry_t swap, struct page *page) + swp_entry_t swap, struct page **pagep) { struct address_space *mapping = info->vfs_inode.i_mapping; void *radswap; pgoff_t index; - int error; + gfp_t gfp; + int error = 0; radswap = swp_to_radix_entry(swap); index = radix_tree_locate_item(&mapping->page_tree, radswap); @@ -625,22 +629,48 @@ static int shmem_unuse_inode(struct shme if (shmem_swaplist.next != &info->swaplist) list_move_tail(&shmem_swaplist, &info->swaplist); + gfp = mapping_gfp_mask(mapping); + if (shmem_should_substitute_page(*pagep, gfp)) { + mutex_unlock(&shmem_swaplist_mutex); + error = shmem_substitute_page(pagep, gfp, info, index); + mutex_lock(&shmem_swaplist_mutex); + /* + * We needed to drop mutex to make that restrictive page + * allocation, but the inode might have been freed while we + * dropped it: although a racing shmem_evict_inode() cannot + * complete without emptying the radix_tree, our page lock + * on this swapcache page is not enough to prevent that - + * free_swap_and_cache() of our swap entry will only + * trylock_page(), removing swap from radix_tree whatever. + * + * We must not proceed to shmem_add_to_page_cache() if the + * inode has been freed, but of course we cannot rely on + * inode or mapping or info to check that. However, we can + * safely check if our swap entry is still in use (and here + * it can't have got reused for another page): if it's still + * in use, then the inode cannot have been freed yet, and we + * can safely proceed (if it's no longer in use, that tells + * nothing about the inode, but we don't need to unuse swap). + */ + if (!page_swapcount(*pagep)) + error = -ENOENT; + } + /* * We rely on shmem_swaplist_mutex, not only to protect the swaplist, * but also to hold up shmem_evict_inode(): so inode cannot be freed * beneath us (pagelock doesn't help until the page is in pagecache). */ - error = shmem_add_to_page_cache(page, mapping, index, + if (!error) + error = shmem_add_to_page_cache(*pagep, mapping, index, GFP_NOWAIT, radswap); - /* which does mem_cgroup_uncharge_cache_page on error */ - if (error != -ENOMEM) { /* * Truncation and eviction use free_swap_and_cache(), which * only does trylock page: if we raced, best clean up here. */ - delete_from_swap_cache(page); - set_page_dirty(page); + delete_from_swap_cache(*pagep); + set_page_dirty(*pagep); if (!error) { spin_lock(&info->lock); info->swapped--; @@ -660,7 +690,14 @@ int shmem_unuse(swp_entry_t swap, struct struct list_head *this, *next; struct shmem_inode_info *info; int found = 0; - int error; + int error = 0; + + /* + * There's a faint possibility that swap page was substituted before + * caller locked it: caller will come back later with the right page. + */ + if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val)) + goto out; /* * Charge page using GFP_KERNEL while we can wait, before taking @@ -676,7 +713,7 @@ int shmem_unuse(swp_entry_t swap, struct list_for_each_safe(this, next, &shmem_swaplist) { info = list_entry(this, struct shmem_inode_info, swaplist); if (info->swapped) - found = shmem_unuse_inode(info, swap, page); + found = shmem_unuse_inode(info, swap, &page); else list_del_init(&info->swaplist); cond_resched(); @@ -685,8 +722,6 @@ int shmem_unuse(swp_entry_t swap, struct } mutex_unlock(&shmem_swaplist_mutex); - if (!found) - mem_cgroup_uncharge_cache_page(page); if (found < 0) error = found; out: @@ -856,6 +891,89 @@ static inline struct mempolicy *shmem_ge #endif /* + * When a page is moved from swapcache to shmem filecache (either by the + * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of + * shmem_unuse_inode()), it may have been read in earlier from swap, in + * ignorance of the mapping it belongs to. If that mapping has special + * constraints (like the gma500 GEM driver, which requires RAM below 4GB), + * we may need to copy to a suitable page before moving to filecache. + * + * In a future release, this may well be extended to respect cpuset and + * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); + * but for now it is a simple matter of zone. + */ +static bool shmem_should_substitute_page(struct page *page, gfp_t gfp) +{ + return page_zonenum(page) > gfp_zone(gfp); +} + +static int shmem_substitute_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index) +{ + struct page *oldpage, *newpage; + struct address_space *swap_mapping; + pgoff_t swap_index; + int error; + + oldpage = *pagep; + swap_index = page_private(oldpage); + swap_mapping = page_mapping(oldpage); + + /* + * We have arrived here because our zones are constrained, so don't + * limit chance of success by further cpuset and node constraints. + */ + gfp &= ~GFP_CONSTRAINT_MASK; + newpage = shmem_alloc_page(gfp, info, index); + if (!newpage) + return -ENOMEM; + + page_cache_get(newpage); + copy_highpage(newpage, oldpage); + flush_dcache_page(newpage); + + __set_page_locked(newpage); + SetPageUptodate(newpage); + SetPageSwapBacked(newpage); + set_page_private(newpage, swap_index); + SetPageSwapCache(newpage); + + /* + * Our caller will very soon move newpage out of swapcache, but it's a + * nice clean interface for us to substitute newpage for oldpage there. + */ + spin_lock_irq(&swap_mapping->tree_lock); + error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, + newpage); + if (!error) { + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + } + spin_unlock_irq(&swap_mapping->tree_lock); + + if (unlikely(error)) { + /* + * Is this possible? I think not, now that our callers check + * both PageSwapCache and page_private after getting page lock; + * but be defensive. Reverse old to newpage for clear and free. + */ + oldpage = newpage; + } else { + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + *pagep = newpage; + } + + ClearPageSwapCache(oldpage); + set_page_private(oldpage, 0); + + unlock_page(oldpage); + page_cache_release(oldpage); + page_cache_release(oldpage); + return error; +} + +/* * shmem_getpage_gfp - find page in cache, or get from swap, or allocate * * If we allocate a new one we do not mark it dirty. That's up to the @@ -923,19 +1041,21 @@ repeat: /* We have to do this with page locked to prevent races */ lock_page(page); + if (!PageSwapCache(page) || page_private(page) != swap.val || + page->mapping) { + error = -EEXIST; /* try again */ + goto failed; + } if (!PageUptodate(page)) { error = -EIO; goto failed; } wait_on_page_writeback(page); - /* Someone may have already done it for us */ - if (page->mapping) { - if (page->mapping == mapping && - page->index == index) - goto done; - error = -EEXIST; - goto failed; + if (shmem_should_substitute_page(page, gfp)) { + error = shmem_substitute_page(&page, gfp, info, index); + if (error) + goto failed; } error = mem_cgroup_cache_charge(page, current->mm, @@ -998,7 +1118,7 @@ repeat: if (sgp == SGP_DIRTY) set_page_dirty(page); } -done: + /* Perhaps the file has been truncated since we checked */ if (sgp != SGP_WRITE && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { --- 3045N.orig/mm/swapfile.c 2012-05-17 16:30:20.226078835 -0700 +++ 3045N/mm/swapfile.c 2012-05-17 16:30:29.790078925 -0700 @@ -604,7 +604,7 @@ void swapcache_free(swp_entry_t entry, s * This does not give an exact answer when swap count is continued, * but does include the high COUNT_CONTINUED flag to allow for that. */ -static inline int page_swapcount(struct page *page) +int page_swapcount(struct page *page) { int count = 0; struct swap_info_struct *p; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: john stultz Subject: Re: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE Date: Mon, 21 May 2012 17:08:29 -0700 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Andrew Morton , Christoph Hellwig , Cong Wang , Al Viro , Colin Cross , John Stulz , Greg Kroah-Hartman , "Theodore Ts'o" , Andreas Dilger , Mark Fasheh , Joel Becker , Dave Chinner , Ben Myers , Michael Kerrisk , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Hugh Dickins , =?ISO-8859-1?Q?Arve_Hj=F8nnev=E5g?= Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sat, May 12, 2012 at 5:13 AM, Hugh Dickins wrote: > Now tmpfs supports hole-punching via fallocate(), switch madvise_remove() > to use do_fallocate() instead of vmtruncate_range(): which extends > madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs. > > There is one more user of vmtruncate_range() in our tree, staging/android= 's > ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinne= d > areas are already unmapped - I don't know - then it would do better to us= e > shmem_truncate_range() directly). I suspect shmem_truncate_range directly would be the right approach, but am not totally sure. Arve: Any thoughts? Hugh: Do you have a git tree with this set available somewhere? I was working on my own tmpfs support for FALLOC_FL_PUNCH_HOLE, along with my volatile range work, so I'd like to rebase on top of your work here. thanks -john > > Based-on-patch-by: Cong Wang > Signed-off-by: Hugh Dickins > --- > =A0drivers/staging/android/ashmem.c | =A0 =A08 +++++--- > =A0mm/madvise.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 15 +++++++-= ------- > =A02 files changed, 12 insertions(+), 11 deletions(-) > > --- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.56405= 6626 -0700 > +++ 3045N/drivers/staging/android/ashmem.c =A0 =A0 =A02012-05-05 10:46:25= .692062478 -0700 > @@ -19,6 +19,7 @@ > =A0#include > =A0#include > =A0#include > +#include > =A0#include > =A0#include > =A0#include > @@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker > > =A0 =A0 =A0 =A0mutex_lock(&ashmem_mutex); > =A0 =A0 =A0 =A0list_for_each_entry_safe(range, next, &ashmem_lru_list, lr= u) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct inode *inode =3D range->asma->file->= f_dentry->d_inode; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0loff_t start =3D range->pgstart * PAGE_SIZ= E; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 loff_t end =3D (range->pgend + 1) * PAGE_SI= ZE - 1; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 loff_t end =3D (range->pgend + 1) * PAGE_SI= ZE; > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 vmtruncate_range(inode, start, end); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 do_fallocate(range->asma->file, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 FALLOC_FL_P= UNCH_HOLE | FALLOC_FL_KEEP_SIZE, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 start, end = - start); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0range->purged =3D ASHMEM_WAS_PURGED; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0lru_del(range); > > --- 3045N.orig/mm/madvise.c =A0 =A0 2012-05-05 10:42:33.572056784 -0700 > +++ 3045N/mm/madvise.c =A02012-05-05 10:46:25.692062478 -0700 > @@ -11,8 +11,10 @@ > =A0#include > =A0#include > =A0#include > +#include > =A0#include > =A0#include > +#include > > =A0/* > =A0* Any behaviour which results in changes to the vma->vm_flags needs to > @@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct vm_= area_struct **prev, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unsigned l= ong start, unsigned long end) > =A0{ > - =A0 =A0 =A0 struct address_space *mapping; > - =A0 =A0 =A0 loff_t offset, endoff; > + =A0 =A0 =A0 loff_t offset; > =A0 =A0 =A0 =A0int error; > > =A0 =A0 =A0 =A0*prev =3D NULL; =A0 /* tell sys_madvise we drop mmap_sem *= / > @@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are > =A0 =A0 =A0 =A0if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) !=3D (VM_SHARED= |VM_WRITE)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return -EACCES; > > - =A0 =A0 =A0 mapping =3D vma->vm_file->f_mapping; > - > =A0 =A0 =A0 =A0offset =3D (loff_t)(start - vma->vm_start) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0+ ((loff_t)vma->vm_pgoff <= < PAGE_SHIFT); > - =A0 =A0 =A0 endoff =3D (loff_t)(end - vma->vm_start - 1) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 + ((loff_t)vma->vm_pgoff <<= PAGE_SHIFT); > > - =A0 =A0 =A0 /* vmtruncate_range needs to take i_mutex */ > + =A0 =A0 =A0 /* filesystem's fallocate may need to take i_mutex */ > =A0 =A0 =A0 =A0up_read(¤t->mm->mmap_sem); > - =A0 =A0 =A0 error =3D vmtruncate_range(mapping->host, offset, endoff); > + =A0 =A0 =A0 error =3D do_fallocate(vma->vm_file, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 FALLOC_FL_P= UNCH_HOLE | FALLOC_FL_KEEP_SIZE, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset, end= - start); > =A0 =A0 =A0 =A0down_read(¤t->mm->mmap_sem); > =A0 =A0 =A0 =A0return error; > =A0} > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" i= n > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > Please read the FAQ at =A0http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE Date: Tue, 22 May 2012 08:11:49 -0700 (PDT) Message-ID: References: Mime-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="8323584-1437716011-1337699517=:2513" Cc: Arve Hjonnevag , Andrew Morton , Christoph Hellwig , Cong Wang , Al Viro , Colin Cross , John Stultz , Greg Kroah-Hartman , Theodore Ts'o , Andreas Dilger , Mark Fasheh , Joel Becker , Dave Chinner , Ben Myers , Michael Kerrisk , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: john stultz Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323584-1437716011-1337699517=:2513 Content-Type: TEXT/PLAIN; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 21 May 2012, john stultz wrote: > On Sat, May 12, 2012 at 5:13 AM, Hugh Dickins wrote: > > Now tmpfs supports hole-punching via fallocate(), switch madvise_remove= () > > to use do_fallocate() instead of vmtruncate_range(): which extends > > madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs. > > > > There is one more user of vmtruncate_range() in our tree, staging/andro= id's > > ashmem_shrink(): convert it to use do_fallocate() too (but if its unpin= ned > > areas are already unmapped - I don't know - then it would do better to = use > > shmem_truncate_range() directly). >=20 > I suspect shmem_truncate_range directly would be the right approach, > but am not totally sure. > Arve: Any thoughts? >=20 > Hugh: Do you have a git tree with this set available somewhere? I was > working on my own tmpfs support for FALLOC_FL_PUNCH_HOLE, along with > my volatile range work, so I'd like to rebase on top of your work > here. I don't, no, just the patch series posted. I had hoped by now to say that it's in linux-next (though it would be at the daily rebased end, which probably doesn't help you), but not yet. If shmem_truncate_range() is all you need, then that doesn't depend on these patches at all - but I expect you are aiming to be more general. Hugh >=20 > thanks > -john >=20 >=20 > > > > Based-on-patch-by: Cong Wang > > Signed-off-by: Hugh Dickins > > --- > > =A0drivers/staging/android/ashmem.c | =A0 =A08 +++++--- > > =A0mm/madvise.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 15 ++++++= +-------- > > =A02 files changed, 12 insertions(+), 11 deletions(-) > > > > --- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.564= 056626 -0700 > > +++ 3045N/drivers/staging/android/ashmem.c =A0 =A0 =A02012-05-05 10:46:= 25.692062478 -0700 > > @@ -19,6 +19,7 @@ > > =A0#include > > =A0#include > > =A0#include > > +#include > > =A0#include > > =A0#include > > =A0#include > > @@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker > > > > =A0 =A0 =A0 =A0mutex_lock(&ashmem_mutex); > > =A0 =A0 =A0 =A0list_for_each_entry_safe(range, next, &ashmem_lru_list, = lru) { > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct inode *inode =3D range->asma->file= ->f_dentry->d_inode; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0loff_t start =3D range->pgstart * PAGE_S= IZE; > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 loff_t end =3D (range->pgend + 1) * PAGE_= SIZE - 1; > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 loff_t end =3D (range->pgend + 1) * PAGE_= SIZE; > > > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 vmtruncate_range(inode, start, end); > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 do_fallocate(range->asma->file, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 FALLOC_FL= _PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 start, en= d - start); > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0range->purged =3D ASHMEM_WAS_PURGED; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0lru_del(range); > > > > --- 3045N.orig/mm/madvise.c =A0 =A0 2012-05-05 10:42:33.572056784 -0700 > > +++ 3045N/mm/madvise.c =A02012-05-05 10:46:25.692062478 -0700 > > @@ -11,8 +11,10 @@ > > =A0#include > > =A0#include > > =A0#include > > +#include > > =A0#include > > =A0#include > > +#include > > > > =A0/* > > =A0* Any behaviour which results in changes to the vma->vm_flags needs = to > > @@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct v= m_area_struct **prev, > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unsigned= long start, unsigned long end) > > =A0{ > > - =A0 =A0 =A0 struct address_space *mapping; > > - =A0 =A0 =A0 loff_t offset, endoff; > > + =A0 =A0 =A0 loff_t offset; > > =A0 =A0 =A0 =A0int error; > > > > =A0 =A0 =A0 =A0*prev =3D NULL; =A0 /* tell sys_madvise we drop mmap_sem= */ > > @@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are > > =A0 =A0 =A0 =A0if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) !=3D (VM_SHAR= ED|VM_WRITE)) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return -EACCES; > > > > - =A0 =A0 =A0 mapping =3D vma->vm_file->f_mapping; > > - > > =A0 =A0 =A0 =A0offset =3D (loff_t)(start - vma->vm_start) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0+ ((loff_t)vma->vm_pgoff= << PAGE_SHIFT); > > - =A0 =A0 =A0 endoff =3D (loff_t)(end - vma->vm_start - 1) > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 + ((loff_t)vma->vm_pgoff = << PAGE_SHIFT); > > > > - =A0 =A0 =A0 /* vmtruncate_range needs to take i_mutex */ > > + =A0 =A0 =A0 /* filesystem's fallocate may need to take i_mutex */ > > =A0 =A0 =A0 =A0up_read(¤t->mm->mmap_sem); > > - =A0 =A0 =A0 error =3D vmtruncate_range(mapping->host, offset, endoff)= ; > > + =A0 =A0 =A0 error =3D do_fallocate(vma->vm_file, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 FALLOC_FL= _PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset, e= nd - start); > > =A0 =A0 =A0 =A0down_read(¤t->mm->mmap_sem); > > =A0 =A0 =A0 =A0return error; > > =A0} > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"= in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at =A0http://www.tux.org/lkml/ >=20 --8323584-1437716011-1337699517=:2513-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx167.postini.com [74.125.245.167]) by kanga.kvack.org (Postfix) with SMTP id 95CB26B004D for ; Sat, 12 May 2012 08:17:40 -0400 (EDT) Received: by dakp5 with SMTP id p5so5899717dak.14 for ; Sat, 12 May 2012 05:17:39 -0700 (PDT) Date: Sat, 12 May 2012 05:17:24 -0700 (PDT) From: Hugh Dickins Subject: [PATCH 7/10] tmpfs: support fallocate preallocation In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Hellwig , Cong Wang , Kay Sievers , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org The systemd plumbers expressed a wish that tmpfs support preallocation. Cong Wang wrote a patch, but several kernel guys expressed scepticism: https://lkml.org/lkml/2011/11/18/137 Christoph Hellwig: What for exactly? Please explain why preallocating on tmpfs would make any sense. Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or -EOPNOTSUPP] on fallocate is just ugly. Hugh Dickins: If tmpfs is going to support fallocate(FALLOC_FL_PUNCH_HOLE), it would seem perverse to permit the deallocation but fail the allocation. Christoph Hellwig: Agreed. Now that we do have shmem_fallocate() for hole-punching, plumb in basic support for preallocation mode too. It's fairly straightforward (though quite a few details needed attention), except for when it fails part way through. What a pity that fallocate(2) was not specified to return the length allocated, permitting short fallocations! As it is, when it fails part way through, we ought to free what has just been allocated by this system call; but must be very sure not to free any allocated earlier, or any allocated by racing accesses (not all excluded by i_mutex). But we cannot distinguish them: so in this patch simply leak allocations on partial failure (they will be freed later if the file is removed). An attractive alternative approach would have been for fallocate() not to allocate pages at all, but note reservations by entries in the radix-tree. But that would give less assurance, and, critically, would be hard to fit with mem cgroups (who owns the reservations?): allocating pages lets fallocate() behave in just the same way as write(). Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- mm/shmem.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 60 insertions(+), 1 deletion(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:35.224062700 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:45.312062979 -0700 @@ -1602,7 +1602,9 @@ static long shmem_fallocate(struct file loff_t len) { struct inode *inode = file->f_path.dentry->d_inode; - int error = -EOPNOTSUPP; + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + pgoff_t start, index, end; + int error; mutex_lock(&inode->i_mutex); @@ -1617,8 +1619,65 @@ static long shmem_fallocate(struct file shmem_truncate_range(inode, offset, offset + len - 1); /* No need to unmap again: hole-punching leaves COWed pages */ error = 0; + goto out; } + /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ + error = inode_newsize_ok(inode, offset + len); + if (error) + goto out; + + start = offset >> PAGE_CACHE_SHIFT; + end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + /* Try to avoid a swapstorm if len is impossible to satisfy */ + if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) { + error = -ENOSPC; + goto out; + } + + for (index = start; index < end; index++) { + struct page *page; + + /* + * Good, the fallocate(2) manpage permits EINTR: we may have + * been interrupted because we are using up too much memory. + */ + if (signal_pending(current)) + error = -EINTR; + else + error = shmem_getpage(inode, index, &page, SGP_WRITE, + NULL); + if (error) { + /* + * We really ought to free what we allocated so far, + * but it would be wrong to free pages allocated + * earlier, or already now in use: i_mutex does not + * exclude all cases. We do not know what to free. + */ + goto ctime; + } + + if (!PageUptodate(page)) { + clear_highpage(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + /* + * set_page_dirty so that memory pressure will swap rather + * than free the pages we are allocating (and SGP_CACHE pages + * might still be clean: we now need to mark those dirty too). + */ + set_page_dirty(page); + unlock_page(page); + page_cache_release(page); + cond_resched(); + } + + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) + i_size_write(inode, offset + len); +ctime: + inode->i_ctime = CURRENT_TIME; +out: mutex_unlock(&inode->i_mutex); return error; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id C7EFD6B0081 for ; Mon, 14 May 2012 15:42:44 -0400 (EDT) Received: by pbbrp2 with SMTP id rp2so9413865pbb.14 for ; Mon, 14 May 2012 12:42:44 -0700 (PDT) Date: Mon, 14 May 2012 12:42:21 -0700 (PDT) From: Hugh Dickins Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone In-Reply-To: <4FB0C888.8070805@gmail.com> Message-ID: References: <4FB0C888.8070805@gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Cong Wang Cc: Andrew Morton , Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 14 May 2012, Cong Wang wrote: > On 05/12/2012 07:59 PM, Hugh Dickins wrote: > > + VM_BUG_ON(!PageLocked(oldpage)); > > + __set_page_locked(newpage); > > + VM_BUG_ON(!PageUptodate(oldpage)); > > + SetPageUptodate(newpage); > > + VM_BUG_ON(!PageSwapBacked(oldpage)); > > + SetPageSwapBacked(newpage); > > + VM_BUG_ON(!swap_index); > > + set_page_private(newpage, swap_index); > > + VM_BUG_ON(!PageSwapCache(oldpage)); > > + SetPageSwapCache(newpage); > > + > > Are all of these VM_BUG_ON's necessary? I'm really glad you asked that - thank you. At first I was just going to brush you off with a standard reply of something like "well, no BUG_ON should ever be necessary, but we do find them helpful in practice". But (a) these ones have probably outlived their usefulness: they were certainly reassuring to me when I was testing, but perhaps now are just cluttering up the flow. I did make them "VM_" BUG_ONs in the hope that distros wouldn't waste space and time switching them on, but now I'm inclined to agree with you that they should just be removed. Most of them are doing no more than confirm what's been checked before calling the function (and confirming that status cannot racily change). And (b) whereas they didn't actually catch anything for me, they have been giving false assurance: because (I believe) there really is a bug lurking there that they have not yet met and caught. And I would have missed it if you hadn't directed me back to think about these. It is an exceedingly unlikely bug (and need not delay use of the patch), but what I'm re-remembering is just how slippery swap is: the problem is that a swapcache page can get freed and reused before getting the page lock on it; and it might even get reused for swapcache. Perhaps I need also to be checking page->private, or perhaps I need to check for error instead of BUG_ON(error) just before the lines you picked out, or both. I'm not going to rush the incremental patch to fix this: need to think about it quietly first. If you're wondering what I'm talking about (sorry, I don't have time to explain more right now), take a look at comment and git history of line 2956 (in 3.4-rc7) of mm/memory.c: if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) I don't suppose anyone ever actually hit the bug in the years before we added that protection, but we still ought to guard against it, there and here in shmem_replace_page(). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755137Ab2ELLxQ (ORCPT ); Sat, 12 May 2012 07:53:16 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:51202 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754980Ab2ELLxO (ORCPT ); Sat, 12 May 2012 07:53:14 -0400 Date: Sat, 12 May 2012 04:52:51 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Cong Wang , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/10] shmem/tmpfs: misc and fallocate Message-ID: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Here's a bunch of shmem/tmpfs updates: mostly completed in January, then put aside while I attended to other stuff. But the more recent 1/10 has some urgency, so I'm expediting the descriptions, and shipping them off to you now for v3.5. They're diffed against 3.4.0-rc5-next-20120504, but apply and build and work on most v3.4-rc and v3.4-rc-next. The fallocate ones were prompted by posts from Cong Wang in November: I've attributed four of those with Based-on-patch-by, but could not put From or Signed-off-by, since the originals were somewhat flawed, and I needed to start again and reorder it all. Whether 10/10 should go any further than exposure in -next is in doubt: we shall have to see if it's useful to anyone. 1/10 shmem: replace page if mapping excludes its zone 2/10 tmpfs: enable NOSEC optimization 3/10 tmpfs: optimize clearing when writing 4/10 tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE 5/10 mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE 6/10 mm/fs: remove truncate_range 7/10 tmpfs: support fallocate preallocation 8/10 tmpfs: undo fallocation on failure 9/10 tmpfs: quit when fallocate fills memory 10/10 tmpfs: support SEEK_DATA and SEEK_HOLE Documentation/filesystems/Locking | 2 Documentation/filesystems/vfs.txt | 13 drivers/staging/android/ashmem.c | 8 fs/bad_inode.c | 1 include/linux/fs.h | 1 include/linux/mm.h | 4 include/linux/swap.h | 6 mm/madvise.c | 15 mm/memcontrol.c | 17 mm/shmem.c | 513 +++++++++++++++++++++++++--- mm/swapfile.c | 2 mm/truncate.c | 25 - 12 files changed, 500 insertions(+), 107 deletions(-) Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755240Ab2ELMAQ (ORCPT ); Sat, 12 May 2012 08:00:16 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:55200 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754807Ab2ELMAM (ORCPT ); Sat, 12 May 2012 08:00:12 -0400 Date: Sat, 12 May 2012 04:59:56 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/10] shmem: replace page if mapping excludes its zone In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the backing RAM has to be below 4GB. Not a problem while the boards supported only 4GB: but now Intel's D2700MUD boards support 8GB, and their GMA3600 is managed by the GMA500 driver. shmem/tmpfs has never pretended to support hardware restrictions on the backing memory, but it might have appeared to do so before v3.1, and even now it works fine until a page is swapped out then back in. When read_cache_page_gfp() supplied a freshly allocated page for copy, that compensated for whatever choice might have been made by earlier swapin readahead; but swapoff was likely to destroy the illusion. We'd like to continue to support GMA500, so now add a new shmem_should_replace_page() check on the zone when about to move a page from swapcache to filecache (in swapin and swapoff cases), with shmem_replace_page() to allocate and substitute a suitable page (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). This does involve a minor extension to mem_cgroup_replace_page_cache() (the page may or may not have already been charged); and I've removed a comment and call to mem_cgroup_uncharge_cache_page(), which in fact is always a no-op while PageSwapCache. Also removed optimization of an unlikely path in shmem_getpage_gfp(), now that we need to check PageSwapCache more carefully (a racing caller might already have made the copy). And at one point shmem_unuse_inode() needs to use the hitherto private page_swapcount(), to guard against racing with inode eviction. It would make sense to extend shmem_should_replace_page(), to cover cpuset and NUMA mempolicy restrictions too, but set that aside for now: needs a cleanup of shmem mempolicy handling, and more testing, and ought to handle swap faults in do_swap_page() as well as shmem. Signed-off-by: Hugh Dickins --- I've Cc'ed Stephane, Andi, Dave, Daniel and Rob because of their interest in the i915 Sandybridge issue; but reiterate that this patch does nothing for that case. include/linux/swap.h | 6 + mm/memcontrol.c | 17 +++- mm/shmem.c | 141 +++++++++++++++++++++++++++++++++++------ mm/swapfile.c | 2 4 files changed, 142 insertions(+), 24 deletions(-) --- 3045N.orig/include/linux/swap.h 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/include/linux/swap.h 2012-05-05 10:45:17.884060895 -0700 @@ -355,6 +355,7 @@ extern int swap_type_of(dev_t, sector_t, extern unsigned int count_swap_pages(int, int); extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); +extern int page_swapcount(struct page *); extern int reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; @@ -448,6 +449,11 @@ static inline void delete_from_swap_cach { } +static inline int page_swapcount(struct page *page) +{ + return 0; +} + #define reuse_swap_page(page) (page_mapcount(page) == 1) static inline int try_to_free_swap(struct page *page) --- 3045N.orig/mm/memcontrol.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/memcontrol.c 2012-05-05 10:45:17.888060878 -0700 @@ -3548,7 +3548,7 @@ void mem_cgroup_end_migration(struct mem void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; struct page_cgroup *pc; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3558,11 +3558,20 @@ void mem_cgroup_replace_page_cache(struc pc = lookup_page_cgroup(oldpage); /* fix accounting on old pages */ lock_page_cgroup(pc); - memcg = pc->mem_cgroup; - mem_cgroup_charge_statistics(memcg, false, -1); - ClearPageCgroupUsed(pc); + if (PageCgroupUsed(pc)) { + memcg = pc->mem_cgroup; + mem_cgroup_charge_statistics(memcg, false, -1); + ClearPageCgroupUsed(pc); + } unlock_page_cgroup(pc); + /* + * When called from shmem_replace_page(), in some cases the + * oldpage has already been charged, and in some cases not. + */ + if (!memcg) + return; + if (PageSwapBacked(oldpage)) type = MEM_CGROUP_CHARGE_TYPE_SHMEM; --- 3045N.orig/mm/shmem.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 @@ -103,6 +103,9 @@ static unsigned long shmem_default_max_i } #endif +static bool shmem_should_replace_page(struct page *page, gfp_t gfp); +static int shmem_replace_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index); static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type); @@ -604,12 +607,13 @@ static void shmem_evict_inode(struct ino * If swap found in inode, free it and move page from swapcache to filecache. */ static int shmem_unuse_inode(struct shmem_inode_info *info, - swp_entry_t swap, struct page *page) + swp_entry_t swap, struct page **pagep) { struct address_space *mapping = info->vfs_inode.i_mapping; void *radswap; pgoff_t index; - int error; + gfp_t gfp; + int error = 0; radswap = swp_to_radix_entry(swap); index = radix_tree_locate_item(&mapping->page_tree, radswap); @@ -625,22 +629,37 @@ static int shmem_unuse_inode(struct shme if (shmem_swaplist.next != &info->swaplist) list_move_tail(&shmem_swaplist, &info->swaplist); + gfp = mapping_gfp_mask(mapping); + if (shmem_should_replace_page(*pagep, gfp)) { + mutex_unlock(&shmem_swaplist_mutex); + error = shmem_replace_page(pagep, gfp, info, index); + mutex_lock(&shmem_swaplist_mutex); + /* + * We needed to drop mutex to make that restrictive page + * allocation; but the inode might already be freed by now, + * and we cannot refer to inode or mapping or info to check. + * However, we do hold page lock on the PageSwapCache page, + * so can check if that still has our reference remaining. + */ + if (!page_swapcount(*pagep)) + error = -ENOENT; + } + /* * We rely on shmem_swaplist_mutex, not only to protect the swaplist, * but also to hold up shmem_evict_inode(): so inode cannot be freed * beneath us (pagelock doesn't help until the page is in pagecache). */ - error = shmem_add_to_page_cache(page, mapping, index, + if (!error) + error = shmem_add_to_page_cache(*pagep, mapping, index, GFP_NOWAIT, radswap); - /* which does mem_cgroup_uncharge_cache_page on error */ - if (error != -ENOMEM) { /* * Truncation and eviction use free_swap_and_cache(), which * only does trylock page: if we raced, best clean up here. */ - delete_from_swap_cache(page); - set_page_dirty(page); + delete_from_swap_cache(*pagep); + set_page_dirty(*pagep); if (!error) { spin_lock(&info->lock); info->swapped--; @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct struct list_head *this, *next; struct shmem_inode_info *info; int found = 0; - int error; + int error = 0; + + /* + * There's a faint possibility that swap page was replaced before + * caller locked it: it will come back later with the right page. + */ + if (unlikely(!PageSwapCache(page))) + goto out; /* * Charge page using GFP_KERNEL while we can wait, before taking @@ -676,7 +702,7 @@ int shmem_unuse(swp_entry_t swap, struct list_for_each_safe(this, next, &shmem_swaplist) { info = list_entry(this, struct shmem_inode_info, swaplist); if (info->swapped) - found = shmem_unuse_inode(info, swap, page); + found = shmem_unuse_inode(info, swap, &page); else list_del_init(&info->swaplist); cond_resched(); @@ -685,8 +711,6 @@ int shmem_unuse(swp_entry_t swap, struct } mutex_unlock(&shmem_swaplist_mutex); - if (!found) - mem_cgroup_uncharge_cache_page(page); if (found < 0) error = found; out: @@ -856,6 +880,84 @@ static inline struct mempolicy *shmem_ge #endif /* + * When a page is moved from swapcache to shmem filecache (either by the + * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of + * shmem_unuse_inode()), it may have been read in earlier from swap, in + * ignorance of the mapping it belongs to. If that mapping has special + * constraints (like the gma500 GEM driver, which requires RAM below 4GB), + * we may need to copy to a suitable page before moving to filecache. + * + * In a future release, this may well be extended to respect cpuset and + * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); + * but for now it is a simple matter of zone. + */ +static bool shmem_should_replace_page(struct page *page, gfp_t gfp) +{ + return page_zonenum(page) > gfp_zone(gfp); +} + +static int shmem_replace_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index) +{ + struct page *oldpage, *newpage; + struct address_space *swap_mapping; + pgoff_t swap_index; + int error; + + oldpage = *pagep; + swap_index = page_private(oldpage); + swap_mapping = page_mapping(oldpage); + + /* + * We have arrived here because our zones are constrained, so don't + * limit chance of success by further cpuset and node constraints. + */ + gfp &= ~GFP_CONSTRAINT_MASK; + newpage = shmem_alloc_page(gfp, info, index); + if (!newpage) + return -ENOMEM; + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); + + *pagep = newpage; + page_cache_get(newpage); + copy_highpage(newpage, oldpage); + + VM_BUG_ON(!PageLocked(oldpage)); + __set_page_locked(newpage); + VM_BUG_ON(!PageUptodate(oldpage)); + SetPageUptodate(newpage); + VM_BUG_ON(!PageSwapBacked(oldpage)); + SetPageSwapBacked(newpage); + VM_BUG_ON(!swap_index); + set_page_private(newpage, swap_index); + VM_BUG_ON(!PageSwapCache(oldpage)); + SetPageSwapCache(newpage); + + /* + * Our caller will very soon move newpage out of swapcache, but it's + * a nice clean interface for us to replace oldpage by newpage there. + */ + spin_lock_irq(&swap_mapping->tree_lock); + error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, + newpage); + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + spin_unlock_irq(&swap_mapping->tree_lock); + BUG_ON(error); + + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + + ClearPageSwapCache(oldpage); + set_page_private(oldpage, 0); + + unlock_page(oldpage); + page_cache_release(oldpage); + page_cache_release(oldpage); + return 0; +} + +/* * shmem_getpage_gfp - find page in cache, or get from swap, or allocate * * If we allocate a new one we do not mark it dirty. That's up to the @@ -923,19 +1025,20 @@ repeat: /* We have to do this with page locked to prevent races */ lock_page(page); + if (!PageSwapCache(page) || page->mapping) { + error = -EEXIST; /* try again */ + goto failed; + } if (!PageUptodate(page)) { error = -EIO; goto failed; } wait_on_page_writeback(page); - /* Someone may have already done it for us */ - if (page->mapping) { - if (page->mapping == mapping && - page->index == index) - goto done; - error = -EEXIST; - goto failed; + if (shmem_should_replace_page(page, gfp)) { + error = shmem_replace_page(&page, gfp, info, index); + if (error) + goto failed; } error = mem_cgroup_cache_charge(page, current->mm, @@ -998,7 +1101,7 @@ repeat: if (sgp == SGP_DIRTY) set_page_dirty(page); } -done: + /* Perhaps the file has been truncated since we checked */ if (sgp != SGP_WRITE && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { --- 3045N.orig/mm/swapfile.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/swapfile.c 2012-05-05 10:45:17.888060878 -0700 @@ -604,7 +604,7 @@ void swapcache_free(swp_entry_t entry, s * This does not give an exact answer when swap count is continued, * but does include the high COUNT_CONTINUED flag to allow for that. */ -static inline int page_swapcount(struct page *page) +int page_swapcount(struct page *page) { int count = 0; struct swap_info_struct *p; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755273Ab2ELMCr (ORCPT ); Sat, 12 May 2012 08:02:47 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:51175 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754909Ab2ELMCp (ORCPT ); Sat, 12 May 2012 08:02:45 -0400 Date: Sat, 12 May 2012 05:02:29 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Andi Kleen , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 2/10] tmpfs: enable NOSEC optimization In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() overhead on most common writes): set MS_NOSEC on its superblocks. Signed-off-by: Hugh Dickins --- mm/shmem.c | 1 + 1 file changed, 1 insertion(+) --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block } } sb->s_export_op = &shmem_export_ops; + sb->s_flags |= MS_NOSEC; #else sb->s_flags |= MS_NOUSER; #endif From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755326Ab2ELMEa (ORCPT ); Sat, 12 May 2012 08:04:30 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:38347 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754807Ab2ELME2 (ORCPT ); Sat, 12 May 2012 08:04:28 -0400 Date: Sat, 12 May 2012 05:04:12 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Nick Piggin , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 3/10] tmpfs: optimize clearing when writing In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Nick proposed years ago that tmpfs should avoid clearing its pages where write will overwrite them with new data, as ramfs has long done. But I messed it up and just got bad data. Tried again recently, it works fine. Here's time output for writing 4GiB 16 times on this Core i5 laptop: before: real 0m21.169s user 0m0.028s sys 0m21.057s real 0m21.382s user 0m0.016s sys 0m21.289s real 0m21.311s user 0m0.020s sys 0m21.217s after: real 0m18.273s user 0m0.032s sys 0m18.165s real 0m18.354s user 0m0.020s sys 0m18.265s real 0m18.440s user 0m0.032s sys 0m18.337s ramfs: real 0m16.860s user 0m0.028s sys 0m16.765s real 0m17.382s user 0m0.040s sys 0m17.273s real 0m17.133s user 0m0.044s sys 0m17.021s Yes, I have done perf reports, but they need more explanation than they deserve: in summary, clear_page vanishes, its cache loading shifts into copy_user_generic_unrolled; shmem_getpage_gfp goes down, and surprisingly mark_page_accessed goes way up - I think because they are respectively where the cache gets to be reloaded after being purged by clear or copy. Suggested-by: Nick Piggin Signed-off-by: Hugh Dickins --- mm/shmem.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:12.316062172 -0700 @@ -1095,9 +1095,14 @@ repeat: shmem_recalc_inode(inode); spin_unlock(&info->lock); - clear_highpage(page); - flush_dcache_page(page); - SetPageUptodate(page); + /* + * Let SGP_WRITE caller clear ends if write does not fill page + */ + if (sgp != SGP_WRITE) { + clear_highpage(page); + flush_dcache_page(page); + SetPageUptodate(page); + } if (sgp == SGP_DIRTY) set_page_dirty(page); } @@ -1307,6 +1312,14 @@ shmem_write_end(struct file *file, struc if (pos + copied > inode->i_size) i_size_write(inode, pos + copied); + if (!PageUptodate(page)) { + if (copied < PAGE_CACHE_SIZE) { + unsigned from = pos & (PAGE_CACHE_SIZE - 1); + zero_user_segments(page, 0, from, + from + copied, PAGE_CACHE_SIZE); + } + SetPageUptodate(page); + } set_page_dirty(page); unlock_page(page); page_cache_release(page); @@ -1768,6 +1781,7 @@ static int shmem_symlink(struct inode *d kaddr = kmap_atomic(page); memcpy(kaddr, symname, len); kunmap_atomic(kaddr); + SetPageUptodate(page); set_page_dirty(page); unlock_page(page); page_cache_release(page); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755357Ab2ELMGK (ORCPT ); Sat, 12 May 2012 08:06:10 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:44181 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754807Ab2ELMGH (ORCPT ); Sat, 12 May 2012 08:06:07 -0400 Date: Sat, 12 May 2012 05:05:51 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 4/10] tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org tmpfs has supported hole-punching since 2.6.16, via madvise(,,MADV_REMOVE). But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is the agreed way to punch holes. So add shmem_fallocate() to support that, and tweak shmem_truncate_range() to support partial pages at both the beginning and end of range (never needed for madvise, which demands rounded addr and rounds up length). Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- mm/shmem.c | 68 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 57 insertions(+), 11 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:12.316062172 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:18.768062321 -0700 @@ -53,6 +53,7 @@ static struct vfsmount *shm_mnt; #include #include #include +#include #include #include #include @@ -432,21 +433,23 @@ void shmem_truncate_range(struct inode * struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; - unsigned partial = lstart & (PAGE_CACHE_SIZE - 1); - pgoff_t end = (lend >> PAGE_CACHE_SHIFT); + pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT; + unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1); + unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1); struct pagevec pvec; pgoff_t indices[PAGEVEC_SIZE]; long nr_swaps_freed = 0; pgoff_t index; int i; - BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); + if (lend == -1) + end = -1; /* unsigned, so actually very big */ pagevec_init(&pvec, 0); index = start; - while (index <= end) { + while (index < end) { pvec.nr = shmem_find_get_pages_and_swap(mapping, index, - min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1, + min(end - index, (pgoff_t)PAGEVEC_SIZE), pvec.pages, indices); if (!pvec.nr) break; @@ -455,7 +458,7 @@ void shmem_truncate_range(struct inode * struct page *page = pvec.pages[i]; index = indices[i]; - if (index > end) + if (index >= end) break; if (radix_tree_exceptional_entry(page)) { @@ -479,22 +482,39 @@ void shmem_truncate_range(struct inode * index++; } - if (partial) { + if (partial_start) { struct page *page = NULL; shmem_getpage(inode, start - 1, &page, SGP_READ, NULL); if (page) { - zero_user_segment(page, partial, PAGE_CACHE_SIZE); + unsigned int top = PAGE_CACHE_SIZE; + if (start > end) { + top = partial_end; + partial_end = 0; + } + zero_user_segment(page, partial_start, top); set_page_dirty(page); unlock_page(page); page_cache_release(page); } } + if (partial_end) { + struct page *page = NULL; + shmem_getpage(inode, end, &page, SGP_READ, NULL); + if (page) { + zero_user_segment(page, 0, partial_end); + set_page_dirty(page); + unlock_page(page); + page_cache_release(page); + } + } + if (start >= end) + return; index = start; for ( ; ; ) { cond_resched(); pvec.nr = shmem_find_get_pages_and_swap(mapping, index, - min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1, + min(end - index, (pgoff_t)PAGEVEC_SIZE), pvec.pages, indices); if (!pvec.nr) { if (index == start) @@ -502,7 +522,7 @@ void shmem_truncate_range(struct inode * index = start; continue; } - if (index == start && indices[0] > end) { + if (index == start && indices[0] >= end) { shmem_deswap_pagevec(&pvec); pagevec_release(&pvec); break; @@ -512,7 +532,7 @@ void shmem_truncate_range(struct inode * struct page *page = pvec.pages[i]; index = indices[i]; - if (index > end) + if (index >= end) break; if (radix_tree_exceptional_entry(page)) { @@ -1578,6 +1598,31 @@ static ssize_t shmem_file_splice_read(st return error; } +static long shmem_fallocate(struct file *file, int mode, loff_t offset, + loff_t len) +{ + struct inode *inode = file->f_path.dentry->d_inode; + int error = -EOPNOTSUPP; + + mutex_lock(&inode->i_mutex); + + if (mode & FALLOC_FL_PUNCH_HOLE) { + struct address_space *mapping = file->f_mapping; + loff_t unmap_start = round_up(offset, PAGE_SIZE); + loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; + + if ((u64)unmap_end > (u64)unmap_start) + unmap_mapping_range(mapping, unmap_start, + 1 + unmap_end - unmap_start, 0); + shmem_truncate_range(inode, offset, offset + len - 1); + /* No need to unmap again: hole-punching leaves COWed pages */ + error = 0; + } + + mutex_unlock(&inode->i_mutex); + return error; +} + static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) { struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb); @@ -2478,6 +2523,7 @@ static const struct file_operations shme .fsync = noop_fsync, .splice_read = shmem_file_splice_read, .splice_write = generic_file_splice_write, + .fallocate = shmem_fallocate, #endif }; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755480Ab2ELMNh (ORCPT ); Sat, 12 May 2012 08:13:37 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:44770 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752247Ab2ELMNe (ORCPT ); Sat, 12 May 2012 08:13:34 -0400 Date: Sat, 12 May 2012 05:13:18 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Cong Wang , Al Viro , Colin Cross , John Stulz , Greg Kroah-Hartman , "Theodore Ts'o" , Andreas Dilger , Mark Fasheh , Joel Becker , Dave Chinner , Ben Myers , Michael Kerrisk , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now tmpfs supports hole-punching via fallocate(), switch madvise_remove() to use do_fallocate() instead of vmtruncate_range(): which extends madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs. There is one more user of vmtruncate_range() in our tree, staging/android's ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned areas are already unmapped - I don't know - then it would do better to use shmem_truncate_range() directly). Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- drivers/staging/android/ashmem.c | 8 +++++--- mm/madvise.c | 15 +++++++-------- 2 files changed, 12 insertions(+), 11 deletions(-) --- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.564056626 -0700 +++ 3045N/drivers/staging/android/ashmem.c 2012-05-05 10:46:25.692062478 -0700 @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker mutex_lock(&ashmem_mutex); list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) { - struct inode *inode = range->asma->file->f_dentry->d_inode; loff_t start = range->pgstart * PAGE_SIZE; - loff_t end = (range->pgend + 1) * PAGE_SIZE - 1; + loff_t end = (range->pgend + 1) * PAGE_SIZE; - vmtruncate_range(inode, start, end); + do_fallocate(range->asma->file, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + start, end - start); range->purged = ASHMEM_WAS_PURGED; lru_del(range); --- 3045N.orig/mm/madvise.c 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/mm/madvise.c 2012-05-05 10:46:25.692062478 -0700 @@ -11,8 +11,10 @@ #include #include #include +#include #include #include +#include /* * Any behaviour which results in changes to the vma->vm_flags needs to @@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are struct vm_area_struct **prev, unsigned long start, unsigned long end) { - struct address_space *mapping; - loff_t offset, endoff; + loff_t offset; int error; *prev = NULL; /* tell sys_madvise we drop mmap_sem */ @@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE)) return -EACCES; - mapping = vma->vm_file->f_mapping; - offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - endoff = (loff_t)(end - vma->vm_start - 1) - + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - /* vmtruncate_range needs to take i_mutex */ + /* filesystem's fallocate may need to take i_mutex */ up_read(¤t->mm->mmap_sem); - error = vmtruncate_range(mapping->host, offset, endoff); + error = do_fallocate(vma->vm_file, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + offset, end - start); down_read(¤t->mm->mmap_sem); return error; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755502Ab2ELMPY (ORCPT ); Sat, 12 May 2012 08:15:24 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:53555 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753131Ab2ELMPW (ORCPT ); Sat, 12 May 2012 08:15:22 -0400 Date: Sat, 12 May 2012 05:15:03 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Cong Wang , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 6/10] mm/fs: remove truncate_range In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Remove vmtruncate_range(), and remove the truncate_range method from struct inode_operations: only tmpfs ever supported it, and tmpfs has now converted over to using the fallocate method of file_operations. Update Documentation accordingly, adding (setlease and) fallocate lines. And while we're in mm.h, remove duplicate declarations of shmem_lock() and shmem_file_setup(): everyone is now using the ones in shmem_fs.h. Based-on-patch-by: Cong Wang Signed-off-by: Hugh Dickins --- Documentation/filesystems/Locking | 2 -- Documentation/filesystems/vfs.txt | 13 ++++++++----- fs/bad_inode.c | 1 - include/linux/fs.h | 1 - include/linux/mm.h | 4 ---- mm/shmem.c | 1 - mm/truncate.c | 25 ------------------------- 7 files changed, 8 insertions(+), 39 deletions(-) --- 3045N.orig/Documentation/filesystems/Locking 2012-05-05 10:42:33.560056589 -0700 +++ 3045N/Documentation/filesystems/Locking 2012-05-05 10:46:35.220062708 -0700 @@ -60,7 +60,6 @@ ata *); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); - void (*truncate_range)(struct inode *, loff_t, loff_t); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); locking rules: @@ -87,7 +86,6 @@ setxattr: yes getxattr: no listxattr: no removexattr: yes -truncate_range: yes fiemap: no Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on victim. --- 3045N.orig/Documentation/filesystems/vfs.txt 2012-05-05 10:42:33.560056589 -0700 +++ 3045N/Documentation/filesystems/vfs.txt 2012-05-05 10:46:35.220062708 -0700 @@ -363,7 +363,6 @@ struct inode_operations { ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); - void (*truncate_range)(struct inode *, loff_t, loff_t); }; Again, all methods are called without any locks being held, unless @@ -472,9 +471,6 @@ otherwise noted. removexattr: called by the VFS to remove an extended attribute from a file. This method is called by removexattr(2) system call. - truncate_range: a method provided by the underlying filesystem to truncate a - range of blocks , i.e. punch a hole somewhere in a file. - The Address Space Object ======================== @@ -760,7 +756,7 @@ struct file_operations ---------------------- This describes how the VFS can manipulate an open file. As of kernel -2.6.22, the following members are defined: +3.5, the following members are defined: struct file_operations { struct module *owner; @@ -790,6 +786,8 @@ struct file_operations { int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); + int (*setlease)(struct file *, long arg, struct file_lock **); + long (*fallocate)(struct file *, int mode, loff_t offset, loff_t len); }; Again, all methods are called without any locks being held, unless @@ -858,6 +856,11 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + setlease: called by the VFS to set or release a file lock lease. + setlease has the file_lock_lock held and must not sleep. + + fallocate: called by the VFS to preallocate blocks or punch a hole. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special --- 3045N.orig/fs/bad_inode.c 2012-05-05 10:42:33.564056626 -0700 +++ 3045N/fs/bad_inode.c 2012-05-05 10:46:35.220062708 -0700 @@ -292,7 +292,6 @@ static const struct inode_operations bad .getxattr = bad_inode_getxattr, .listxattr = bad_inode_listxattr, .removexattr = bad_inode_removexattr, - /* truncate_range returns void */ }; --- 3045N.orig/include/linux/fs.h 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/include/linux/fs.h 2012-05-05 10:46:35.220062708 -0700 @@ -1673,7 +1673,6 @@ struct inode_operations { ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); - void (*truncate_range)(struct inode *, loff_t, loff_t); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); } ____cacheline_aligned; --- 3045N.orig/include/linux/mm.h 2012-05-05 10:42:33.572056784 -0700 +++ 3045N/include/linux/mm.h 2012-05-05 10:46:35.220062708 -0700 @@ -871,8 +871,6 @@ extern void pagefault_out_of_memory(void extern void show_free_areas(unsigned int flags); extern bool skip_free_areas_node(unsigned int flags, int nid); -int shmem_lock(struct file *file, int lock, struct user_struct *user); -struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags); int shmem_zero_setup(struct vm_area_struct *); extern int can_do_mlock(void); @@ -953,11 +951,9 @@ static inline void unmap_shared_mapping_ extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new); extern void truncate_setsize(struct inode *inode, loff_t newsize); extern int vmtruncate(struct inode *inode, loff_t offset); -extern int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end); void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); int generic_error_remove_page(struct address_space *mapping, struct page *page); - int invalidate_inode_page(struct page *page); #ifdef CONFIG_MMU --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:18.768062321 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:35.224062700 -0700 @@ -2529,7 +2529,6 @@ static const struct file_operations shme static const struct inode_operations shmem_inode_operations = { .setattr = shmem_setattr, - .truncate_range = shmem_truncate_range, #ifdef CONFIG_TMPFS_XATTR .setxattr = shmem_setxattr, .getxattr = shmem_getxattr, --- 3045N.orig/mm/truncate.c 2012-05-05 10:42:33.576056912 -0700 +++ 3045N/mm/truncate.c 2012-05-05 10:46:35.224062700 -0700 @@ -602,31 +602,6 @@ int vmtruncate(struct inode *inode, loff } EXPORT_SYMBOL(vmtruncate); -int vmtruncate_range(struct inode *inode, loff_t lstart, loff_t lend) -{ - struct address_space *mapping = inode->i_mapping; - loff_t holebegin = round_up(lstart, PAGE_SIZE); - loff_t holelen = 1 + lend - holebegin; - - /* - * If the underlying filesystem is not going to provide - * a way to truncate a range of blocks (punch a hole) - - * we should return failure right now. - */ - if (!inode->i_op->truncate_range) - return -ENOSYS; - - mutex_lock(&inode->i_mutex); - inode_dio_wait(inode); - unmap_mapping_range(mapping, holebegin, holelen, 1); - inode->i_op->truncate_range(inode, lstart, lend); - /* unmap again to remove racily COWed private pages */ - unmap_mapping_range(mapping, holebegin, holelen, 1); - mutex_unlock(&inode->i_mutex); - - return 0; -} - /** * truncate_pagecache_range - unmap and remove pagecache that is hole-punched * @inode: inode From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755594Ab2ELMTg (ORCPT ); Sat, 12 May 2012 08:19:36 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:56984 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751421Ab2ELMTe (ORCPT ); Sat, 12 May 2012 08:19:34 -0400 Date: Sat, 12 May 2012 05:19:18 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Cong Wang , Kay Sievers , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 8/10] tmpfs: undo fallocation on failure In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In the previous episode, we left the already-fallocated pages attached to the file when shmem_fallocate() fails part way through. Now try to do better, by extending the earlier optimization of !Uptodate pages (then always under page lock) to !Uptodate pages (outside of page lock), representing fallocated pages. And don't waste time clearing them at the time of fallocate(), leave that until later if necessary. Adapt shmem_truncate_range() to shmem_undo_range(), so that a failing fallocate can recognize and remove precisely those !Uptodate allocations which it added (and were not independently allocated by racing tasks). But unless we start playing with swapfile.c and memcontrol.c too, once one of our fallocated pages reaches shmem_writepage(), we do then have to instantiate it as an ordinarily allocated page, before swapping out. This is unsatisfactory, but improved in the next episode. Signed-off-by: Hugh Dickins --- mm/shmem.c | 105 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 72 insertions(+), 33 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:45.312062979 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:46:53.860063201 -0700 @@ -89,7 +89,8 @@ enum sgp_type { SGP_READ, /* don't exceed i_size, don't allocate page */ SGP_CACHE, /* don't exceed i_size, may allocate page */ SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */ - SGP_WRITE, /* may exceed i_size, may allocate page */ + SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ + SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ }; #ifdef CONFIG_TMPFS @@ -427,8 +428,10 @@ void shmem_unlock_mapping(struct address /* * Remove range of pages and swap entries from radix tree, and free them. + * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. */ -void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) +static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, + bool unfalloc) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); @@ -462,6 +465,8 @@ void shmem_truncate_range(struct inode * break; if (radix_tree_exceptional_entry(page)) { + if (unfalloc) + continue; nr_swaps_freed += !shmem_free_swap(mapping, index, page); continue; @@ -469,9 +474,11 @@ void shmem_truncate_range(struct inode * if (!trylock_page(page)) continue; - if (page->mapping == mapping) { - VM_BUG_ON(PageWriteback(page)); - truncate_inode_page(mapping, page); + if (!unfalloc || !PageUptodate(page)) { + if (page->mapping == mapping) { + VM_BUG_ON(PageWriteback(page)); + truncate_inode_page(mapping, page); + } } unlock_page(page); } @@ -517,12 +524,12 @@ void shmem_truncate_range(struct inode * min(end - index, (pgoff_t)PAGEVEC_SIZE), pvec.pages, indices); if (!pvec.nr) { - if (index == start) + if (index == start || unfalloc) break; index = start; continue; } - if (index == start && indices[0] >= end) { + if ((index == start || unfalloc) && indices[0] >= end) { shmem_deswap_pagevec(&pvec); pagevec_release(&pvec); break; @@ -536,15 +543,19 @@ void shmem_truncate_range(struct inode * break; if (radix_tree_exceptional_entry(page)) { + if (unfalloc) + continue; nr_swaps_freed += !shmem_free_swap(mapping, index, page); continue; } lock_page(page); - if (page->mapping == mapping) { - VM_BUG_ON(PageWriteback(page)); - truncate_inode_page(mapping, page); + if (!unfalloc || !PageUptodate(page)) { + if (page->mapping == mapping) { + VM_BUG_ON(PageWriteback(page)); + truncate_inode_page(mapping, page); + } } unlock_page(page); } @@ -558,7 +569,11 @@ void shmem_truncate_range(struct inode * info->swapped -= nr_swaps_freed; shmem_recalc_inode(inode); spin_unlock(&info->lock); +} +void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) +{ + shmem_undo_range(inode, lstart, lend, false); inode->i_ctime = inode->i_mtime = CURRENT_TIME; } EXPORT_SYMBOL_GPL(shmem_truncate_range); @@ -771,6 +786,18 @@ static int shmem_writepage(struct page * WARN_ON_ONCE(1); /* Still happens? Tell us about it! */ goto redirty; } + + /* + * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC + * value into swapfile.c, the only way we can correctly account for a + * fallocated page arriving here is now to initialize it and write it. + */ + if (!PageUptodate(page)) { + clear_highpage(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + swap = get_swap_page(); if (!swap.val) goto redirty; @@ -994,6 +1021,7 @@ static int shmem_getpage_gfp(struct inod swp_entry_t swap; int error; int once = 0; + int alloced = 0; if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT)) return -EFBIG; @@ -1005,19 +1033,21 @@ repeat: page = NULL; } - if (sgp != SGP_WRITE && + if (sgp != SGP_WRITE && sgp != SGP_FALLOC && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { error = -EINVAL; goto failed; } + /* fallocated page? */ + if (page && !PageUptodate(page)) { + if (sgp != SGP_READ) + goto clear; + unlock_page(page); + page_cache_release(page); + page = NULL; + } if (page || (sgp == SGP_READ && !swap.val)) { - /* - * Once we can get the page lock, it must be uptodate: - * if there were an error in reading back from swap, - * the page would not be inserted into the filecache. - */ - BUG_ON(page && !PageUptodate(page)); *pagep = page; return 0; } @@ -1114,9 +1144,18 @@ repeat: inode->i_blocks += BLOCKS_PER_PAGE; shmem_recalc_inode(inode); spin_unlock(&info->lock); + alloced = true; /* - * Let SGP_WRITE caller clear ends if write does not fill page + * Let SGP_FALLOC use the SGP_WRITE optimization on a new page. + */ + if (sgp == SGP_FALLOC) + sgp = SGP_WRITE; +clear: + /* + * Let SGP_WRITE caller clear ends if write does not fill page; + * but SGP_FALLOC on a page fallocated earlier must initialize + * it now, lest undo on failure cancel our earlier guarantee. */ if (sgp != SGP_WRITE) { clear_highpage(page); @@ -1128,10 +1167,13 @@ repeat: } /* Perhaps the file has been truncated since we checked */ - if (sgp != SGP_WRITE && + if (sgp != SGP_WRITE && sgp != SGP_FALLOC && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { error = -EINVAL; - goto trunc; + if (alloced) + goto trunc; + else + goto failed; } *pagep = page; return 0; @@ -1140,6 +1182,7 @@ repeat: * Error recovery. */ trunc: + info = SHMEM_I(inode); ClearPageDirty(page); delete_from_page_cache(page); spin_lock(&info->lock); @@ -1147,6 +1190,7 @@ trunc: inode->i_blocks -= BLOCKS_PER_PAGE; spin_unlock(&info->lock); decused: + sbinfo = SHMEM_SB(inode->i_sb); if (sbinfo->max_blocks) percpu_counter_add(&sbinfo->used_blocks, -1); unacct: @@ -1645,25 +1689,20 @@ static long shmem_fallocate(struct file if (signal_pending(current)) error = -EINTR; else - error = shmem_getpage(inode, index, &page, SGP_WRITE, + error = shmem_getpage(inode, index, &page, SGP_FALLOC, NULL); if (error) { - /* - * We really ought to free what we allocated so far, - * but it would be wrong to free pages allocated - * earlier, or already now in use: i_mutex does not - * exclude all cases. We do not know what to free. - */ + /* Remove the !PageUptodate pages we added */ + shmem_undo_range(inode, + (loff_t)start << PAGE_CACHE_SHIFT, + (loff_t)index << PAGE_CACHE_SHIFT, true); goto ctime; } - if (!PageUptodate(page)) { - clear_highpage(page); - flush_dcache_page(page); - SetPageUptodate(page); - } /* - * set_page_dirty so that memory pressure will swap rather + * If !PageUptodate, leave it that way so that freeable pages + * can be recognized if we need to rollback on error later. + * But set_page_dirty so that memory pressure will swap rather * than free the pages we are allocating (and SGP_CACHE pages * might still be clean: we now need to mark those dirty too). */ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755720Ab2ELMVl (ORCPT ); Sat, 12 May 2012 08:21:41 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:54806 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752247Ab2ELMVi (ORCPT ); Sat, 12 May 2012 08:21:38 -0400 Date: Sat, 12 May 2012 05:21:22 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Cong Wang , Kay Sievers , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 9/10] tmpfs: quit when fallocate fills memory In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As it stands, a large fallocate() on tmpfs is liable to fill memory with pages, freed on failure except when they run into swap, at which point they become fixed into the file despite the failure. That feels quite wrong, to be consuming resources precisely when they're in short supply. Go the other way instead: shmem_fallocate() indicate the range it has fallocated to shmem_writepage(), keeping count of pages it's allocating; shmem_writepage() reactivate instead of swapping out pages fallocated by this syscall (but happily swap out those from earlier occasions), keeping count; shmem_fallocate() compare counts and give up once the reactivated pages have started to coming back to writepage (approximately: some zones would in fact recycle faster than others). This is a little unusual, but works well: although we could consider the failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling added in swapfile.c and memcontrol.c, I doubt that we shall ever want to. (If there's no swap, an over-large fallocate() on tmpfs is limited in the same way as writing: stopped by rlimit, or by tmpfs mount size if that was set sensibly, or by __vm_enough_memory() heuristics if OVERCOMMIT_GUESS or OVERCOMMIT_NEVER. If OVERCOMMIT_ALWAYS, then it is liable to OOM-kill others as writing would, but stops and frees if interrupted.) Now that everything is freed on failure, we can then skip updating ctime. Signed-off-by: Hugh Dickins --- mm/shmem.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 56 insertions(+), 2 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:46:53.860063201 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:47:02.216063339 -0700 @@ -84,6 +84,18 @@ struct shmem_xattr { char value[0]; }; +/* + * shmem_fallocate and shmem_writepage communicate via inode->i_private + * (with i_mutex making sure that it has only one user at a time): + * we would prefer not to enlarge the shmem inode just for that. + */ +struct shmem_falloc { + pgoff_t start; /* start of range currently being fallocated */ + pgoff_t next; /* the next page offset to be fallocated */ + pgoff_t nr_falloced; /* how many new pages have been fallocated */ + pgoff_t nr_unswapped; /* how often writepage refused to swap out */ +}; + /* Flag allocation requirements to shmem_getpage */ enum sgp_type { SGP_READ, /* don't exceed i_size, don't allocate page */ @@ -791,8 +803,28 @@ static int shmem_writepage(struct page * * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC * value into swapfile.c, the only way we can correctly account for a * fallocated page arriving here is now to initialize it and write it. + * + * That's okay for a page already fallocated earlier, but if we have + * not yet completed the fallocation, then (a) we want to keep track + * of this page in case we have to undo it, and (b) it may not be a + * good idea to continue anyway, once we're pushing into swap. So + * reactivate the page, and let shmem_fallocate() quit when too many. */ if (!PageUptodate(page)) { + if (inode->i_private) { + struct shmem_falloc *shmem_falloc; + spin_lock(&inode->i_lock); + shmem_falloc = inode->i_private; + if (shmem_falloc && + index >= shmem_falloc->start && + index < shmem_falloc->next) + shmem_falloc->nr_unswapped++; + else + shmem_falloc = NULL; + spin_unlock(&inode->i_lock); + if (shmem_falloc) + goto redirty; + } clear_highpage(page); flush_dcache_page(page); SetPageUptodate(page); @@ -1647,6 +1679,7 @@ static long shmem_fallocate(struct file { struct inode *inode = file->f_path.dentry->d_inode; struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + struct shmem_falloc shmem_falloc; pgoff_t start, index, end; int error; @@ -1679,6 +1712,14 @@ static long shmem_fallocate(struct file goto out; } + shmem_falloc.start = start; + shmem_falloc.next = start; + shmem_falloc.nr_falloced = 0; + shmem_falloc.nr_unswapped = 0; + spin_lock(&inode->i_lock); + inode->i_private = &shmem_falloc; + spin_unlock(&inode->i_lock); + for (index = start; index < end; index++) { struct page *page; @@ -1688,6 +1729,8 @@ static long shmem_fallocate(struct file */ if (signal_pending(current)) error = -EINTR; + else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) + error = -ENOMEM; else error = shmem_getpage(inode, index, &page, SGP_FALLOC, NULL); @@ -1696,10 +1739,18 @@ static long shmem_fallocate(struct file shmem_undo_range(inode, (loff_t)start << PAGE_CACHE_SHIFT, (loff_t)index << PAGE_CACHE_SHIFT, true); - goto ctime; + goto undone; } /* + * Inform shmem_writepage() how far we have reached. + * No need for lock or barrier: we have the page lock. + */ + shmem_falloc.next++; + if (!PageUptodate(page)) + shmem_falloc.nr_falloced++; + + /* * If !PageUptodate, leave it that way so that freeable pages * can be recognized if we need to rollback on error later. * But set_page_dirty so that memory pressure will swap rather @@ -1714,8 +1765,11 @@ static long shmem_fallocate(struct file if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) i_size_write(inode, offset + len); -ctime: inode->i_ctime = CURRENT_TIME; +undone: + spin_lock(&inode->i_lock); + inode->i_private = NULL; + spin_unlock(&inode->i_lock); out: mutex_unlock(&inode->i_mutex); return error; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755739Ab2ELM1v (ORCPT ); Sat, 12 May 2012 08:27:51 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:44878 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752247Ab2ELM1r (ORCPT ); Sat, 12 May 2012 08:27:47 -0400 Date: Sat, 12 May 2012 05:27:31 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , Josef Bacik , Andi Kleen , Andreas Dilger , Dave Chinner , Marco Stornelli , Jeff liu , Chris Mason , Sunil Mushran , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 10/10] tmpfs: support SEEK_DATA and SEEK_HOLE In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It's quite easy for tmpfs to scan the radix_tree to support llseek's new SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still on my mind (in particular, the !PageUptodate-ness of pages fallocated but still unwritten). But I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it would be of any use to them on tmpfs. This code adds 92 lines and 752 bytes on x86_64 - is that bloat or worthwhile? Signed-off-by: Hugh Dickins --- mm/shmem.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 93 insertions(+), 1 deletion(-) --- 3045N.orig/mm/shmem.c 2012-05-05 10:47:02.216063339 -0700 +++ 3045N/mm/shmem.c 2012-05-05 10:47:09.724063528 -0700 @@ -439,6 +439,56 @@ void shmem_unlock_mapping(struct address } /* + * llseek SEEK_DATA or SEEK_HOLE through the radix_tree. + */ +static pgoff_t shmem_seek_hole_data(struct address_space *mapping, + pgoff_t index, pgoff_t end, int origin) +{ + struct page *page; + struct pagevec pvec; + pgoff_t indices[PAGEVEC_SIZE]; + bool done = false; + int i; + + pagevec_init(&pvec, 0); + pvec.nr = 1; /* start small: we may be there already */ + while (!done) { + pvec.nr = shmem_find_get_pages_and_swap(mapping, index, + pvec.nr, pvec.pages, indices); + if (!pvec.nr) { + if (origin == SEEK_DATA) + index = end; + break; + } + for (i = 0; i < pvec.nr; i++, index++) { + if (index < indices[i]) { + if (origin == SEEK_HOLE) { + done = true; + break; + } + index = indices[i]; + } + page = pvec.pages[i]; + if (page && !radix_tree_exceptional_entry(page)) { + if (!PageUptodate(page)) + page = NULL; + } + if (index >= end || + (page && origin == SEEK_DATA) || + (!page && origin == SEEK_HOLE)) { + done = true; + break; + } + } + shmem_deswap_pagevec(&pvec); + pagevec_release(&pvec); + pvec.nr = PAGEVEC_SIZE; + cond_resched(); + } + return index; +} + +/* * Remove range of pages and swap entries from radix tree, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. */ @@ -1674,6 +1724,48 @@ static ssize_t shmem_file_splice_read(st return error; } +static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin) +{ + struct address_space *mapping; + struct inode *inode; + pgoff_t start, end; + loff_t new_offset; + + if (origin != SEEK_DATA && origin != SEEK_HOLE) + return generic_file_llseek_size(file, offset, origin, + MAX_LFS_FILESIZE); + mapping = file->f_mapping; + inode = mapping->host; + mutex_lock(&inode->i_mutex); + /* We're holding i_mutex so we can access i_size directly */ + + if (offset < 0) + offset = -EINVAL; + else if (offset >= inode->i_size) + offset = -ENXIO; + else { + start = offset >> PAGE_CACHE_SHIFT; + end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + new_offset = shmem_seek_hole_data(mapping, start, end, origin); + new_offset <<= PAGE_CACHE_SHIFT; + if (new_offset > offset) { + if (new_offset < inode->i_size) + offset = new_offset; + else if (origin == SEEK_DATA) + offset = -ENXIO; + else + offset = inode->i_size; + } + } + + if (offset >= 0 && offset != file->f_pos) { + file->f_pos = offset; + file->f_version = 0; + } + mutex_unlock(&inode->i_mutex); + return offset; +} + static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { @@ -2667,7 +2759,7 @@ static const struct address_space_operat static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, #ifdef CONFIG_TMPFS - .llseek = generic_file_llseek, + .llseek = shmem_file_llseek, .read = do_sync_read, .write = do_sync_write, .aio_read = shmem_file_aio_read, From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755552Ab2ENIzu (ORCPT ); Mon, 14 May 2012 04:55:50 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:45972 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754863Ab2ENIzt (ORCPT ); Mon, 14 May 2012 04:55:49 -0400 Message-ID: <4FB0C888.8070805@gmail.com> Date: Mon, 14 May 2012 16:55:36 +0800 From: Cong Wang User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Hugh Dickins CC: Andrew Morton , Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/12/2012 07:59 PM, Hugh Dickins wrote: > + VM_BUG_ON(!PageLocked(oldpage)); > + __set_page_locked(newpage); > + VM_BUG_ON(!PageUptodate(oldpage)); > + SetPageUptodate(newpage); > + VM_BUG_ON(!PageSwapBacked(oldpage)); > + SetPageSwapBacked(newpage); > + VM_BUG_ON(!swap_index); > + set_page_private(newpage, swap_index); > + VM_BUG_ON(!PageSwapCache(oldpage)); > + SetPageSwapCache(newpage); > + Are all of these VM_BUG_ON's necessary? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755339Ab2ENJIP (ORCPT ); Mon, 14 May 2012 05:08:15 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:38172 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754894Ab2ENJIN (ORCPT ); Mon, 14 May 2012 05:08:13 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.7.4 Message-ID: <4FB0CAFF.6060705@jp.fujitsu.com> Date: Mon, 14 May 2012 18:06:07 +0900 From: KAMEZAWA Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Hugh Dickins CC: Andrew Morton , Christoph Hellwig , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/05/12 20:59), Hugh Dickins wrote: > The GMA500 GPU driver uses GEM shmem objects, but with a new twist: > the backing RAM has to be below 4GB. Not a problem while the boards > supported only 4GB: but now Intel's D2700MUD boards support 8GB, and > their GMA3600 is managed by the GMA500 driver. > > shmem/tmpfs has never pretended to support hardware restrictions on > the backing memory, but it might have appeared to do so before v3.1, > and even now it works fine until a page is swapped out then back in. > When read_cache_page_gfp() supplied a freshly allocated page for copy, > that compensated for whatever choice might have been made by earlier > swapin readahead; but swapoff was likely to destroy the illusion. > > We'd like to continue to support GMA500, so now add a new > shmem_should_replace_page() check on the zone when about to move > a page from swapcache to filecache (in swapin and swapoff cases), > with shmem_replace_page() to allocate and substitute a suitable page > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > This does involve a minor extension to mem_cgroup_replace_page_cache() > (the page may or may not have already been charged); and I've removed > a comment and call to mem_cgroup_uncharge_cache_page(), which in fact > is always a no-op while PageSwapCache. > > Also removed optimization of an unlikely path in shmem_getpage_gfp(), > now that we need to check PageSwapCache more carefully (a racing caller > might already have made the copy). And at one point shmem_unuse_inode() > needs to use the hitherto private page_swapcount(), to guard against > racing with inode eviction. > > It would make sense to extend shmem_should_replace_page(), to cover > cpuset and NUMA mempolicy restrictions too, but set that aside for > now: needs a cleanup of shmem mempolicy handling, and more testing, > and ought to handle swap faults in do_swap_page() as well as shmem. > > Signed-off-by: Hugh Dickins > --- Acked-by: KAMEZAWA Hiroyuki From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755371Ab2ENJfK (ORCPT ); Mon, 14 May 2012 05:35:10 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:44351 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754984Ab2ENJfI (ORCPT ); Mon, 14 May 2012 05:35:08 -0400 Message-ID: <4FB0D1C6.7090801@gmail.com> Date: Mon, 14 May 2012 17:35:02 +0800 From: Cong Wang User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Hugh Dickins CC: Andrew Morton , Christoph Hellwig , Andi Kleen , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/10] tmpfs: enable NOSEC optimization References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/12/2012 08:02 PM, Hugh Dickins wrote: > Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() > overhead on most common writes): set MS_NOSEC on its superblocks. > > Signed-off-by: Hugh Dickins > --- > mm/shmem.c | 1 + > 1 file changed, 1 insertion(+) > > --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 > +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 > @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block > } > } > sb->s_export_op =&shmem_export_ops; > + sb->s_flags |= MS_NOSEC; Isn't setting the flag on inode better? Something like: diff --git a/mm/shmem.c b/mm/shmem.c index f99ff3e..7d98fb5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2325,6 +2325,7 @@ static void shmem_init_inode(void *foo) { struct shmem_inode_info *info = foo; inode_init_once(&info->vfs_inode); + info->vfs_inode.i_flags |= S_NOSEC; } static int shmem_init_inodecache(void) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932251Ab2ENTsf (ORCPT ); Mon, 14 May 2012 15:48:35 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:49428 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757713Ab2ENTse (ORCPT ); Mon, 14 May 2012 15:48:34 -0400 Date: Mon, 14 May 2012 12:48:16 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Cong Wang cc: Andrew Morton , Christoph Hellwig , Andi Kleen , Al Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/10] tmpfs: enable NOSEC optimization In-Reply-To: <4FB0D1C6.7090801@gmail.com> Message-ID: References: <4FB0D1C6.7090801@gmail.com> User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 14 May 2012, Cong Wang wrote: > On 05/12/2012 08:02 PM, Hugh Dickins wrote: > > Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() > > overhead on most common writes): set MS_NOSEC on its superblocks. > > > > Signed-off-by: Hugh Dickins > > --- > > mm/shmem.c | 1 + > > 1 file changed, 1 insertion(+) > > > > --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700 > > +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700 > > @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block > > } > > } > > sb->s_export_op =&shmem_export_ops; > > + sb->s_flags |= MS_NOSEC; > > Isn't setting the flag on inode better? Something like: I don't think so. The MS_NOSEC S_NOSEC business is fairly subtle, and easy to miss if it's gone wrong, so I would much rather follow the established pattern in local block filesystems: which is to set MS_NOSEC in superblock flags, and leave S_NOSEC to file_remove_suid(). Hugh > > diff --git a/mm/shmem.c b/mm/shmem.c > index f99ff3e..7d98fb5 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2325,6 +2325,7 @@ static void shmem_init_inode(void *foo) > { > struct shmem_inode_info *info = foo; > inode_init_once(&info->vfs_inode); > + info->vfs_inode.i_flags |= S_NOSEC; > } > > static int shmem_init_inodecache(void) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758256Ab2ENXNe (ORCPT ); Mon, 14 May 2012 19:13:34 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:56330 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757353Ab2ENXNc (ORCPT ); Mon, 14 May 2012 19:13:32 -0400 Date: Mon, 14 May 2012 16:13:30 -0700 From: Andrew Morton To: Hugh Dickins Cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone Message-Id: <20120514161330.def0ac52.akpm@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 12 May 2012 04:59:56 -0700 (PDT) Hugh Dickins wrote: > The GMA500 GPU driver uses GEM shmem objects, but with a new twist: > the backing RAM has to be below 4GB. Not a problem while the boards > supported only 4GB: but now Intel's D2700MUD boards support 8GB, and > their GMA3600 is managed by the GMA500 driver. > > shmem/tmpfs has never pretended to support hardware restrictions on > the backing memory, but it might have appeared to do so before v3.1, > and even now it works fine until a page is swapped out then back in. > When read_cache_page_gfp() supplied a freshly allocated page for copy, > that compensated for whatever choice might have been made by earlier > swapin readahead; but swapoff was likely to destroy the illusion. > > We'd like to continue to support GMA500, so now add a new > shmem_should_replace_page() check on the zone when about to move > a page from swapcache to filecache (in swapin and swapoff cases), > with shmem_replace_page() to allocate and substitute a suitable page > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > This does involve a minor extension to mem_cgroup_replace_page_cache() > (the page may or may not have already been charged); and I've removed > a comment and call to mem_cgroup_uncharge_cache_page(), which in fact > is always a no-op while PageSwapCache. > > Also removed optimization of an unlikely path in shmem_getpage_gfp(), > now that we need to check PageSwapCache more carefully (a racing caller > might already have made the copy). And at one point shmem_unuse_inode() > needs to use the hitherto private page_swapcount(), to guard against > racing with inode eviction. > > It would make sense to extend shmem_should_replace_page(), to cover > cpuset and NUMA mempolicy restrictions too, but set that aside for > now: needs a cleanup of shmem mempolicy handling, and more testing, > and ought to handle swap faults in do_swap_page() as well as shmem. > > ... > > static int shmem_unuse_inode(struct shmem_inode_info *info, > - swp_entry_t swap, struct page *page) > + swp_entry_t swap, struct page **pagep) > { > struct address_space *mapping = info->vfs_inode.i_mapping; > void *radswap; > pgoff_t index; > - int error; > + gfp_t gfp; > + int error = 0; > > radswap = swp_to_radix_entry(swap); > index = radix_tree_locate_item(&mapping->page_tree, radswap); > @@ -625,22 +629,37 @@ static int shmem_unuse_inode(struct shme > if (shmem_swaplist.next != &info->swaplist) > list_move_tail(&shmem_swaplist, &info->swaplist); > > + gfp = mapping_gfp_mask(mapping); > + if (shmem_should_replace_page(*pagep, gfp)) { > + mutex_unlock(&shmem_swaplist_mutex); > + error = shmem_replace_page(pagep, gfp, info, index); > + mutex_lock(&shmem_swaplist_mutex); > + /* > + * We needed to drop mutex to make that restrictive page > + * allocation; but the inode might already be freed by now, > + * and we cannot refer to inode or mapping or info to check. > + * However, we do hold page lock on the PageSwapCache page, > + * so can check if that still has our reference remaining. > + */ > + if (!page_swapcount(*pagep)) > + error = -ENOENT; This has my head spinning a bit. What is "our reference"? I'd expect that to mean a temporary reference which was taken by this thread of control. But such a thing has no relevance when trying to determine the state of the page and/or data structures which refer to it. Also, what are we trying to determine here with this test? Whether the page was removed from swapcache under our feet? Presumably not, as it is locked. So perhaps you could spell out in more detail what we're trying to do here, and what contributes to page_swapcount() here? > + } > + > /* > * We rely on shmem_swaplist_mutex, not only to protect the swaplist, > * but also to hold up shmem_evict_inode(): so inode cannot be freed > * beneath us (pagelock doesn't help until the page is in pagecache). > */ > - error = shmem_add_to_page_cache(page, mapping, index, > + if (!error) > + error = shmem_add_to_page_cache(*pagep, mapping, index, > GFP_NOWAIT, radswap); > - /* which does mem_cgroup_uncharge_cache_page on error */ > - > if (error != -ENOMEM) { > /* > * Truncation and eviction use free_swap_and_cache(), which > * only does trylock page: if we raced, best clean up here. > */ > - delete_from_swap_cache(page); > - set_page_dirty(page); > + delete_from_swap_cache(*pagep); > + set_page_dirty(*pagep); > if (!error) { > spin_lock(&info->lock); > info->swapped--; > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct > struct list_head *this, *next; > struct shmem_inode_info *info; > int found = 0; > - int error; > + int error = 0; > + > + /* > + * There's a faint possibility that swap page was replaced before > + * caller locked it: it will come back later with the right page. So a caller locked the page then failed to check that it's still the right sort of page? Shouldn't the caller locally clean up its own mess rather than requiring a callee to know about the caller's intricate shortcomings? > + */ > + if (unlikely(!PageSwapCache(page))) > + goto out; > > /* > * Charge page using GFP_KERNEL while we can wait, before taking > > ... > > @@ -856,6 +880,84 @@ static inline struct mempolicy *shmem_ge > #endif > > /* > + * When a page is moved from swapcache to shmem filecache (either by the > + * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of > + * shmem_unuse_inode()), it may have been read in earlier from swap, in > + * ignorance of the mapping it belongs to. If that mapping has special > + * constraints (like the gma500 GEM driver, which requires RAM below 4GB), > + * we may need to copy to a suitable page before moving to filecache. > + * > + * In a future release, this may well be extended to respect cpuset and > + * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); > + * but for now it is a simple matter of zone. > + */ > +static bool shmem_should_replace_page(struct page *page, gfp_t gfp) > +{ > + return page_zonenum(page) > gfp_zone(gfp); > +} > + > +static int shmem_replace_page(struct page **pagep, gfp_t gfp, > + struct shmem_inode_info *info, pgoff_t index) > +{ > + struct page *oldpage, *newpage; > + struct address_space *swap_mapping; > + pgoff_t swap_index; > + int error; > + > + oldpage = *pagep; > + swap_index = page_private(oldpage); > + swap_mapping = page_mapping(oldpage); > + > + /* > + * We have arrived here because our zones are constrained, so don't > + * limit chance of success by further cpuset and node constraints. > + */ > + gfp &= ~GFP_CONSTRAINT_MASK; > + newpage = shmem_alloc_page(gfp, info, index); > + if (!newpage) > + return -ENOMEM; > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); > + > + *pagep = newpage; > + page_cache_get(newpage); > + copy_highpage(newpage, oldpage); copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()? > + VM_BUG_ON(!PageLocked(oldpage)); > + __set_page_locked(newpage); > + VM_BUG_ON(!PageUptodate(oldpage)); > + SetPageUptodate(newpage); > + VM_BUG_ON(!PageSwapBacked(oldpage)); > + SetPageSwapBacked(newpage); > + VM_BUG_ON(!swap_index); > + set_page_private(newpage, swap_index); > + VM_BUG_ON(!PageSwapCache(oldpage)); > + SetPageSwapCache(newpage); > + > + /* > + * Our caller will very soon move newpage out of swapcache, but it's > + * a nice clean interface for us to replace oldpage by newpage there. > + */ > + spin_lock_irq(&swap_mapping->tree_lock); > + error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, > + newpage); > + __inc_zone_page_state(newpage, NR_FILE_PAGES); > + __dec_zone_page_state(oldpage, NR_FILE_PAGES); > + spin_unlock_irq(&swap_mapping->tree_lock); > + BUG_ON(error); > + > + mem_cgroup_replace_page_cache(oldpage, newpage); > + lru_cache_add_anon(newpage); > + > + ClearPageSwapCache(oldpage); > + set_page_private(oldpage, 0); > + > + unlock_page(oldpage); > + page_cache_release(oldpage); > + page_cache_release(oldpage); > + return 0; > +} shmem_replace_page() is a fairly generic and unexceptional sounding thing. Methinks shmem_substitute_page() would be a better name. > +/* > * shmem_getpage_gfp - find page in cache, or get from swap, or allocate > * > * If we allocate a new one we do not mark it dirty. That's up to the > > ... > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932310Ab2EOEIL (ORCPT ); Tue, 15 May 2012 00:08:11 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:45715 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751069Ab2EOEIF (ORCPT ); Tue, 15 May 2012 00:08:05 -0400 Date: Mon, 14 May 2012 21:07:41 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone In-Reply-To: <20120514161330.def0ac52.akpm@linux-foundation.org> Message-ID: References: <20120514161330.def0ac52.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 14 May 2012, Andrew Morton wrote: > On Sat, 12 May 2012 04:59:56 -0700 (PDT) > Hugh Dickins wrote: > > > > We'd like to continue to support GMA500, so now add a new > > shmem_should_replace_page() check on the zone when about to move > > a page from swapcache to filecache (in swapin and swapoff cases), > > with shmem_replace_page() to allocate and substitute a suitable page > > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > ... > > + gfp = mapping_gfp_mask(mapping); > > + if (shmem_should_replace_page(*pagep, gfp)) { > > + mutex_unlock(&shmem_swaplist_mutex); > > + error = shmem_replace_page(pagep, gfp, info, index); > > + mutex_lock(&shmem_swaplist_mutex); > > + /* > > + * We needed to drop mutex to make that restrictive page > > + * allocation; but the inode might already be freed by now, > > + * and we cannot refer to inode or mapping or info to check. > > + * However, we do hold page lock on the PageSwapCache page, > > + * so can check if that still has our reference remaining. > > + */ > > + if (!page_swapcount(*pagep)) > > + error = -ENOENT; > > This has my head spinning a bit. What is "our reference"? I'd expect > that to mean a temporary reference which was taken by this thread of > control. (I'm sure you'll prefer a reworking of that comment in an incremental fixes patch, but let me try to explain better here too.) No, I didn't mean a temporary reference taken by this (swapoff) thread, but the reference (swap entry) which has just been located in the inode's radix_tree, just before this hunk: which would be tracked by page_swapcount 1 (there's also a page swapcache bit in the swap_map along with the count, corresponding to the reference from the swapcache page itself, but that's not included in page_swapcount). > But such a thing has no relevance when trying to determine > the state of the page and/or data structures which refer to it. I don't understand you there, but maybe it won't matter. > > Also, what are we trying to determine here with this test? Whether the > page was removed from swapcache under our feet? Presumably not, as it > is locked. > > So perhaps you could spell out in more detail what we're trying to do > here, and what contributes to page_swapcount() here? The danger here is that the inode we're dealing with has gone through shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was certainly in use before, and shmem_swaplist_mutex (together with inode being on shmem_swaplist) holds it up from being evicted and freed; but once we drop the mutex, it could go away at any moment. We cannot determine that by looking at struct inode or struct address_space or struct shmem_inode_info, they're all part of what would be freed; but we cannot proceed to shmem_add_to_page_cache() once they're freed. How to tell whether it's been freed? Once upon a time I "solved" it with igrab() and iput(), but Konstantin demonstrated how that gives no safety against unmounting, and I remain reluctant to go back to relying upon filesystem semantics to solve this. It occurred to me that the inode cannot be freed until that radix_tree entry has been removed (by shmem_evict_inode's shmem_truncate_range), and the act or removing that entry (free_swap_and_cache) brings page_swapcount down from 1 to 0. You're thinking that the page cannot be removed from swapcache while we hold page lock: correct, but... free_swap_and_cache() only does a trylock_page(), and happily leaves the swapcache page to be garbage collected later if it cannot get the page lock. (And I certainly would not want to change it to wait for page lock.) So, the inode can get evicted while the page is still in swapcache: the page lock gives no protection against that, until the page itself gets into the radix_tree. I doubt that writing this essay into a comment there will be the right thing to do (and I may still be losing you); but I shall try to rewrite it, and if there's one missing fact that needs highlighting, it probably is that last, that free_swap_and_cache() only does a trylock, so our page lock does not protect the inode from eviction. (At this moment, I can't think what is the relevance of my comment "we do hold page lock on the PageSwapCache page": in other contexts it would be important, but here in swapoff we know that that swap cannot get reused, or not before we're done.) > > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct > > struct list_head *this, *next; > > struct shmem_inode_info *info; > > int found = 0; > > - int error; > > + int error = 0; > > + > > + /* > > + * There's a faint possibility that swap page was replaced before > > + * caller locked it: it will come back later with the right page. > > So a caller locked the page then failed to check that it's still the > right sort of page? Shouldn't the caller locally clean up its own mess > rather than requiring a callee to know about the caller's intricate > shortcomings? The caller being try_to_unuse(). You're certainly not the first to argue that way. Perhaps I'm a bit perverse, in letting code which works even in the surprising cases, remain as it is without weeding out those surprising cases. And on this occasion didn't want to add an additional dependence on a slight subtle change in mm/swapfile.c functionality. Hmm, yes, I do still prefer to have the check here in shmem.c: particularly since it is this "shmem_replace_page" business which is increasing the likelihood of such a race, and making further demands on it (if we're going to make the copied page PageSwapCache, then we need to be sure that the page it's replacing was PageSwapCache - though that's something I need to think through again in the light of the race which I thought of in responding to Cong). > > + newpage = shmem_alloc_page(gfp, info, index); > > + if (!newpage) > > + return -ENOMEM; > > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); > > + > > + *pagep = newpage; > > + page_cache_get(newpage); > > + copy_highpage(newpage, oldpage); > > copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()? Ooh, I'm pretty sure you're right that we do need flush_dcache_page() there: good catch, thank you. We can't use copy_user_highpage() because in general we don't know any address and vma; but should be following the shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate. > > shmem_replace_page() is a fairly generic and unexceptional sounding > thing. Methinks shmem_substitute_page() would be a better name. Okay, shmem_replace_page() seemed appropriate to me (especially thinking of it as "re-place"), but I don't mind changing to shmem_substitute_page(). The flush_dcache_page() addition is important, but until people are using GMA500 on ARM or something (I doubt that combination) with more than 4GB, this code is not coming into play - so I'm not breaking anyone's system if it sneaks into linux-next before I fix that. The main thing I need to think through quietly is the slippery swap race: I'll send you an incremental patch to fix all these up once I'm satisfied on that. Thanks, Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758483Ab2EOIvX (ORCPT ); Tue, 15 May 2012 04:51:23 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:39874 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758434Ab2EOIvR (ORCPT ); Tue, 15 May 2012 04:51:17 -0400 MIME-Version: 1.0 In-Reply-To: References: Date: Tue, 15 May 2012 18:51:16 +1000 Message-ID: Subject: Re: [PATCH 3/10] tmpfs: optimize clearing when writing From: Nick Piggin To: Hugh Dickins Cc: Andrew Morton , Christoph Hellwig , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id q4F8pUPT009941 On 12 May 2012 22:04, Hugh Dickins wrote: > Nick proposed years ago that tmpfs should avoid clearing its pages where > write will overwrite them with new data, as ramfs has long done.  But I > messed it up and just got bad data.  Tried again recently, it works fine. > > Here's time output for writing 4GiB 16 times on this Core i5 laptop: > > before: real    0m21.169s user  0m0.028s sys    0m21.057s >        real    0m21.382s user  0m0.016s sys    0m21.289s >        real    0m21.311s user  0m0.020s sys    0m21.217s > > after:  real    0m18.273s user  0m0.032s sys    0m18.165s >        real    0m18.354s user  0m0.020s sys    0m18.265s >        real    0m18.440s user  0m0.032s sys    0m18.337s > > ramfs:  real    0m16.860s user  0m0.028s sys    0m16.765s >        real    0m17.382s user  0m0.040s sys    0m17.273s >        real    0m17.133s user  0m0.044s sys    0m17.021s Cool, thanks Hugh! Very big speedup. > > Yes, I have done perf reports, but they need more explanation than they > deserve: in summary, clear_page vanishes, its cache loading shifts into > copy_user_generic_unrolled; shmem_getpage_gfp goes down, and surprisingly > mark_page_accessed goes way up - I think because they are respectively > where the cache gets to be reloaded after being purged by clear or copy. > > Suggested-by: Nick Piggin > Signed-off-by: Hugh Dickins > --- >  mm/shmem.c |   20 +++++++++++++++++--- >  1 file changed, 17 insertions(+), 3 deletions(-) > > --- 3045N.orig/mm/shmem.c       2012-05-05 10:46:05.732062006 -0700 > +++ 3045N/mm/shmem.c    2012-05-05 10:46:12.316062172 -0700 > @@ -1095,9 +1095,14 @@ repeat: >                shmem_recalc_inode(inode); >                spin_unlock(&info->lock); > > -               clear_highpage(page); > -               flush_dcache_page(page); > -               SetPageUptodate(page); > +               /* > +                * Let SGP_WRITE caller clear ends if write does not fill page > +                */ > +               if (sgp != SGP_WRITE) { > +                       clear_highpage(page); > +                       flush_dcache_page(page); > +                       SetPageUptodate(page); > +               } >                if (sgp == SGP_DIRTY) >                        set_page_dirty(page); >        } > @@ -1307,6 +1312,14 @@ shmem_write_end(struct file *file, struc >        if (pos + copied > inode->i_size) >                i_size_write(inode, pos + copied); > > +       if (!PageUptodate(page)) { > +               if (copied < PAGE_CACHE_SIZE) { > +                       unsigned from = pos & (PAGE_CACHE_SIZE - 1); > +                       zero_user_segments(page, 0, from, > +                                       from + copied, PAGE_CACHE_SIZE); > +               } > +               SetPageUptodate(page); > +       } >        set_page_dirty(page); >        unlock_page(page); >        page_cache_release(page); > @@ -1768,6 +1781,7 @@ static int shmem_symlink(struct inode *d >                kaddr = kmap_atomic(page); >                memcpy(kaddr, symname, len); >                kunmap_atomic(kaddr); > +               SetPageUptodate(page); >                set_page_dirty(page); >                unlock_page(page); >                page_cache_release(page); {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756731Ab2EOJox (ORCPT ); Tue, 15 May 2012 05:44:53 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:41480 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752272Ab2EOJou (ORCPT ); Tue, 15 May 2012 05:44:50 -0400 Message-ID: <4FB22589.9050406@gmail.com> Date: Tue, 15 May 2012 17:44:41 +0800 From: Cong Wang User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Hugh Dickins CC: Andrew Morton , Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone References: <4FB0C888.8070805@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/15/2012 03:42 AM, Hugh Dickins wrote: > I'm not going to rush the incremental patch to fix this: need to think > about it quietly first. > > If you're wondering what I'm talking about (sorry, I don't have time > to explain more right now), take a look at comment and git history of > line 2956 (in 3.4-rc7) of mm/memory.c: > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > I don't suppose anyone ever actually hit the bug in the years before > we added that protection, but we still ought to guard against it, > there and here in shmem_replace_page(). > Ok, I have no objections. Thanks for your patches anyway! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759857Ab2ESBmF (ORCPT ); Fri, 18 May 2012 21:42:05 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:39171 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757089Ab2ESBmA (ORCPT ); Fri, 18 May 2012 21:42:00 -0400 Date: Fri, 18 May 2012 18:41:38 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone In-Reply-To: Message-ID: References: <20120514161330.def0ac52.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 14 May 2012, Hugh Dickins wrote: > On Mon, 14 May 2012, Andrew Morton wrote: > > On Sat, 12 May 2012 04:59:56 -0700 (PDT) > > Hugh Dickins wrote: > > > > > > We'd like to continue to support GMA500, so now add a new > > > shmem_should_replace_page() check on the zone when about to move > > > a page from swapcache to filecache (in swapin and swapoff cases), > > > with shmem_replace_page() to allocate and substitute a suitable page > > > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > > > ... > > > + gfp = mapping_gfp_mask(mapping); > > > + if (shmem_should_replace_page(*pagep, gfp)) { > > > + mutex_unlock(&shmem_swaplist_mutex); > > > + error = shmem_replace_page(pagep, gfp, info, index); > > > + mutex_lock(&shmem_swaplist_mutex); > > > + /* > > > + * We needed to drop mutex to make that restrictive page > > > + * allocation; but the inode might already be freed by now, > > > + * and we cannot refer to inode or mapping or info to check. > > > + * However, we do hold page lock on the PageSwapCache page, > > > + * so can check if that still has our reference remaining. > > > + */ > > > + if (!page_swapcount(*pagep)) > > > + error = -ENOENT; > > > > This has my head spinning a bit. What is "our reference"? I'd expect > > that to mean a temporary reference which was taken by this thread of > > control. > > (I'm sure you'll prefer a reworking of that comment in an incremental > fixes patch, but let me try to explain better here too.) > > No, I didn't mean a temporary reference taken by this (swapoff) thread, > but the reference (swap entry) which has just been located in the inode's > radix_tree, just before this hunk: which would be tracked by page_swapcount > 1 (there's also a page swapcache bit in the swap_map along with the count, > corresponding to the reference from the swapcache page itself, but that's > not included in page_swapcount). > > > But such a thing has no relevance when trying to determine > > the state of the page and/or data structures which refer to it. > > I don't understand you there, but maybe it won't matter. > > > > > Also, what are we trying to determine here with this test? Whether the > > page was removed from swapcache under our feet? Presumably not, as it > > is locked. > > > > So perhaps you could spell out in more detail what we're trying to do > > here, and what contributes to page_swapcount() here? > > The danger here is that the inode we're dealing with has gone through > shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was > certainly in use before, and shmem_swaplist_mutex (together with inode > being on shmem_swaplist) holds it up from being evicted and freed; but > once we drop the mutex, it could go away at any moment. We cannot > determine that by looking at struct inode or struct address_space or > struct shmem_inode_info, they're all part of what would be freed; > but we cannot proceed to shmem_add_to_page_cache() once they're freed. > How to tell whether it's been freed? > > Once upon a time I "solved" it with igrab() and iput(), but Konstantin > demonstrated how that gives no safety against unmounting, and I remain > reluctant to go back to relying upon filesystem semantics to solve this. > > It occurred to me that the inode cannot be freed until that radix_tree > entry has been removed (by shmem_evict_inode's shmem_truncate_range), > and the act or removing that entry (free_swap_and_cache) brings > page_swapcount down from 1 to 0. > > You're thinking that the page cannot be removed from swapcache while > we hold page lock: correct, but... free_swap_and_cache() only does a > trylock_page(), and happily leaves the swapcache page to be garbage > collected later if it cannot get the page lock. (And I certainly > would not want to change it to wait for page lock.) So, the inode > can get evicted while the page is still in swapcache: the page lock > gives no protection against that, until the page itself gets into > the radix_tree. > > I doubt that writing this essay into a comment there will be the > right thing to do (and I may still be losing you); but I shall try > to rewrite it, and if there's one missing fact that needs highlighting, > it probably is that last, that free_swap_and_cache() only does a trylock, > so our page lock does not protect the inode from eviction. > > (At this moment, I can't think what is the relevance of my comment > "we do hold page lock on the PageSwapCache page": in other contexts it > would be important, but here in swapoff we know that that swap cannot > get reused, or not before we're done.) > > > > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct > > > struct list_head *this, *next; > > > struct shmem_inode_info *info; > > > int found = 0; > > > - int error; > > > + int error = 0; > > > + > > > + /* > > > + * There's a faint possibility that swap page was replaced before > > > + * caller locked it: it will come back later with the right page. > > > > So a caller locked the page then failed to check that it's still the > > right sort of page? Shouldn't the caller locally clean up its own mess > > rather than requiring a callee to know about the caller's intricate > > shortcomings? > > The caller being try_to_unuse(). You're certainly not the first to argue > that way. Perhaps I'm a bit perverse, in letting code which works even > in the surprising cases, remain as it is without weeding out those > surprising cases. And on this occasion didn't want to add an additional > dependence on a slight subtle change in mm/swapfile.c functionality. > > Hmm, yes, I do still prefer to have the check here in shmem.c: > particularly since it is this "shmem_replace_page" business which is > increasing the likelihood of such a race, and making further demands > on it (if we're going to make the copied page PageSwapCache, then we > need to be sure that the page it's replacing was PageSwapCache - though > that's something I need to think through again in the light of the race > which I thought of in responding to Cong). > > > > + newpage = shmem_alloc_page(gfp, info, index); > > > + if (!newpage) > > > + return -ENOMEM; > > > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); > > > + > > > + *pagep = newpage; > > > + page_cache_get(newpage); > > > + copy_highpage(newpage, oldpage); > > > > copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()? > > Ooh, I'm pretty sure you're right that we do need flush_dcache_page() > there: good catch, thank you. We can't use copy_user_highpage() because > in general we don't know any address and vma; but should be following the > shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate. > > > > > shmem_replace_page() is a fairly generic and unexceptional sounding > > thing. Methinks shmem_substitute_page() would be a better name. > > Okay, shmem_replace_page() seemed appropriate to me (especially thinking > of it as "re-place"), but I don't mind changing to shmem_substitute_page(). > > The flush_dcache_page() addition is important, but until people are > using GMA500 on ARM or something (I doubt that combination) with more > than 4GB, this code is not coming into play - so I'm not breaking anyone's > system if it sneaks into linux-next before I fix that. > > The main thing I need to think through quietly is the slippery swap race: > I'll send you an incremental patch to fix all these up once I'm satisfied > on that. I promised you an incremental, but that's not really possible because of the name changes from "replace" to "substitute". So I'll be sending you a v2 patch in a moment, to replace (or substitute for) the original 1/10. It responds to feedback comment: 1. "substitute" instead of "replace" [akpm] 2. more explanation of page_swapcount test [akpm] 3. flush_dcache_page after copy_highpage [akpm] 4. removal of excessive VM_BUG_ONs [wangcong] 5. check page_private before and error path within substitute_page [hughd] See below for a diff from v1 for review, omitting replace->substitute mods. Please don't be disappointed if I send you a further patch to shmem_substitute_page() in the weeks ahead: although the page_private checks I've added in this one make it very very very unlikely, and its consequence very probably benign, there is still a surprising (never observed) race by which shmem_getpage_gfp() could get hold of someone else's swap. It's correctly resolved by shmem_add_to_page_cache(), but by that time we have already done a mem_cgroup charge, and now also this substitution. It would be better to rearrange a little here, to eliminate all chance of that surprise: I hoped to complete that earlier, but now think I'd better get the safer intermediate version to you first. Thanks, Hugh --- mm/shmem.c | 57 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 37 insertions(+), 20 deletions(-) --- 3045N.orig/mm/shmem.c 2012-05-17 16:28:43.278076430 -0700 +++ 3045N/mm/shmem.c 2012-05-18 16:28:33.642198028 -0700 @@ -636,10 +636,21 @@ static int shmem_unuse_inode(struct shme mutex_lock(&shmem_swaplist_mutex); /* * We needed to drop mutex to make that restrictive page - * allocation; but the inode might already be freed by now, - * and we cannot refer to inode or mapping or info to check. - * However, we do hold page lock on the PageSwapCache page, - * so can check if that still has our reference remaining. + * allocation, but the inode might have been freed while we + * dropped it: although a racing shmem_evict_inode() cannot + * complete without emptying the radix_tree, our page lock + * on this swapcache page is not enough to prevent that - + * free_swap_and_cache() of our swap entry will only + * trylock_page(), removing swap from radix_tree whatever. + * + * We must not proceed to shmem_add_to_page_cache() if the + * inode has been freed, but of course we cannot rely on + * inode or mapping or info to check that. However, we can + * safely check if our swap entry is still in use (and here + * it can't have got reused for another page): if it's still + * in use, then the inode cannot have been freed yet, and we + * can safely proceed (if it's no longer in use, that tells + * nothing about the inode, but we don't need to unuse swap). */ if (!page_swapcount(*pagep)) error = -ENOENT; @@ -683,9 +694,9 @@ int shmem_unuse(swp_entry_t swap, struct /* * There's a faint possibility that swap page was substituted before - * caller locked it: it will come back later with the right page. + * caller locked it: caller will come back later with the right page. */ - if (unlikely(!PageSwapCache(page))) + if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val)) goto out; /* @@ -916,21 +927,15 @@ static int shmem_substitute_page(struct newpage = shmem_alloc_page(gfp, info, index); if (!newpage) return -ENOMEM; - VM_BUG_ON(shmem_should_substitute_page(newpage, gfp)); - *pagep = newpage; page_cache_get(newpage); copy_highpage(newpage, oldpage); + flush_dcache_page(newpage); - VM_BUG_ON(!PageLocked(oldpage)); __set_page_locked(newpage); - VM_BUG_ON(!PageUptodate(oldpage)); SetPageUptodate(newpage); - VM_BUG_ON(!PageSwapBacked(oldpage)); SetPageSwapBacked(newpage); - VM_BUG_ON(!swap_index); set_page_private(newpage, swap_index); - VM_BUG_ON(!PageSwapCache(oldpage)); SetPageSwapCache(newpage); /* @@ -940,13 +945,24 @@ static int shmem_substitute_page(struct spin_lock_irq(&swap_mapping->tree_lock); error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, newpage); - __inc_zone_page_state(newpage, NR_FILE_PAGES); - __dec_zone_page_state(oldpage, NR_FILE_PAGES); + if (!error) { + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + } spin_unlock_irq(&swap_mapping->tree_lock); - BUG_ON(error); - mem_cgroup_replace_page_cache(oldpage, newpage); - lru_cache_add_anon(newpage); + if (unlikely(error)) { + /* + * Is this possible? I think not, now that our callers check + * both PageSwapCache and page_private after getting page lock; + * but be defensive. Reverse old to newpage for clear and free. + */ + oldpage = newpage; + } else { + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + *pagep = newpage; + } ClearPageSwapCache(oldpage); set_page_private(oldpage, 0); @@ -954,7 +970,7 @@ static int shmem_substitute_page(struct unlock_page(oldpage); page_cache_release(oldpage); page_cache_release(oldpage); - return 0; + return error; } /* @@ -1025,7 +1041,8 @@ repeat: /* We have to do this with page locked to prevent races */ lock_page(page); - if (!PageSwapCache(page) || page->mapping) { + if (!PageSwapCache(page) || page_private(page) != swap.val || + page->mapping) { error = -EEXIST; /* try again */ goto failed; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759938Ab2ESBok (ORCPT ); Fri, 18 May 2012 21:44:40 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:37733 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933301Ab2ESBod (ORCPT ); Fri, 18 May 2012 21:44:33 -0400 Date: Fri, 18 May 2012 18:44:17 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Christoph Hellwig , KAMEZAWA Hiroyuki , Alan Cox , Stephane Marchesin , Andi Kleen , Dave Airlie , Daniel Vetter , Rob Clark , Cong Wang , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 1/10] shmem: substitute page if mapping excludes its zone In-Reply-To: Message-ID: References: <20120514161330.def0ac52.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the backing RAM has to be below 4GB. Not a problem while the boards supported only 4GB: but now Intel's D2700MUD boards support 8GB, and their GMA3600 is managed by the GMA500 driver. shmem/tmpfs has never pretended to support hardware restrictions on the backing memory, but it might have appeared to do so before v3.1, and even now it works fine until a page is swapped out then back in. When read_cache_page_gfp() supplied a freshly allocated page for copy, that compensated for whatever choice might have been made by earlier swapin readahead; but swapoff was likely to destroy the illusion. We would like to continue to support GMA500, so now add a new shmem_should_substitute_page() check on the zone when about to move a page from swapcache to filecache (in swapin and swapoff cases), with shmem_substitute_page() to allocate and substitute a suitable page (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). This does involve a minor extension to mem_cgroup_replace_page_cache() (the page may or may not have already been charged); and I've removed a comment and call to mem_cgroup_uncharge_cache_page(), which in fact is always a no-op while PageSwapCache. Also removed optimization of an unlikely path in shmem_getpage_gfp(), now that we need to check PageSwapCache more carefully (a racing caller might already have made the copy). And at one point shmem_unuse_inode() needs to use the hitherto private page_swapcount(), to guard against racing with inode eviction. It would make sense to extend shmem_should_substitute_page(), to cover cpuset and NUMA mempolicy restrictions too, but set that aside for now: needs a cleanup of shmem mempolicy handling, and more testing, and ought to handle swap faults in do_swap_page() as well as shmem. Signed-off-by: Hugh Dickins Acked-by: KAMEZAWA Hiroyuki --- I've Cc'ed Stephane, Andi, Dave, Daniel and Rob because of their interest in the i915 Sandybridge issue; but reiterate that this patch does nothing for that case. include/linux/swap.h | 6 + mm/memcontrol.c | 17 +++- mm/shmem.c | 158 ++++++++++++++++++++++++++++++++++++----- mm/swapfile.c | 2 4 files changed, 159 insertions(+), 24 deletions(-) --- 3045N.orig/include/linux/swap.h 2012-05-17 16:30:20.222078994 -0700 +++ 3045N/include/linux/swap.h 2012-05-17 16:30:29.786078930 -0700 @@ -355,6 +355,7 @@ extern int swap_type_of(dev_t, sector_t, extern unsigned int count_swap_pages(int, int); extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); +extern int page_swapcount(struct page *); extern int reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; @@ -448,6 +449,11 @@ static inline void delete_from_swap_cach { } +static inline int page_swapcount(struct page *page) +{ + return 0; +} + #define reuse_swap_page(page) (page_mapcount(page) == 1) static inline int try_to_free_swap(struct page *page) --- 3045N.orig/mm/memcontrol.c 2012-05-17 16:30:20.226078835 -0700 +++ 3045N/mm/memcontrol.c 2012-05-17 16:30:29.786078930 -0700 @@ -3548,7 +3548,7 @@ void mem_cgroup_end_migration(struct mem void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; struct page_cgroup *pc; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3558,11 +3558,20 @@ void mem_cgroup_replace_page_cache(struc pc = lookup_page_cgroup(oldpage); /* fix accounting on old pages */ lock_page_cgroup(pc); - memcg = pc->mem_cgroup; - mem_cgroup_charge_statistics(memcg, false, -1); - ClearPageCgroupUsed(pc); + if (PageCgroupUsed(pc)) { + memcg = pc->mem_cgroup; + mem_cgroup_charge_statistics(memcg, false, -1); + ClearPageCgroupUsed(pc); + } unlock_page_cgroup(pc); + /* + * When called from shmem_substitute_page(), in some cases the + * oldpage has already been charged, and in some cases not. + */ + if (!memcg) + return; + if (PageSwapBacked(oldpage)) type = MEM_CGROUP_CHARGE_TYPE_SHMEM; --- 3045N.orig/mm/shmem.c 2012-05-17 16:30:20.226078835 -0700 +++ 3045N/mm/shmem.c 2012-05-18 16:28:33.642198028 -0700 @@ -103,6 +103,9 @@ static unsigned long shmem_default_max_i } #endif +static bool shmem_should_substitute_page(struct page *page, gfp_t gfp); +static int shmem_substitute_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index); static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type); @@ -604,12 +607,13 @@ static void shmem_evict_inode(struct ino * If swap found in inode, free it and move page from swapcache to filecache. */ static int shmem_unuse_inode(struct shmem_inode_info *info, - swp_entry_t swap, struct page *page) + swp_entry_t swap, struct page **pagep) { struct address_space *mapping = info->vfs_inode.i_mapping; void *radswap; pgoff_t index; - int error; + gfp_t gfp; + int error = 0; radswap = swp_to_radix_entry(swap); index = radix_tree_locate_item(&mapping->page_tree, radswap); @@ -625,22 +629,48 @@ static int shmem_unuse_inode(struct shme if (shmem_swaplist.next != &info->swaplist) list_move_tail(&shmem_swaplist, &info->swaplist); + gfp = mapping_gfp_mask(mapping); + if (shmem_should_substitute_page(*pagep, gfp)) { + mutex_unlock(&shmem_swaplist_mutex); + error = shmem_substitute_page(pagep, gfp, info, index); + mutex_lock(&shmem_swaplist_mutex); + /* + * We needed to drop mutex to make that restrictive page + * allocation, but the inode might have been freed while we + * dropped it: although a racing shmem_evict_inode() cannot + * complete without emptying the radix_tree, our page lock + * on this swapcache page is not enough to prevent that - + * free_swap_and_cache() of our swap entry will only + * trylock_page(), removing swap from radix_tree whatever. + * + * We must not proceed to shmem_add_to_page_cache() if the + * inode has been freed, but of course we cannot rely on + * inode or mapping or info to check that. However, we can + * safely check if our swap entry is still in use (and here + * it can't have got reused for another page): if it's still + * in use, then the inode cannot have been freed yet, and we + * can safely proceed (if it's no longer in use, that tells + * nothing about the inode, but we don't need to unuse swap). + */ + if (!page_swapcount(*pagep)) + error = -ENOENT; + } + /* * We rely on shmem_swaplist_mutex, not only to protect the swaplist, * but also to hold up shmem_evict_inode(): so inode cannot be freed * beneath us (pagelock doesn't help until the page is in pagecache). */ - error = shmem_add_to_page_cache(page, mapping, index, + if (!error) + error = shmem_add_to_page_cache(*pagep, mapping, index, GFP_NOWAIT, radswap); - /* which does mem_cgroup_uncharge_cache_page on error */ - if (error != -ENOMEM) { /* * Truncation and eviction use free_swap_and_cache(), which * only does trylock page: if we raced, best clean up here. */ - delete_from_swap_cache(page); - set_page_dirty(page); + delete_from_swap_cache(*pagep); + set_page_dirty(*pagep); if (!error) { spin_lock(&info->lock); info->swapped--; @@ -660,7 +690,14 @@ int shmem_unuse(swp_entry_t swap, struct struct list_head *this, *next; struct shmem_inode_info *info; int found = 0; - int error; + int error = 0; + + /* + * There's a faint possibility that swap page was substituted before + * caller locked it: caller will come back later with the right page. + */ + if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val)) + goto out; /* * Charge page using GFP_KERNEL while we can wait, before taking @@ -676,7 +713,7 @@ int shmem_unuse(swp_entry_t swap, struct list_for_each_safe(this, next, &shmem_swaplist) { info = list_entry(this, struct shmem_inode_info, swaplist); if (info->swapped) - found = shmem_unuse_inode(info, swap, page); + found = shmem_unuse_inode(info, swap, &page); else list_del_init(&info->swaplist); cond_resched(); @@ -685,8 +722,6 @@ int shmem_unuse(swp_entry_t swap, struct } mutex_unlock(&shmem_swaplist_mutex); - if (!found) - mem_cgroup_uncharge_cache_page(page); if (found < 0) error = found; out: @@ -856,6 +891,89 @@ static inline struct mempolicy *shmem_ge #endif /* + * When a page is moved from swapcache to shmem filecache (either by the + * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of + * shmem_unuse_inode()), it may have been read in earlier from swap, in + * ignorance of the mapping it belongs to. If that mapping has special + * constraints (like the gma500 GEM driver, which requires RAM below 4GB), + * we may need to copy to a suitable page before moving to filecache. + * + * In a future release, this may well be extended to respect cpuset and + * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); + * but for now it is a simple matter of zone. + */ +static bool shmem_should_substitute_page(struct page *page, gfp_t gfp) +{ + return page_zonenum(page) > gfp_zone(gfp); +} + +static int shmem_substitute_page(struct page **pagep, gfp_t gfp, + struct shmem_inode_info *info, pgoff_t index) +{ + struct page *oldpage, *newpage; + struct address_space *swap_mapping; + pgoff_t swap_index; + int error; + + oldpage = *pagep; + swap_index = page_private(oldpage); + swap_mapping = page_mapping(oldpage); + + /* + * We have arrived here because our zones are constrained, so don't + * limit chance of success by further cpuset and node constraints. + */ + gfp &= ~GFP_CONSTRAINT_MASK; + newpage = shmem_alloc_page(gfp, info, index); + if (!newpage) + return -ENOMEM; + + page_cache_get(newpage); + copy_highpage(newpage, oldpage); + flush_dcache_page(newpage); + + __set_page_locked(newpage); + SetPageUptodate(newpage); + SetPageSwapBacked(newpage); + set_page_private(newpage, swap_index); + SetPageSwapCache(newpage); + + /* + * Our caller will very soon move newpage out of swapcache, but it's a + * nice clean interface for us to substitute newpage for oldpage there. + */ + spin_lock_irq(&swap_mapping->tree_lock); + error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, + newpage); + if (!error) { + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + } + spin_unlock_irq(&swap_mapping->tree_lock); + + if (unlikely(error)) { + /* + * Is this possible? I think not, now that our callers check + * both PageSwapCache and page_private after getting page lock; + * but be defensive. Reverse old to newpage for clear and free. + */ + oldpage = newpage; + } else { + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + *pagep = newpage; + } + + ClearPageSwapCache(oldpage); + set_page_private(oldpage, 0); + + unlock_page(oldpage); + page_cache_release(oldpage); + page_cache_release(oldpage); + return error; +} + +/* * shmem_getpage_gfp - find page in cache, or get from swap, or allocate * * If we allocate a new one we do not mark it dirty. That's up to the @@ -923,19 +1041,21 @@ repeat: /* We have to do this with page locked to prevent races */ lock_page(page); + if (!PageSwapCache(page) || page_private(page) != swap.val || + page->mapping) { + error = -EEXIST; /* try again */ + goto failed; + } if (!PageUptodate(page)) { error = -EIO; goto failed; } wait_on_page_writeback(page); - /* Someone may have already done it for us */ - if (page->mapping) { - if (page->mapping == mapping && - page->index == index) - goto done; - error = -EEXIST; - goto failed; + if (shmem_should_substitute_page(page, gfp)) { + error = shmem_substitute_page(&page, gfp, info, index); + if (error) + goto failed; } error = mem_cgroup_cache_charge(page, current->mm, @@ -998,7 +1118,7 @@ repeat: if (sgp == SGP_DIRTY) set_page_dirty(page); } -done: + /* Perhaps the file has been truncated since we checked */ if (sgp != SGP_WRITE && ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { --- 3045N.orig/mm/swapfile.c 2012-05-17 16:30:20.226078835 -0700 +++ 3045N/mm/swapfile.c 2012-05-17 16:30:29.790078925 -0700 @@ -604,7 +604,7 @@ void swapcache_free(swp_entry_t entry, s * This does not give an exact answer when swap count is continued, * but does include the high COUNT_CONTINUED flag to allow for that. */ -static inline int page_swapcount(struct page *page) +int page_swapcount(struct page *page) { int count = 0; struct swap_info_struct *p; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932739Ab2EVAIe (ORCPT ); Mon, 21 May 2012 20:08:34 -0400 Received: from mail-we0-f174.google.com ([74.125.82.174]:34915 "EHLO mail-we0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932253Ab2EVAIc convert rfc822-to-8bit (ORCPT ); Mon, 21 May 2012 20:08:32 -0400 MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 21 May 2012 17:08:29 -0700 X-Google-Sender-Auth: QNXI5dNPEA2hExzCV4LQhnB2Kdc Message-ID: Subject: Re: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE From: john stultz To: Hugh Dickins , =?ISO-8859-1?Q?Arve_Hj=F8nnev=E5g?= Cc: Andrew Morton , Christoph Hellwig , Cong Wang , Al Viro , Colin Cross , John Stulz , Greg Kroah-Hartman , "Theodore Ts'o" , Andreas Dilger , Mark Fasheh , Joel Becker , Dave Chinner , Ben Myers , Michael Kerrisk , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, May 12, 2012 at 5:13 AM, Hugh Dickins wrote: > Now tmpfs supports hole-punching via fallocate(), switch madvise_remove() > to use do_fallocate() instead of vmtruncate_range(): which extends > madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs. > > There is one more user of vmtruncate_range() in our tree, staging/android's > ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned > areas are already unmapped - I don't know - then it would do better to use > shmem_truncate_range() directly). I suspect shmem_truncate_range directly would be the right approach, but am not totally sure. Arve: Any thoughts? Hugh: Do you have a git tree with this set available somewhere? I was working on my own tmpfs support for FALLOC_FL_PUNCH_HOLE, along with my volatile range work, so I'd like to rebase on top of your work here. thanks -john > > Based-on-patch-by: Cong Wang > Signed-off-by: Hugh Dickins > --- > drivers/staging/android/ashmem.c | 8 +++++--- > mm/madvise.c | 15 +++++++-------- > 2 files changed, 12 insertions(+), 11 deletions(-) > > --- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.564056626 -0700 > +++ 3045N/drivers/staging/android/ashmem.c 2012-05-05 10:46:25.692062478 -0700 > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker > > mutex_lock(&ashmem_mutex); > list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) { > - struct inode *inode = range->asma->file->f_dentry->d_inode; > loff_t start = range->pgstart * PAGE_SIZE; > - loff_t end = (range->pgend + 1) * PAGE_SIZE - 1; > + loff_t end = (range->pgend + 1) * PAGE_SIZE; > > - vmtruncate_range(inode, start, end); > + do_fallocate(range->asma->file, > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > + start, end - start); > range->purged = ASHMEM_WAS_PURGED; > lru_del(range); > > --- 3045N.orig/mm/madvise.c 2012-05-05 10:42:33.572056784 -0700 > +++ 3045N/mm/madvise.c 2012-05-05 10:46:25.692062478 -0700 > @@ -11,8 +11,10 @@ > #include > #include > #include > +#include > #include > #include > +#include > > /* > * Any behaviour which results in changes to the vma->vm_flags needs to > @@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are > struct vm_area_struct **prev, > unsigned long start, unsigned long end) > { > - struct address_space *mapping; > - loff_t offset, endoff; > + loff_t offset; > int error; > > *prev = NULL; /* tell sys_madvise we drop mmap_sem */ > @@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are > if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE)) > return -EACCES; > > - mapping = vma->vm_file->f_mapping; > - > offset = (loff_t)(start - vma->vm_start) > + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > - endoff = (loff_t)(end - vma->vm_start - 1) > - + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > > - /* vmtruncate_range needs to take i_mutex */ > + /* filesystem's fallocate may need to take i_mutex */ > up_read(¤t->mm->mmap_sem); > - error = vmtruncate_range(mapping->host, offset, endoff); > + error = do_fallocate(vma->vm_file, > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > + offset, end - start); > down_read(¤t->mm->mmap_sem); > return error; > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/