* [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS @ 2011-04-27 18:13 Andrea Righi [not found] ` <1303928027-5100-1-git-send-email-andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Andrea Righi @ 2011-04-27 18:13 UTC (permalink / raw) To: Andrew Morton Cc: Dave Chinner, Mike Frysinger, Al Viro, Arnd Bergmann, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Introduce a new fadvise flag to drop page cache pages of a single filesystem. At the moment it is possible to drop page cache pages via /proc/sys/vm/drop_pagecache or via posix_fadvise(POSIX_FADV_DONTNEED). The first method drops the whole page cache while the second can be used to drop page cache pages of a single file descriptor. However, there's not a simple way to drop all the pages of a filesystem (we could scan all the file descriptors and use posix_fadvise(POSIX_FADV_DONTNEED), but this solution obviously doesn't scale well). This functionality requires root privilege to avoid potential DoS in the system (i.e., a hard loop of posix_fadvise(POSIX_FADV_DONTNEED_FS) on the root filesystem). A practical example: # ls -lh /mnt/sda/zero /mnt/sdb/zero -rw-r--r-- 1 root root 16M 2011-04-20 10:20 /mnt/sda/zero -rw-r--r-- 1 root root 16M 2011-04-20 10:20 /mnt/sdb/zero $ grep ^Cached /proc/meminfo Cached: 5660 kB $ md5sum /mnt/sda/zero /mnt/sdb/zero 2c7ab85a893283e98c931e9511add182 /mnt/sda/zero 2c7ab85a893283e98c931e9511add182 /mnt/sdb/zero $ grep ^Cached /proc/meminfo Cached: 38544 kB $ sudo ./drop-pagecache /mnt/sda/ $ grep ^Cached /proc/meminfo Cached: 22440 kB $ sudo ./drop-pagecache /mnt/sdb/ $ grep ^Cached /proc/meminfo Cached: 5056 kB A previous RFC about this topic can be found here: http://marc.info/?l=linux-kernel&m=130385374902114&w=2 ChangeLog (v1 -> v2): * use the same value for POSIX_FADV_DONTNEED_FS on all architectures * check CAP_SYS_ADMIN capability instead of checking the EUID value Signed-off-by: Andrea Righi <andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> --- fs/drop_caches.c | 2 +- include/linux/fadvise.h | 1 + include/linux/mm.h | 2 ++ mm/fadvise.c | 7 +++++++ 4 files changed, 11 insertions(+), 1 deletions(-) diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 98b77c8..59d6caa 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -13,7 +13,7 @@ /* A global variable is a bit ugly, but it keeps the code simple */ int sysctl_drop_caches; -static void drop_pagecache_sb(struct super_block *sb, void *unused) +void drop_pagecache_sb(struct super_block *sb, void *unused) { struct inode *inode, *toput_inode = NULL; diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h index e8e7471..ab39117 100644 --- a/include/linux/fadvise.h +++ b/include/linux/fadvise.h @@ -17,5 +17,6 @@ #define POSIX_FADV_DONTNEED 4 /* Don't need these pages. */ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define POSIX_FADV_DONTNEED_FS 8 /* Don't need these filesystem pages. */ #endif /* FADVISE_H_INCLUDED */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 692dbae..004cdbc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -21,6 +21,7 @@ struct anon_vma; struct file_ra_state; struct user_struct; struct writeback_control; +struct super_block; #ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */ extern unsigned long max_mapnr; @@ -1602,6 +1603,7 @@ int in_gate_area_no_mm(unsigned long addr); #define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_mm(addr);}) #endif /* __HAVE_ARCH_GATE_AREA */ +void drop_pagecache_sb(struct super_block *sb, void *unused); int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, diff --git a/mm/fadvise.c b/mm/fadvise.c index 8d723c9..15155e7 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -57,6 +57,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) case POSIX_FADV_WILLNEED: case POSIX_FADV_NOREUSE: case POSIX_FADV_DONTNEED: + case POSIX_FADV_DONTNEED_FS: /* no bad return value, but ignore advice */ break; default: @@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) invalidate_mapping_pages(mapping, start_index, end_index); break; + case POSIX_FADV_DONTNEED_FS: + if (capable(CAP_SYS_ADMIN)) + drop_pagecache_sb(file->f_dentry->d_sb, NULL); + else + ret = -EPERM; + break; default: ret = -EINVAL; } -- 1.7.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
[parent not found: <1303928027-5100-1-git-send-email-andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org>]
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS [not found] ` <1303928027-5100-1-git-send-email-andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> @ 2011-04-27 18:25 ` Mike Frysinger 2011-04-28 9:35 ` Andrea Righi 2011-04-27 18:33 ` Matthew Wilcox 2011-05-04 21:44 ` Andrew Morton 2 siblings, 1 reply; 10+ messages in thread From: Mike Frysinger @ 2011-04-27 18:25 UTC (permalink / raw) To: Andrea Righi Cc: Andrew Morton, Dave Chinner, Al Viro, Arnd Bergmann, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 27, 2011 at 14:13, Andrea Righi wrote: > Introduce a new fadvise flag to drop page cache pages of a single > filesystem. > > At the moment it is possible to drop page cache pages via > /proc/sys/vm/drop_pagecache or via posix_fadvise(POSIX_FADV_DONTNEED). > > The first method drops the whole page cache while the second can be used > to drop page cache pages of a single file descriptor. However, there's > not a simple way to drop all the pages of a filesystem (we could scan > all the file descriptors and use posix_fadvise(POSIX_FADV_DONTNEED), but > this solution obviously doesn't scale well). what if you open the mount point and use POSIX_FADV_DONTNEED on that dir handle ? if you required write access for that level, it'd also implicitly take care of the permission issue. but maybe this is just trying to fit existing code in the wrong way. -mike ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS 2011-04-27 18:25 ` Mike Frysinger @ 2011-04-28 9:35 ` Andrea Righi 0 siblings, 0 replies; 10+ messages in thread From: Andrea Righi @ 2011-04-28 9:35 UTC (permalink / raw) To: Mike Frysinger Cc: Andrew Morton, Dave Chinner, Al Viro, Arnd Bergmann, linux-fsdevel, linux-api, linux-kernel On Wed, Apr 27, 2011 at 02:25:17PM -0400, Mike Frysinger wrote: > On Wed, Apr 27, 2011 at 14:13, Andrea Righi wrote: > > Introduce a new fadvise flag to drop page cache pages of a single > > filesystem. > > > > At the moment it is possible to drop page cache pages via > > /proc/sys/vm/drop_pagecache or via posix_fadvise(POSIX_FADV_DONTNEED). > > > > The first method drops the whole page cache while the second can be used > > to drop page cache pages of a single file descriptor. However, there's > > not a simple way to drop all the pages of a filesystem (we could scan > > all the file descriptors and use posix_fadvise(POSIX_FADV_DONTNEED), but > > this solution obviously doesn't scale well). > > what if you open the mount point and use POSIX_FADV_DONTNEED on that > dir handle ? if you required write access for that level, it'd also > implicitly take care of the permission issue. but maybe this is just > trying to fit existing code in the wrong way. > -mike I still prefer the capability check. I think it's much more simple from the userspace point of view to be able to specify any file or directory instead of being forced to retrieve the mountpoint. However, an advantage with the approach you're proposing is that a non-privileged user can drop the page cache of a filesystem if it has write permission in the root of that filesystem. mmmh.. I don't see big problems also with the interface you propose, if you all think it's better I can implement this in the next version. Thanks, -Andrea ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS [not found] ` <1303928027-5100-1-git-send-email-andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> 2011-04-27 18:25 ` Mike Frysinger @ 2011-04-27 18:33 ` Matthew Wilcox [not found] ` <20110427183308.GA16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org> 2011-04-28 9:29 ` Andrea Righi 2011-05-04 21:44 ` Andrew Morton 2 siblings, 2 replies; 10+ messages in thread From: Matthew Wilcox @ 2011-04-27 18:33 UTC (permalink / raw) To: Andrea Righi Cc: Andrew Morton, Dave Chinner, Mike Frysinger, Al Viro, Arnd Bergmann, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 27, 2011 at 08:13:47PM +0200, Andrea Righi wrote: > @@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) > invalidate_mapping_pages(mapping, start_index, > end_index); > break; > + case POSIX_FADV_DONTNEED_FS: > + if (capable(CAP_SYS_ADMIN)) > + drop_pagecache_sb(file->f_dentry->d_sb, NULL); > + else > + ret = -EPERM; > + break; > default: > ret = -EINVAL; > } Mmm ... what if I open /dev/sdxyz and call fadvise() on it? I think you end up flushing /dev's page cache entries, instead of the filesystem which is on /dev/sdxyz. If I understand correctly, you want mapping->host->i_sb instead of file->f_dentry->d_sb. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20110427183308.GA16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org>]
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS [not found] ` <20110427183308.GA16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org> @ 2011-04-27 18:39 ` Mike Frysinger 2011-04-27 18:47 ` Matthew Wilcox 0 siblings, 1 reply; 10+ messages in thread From: Mike Frysinger @ 2011-04-27 18:39 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrea Righi, Andrew Morton, Dave Chinner, Al Viro, Arnd Bergmann, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 27, 2011 at 14:33, Matthew Wilcox wrote: > On Wed, Apr 27, 2011 at 08:13:47PM +0200, Andrea Righi wrote: >> @@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) >> invalidate_mapping_pages(mapping, start_index, >> end_index); >> break; >> + case POSIX_FADV_DONTNEED_FS: >> + if (capable(CAP_SYS_ADMIN)) >> + drop_pagecache_sb(file->f_dentry->d_sb, NULL); >> + else >> + ret = -EPERM; >> + break; >> default: >> ret = -EINVAL; >> } > > Mmm ... what if I open /dev/sdxyz and call fadvise() on it? I think > you end up flushing /dev's page cache entries, instead of the filesystem > which is on /dev/sdxyz. i was thinking of that, but was trying to come up with situations where there might not have a node to work on. fs's in a file go through loop devs, dm/lvm have ones created, and flash fs's still have a mtd block. how about network based fs's ? how you going to signal dropping of pages for nfs or cifs or fuse ones ? -mike ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS 2011-04-27 18:39 ` Mike Frysinger @ 2011-04-27 18:47 ` Matthew Wilcox [not found] ` <20110427184756.GB16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Matthew Wilcox @ 2011-04-27 18:47 UTC (permalink / raw) To: Mike Frysinger Cc: Andrea Righi, Andrew Morton, Dave Chinner, Al Viro, Arnd Bergmann, linux-fsdevel, linux-api, linux-kernel On Wed, Apr 27, 2011 at 02:39:53PM -0400, Mike Frysinger wrote: > > Mmm ... what if I open /dev/sdxyz and call fadvise() on it? ??I think > > you end up flushing /dev's page cache entries, instead of the filesystem > > which is on /dev/sdxyz. > > i was thinking of that, but was trying to come up with situations > where there might not have a node to work on. fs's in a file go > through loop devs, dm/lvm have ones created, and flash fs's still have > a mtd block. how about network based fs's ? how you going to signal > dropping of pages for nfs or cifs or fuse ones ? For a regular file, mapping->host->i_sb points to the superblock this file is on. For a device, mapping->host->i_sb points to the superblock corresponding to this device. So it's always what we want. (hm, what about block devices not currently mounted? do we need to check whether mapping->host is NULL?) -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20110427184756.GB16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org>]
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS [not found] ` <20110427184756.GB16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org> @ 2011-04-27 18:49 ` Mike Frysinger 0 siblings, 0 replies; 10+ messages in thread From: Mike Frysinger @ 2011-04-27 18:49 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrea Righi, Andrew Morton, Dave Chinner, Al Viro, Arnd Bergmann, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 27, 2011 at 14:47, Matthew Wilcox wrote: > On Wed, Apr 27, 2011 at 02:39:53PM -0400, Mike Frysinger wrote: >> > Mmm ... what if I open /dev/sdxyz and call fadvise() on it? ??I think >> > you end up flushing /dev's page cache entries, instead of the filesystem >> > which is on /dev/sdxyz. >> >> i was thinking of that, but was trying to come up with situations >> where there might not have a node to work on. fs's in a file go >> through loop devs, dm/lvm have ones created, and flash fs's still have >> a mtd block. how about network based fs's ? how you going to signal >> dropping of pages for nfs or cifs or fuse ones ? > > For a regular file, mapping->host->i_sb points to the superblock this > file is on. For a device, mapping->host->i_sb points to the superblock > corresponding to this device. So it's always what we want. sorry, wrong question. i misread your original post (suggesting we should be calling fadvise on the block instead of an arbitrary dir handle). -mike ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS 2011-04-27 18:33 ` Matthew Wilcox [not found] ` <20110427183308.GA16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org> @ 2011-04-28 9:29 ` Andrea Righi 1 sibling, 0 replies; 10+ messages in thread From: Andrea Righi @ 2011-04-28 9:29 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, Dave Chinner, Mike Frysinger, Al Viro, Arnd Bergmann, linux-fsdevel, linux-api, linux-kernel On Wed, Apr 27, 2011 at 12:33:08PM -0600, Matthew Wilcox wrote: > On Wed, Apr 27, 2011 at 08:13:47PM +0200, Andrea Righi wrote: > > @@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) > > invalidate_mapping_pages(mapping, start_index, > > end_index); > > break; > > + case POSIX_FADV_DONTNEED_FS: > > + if (capable(CAP_SYS_ADMIN)) > > + drop_pagecache_sb(file->f_dentry->d_sb, NULL); > > + else > > + ret = -EPERM; > > + break; > > default: > > ret = -EINVAL; > > } > > Mmm ... what if I open /dev/sdxyz and call fadvise() on it? I think > you end up flushing /dev's page cache entries, instead of the filesystem > which is on /dev/sdxyz. > > If I understand correctly, you want mapping->host->i_sb instead of > file->f_dentry->d_sb. I did some tests, but I don't get the expected behaviour. In all cases both if I use mapping->host->i_sb or file->f_dentry->d_sb when I call fadvise() on any block device all the blockdev pages are dropped ("Buffers" from /proc/meminfo), but page cache pages are not touched: # df -hT / Filesystem Type Size Used Avail Use% Mounted on /dev/sda1 ext4 29G 20G 7.4G 73% / # grep "^Cached\|Buffers" /proc/meminfo Buffers: 79772 kB Cached: 32440 kB # sudo drop-pagecache /dev/sda1 # grep "^Cached\|Buffers" /proc/meminfo Buffers: 228 kB Cached: 32440 kB # sudo drop-pagecache / # grep "^Cached\|Buffers" /proc/meminfo Buffers: 228 kB Cached: 4884 kB Thanks, -Andrea ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS [not found] ` <1303928027-5100-1-git-send-email-andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> 2011-04-27 18:25 ` Mike Frysinger 2011-04-27 18:33 ` Matthew Wilcox @ 2011-05-04 21:44 ` Andrew Morton [not found] ` <20110504144411.7f32c00c.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2011-05-04 21:44 UTC (permalink / raw) To: Andrea Righi Cc: Dave Chinner, Mike Frysinger, Al Viro, Arnd Bergmann, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, 27 Apr 2011 20:13:47 +0200 Andrea Righi <andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> wrote: > Introduce a new fadvise flag to drop page cache pages of a single > filesystem. I'm going to object to this on general principle. We shouldn't toss new features into the kernel API just because we can. Each feature needs a really good argument to justify its inclusion, and I don't believe that the case has been made for this feature. IOW, who's going to die if we don't merge it? ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20110504144411.7f32c00c.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS [not found] ` <20110504144411.7f32c00c.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2011-05-04 22:09 ` Andrea Righi 0 siblings, 0 replies; 10+ messages in thread From: Andrea Righi @ 2011-05-04 22:09 UTC (permalink / raw) To: Andrew Morton Cc: Dave Chinner, Mike Frysinger, Al Viro, Arnd Bergmann, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, May 04, 2011 at 02:44:11PM -0700, Andrew Morton wrote: > On Wed, 27 Apr 2011 20:13:47 +0200 > Andrea Righi <andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> wrote: > > > Introduce a new fadvise flag to drop page cache pages of a single > > filesystem. > > I'm going to object to this on general principle. We shouldn't toss > new features into the kernel API just because we can. Each feature > needs a really good argument to justify its inclusion, and I don't > believe that the case has been made for this feature. > > IOW, who's going to die if we don't merge it? OK. :) -Andrea ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-05-04 22:09 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-27 18:13 [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS Andrea Righi [not found] ` <1303928027-5100-1-git-send-email-andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org> 2011-04-27 18:25 ` Mike Frysinger 2011-04-28 9:35 ` Andrea Righi 2011-04-27 18:33 ` Matthew Wilcox [not found] ` <20110427183308.GA16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org> 2011-04-27 18:39 ` Mike Frysinger 2011-04-27 18:47 ` Matthew Wilcox [not found] ` <20110427184756.GB16716-6jwH94ZQLHl74goWV3ctuw@public.gmane.org> 2011-04-27 18:49 ` Mike Frysinger 2011-04-28 9:29 ` Andrea Righi 2011-05-04 21:44 ` Andrew Morton [not found] ` <20110504144411.7f32c00c.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2011-05-04 22:09 ` Andrea Righi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).