* invalidate the buffer heads of a block device @ 2014-09-29 16:24 Thanos Makatos 2014-09-29 18:11 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Thanos Makatos @ 2014-09-29 16:24 UTC (permalink / raw) To: linux-fsdevel@vger.kernel.org; +Cc: Ross Lagerwall, Felipe Franciosi I'm looking for ways to achieve _read_ caching of a block device (usually iSCSI) that is attached to multiple hosts. The problem is that sometimes there will be some writes on that block device in a particular, known host using O_DIRECT, and that requires all other hosts to invalidate their buffer caches for that block device. I'd prefer to use standard tools/procedures rather than hacking things, but if I do have to implement something to solve this problem I'd prefer it to be something that can be accepted upstream. It looks like that closing the block device guarantees that the buffer cache is invalidated, and I can guarantee that all _my_ processes close the block device to achieve this. However, if there is at least one other process I don't know of that is keeping an open file handle against that block device, the buffer cache won't be invalidated, and that will result in corruption. So this solution doesn't seem to work. I had a look at the BLKFLSBUF ioctl, which seems to be designed to do the job, except that it doesn't work if a process has memory mapped the block device, and AFAIK there's no way to disallow memory mapping of a block device. Again, that looks like a deal-breaker. If the above observations are correct, it seems that I have to either extend BLKFLSBUF to some invalidate such memory maps (I'm completely ignorant in that field, is it even possible?), or look for other solutions. Could some configuration of dm-cache, bcache, or some other component solve my problem? posix_fadvise with POSIX_FADV_DONTNEED seems to be just a hint to the kernel, there are no guarantees that cached data will be discarded. I've also though of using a virtual block device driver that exclusively opens the actual network block device and lets user-space applications use the virtualised one so that I have more control and enforce things. Are there any other documents/resources I could read? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-09-29 16:24 invalidate the buffer heads of a block device Thanos Makatos @ 2014-09-29 18:11 ` Jan Kara 2014-09-30 8:58 ` Thanos Makatos 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2014-09-29 18:11 UTC (permalink / raw) To: Thanos Makatos Cc: linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Mon 29-09-14 16:24:35, Thanos Makatos wrote: > I'm looking for ways to achieve _read_ caching of a block device (usually > iSCSI) that is attached to multiple hosts. The problem is that sometimes there > will be some writes on that block device in a particular, known host using > O_DIRECT, and that requires all other hosts to invalidate their buffer caches > for that block device. I'd prefer to use standard tools/procedures rather than > hacking things, but if I do have to implement something to solve this problem > I'd prefer it to be something that can be accepted upstream. > > It looks like that closing the block device guarantees that the buffer > cache is invalidated, and I can guarantee that all _my_ processes close > the block device to achieve this. However, if there is at least one other > process I don't know of that is keeping an open file handle against that > block device, the buffer cache won't be invalidated, and that will result > in corruption. So this solution doesn't seem to work. Well, I wouldn't really advice to depend on buffer cache to be flushed when all openers close the device. > I had a look at the BLKFLSBUF ioctl, which seems to be designed to do the > job, except that it doesn't work if a process has memory mapped the block > device, and AFAIK there's no way to disallow memory mapping of a block > device. Again, that looks like a deal-breaker. Yeah, plus it doesn't work when a page is dirty / under writeback although that doesn't seem to be an issue for your usecase. > If the above observations are correct, it seems that I have to either > extend BLKFLSBUF to some invalidate such memory maps (I'm completely > ignorant in that field, is it even possible?), or look for other > solutions. Well, you could unmap the pages of block device that are mapped in your new ioctl but I don't think it's easily possible to disallow userspace to fault the pages back behind your back. And that could be a problem for you. > Could some configuration of dm-cache, bcache, or some other component > solve my problem? posix_fadvise with POSIX_FADV_DONTNEED seems to be just > a hint to the kernel, there are no guarantees that cached data will be > discarded. I've also though of using a virtual block device driver that > exclusively opens the actual network block device and lets user-space > applications use the virtualised one so that I have more control and > enforce things. Hum, I'm not aware of any approach which would do what you need. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: invalidate the buffer heads of a block device 2014-09-29 18:11 ` Jan Kara @ 2014-09-30 8:58 ` Thanos Makatos 2014-09-30 9:19 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Thanos Makatos @ 2014-09-30 8:58 UTC (permalink / raw) To: 'Jan Kara' Cc: linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi > > If the above observations are correct, it seems that I have to either > > extend BLKFLSBUF to some invalidate such memory maps (I'm completely > > ignorant in that field, is it even possible?), or look for other > > solutions. > Well, you could unmap the pages of block device that are mapped in your > new ioctl but I don't think it's easily possible to disallow userspace to fault the > pages back behind your back. And that could be a problem for you. I'm not sure I understand what "I don't think it's easily possible to disallow userspace to fault the pages back behind your back" means, could you explain? I definitely don't care what happens to the process that dared to memory map the block device, this configuration is not supported in my environment. Is this what you mean? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-09-30 8:58 ` Thanos Makatos @ 2014-09-30 9:19 ` Jan Kara 2014-09-30 9:39 ` Thanos Makatos 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2014-09-30 9:19 UTC (permalink / raw) To: Thanos Makatos Cc: 'Jan Kara', linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Tue 30-09-14 08:58:54, Thanos Makatos wrote: > > > If the above observations are correct, it seems that I have to either > > > extend BLKFLSBUF to some invalidate such memory maps (I'm completely > > > ignorant in that field, is it even possible?), or look for other > > > solutions. > > Well, you could unmap the pages of block device that are mapped in your > > new ioctl but I don't think it's easily possible to disallow userspace to fault the > > pages back behind your back. And that could be a problem for you. > > I'm not sure I understand what "I don't think it's easily possible to > disallow userspace to fault the pages back behind your back" means, could > you explain? I mean by that that even if you evict page from page cache, it can be immediately reloaded from the device. So you have to be careful to provide new data once you decide to go and evict page cache. But that's obvious I guess. > I definitely don't care what happens to the process that dared to memory > map the block device, this configuration is not supported in my > environment. Is this what you mean? OK, so it isn't supported but you don't want such application to be able to screw up your other processes, am I right? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: invalidate the buffer heads of a block device 2014-09-30 9:19 ` Jan Kara @ 2014-09-30 9:39 ` Thanos Makatos 2014-09-30 9:55 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Thanos Makatos @ 2014-09-30 9:39 UTC (permalink / raw) To: 'Jan Kara' Cc: linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi > On Tue 30-09-14 08:58:54, Thanos Makatos wrote: > > > > If the above observations are correct, it seems that I have to > > > > either extend BLKFLSBUF to some invalidate such memory maps (I'm > > > > completely ignorant in that field, is it even possible?), or look > > > > for other solutions. > > > Well, you could unmap the pages of block device that are mapped in > > > your new ioctl but I don't think it's easily possible to disallow > > > userspace to fault the pages back behind your back. And that could be a > problem for you. > > > > I'm not sure I understand what "I don't think it's easily possible to > > disallow userspace to fault the pages back behind your back" means, > > could you explain? > I mean by that that even if you evict page from page cache, it can be > immediately reloaded from the device. So you have to be careful to provide > new data once you decide to go and evict page cache. But that's obvious I > guess. > > > I definitely don't care what happens to the process that dared to > > memory map the block device, this configuration is not supported in my > > environment. Is this what you mean? > OK, so it isn't supported but you don't want such application to be able to > screw up your other processes, am I right? I realise I haven't given enough context to my scenario, I'll try to better explain myself: there is a known set of processes that use the block device in a very specific and predictable way (never via mmap), and they are fully under my control. I don't care what happens to any other process, I don't even care if it crashes, sees the wrong data, or segfaults. When I issue the ioctl (modified to evict pages from the page cache), I can guarantee that all of my process do not access the block device, I can even force them to close it while the modified ioctl executes. You say that "that even if you evict page from page cache, it can be immediately reloaded from the device", does this mean that that "unknown" process can access the block device via mmap and fetch "stale" pages while I'm executing the ioctl, effectively undoing what the ioctl did? If this is the case, the buffer cache will contain the wrong data and that will result in corruption. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-09-30 9:39 ` Thanos Makatos @ 2014-09-30 9:55 ` Jan Kara 2014-09-30 10:11 ` Thanos Makatos 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2014-09-30 9:55 UTC (permalink / raw) To: Thanos Makatos Cc: 'Jan Kara', linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Tue 30-09-14 09:39:17, Thanos Makatos wrote: > > On Tue 30-09-14 08:58:54, Thanos Makatos wrote: > > > > > If the above observations are correct, it seems that I have to > > > > > either extend BLKFLSBUF to some invalidate such memory maps (I'm > > > > > completely ignorant in that field, is it even possible?), or look > > > > > for other solutions. > > > > Well, you could unmap the pages of block device that are mapped in > > > > your new ioctl but I don't think it's easily possible to disallow > > > > userspace to fault the pages back behind your back. And that could be a > > problem for you. > > > > > > I'm not sure I understand what "I don't think it's easily possible to > > > disallow userspace to fault the pages back behind your back" means, > > > could you explain? > > I mean by that that even if you evict page from page cache, it can be > > immediately reloaded from the device. So you have to be careful to provide > > new data once you decide to go and evict page cache. But that's obvious I > > guess. > > > > > I definitely don't care what happens to the process that dared to > > > memory map the block device, this configuration is not supported in my > > > environment. Is this what you mean? > > OK, so it isn't supported but you don't want such application to be able to > > screw up your other processes, am I right? > > I realise I haven't given enough context to my scenario, I'll try to > better explain myself: there is a known set of processes that use the > block device in a very specific and predictable way (never via mmap), and > they are fully under my control. I don't care what happens to any other > process, I don't even care if it crashes, sees the wrong data, or > segfaults. OK. > When I issue the ioctl (modified to evict pages from the page cache), I > can guarantee that all of my process do not access the block device, I > can even force them to close it while the modified ioctl executes. You > say that "that even if you evict page from page cache, it can be > immediately reloaded from the device", does this mean that that "unknown" > process can access the block device via mmap and fetch "stale" pages > while I'm executing the ioctl, effectively undoing what the ioctl did? If > this is the case, the buffer cache will contain the wrong data and that > will result in corruption. Yes, "unknown" process can read data back into pagecache while your ioctl is running thus undoing your work. But if the writes are already visible on the *device* at the moment you run the ioctl, then "unknown" process will just fetch new data and everything is fine... If you need to evict page cache *before* new data is visible on the device, then you need to suspend the device first so that it doesn't serve any IO, then evict page cache, then make new data visible on the device, and finally resume the device. Suspend/resume of the device can be handled by device mapper (it does these tricks when you are changing topology of the device on the fly). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: invalidate the buffer heads of a block device 2014-09-30 9:55 ` Jan Kara @ 2014-09-30 10:11 ` Thanos Makatos 2014-09-30 10:48 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Thanos Makatos @ 2014-09-30 10:11 UTC (permalink / raw) To: 'Jan Kara' Cc: linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi > Yes, "unknown" process can read data back into pagecache while your ioctl > is running thus undoing your work. But if the writes are already visible on the > *device* at the moment you run the ioctl, then "unknown" process will just > fetch new data and everything is fine... If you need to evict page cache > *before* new data is visible on the device, then you need to suspend the > device first so that it doesn't serve any IO, then evict page cache, then make > new data visible on the device, and finally resume the device. > Suspend/resume of the device can be handled by device mapper (it does > these tricks when you are changing topology of the device on the fly). The writes will be visible to the device *before* the ioctl runs so I have one thing less to worry about! Regarding extending the ioctl to invalidate the page cache, do you have any suggestions where I could start looking? Would such a new ioctl have any chance to be accepted upstream? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-09-30 10:11 ` Thanos Makatos @ 2014-09-30 10:48 ` Jan Kara 2014-09-30 20:53 ` Zach Brown 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2014-09-30 10:48 UTC (permalink / raw) To: Thanos Makatos Cc: 'Jan Kara', linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Tue 30-09-14 10:11:32, Thanos Makatos wrote: > > Yes, "unknown" process can read data back into pagecache while your ioctl > > is running thus undoing your work. But if the writes are already visible on the > > *device* at the moment you run the ioctl, then "unknown" process will just > > fetch new data and everything is fine... If you need to evict page cache > > *before* new data is visible on the device, then you need to suspend the > > device first so that it doesn't serve any IO, then evict page cache, then make > > new data visible on the device, and finally resume the device. > > Suspend/resume of the device can be handled by device mapper (it does > > these tricks when you are changing topology of the device on the fly). > > The writes will be visible to the device *before* the ioctl runs so I > have one thing less to worry about! > > Regarding extending the ioctl to invalidate the page cache, do you have > any suggestions where I could start looking? You just need to call invalidate_inode_pages2(). That is going to do all you need. > Would such a new ioctl have any chance to be accepted upstream? I believe a possibility for a file to be fully flushed from page cache is useful at times and if you present well your usecase there are reasonable chances it will get accepted upstream. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-09-30 10:48 ` Jan Kara @ 2014-09-30 20:53 ` Zach Brown 2014-09-30 21:13 ` Trond Myklebust 0 siblings, 1 reply; 15+ messages in thread From: Zach Brown @ 2014-09-30 20:53 UTC (permalink / raw) To: Jan Kara Cc: Thanos Makatos, linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Tue, Sep 30, 2014 at 12:48:45PM +0200, Jan Kara wrote: > On Tue 30-09-14 10:11:32, Thanos Makatos wrote: > > > > Regarding extending the ioctl to invalidate the page cache, do you have > > any suggestions where I could start looking? > You just need to call invalidate_inode_pages2(). That is going to do all > you need. > > > Would such a new ioctl have any chance to be accepted upstream? > I believe a possibility for a file to be fully flushed from page cache is > useful at times and if you present well your usecase there are reasonable > chances it will get accepted upstream. Agreed, this seems reasonable. How many times have we all dropped our entire cache just 'cause we didn't have a more precise tool? $ grep -ri drop_caches xfstests/ xfstests/src/fsync-tester.c: if ((fd = open("/proc/sys/vm/drop_caches", O_WRONLY)) < 0) { xfstests/src/stale_handle.c: system("echo 3 > /proc/sys/vm/drop_caches"); xfstests/common/quota: echo 3 > /proc/sys/vm/drop_caches The last one even says: # XXX: really need an ioctl instead of this big hammer echo 3 > /proc/sys/vm/drop_caches :) - z ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-09-30 20:53 ` Zach Brown @ 2014-09-30 21:13 ` Trond Myklebust 2014-10-01 9:05 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Trond Myklebust @ 2014-09-30 21:13 UTC (permalink / raw) To: Zach Brown Cc: Jan Kara, Thanos Makatos, linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Tue, Sep 30, 2014 at 4:53 PM, Zach Brown <zab@zabbo.net> wrote: > On Tue, Sep 30, 2014 at 12:48:45PM +0200, Jan Kara wrote: >> On Tue 30-09-14 10:11:32, Thanos Makatos wrote: >> > >> > Regarding extending the ioctl to invalidate the page cache, do you have >> > any suggestions where I could start looking? >> You just need to call invalidate_inode_pages2(). That is going to do all >> you need. >> >> > Would such a new ioctl have any chance to be accepted upstream? >> I believe a possibility for a file to be fully flushed from page cache is >> useful at times and if you present well your usecase there are reasonable >> chances it will get accepted upstream. > > Agreed, this seems reasonable. How many times have we all dropped our > entire cache just 'cause we didn't have a more precise tool? > > $ grep -ri drop_caches xfstests/ > xfstests/src/fsync-tester.c: if ((fd = open("/proc/sys/vm/drop_caches", O_WRONLY)) < 0) { > xfstests/src/stale_handle.c: system("echo 3 > /proc/sys/vm/drop_caches"); > xfstests/common/quota: echo 3 > /proc/sys/vm/drop_caches > > The last one even says: > > # XXX: really need an ioctl instead of this big hammer > echo 3 > /proc/sys/vm/drop_caches > > :) > It would definitely be useful for NFS, however we'd want the option of clearing the cached metadata too (acls, mode bits, owner/group owner, etc.) -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-09-30 21:13 ` Trond Myklebust @ 2014-10-01 9:05 ` Jan Kara 2014-10-01 11:50 ` Trond Myklebust 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2014-10-01 9:05 UTC (permalink / raw) To: Trond Myklebust Cc: Zach Brown, Jan Kara, Thanos Makatos, linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Tue 30-09-14 17:13:19, Trond Myklebust wrote: > On Tue, Sep 30, 2014 at 4:53 PM, Zach Brown <zab@zabbo.net> wrote: > > On Tue, Sep 30, 2014 at 12:48:45PM +0200, Jan Kara wrote: > >> On Tue 30-09-14 10:11:32, Thanos Makatos wrote: > >> > > >> > Regarding extending the ioctl to invalidate the page cache, do you have > >> > any suggestions where I could start looking? > >> You just need to call invalidate_inode_pages2(). That is going to do all > >> you need. > >> > >> > Would such a new ioctl have any chance to be accepted upstream? > >> I believe a possibility for a file to be fully flushed from page cache is > >> useful at times and if you present well your usecase there are reasonable > >> chances it will get accepted upstream. > > > > Agreed, this seems reasonable. How many times have we all dropped our > > entire cache just 'cause we didn't have a more precise tool? > > > > $ grep -ri drop_caches xfstests/ > > xfstests/src/fsync-tester.c: if ((fd = open("/proc/sys/vm/drop_caches", O_WRONLY)) < 0) { > > xfstests/src/stale_handle.c: system("echo 3 > /proc/sys/vm/drop_caches"); > > xfstests/common/quota: echo 3 > /proc/sys/vm/drop_caches > > > > The last one even says: > > > > # XXX: really need an ioctl instead of this big hammer > > echo 3 > /proc/sys/vm/drop_caches > > > > :) > > > > It would definitely be useful for NFS, however we'd want the option of > clearing the cached metadata too (acls, mode bits, owner/group owner, > etc.) Hum, I can imagine you might be somehow successful in flushing ACLs but how would you like to flush mode or owner? It's not like you can just reload inode from disk and overwrite what you have in memory. What would look sane is to push inode with all metadata it caches out of memory but once somehow holds the inode (and ioctl() itself will have a file handle of the inode), you just cannot. So I'm not sure if you meant something different or if this was just a wish without deeper thought. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-10-01 9:05 ` Jan Kara @ 2014-10-01 11:50 ` Trond Myklebust 2014-10-01 14:07 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Trond Myklebust @ 2014-10-01 11:50 UTC (permalink / raw) To: Jan Kara Cc: Zach Brown, Thanos Makatos, linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Wed, Oct 1, 2014 at 5:05 AM, Jan Kara <jack@suse.cz> wrote: > On Tue 30-09-14 17:13:19, Trond Myklebust wrote: >> On Tue, Sep 30, 2014 at 4:53 PM, Zach Brown <zab@zabbo.net> wrote: >> > On Tue, Sep 30, 2014 at 12:48:45PM +0200, Jan Kara wrote: >> >> On Tue 30-09-14 10:11:32, Thanos Makatos wrote: >> >> > >> >> > Regarding extending the ioctl to invalidate the page cache, do you have >> >> > any suggestions where I could start looking? >> >> You just need to call invalidate_inode_pages2(). That is going to do all >> >> you need. >> >> >> >> > Would such a new ioctl have any chance to be accepted upstream? >> >> I believe a possibility for a file to be fully flushed from page cache is >> >> useful at times and if you present well your usecase there are reasonable >> >> chances it will get accepted upstream. >> > >> > Agreed, this seems reasonable. How many times have we all dropped our >> > entire cache just 'cause we didn't have a more precise tool? >> > >> > $ grep -ri drop_caches xfstests/ >> > xfstests/src/fsync-tester.c: if ((fd = open("/proc/sys/vm/drop_caches", O_WRONLY)) < 0) { >> > xfstests/src/stale_handle.c: system("echo 3 > /proc/sys/vm/drop_caches"); >> > xfstests/common/quota: echo 3 > /proc/sys/vm/drop_caches >> > >> > The last one even says: >> > >> > # XXX: really need an ioctl instead of this big hammer >> > echo 3 > /proc/sys/vm/drop_caches >> > >> > :) >> > >> >> It would definitely be useful for NFS, however we'd want the option of >> clearing the cached metadata too (acls, mode bits, owner/group owner, >> etc.) > Hum, I can imagine you might be somehow successful in flushing ACLs but > how would you like to flush mode or owner? It's not like you can just > reload inode from disk and overwrite what you have in memory. What would > look sane is to push inode with all metadata it caches out of memory but > once somehow holds the inode (and ioctl() itself will have a file handle of > the inode), you just cannot. So I'm not sure if you meant something > different or if this was just a wish without deeper thought. You can and you _must_ reload the inode if it changes on the server. That's what makes distributed filesystems "interesting" as far as caching goes. What we do in the cases where we think the file metadata may have changed, is to mark the existing cached metadata as being stale so that we know that we need to revalidate before trying to use it. So being able to tell in the ioctl() whether or not the application thinks the data or metadata (or both) may have changed on the remote disk is actually a very useful feature if you are trying to do things like distributed compiles. I've had an experimental NFS-only ioctl() based caching interface kicking around in my git repository for a couple of years now (http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/ioctl). If we are planning on doing something at the VFS level, then it would be nice to at least duplicate the functionality from the first 2 patches. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-10-01 11:50 ` Trond Myklebust @ 2014-10-01 14:07 ` Jan Kara 2014-10-01 14:47 ` Trond Myklebust 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2014-10-01 14:07 UTC (permalink / raw) To: Trond Myklebust Cc: Jan Kara, Zach Brown, Thanos Makatos, linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Wed 01-10-14 07:50:21, Trond Myklebust wrote: > On Wed, Oct 1, 2014 at 5:05 AM, Jan Kara <jack@suse.cz> wrote: > > On Tue 30-09-14 17:13:19, Trond Myklebust wrote: > >> On Tue, Sep 30, 2014 at 4:53 PM, Zach Brown <zab@zabbo.net> wrote: > >> > On Tue, Sep 30, 2014 at 12:48:45PM +0200, Jan Kara wrote: > >> >> On Tue 30-09-14 10:11:32, Thanos Makatos wrote: > >> >> > > >> >> > Regarding extending the ioctl to invalidate the page cache, do you have > >> >> > any suggestions where I could start looking? > >> >> You just need to call invalidate_inode_pages2(). That is going to do all > >> >> you need. > >> >> > >> >> > Would such a new ioctl have any chance to be accepted upstream? > >> >> I believe a possibility for a file to be fully flushed from page cache is > >> >> useful at times and if you present well your usecase there are reasonable > >> >> chances it will get accepted upstream. > >> > > >> > Agreed, this seems reasonable. How many times have we all dropped our > >> > entire cache just 'cause we didn't have a more precise tool? > >> > > >> > $ grep -ri drop_caches xfstests/ > >> > xfstests/src/fsync-tester.c: if ((fd = open("/proc/sys/vm/drop_caches", O_WRONLY)) < 0) { > >> > xfstests/src/stale_handle.c: system("echo 3 > /proc/sys/vm/drop_caches"); > >> > xfstests/common/quota: echo 3 > /proc/sys/vm/drop_caches > >> > > >> > The last one even says: > >> > > >> > # XXX: really need an ioctl instead of this big hammer > >> > echo 3 > /proc/sys/vm/drop_caches > >> > > >> > :) > >> > > >> > >> It would definitely be useful for NFS, however we'd want the option of > >> clearing the cached metadata too (acls, mode bits, owner/group owner, > >> etc.) > > Hum, I can imagine you might be somehow successful in flushing ACLs but > > how would you like to flush mode or owner? It's not like you can just > > reload inode from disk and overwrite what you have in memory. What would > > look sane is to push inode with all metadata it caches out of memory but > > once somehow holds the inode (and ioctl() itself will have a file handle of > > the inode), you just cannot. So I'm not sure if you meant something > > different or if this was just a wish without deeper thought. > > You can and you _must_ reload the inode if it changes on the server. > That's what makes distributed filesystems "interesting" as far as > caching goes. What we do in the cases where we think the file metadata > may have changed, is to mark the existing cached metadata as being > stale so that we know that we need to revalidate before trying to use > it. Ah OK, I somehow thought about revalidation automatically happening for any filesystem and found that impossible. I can imagine doing this specifically for NFS which is designed to handle such things is possible :) > So being able to tell in the ioctl() whether or not the application > thinks the data or metadata (or both) may have changed on the remote > disk is actually a very useful feature if you are trying to do things > like distributed compiles. > > I've had an experimental NFS-only ioctl() based caching interface > kicking around in my git repository for a couple of years now > (http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/ioctl). > If we are planning on doing something at the VFS level, then it would > be nice to at least duplicate the functionality from the first 2 > patches. Thanks for the pointer. The first two patches look useful and I think we can design the ioctl interface to accommodate the NFS usecase (although most filesystems will just return EOPNOTSUPP in case someone asks for metadata invalidation). The interface in your patches looks mostly OK, I'd just rather use 'flags' than 'cmd'. So something like: struct invalidate_arg { u64 flags; u64 start; u64 len; }; Where 'flags' can be: INVAL_METADATA - invalidate inode itself, acls, etc. if fs supports it (this can be more finegrained if that's useful. Is it?) INVAL_DATA - invalidate data in range <start, start+len) We can have a fs callback that gets called for this ioctl to do the work, most filesystems would just leave it at NULL which would fall back to the default function which just handles INVAL_DATA and returns EOPNOTSUPP if INVAL_METADATA is set. NFS could then implement its own thing for INVAL_METADATA and call the generic function for handling INVAL_DATA. Thoughts? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-10-01 14:07 ` Jan Kara @ 2014-10-01 14:47 ` Trond Myklebust 2014-10-01 15:15 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Trond Myklebust @ 2014-10-01 14:47 UTC (permalink / raw) To: Jan Kara Cc: Zach Brown, Thanos Makatos, linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Wed, Oct 1, 2014 at 10:07 AM, Jan Kara <jack@suse.cz> wrote: > On Wed 01-10-14 07:50:21, Trond Myklebust wrote: >> On Wed, Oct 1, 2014 at 5:05 AM, Jan Kara <jack@suse.cz> wrote: >> > On Tue 30-09-14 17:13:19, Trond Myklebust wrote: >> >> On Tue, Sep 30, 2014 at 4:53 PM, Zach Brown <zab@zabbo.net> wrote: >> >> > On Tue, Sep 30, 2014 at 12:48:45PM +0200, Jan Kara wrote: >> >> >> On Tue 30-09-14 10:11:32, Thanos Makatos wrote: >> >> >> > >> >> >> > Regarding extending the ioctl to invalidate the page cache, do you have >> >> >> > any suggestions where I could start looking? >> >> >> You just need to call invalidate_inode_pages2(). That is going to do all >> >> >> you need. >> >> >> >> >> >> > Would such a new ioctl have any chance to be accepted upstream? >> >> >> I believe a possibility for a file to be fully flushed from page cache is >> >> >> useful at times and if you present well your usecase there are reasonable >> >> >> chances it will get accepted upstream. >> >> > >> >> > Agreed, this seems reasonable. How many times have we all dropped our >> >> > entire cache just 'cause we didn't have a more precise tool? >> >> > >> >> > $ grep -ri drop_caches xfstests/ >> >> > xfstests/src/fsync-tester.c: if ((fd = open("/proc/sys/vm/drop_caches", O_WRONLY)) < 0) { >> >> > xfstests/src/stale_handle.c: system("echo 3 > /proc/sys/vm/drop_caches"); >> >> > xfstests/common/quota: echo 3 > /proc/sys/vm/drop_caches >> >> > >> >> > The last one even says: >> >> > >> >> > # XXX: really need an ioctl instead of this big hammer >> >> > echo 3 > /proc/sys/vm/drop_caches >> >> > >> >> > :) >> >> > >> >> >> >> It would definitely be useful for NFS, however we'd want the option of >> >> clearing the cached metadata too (acls, mode bits, owner/group owner, >> >> etc.) >> > Hum, I can imagine you might be somehow successful in flushing ACLs but >> > how would you like to flush mode or owner? It's not like you can just >> > reload inode from disk and overwrite what you have in memory. What would >> > look sane is to push inode with all metadata it caches out of memory but >> > once somehow holds the inode (and ioctl() itself will have a file handle of >> > the inode), you just cannot. So I'm not sure if you meant something >> > different or if this was just a wish without deeper thought. >> >> You can and you _must_ reload the inode if it changes on the server. >> That's what makes distributed filesystems "interesting" as far as >> caching goes. What we do in the cases where we think the file metadata >> may have changed, is to mark the existing cached metadata as being >> stale so that we know that we need to revalidate before trying to use >> it. > Ah OK, I somehow thought about revalidation automatically happening for > any filesystem and found that impossible. I can imagine doing this > specifically for NFS which is designed to handle such things is possible :) > >> So being able to tell in the ioctl() whether or not the application >> thinks the data or metadata (or both) may have changed on the remote >> disk is actually a very useful feature if you are trying to do things >> like distributed compiles. >> >> I've had an experimental NFS-only ioctl() based caching interface >> kicking around in my git repository for a couple of years now >> (http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/ioctl). >> If we are planning on doing something at the VFS level, then it would >> be nice to at least duplicate the functionality from the first 2 >> patches. > Thanks for the pointer. The first two patches look useful and I think we > can design the ioctl interface to accommodate the NFS usecase (although most > filesystems will just return EOPNOTSUPP in case someone asks for metadata > invalidation). > > The interface in your patches looks mostly OK, I'd just rather use 'flags' > than 'cmd'. So something like: > struct invalidate_arg { > u64 flags; > u64 start; > u64 len; > }; > > Where 'flags' can be: > INVAL_METADATA - invalidate inode itself, acls, etc. if fs supports it > (this can be more finegrained if that's useful. Is it?) > INVAL_DATA - invalidate data in range <start, start+len) If these are flags rather than a command, then is the intention that they can be ORed together so that you can force a metadata+data revalidate in a single function call? > We can have a fs callback that gets called for this ioctl to do the work, > most filesystems would just leave it at NULL which would fall back to the > default function which just handles INVAL_DATA and returns EOPNOTSUPP if > INVAL_METADATA is set. NFS could then implement its own thing for > INVAL_METADATA and call the generic function for handling INVAL_DATA. > Thoughts? Sounds fine to me. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: invalidate the buffer heads of a block device 2014-10-01 14:47 ` Trond Myklebust @ 2014-10-01 15:15 ` Jan Kara 0 siblings, 0 replies; 15+ messages in thread From: Jan Kara @ 2014-10-01 15:15 UTC (permalink / raw) To: Trond Myklebust Cc: Jan Kara, Zach Brown, Thanos Makatos, linux-fsdevel@vger.kernel.org, Ross Lagerwall, Felipe Franciosi On Wed 01-10-14 10:47:05, Trond Myklebust wrote: > On Wed, Oct 1, 2014 at 10:07 AM, Jan Kara <jack@suse.cz> wrote: > > On Wed 01-10-14 07:50:21, Trond Myklebust wrote: > >> On Wed, Oct 1, 2014 at 5:05 AM, Jan Kara <jack@suse.cz> wrote: > >> > On Tue 30-09-14 17:13:19, Trond Myklebust wrote: > >> >> On Tue, Sep 30, 2014 at 4:53 PM, Zach Brown <zab@zabbo.net> wrote: > >> >> > On Tue, Sep 30, 2014 at 12:48:45PM +0200, Jan Kara wrote: > >> >> >> On Tue 30-09-14 10:11:32, Thanos Makatos wrote: > >> >> >> > > >> >> >> > Regarding extending the ioctl to invalidate the page cache, do you have > >> >> >> > any suggestions where I could start looking? > >> >> >> You just need to call invalidate_inode_pages2(). That is going to do all > >> >> >> you need. > >> >> >> > >> >> >> > Would such a new ioctl have any chance to be accepted upstream? > >> >> >> I believe a possibility for a file to be fully flushed from page cache is > >> >> >> useful at times and if you present well your usecase there are reasonable > >> >> >> chances it will get accepted upstream. > >> >> > > >> >> > Agreed, this seems reasonable. How many times have we all dropped our > >> >> > entire cache just 'cause we didn't have a more precise tool? > >> >> > > >> >> > $ grep -ri drop_caches xfstests/ > >> >> > xfstests/src/fsync-tester.c: if ((fd = open("/proc/sys/vm/drop_caches", O_WRONLY)) < 0) { > >> >> > xfstests/src/stale_handle.c: system("echo 3 > /proc/sys/vm/drop_caches"); > >> >> > xfstests/common/quota: echo 3 > /proc/sys/vm/drop_caches > >> >> > > >> >> > The last one even says: > >> >> > > >> >> > # XXX: really need an ioctl instead of this big hammer > >> >> > echo 3 > /proc/sys/vm/drop_caches > >> >> > > >> >> > :) > >> >> > > >> >> > >> >> It would definitely be useful for NFS, however we'd want the option of > >> >> clearing the cached metadata too (acls, mode bits, owner/group owner, > >> >> etc.) > >> > Hum, I can imagine you might be somehow successful in flushing ACLs but > >> > how would you like to flush mode or owner? It's not like you can just > >> > reload inode from disk and overwrite what you have in memory. What would > >> > look sane is to push inode with all metadata it caches out of memory but > >> > once somehow holds the inode (and ioctl() itself will have a file handle of > >> > the inode), you just cannot. So I'm not sure if you meant something > >> > different or if this was just a wish without deeper thought. > >> > >> You can and you _must_ reload the inode if it changes on the server. > >> That's what makes distributed filesystems "interesting" as far as > >> caching goes. What we do in the cases where we think the file metadata > >> may have changed, is to mark the existing cached metadata as being > >> stale so that we know that we need to revalidate before trying to use > >> it. > > Ah OK, I somehow thought about revalidation automatically happening for > > any filesystem and found that impossible. I can imagine doing this > > specifically for NFS which is designed to handle such things is possible :) > > > >> So being able to tell in the ioctl() whether or not the application > >> thinks the data or metadata (or both) may have changed on the remote > >> disk is actually a very useful feature if you are trying to do things > >> like distributed compiles. > >> > >> I've had an experimental NFS-only ioctl() based caching interface > >> kicking around in my git repository for a couple of years now > >> (http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/ioctl). > >> If we are planning on doing something at the VFS level, then it would > >> be nice to at least duplicate the functionality from the first 2 > >> patches. > > Thanks for the pointer. The first two patches look useful and I think we > > can design the ioctl interface to accommodate the NFS usecase (although most > > filesystems will just return EOPNOTSUPP in case someone asks for metadata > > invalidation). > > > > The interface in your patches looks mostly OK, I'd just rather use 'flags' > > than 'cmd'. So something like: > > struct invalidate_arg { > > u64 flags; > > u64 start; > > u64 len; > > }; > > > > Where 'flags' can be: > > INVAL_METADATA - invalidate inode itself, acls, etc. if fs supports it > > (this can be more finegrained if that's useful. Is it?) > > INVAL_DATA - invalidate data in range <start, start+len) > > If these are flags rather than a command, then is the intention that > they can be ORed together so that you can force a metadata+data > revalidate in a single function call? Yes, that is my intention. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-10-01 15:16 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-29 16:24 invalidate the buffer heads of a block device Thanos Makatos 2014-09-29 18:11 ` Jan Kara 2014-09-30 8:58 ` Thanos Makatos 2014-09-30 9:19 ` Jan Kara 2014-09-30 9:39 ` Thanos Makatos 2014-09-30 9:55 ` Jan Kara 2014-09-30 10:11 ` Thanos Makatos 2014-09-30 10:48 ` Jan Kara 2014-09-30 20:53 ` Zach Brown 2014-09-30 21:13 ` Trond Myklebust 2014-10-01 9:05 ` Jan Kara 2014-10-01 11:50 ` Trond Myklebust 2014-10-01 14:07 ` Jan Kara 2014-10-01 14:47 ` Trond Myklebust 2014-10-01 15:15 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).