* Sharing ext4 on target storage to multiple initiators using NVMeoF @ 2019-09-16 14:33 Daegyu Han 2019-09-16 19:23 ` Eric Sandeen 2019-09-17 6:48 ` Christoph Hellwig 0 siblings, 2 replies; 6+ messages in thread From: Daegyu Han @ 2019-09-16 14:33 UTC (permalink / raw) To: linux-fsdevel Hi linux file system experts, I want to share ext4 on the storage server to multiple initiators(node A,B) using NVMeoF. Node A will write file to ext4 on the storage server, and I will mount read-only option on Node B. Actually, the reason I do this is for a prototype test. I can't see the file's dentry and inode written in Node A on Node B unless remount(umount and then mount) it. Why is that? I think if there is file system cache(dentry, inode) on Node B, then disk IO will occur to read the data written by Node A. Curiously, drop cache on Node B and do blockdev --flushbufs, then I can access the file written by Node A. I checked the kernel code and found that flushbufs incurs sync_filesystem() which flushes the superblock and all dirty file system caches. Should the superblock data structure be flushed (updated) when accessing the disk inode? I wonder why this happens. Regards, ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF 2019-09-16 14:33 Sharing ext4 on target storage to multiple initiators using NVMeoF Daegyu Han @ 2019-09-16 19:23 ` Eric Sandeen 2019-09-17 0:44 ` Daegyu Han 2019-09-17 6:48 ` Christoph Hellwig 1 sibling, 1 reply; 6+ messages in thread From: Eric Sandeen @ 2019-09-16 19:23 UTC (permalink / raw) To: Daegyu Han, linux-fsdevel On 9/16/19 9:33 AM, Daegyu Han wrote: > Hi linux file system experts, > > I want to share ext4 on the storage server to multiple initiators(node > A,B) using NVMeoF. > Node A will write file to ext4 on the storage server, and I will mount > read-only option on Node B. > > Actually, the reason I do this is for a prototype test. > > I can't see the file's dentry and inode written in Node A on Node B > unless remount(umount and then mount) it. > > Why is that? Caching, metadata journaling, etc. What you are trying to do will not work. > I think if there is file system cache(dentry, inode) on Node B, then > disk IO will occur to read the data written by Node A. why would it? there is no coordination between the nodes. ext4 is not a clustered filesystem. > Curiously, drop cache on Node B and do blockdev --flushbufs, then I > can access the file written by Node A. > > I checked the kernel code and found that flushbufs incurs > sync_filesystem() which flushes the superblock and all dirty file > system caches. > > Should the superblock data structure be flushed (updated) when > accessing the disk inode? It has nothing to do w/ the superblock. > I wonder why this happens. ext4 cannot be used for what you're trying to do. -Eric ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF 2019-09-16 19:23 ` Eric Sandeen @ 2019-09-17 0:44 ` Daegyu Han 2019-09-17 12:54 ` Theodore Y. Ts'o 0 siblings, 1 reply; 6+ messages in thread From: Daegyu Han @ 2019-09-17 0:44 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-fsdevel It started with my curiosity. I know this is not the right way to use a local filesystem and someone would feel weird. I just wanted to organize the situation and experiment like that. I thought it would work if I flushed Node B's cached file system metadata with the drop cache, but I didn't. I've googled for something other than the mount and unmount process, and I saw a StackOverflow article telling file systems to sync via blockdev --flushbufs. So I do the blockdev --flushbufs after the drop cache. However, I still do not know why I can read the data stored in the shared storage via Node B. Thank you, 2019-09-17 4:23 GMT+09:00, Eric Sandeen <sandeen@sandeen.net>: > > > On 9/16/19 9:33 AM, Daegyu Han wrote: >> Hi linux file system experts, >> >> I want to share ext4 on the storage server to multiple initiators(node >> A,B) using NVMeoF. >> Node A will write file to ext4 on the storage server, and I will mount >> read-only option on Node B. >> >> Actually, the reason I do this is for a prototype test. >> >> I can't see the file's dentry and inode written in Node A on Node B >> unless remount(umount and then mount) it. >> >> Why is that? > > Caching, metadata journaling, etc. > > What you are trying to do will not work. > >> I think if there is file system cache(dentry, inode) on Node B, then >> disk IO will occur to read the data written by Node A. > > why would it? there is no coordination between the nodes. ext4 is > not a clustered filesystem. > >> Curiously, drop cache on Node B and do blockdev --flushbufs, then I >> can access the file written by Node A. >> >> I checked the kernel code and found that flushbufs incurs >> sync_filesystem() which flushes the superblock and all dirty file >> system caches. >> >> Should the superblock data structure be flushed (updated) when >> accessing the disk inode? > > It has nothing to do w/ the superblock. > >> I wonder why this happens. > > ext4 cannot be used for what you're trying to do. > > -Eric > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF 2019-09-17 0:44 ` Daegyu Han @ 2019-09-17 12:54 ` Theodore Y. Ts'o 2019-09-17 15:38 ` Daegyu Han 0 siblings, 1 reply; 6+ messages in thread From: Theodore Y. Ts'o @ 2019-09-17 12:54 UTC (permalink / raw) To: Daegyu Han; +Cc: Eric Sandeen, linux-fsdevel On Tue, Sep 17, 2019 at 09:44:00AM +0900, Daegyu Han wrote: > It started with my curiosity. > I know this is not the right way to use a local filesystem and someone > would feel weird. > I just wanted to organize the situation and experiment like that. > > I thought it would work if I flushed Node B's cached file system > metadata with the drop cache, but I didn't. > > I've googled for something other than the mount and unmount process, > and I saw a StackOverflow article telling file systems to sync via > blockdev --flushbufs. > > So I do the blockdev --flushbufs after the drop cache. > However, I still do not know why I can read the data stored in the > shared storage via Node B. There are many problems, but the primary one is that Node B has caches. If it has a cached version of the inode table block, why should it reread it after Node A has modified it? Also, the VFS also has negative dentry caches. This is very important for search path performance. Consider for example the compiler which may need to look in many directories for a particular header file. If the C program has: #include "amazing.h" The C compiler may need to look in a dozen or more directories trying to find the header file amazing.h. And each successive C compiler process will need to keep looking in all of those same directories. So the kernel will keep a "negative cache", so if /usr/include/amazing.h doesn't exist, it won't ask the file system when the 2nd, 3rd, 4th, 5th, ... compiler process tries to open /usr/include/amazing.h. You can disable all of the caches, but that makes the file system terribly, terribly slow. What network file systems will do is they have schemes whereby they can safely cache, since the network file system protocol has a way that the client can be told that their cached information must be reread. Local disk file systems don't have anything like this. There are shared-disk file systems that are designed for multi-initiator setups. Examples of this include gfs and ocfs2 in Linux. You will find that they often trade performance for scalability to support multiple initiators. You can use ext4 for fallback schemes, where the primary server has exclusive access to the disk, and when the primary dies, the fallback server can take over. The ext4 multi-mount protection scheme is designed for those sorts of use cases, and it's used by Lustre servers. But only one system is actively reading or writing to the disk at a time, and the fallback server has to replay the journal, and assure that primary server won't "come back to life". Those are sometimes called STONITH schemes ("shoot the other node in the head"), and might involve network controlled power strips, etc. Regards, - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF 2019-09-17 12:54 ` Theodore Y. Ts'o @ 2019-09-17 15:38 ` Daegyu Han 0 siblings, 0 replies; 6+ messages in thread From: Daegyu Han @ 2019-09-17 15:38 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: linux-fsdevel Thank you for the clear explanation. Best regards, Daegyu 2019-09-17 21:54 GMT+09:00, Theodore Y. Ts'o <tytso@mit.edu>: > On Tue, Sep 17, 2019 at 09:44:00AM +0900, Daegyu Han wrote: >> It started with my curiosity. >> I know this is not the right way to use a local filesystem and someone >> would feel weird. >> I just wanted to organize the situation and experiment like that. >> >> I thought it would work if I flushed Node B's cached file system >> metadata with the drop cache, but I didn't. >> >> I've googled for something other than the mount and unmount process, >> and I saw a StackOverflow article telling file systems to sync via >> blockdev --flushbufs. >> >> So I do the blockdev --flushbufs after the drop cache. >> However, I still do not know why I can read the data stored in the >> shared storage via Node B. > > There are many problems, but the primary one is that Node B has > caches. If it has a cached version of the inode table block, why > should it reread it after Node A has modified it? Also, the VFS also > has negative dentry caches. This is very important for search path > performance. Consider for example the compiler which may need to look > in many directories for a particular header file. If the C program has: > > #include "amazing.h" > > The C compiler may need to look in a dozen or more directories trying > to find the header file amazing.h. And each successive C compiler > process will need to keep looking in all of those same directories. > So the kernel will keep a "negative cache", so if > /usr/include/amazing.h doesn't exist, it won't ask the file system > when the 2nd, 3rd, 4th, 5th, ... compiler process tries to open > /usr/include/amazing.h. > > You can disable all of the caches, but that makes the file system > terribly, terribly slow. What network file systems will do is they > have schemes whereby they can safely cache, since the network file > system protocol has a way that the client can be told that their > cached information must be reread. Local disk file systems don't have > anything like this. > > There are shared-disk file systems that are designed for > multi-initiator setups. Examples of this include gfs and ocfs2 in > Linux. You will find that they often trade performance for > scalability to support multiple initiators. > > You can use ext4 for fallback schemes, where the primary server has > exclusive access to the disk, and when the primary dies, the fallback > server can take over. The ext4 multi-mount protection scheme is > designed for those sorts of use cases, and it's used by Lustre > servers. But only one system is actively reading or writing to the > disk at a time, and the fallback server has to replay the journal, and > assure that primary server won't "come back to life". Those are > sometimes called STONITH schemes ("shoot the other node in the head"), > and might involve network controlled power strips, etc. > > Regards, > > - Ted > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF 2019-09-16 14:33 Sharing ext4 on target storage to multiple initiators using NVMeoF Daegyu Han 2019-09-16 19:23 ` Eric Sandeen @ 2019-09-17 6:48 ` Christoph Hellwig 1 sibling, 0 replies; 6+ messages in thread From: Christoph Hellwig @ 2019-09-17 6:48 UTC (permalink / raw) To: Daegyu Han; +Cc: linux-fsdevel You might want to look into the pnfs block layout instead to do this safely. It is supported with XFS out of the box, but adding ext4 support shouldn't be all that hard. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-09-17 15:44 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-09-16 14:33 Sharing ext4 on target storage to multiple initiators using NVMeoF Daegyu Han 2019-09-16 19:23 ` Eric Sandeen 2019-09-17 0:44 ` Daegyu Han 2019-09-17 12:54 ` Theodore Y. Ts'o 2019-09-17 15:38 ` Daegyu Han 2019-09-17 6:48 ` Christoph Hellwig
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.