* [RFC] mount flag "direct"
@ 2002-09-03 15:01 Peter T. Breuer
2002-09-03 15:13 ` Rik van Riel
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-03 15:01 UTC (permalink / raw)
To: linux kernel
I'll rephrase this as an RFC, since I want help and comments.
Scenario:
I have a driver which accesses a "disk" at the block level, to which
another driver on another machine is also writing. I want to have
an arbitrary FS on this device which can be read from and written to
from both kernels, and I want support at the block level for this idea.
Question:
What do people think of adding a "direct" option to mount, with the
semantics that the VFS then makes all opens on files on the FS mounted
"direct" use O_DIRECT, which means that file r/w is not cached in VMS,
but instead goes straight to and from the real device? Is this enough
or nearly enough for what I have in mind?
Rationale:
No caching means that each kernel doesn't go off with its own idea of
what is on the disk in a file, at least. Dunno about directories and
metadata.
Wish:
If that mount option looks promising, can somebody make provision for
it in the kernel? Details to be ironed out later?
What I have explored or will explore:
1) I have put shared zoned read/write locks on the remote resource, so each
kernel request locks precisely the "disk" area that it should, in
precisely the mode it should, for precisely the duration of each block
layer request.
2) I have maintained request write order from individual kernels.
3) IMO I should also intercept and share the FS superblock lock, but thats
for later, and please tell me about it. What about dentries? Does
O_DIRECT get rid of them? What happens with mkdir?
4) I would LIKE the kernel to emit a "tag request" on the underlying
device before and after every atomic FS operation, so that I can maintain
FS atomicity at the block level. Please comment. Can somebody make this
happen, please? Or do I add the functionality to VFS myself? Where?
I have patched the kernel to support mount -o direct, creating MS_DIRECT
and MNT_DIRECT flags for the purpose. And it works. But I haven't
dared do too much to the remote FS by way of testing yet. I have
confirmed that individual file contents can be changed without problem
when the file size does not change.
Comments?
Here is the tiny proof of concept patch for VFS that implements the
"direct" mount option.
Peter
The idea embodied in this patch is that if we get the MS_DIRECT flag when
the vfs do_mount() is called, we pass it across into the mnt flags used
by do_add_mount() as MNT_DIRECT and thus make it a permament part of the
vfsmnt object that is the mounted fs. Then, in the generic
dentry_open() call for any file, we examine the flags on the mnt
parameter and set the O_DIRECT flag on the file pointer if MNT_DIRECT
is set on the vfsmnt object.
That makes all file opens O_DIRECT on the file system in question,
and makes all file accesses uncached by VMS.
The patch in itself works fine.
--- linux-2.5.31/fs/open.c.pre-o_direct Mon Sep 2 20:36:11 2002
+++ linux-2.5.31/fs/open.c Mon Sep 2 17:12:08 2002
@@ -643,6 +643,9 @@
if (error)
goto cleanup_file;
}
+ if (mnt->mnt_flags & MNT_DIRECT)
+ f->f_flags |= O_DIRECT;
+
f->f_ra.ra_pages = inode->i_mapping->backing_dev_info->ra_pages;
f->f_dentry = dentry;
f->f_vfsmnt = mnt;
--- linux-2.5.31/fs/namespace.c.pre-o_direct Mon Sep 2 20:37:39 2002
+++ linux-2.5.31/fs/namespace.c Mon Sep 2 17:12:04 2002
@@ -201,6 +201,7 @@
{ MS_MANDLOCK, ",mand" },
{ MS_NOATIME, ",noatime" },
{ MS_NODIRATIME, ",nodiratime" },
+ { MS_DIRECT, ",direct" },
{ 0, NULL }
};
static struct proc_fs_info mnt_info[] = {
@@ -734,7 +741,9 @@
mnt_flags |= MNT_NODEV;
if (flags & MS_NOEXEC)
mnt_flags |= MNT_NOEXEC;
- flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV);
+ if (flags & MS_DIRECT)
+ mnt_flags |= MNT_DIRECT;
+ flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_DIRECT);
/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
--- linux-2.5.31/include/linux/mount.h.pre-o_direct Mon Sep 2 20:31:16 2002
+++ linux-2.5.31/include/linux/mount.h Mon Sep 2 18:06:14 2002
@@ -17,6 +17,7 @@
#define MNT_NOSUID 1
#define MNT_NODEV 2
#define MNT_NOEXEC 4
+#define MNT_DIRECT 256
struct vfsmount
{
--- linux-2.5.31/include/linux/fs.h.pre-o_direct Mon Sep 2 20:32:05 2002
+++ linux-2.5.31/include/linux/fs.h Mon Sep 2 18:05:57 2002
@@ -104,6 +104,9 @@
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+
+#define MS_DIRECT 256 /* Make all opens be O_DIRECT */
+
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC] mount flag "direct" 2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer @ 2002-09-03 15:13 ` Rik van Riel 2002-09-03 15:53 ` Maciej W. Rozycki 2002-09-03 15:16 ` jbradford 2002-09-03 15:37 ` Anton Altaparmakov 2 siblings, 1 reply; 28+ messages in thread From: Rik van Riel @ 2002-09-03 15:13 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux kernel On Tue, 3 Sep 2002, Peter T. Breuer wrote: > Rationale: > No caching means that each kernel doesn't go off with its own idea of > what is on the disk in a file, at least. Dunno about directories and > metadata. And what if they both allocate the same disk block to another file, simultaneously ? A mount option isn't enough to achieve your goal. It looks like you want GFS or OCFS. Info about GFS can be found at: http://www.opengfs.org/ http://www.sistina.com/ (commercial GFS) Dunno where Oracle's cluster fs is documented. regards, Rik -- http://www.linuxsymposium.org/2002/ "You're one of those condescending OLS attendants" "Here's a nickle kid. Go buy yourself a real t-shirt" http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 15:13 ` Rik van Riel @ 2002-09-03 15:53 ` Maciej W. Rozycki 2002-09-03 16:04 ` Peter T. Breuer 0 siblings, 1 reply; 28+ messages in thread From: Maciej W. Rozycki @ 2002-09-03 15:53 UTC (permalink / raw) To: Rik van Riel; +Cc: Peter T. Breuer, linux kernel On Tue, 3 Sep 2002, Rik van Riel wrote: > > Rationale: > > No caching means that each kernel doesn't go off with its own idea of > > what is on the disk in a file, at least. Dunno about directories and > > metadata. > > And what if they both allocate the same disk block to another > file, simultaneously ? You need a mutex then. For SCSI devices a reservation is the way to go -- the RESERVE/RELEASE commands are mandatory for direct-access devices, so thy should work universally for disks. -- + Maciej W. Rozycki, Technical University of Gdansk, Poland + +--------------------------------------------------------------+ + e-mail: macro@ds2.pg.gda.pl, PGP key available + ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 15:53 ` Maciej W. Rozycki @ 2002-09-03 16:04 ` Peter T. Breuer 2002-09-03 16:08 ` Rik van Riel 0 siblings, 1 reply; 28+ messages in thread From: Peter T. Breuer @ 2002-09-03 16:04 UTC (permalink / raw) To: Maciej W. Rozycki; +Cc: Rik van Riel, Peter T. Breuer, linux kernel "A month of sundays ago Maciej W. Rozycki wrote:" > On Tue, 3 Sep 2002, Rik van Riel wrote: > > And what if they both allocate the same disk block to another > > file, simultaneously ? > > You need a mutex then. For SCSI devices a reservation is the way to go > -- the RESERVE/RELEASE commands are mandatory for direct-access devices, > so thy should work universally for disks. Is there provision in VFS for this operation? (i.e. care to point me at an entry point? I just grepped for "reserve" and came up with nothing useful). Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 16:04 ` Peter T. Breuer @ 2002-09-03 16:08 ` Rik van Riel 0 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2002-09-03 16:08 UTC (permalink / raw) To: Peter T. Breuer; +Cc: Maciej W. Rozycki, linux kernel On Tue, 3 Sep 2002, Peter T. Breuer wrote: > "A month of sundays ago Maciej W. Rozycki wrote:" > > On Tue, 3 Sep 2002, Rik van Riel wrote: > > > And what if they both allocate the same disk block to another > > > file, simultaneously ? > > > > You need a mutex then. For SCSI devices a reservation is the way to go > > -- the RESERVE/RELEASE commands are mandatory for direct-access devices, > > so thy should work universally for disks. > > Is there provision in VFS for this operation? No. Everybody but you seems to agree these things should be filesystem specific and not in the VFS. > (i.e. care to point me at an entry point? I just grepped for "reserve" > and came up with nothing useful). Good. cheers, Rik -- http://www.linuxsymposium.org/2002/ "You're one of those condescending OLS attendants" "Here's a nickle kid. Go buy yourself a real t-shirt" http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer 2002-09-03 15:13 ` Rik van Riel @ 2002-09-03 15:16 ` jbradford 2002-09-03 15:37 ` Anton Altaparmakov 2 siblings, 0 replies; 28+ messages in thread From: jbradford @ 2002-09-03 15:16 UTC (permalink / raw) To: ptb; +Cc: linux-kernel > Rationale: > No caching means that each kernel doesn't go off with its own idea of > what is on the disk in a file, at least. Dunno about directories and > metadata. Somewhat related to this - is there currently, or would it be possible to include in what you're working on now, a sane way for two or more machines to access a SCSI drive on a shared SCSI bus - in other words, several host adaptors in different machines are all connected to one SCSI bus, and can all access a single hard disk. At the moment, you can only do this if all machines mount the disk read-only. John. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer 2002-09-03 15:13 ` Rik van Riel 2002-09-03 15:16 ` jbradford @ 2002-09-03 15:37 ` Anton Altaparmakov 2002-09-03 15:44 ` Peter T. Breuer 2 siblings, 1 reply; 28+ messages in thread From: Anton Altaparmakov @ 2002-09-03 15:37 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux kernel On Tue, 3 Sep 2002, Peter T. Breuer wrote: > I'll rephrase this as an RFC, since I want help and comments. > > Scenario: > I have a driver which accesses a "disk" at the block level, to which > another driver on another machine is also writing. I want to have > an arbitrary FS on this device which can be read from and written to > from both kernels, and I want support at the block level for this idea. You cannot have an arbitrary fs. The two fs drivers must coordinate with each other in order for your scheme to work. Just think about if the two fs drivers work on the same file simultaneously and both start growing the file at the same time. All hell would break lose. For your scheme to work, the fs drivers need to communicate with each other in order to attain atomicity of cluster and inode (de-)allocations, etc. Basically you need a clustered fs for this to work. GFS springs to mind but I never really looked at it... Best regards, Anton -- Anton Altaparmakov <aia21 at cantab.net> (replace at with @) Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 15:37 ` Anton Altaparmakov @ 2002-09-03 15:44 ` Peter T. Breuer 2002-09-03 16:23 ` Lars Marowsky-Bree 2002-09-03 18:20 ` Daniel Phillips 0 siblings, 2 replies; 28+ messages in thread From: Peter T. Breuer @ 2002-09-03 15:44 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Peter T. Breuer, linux kernel "A month of sundays ago Anton Altaparmakov wrote:" > On Tue, 3 Sep 2002, Peter T. Breuer wrote: > > > I'll rephrase this as an RFC, since I want help and comments. > > > > Scenario: > > I have a driver which accesses a "disk" at the block level, to which > > another driver on another machine is also writing. I want to have > > an arbitrary FS on this device which can be read from and written to > > from both kernels, and I want support at the block level for this idea. > > You cannot have an arbitrary fs. The two fs drivers must coordinate with > each other in order for your scheme to work. Just think about if the two > fs drivers work on the same file simultaneously and both start growing the > file at the same time. All hell would break lose. Thanks! Rik also mentioned that objection! That's good. You both "only" see the same problem, so there can't be many more like it.. I replied thusly: OK - reply: It appears that in order to allocate away free space, one must first "grab" that free space using a shared lock. That's perfectly feasible. > For your scheme to work, the fs drivers need to communicate with each > other in order to attain atomicity of cluster and inode (de-)allocations, > etc. Yes. They must create atomic FS operations at the VFS level (grabbing unallocated space is one of them) and I must share the locks for those ops. > Basically you need a clustered fs for this to work. GFS springs to No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs support necessary :-). That's really what my question is driving at. I see that I need to make VFS ops communicate "tag requests" to the block layer, in order to implement locking. Now you and Rik have pointed out one operation that needs locking. My next question is obviously: can you point me more or less precisely at this operation in the VFS layer? I've only started studying it and I am relatively unfamiliar with it. Thanks. Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 15:44 ` Peter T. Breuer @ 2002-09-03 16:23 ` Lars Marowsky-Bree 2002-09-03 16:41 ` Peter T. Breuer 2002-09-03 18:20 ` Daniel Phillips 1 sibling, 1 reply; 28+ messages in thread From: Lars Marowsky-Bree @ 2002-09-03 16:23 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux kernel On 2002-09-03T17:44:10, "Peter T. Breuer" <ptb@it.uc3m.es> said: > No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs > support necessary :-). > > That's really what my question is driving at. I see that I need to > make VFS ops communicate "tag requests" to the block layer, in > order to implement locking. Now you and Rik have pointed out one > operation that needs locking. My next question is obviously: can you > point me more or less precisely at this operation in the VFS layer? > I've only started studying it and I am relatively unfamiliar with it. Your approach is not feasible. Distributed filesystems have a lot of subtle pitfalls - locking, cache coherency, journal replay to name a few - which you can hardly solve at the VFS layer. Good reading would be any sort of entry literature on clustering, I would recommend "In search of clusters" and many of the whitepapers Google will turn up for you, as well as the OpenGFS source. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- Immortality is an adequate definition of high availability for me. --- Gregory F. Pfister ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 16:23 ` Lars Marowsky-Bree @ 2002-09-03 16:41 ` Peter T. Breuer 2002-09-03 17:07 ` David Lang ` (3 more replies) 0 siblings, 4 replies; 28+ messages in thread From: Peter T. Breuer @ 2002-09-03 16:41 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel "A month of sundays ago Lars Marowsky-Bree wrote:" > "Peter T. Breuer" <ptb@it.uc3m.es> said: > > > No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs > > support necessary :-). > > > > That's really what my question is driving at. I see that I need to > > make VFS ops communicate "tag requests" to the block layer, in > > order to implement locking. Now you and Rik have pointed out one > > operation that needs locking. My next question is obviously: can you > > point me more or less precisely at this operation in the VFS layer? > > I've only started studying it and I am relatively unfamiliar with it. > > Your approach is not feasible. But you have to be specific about why not. I've responded to the particular objections so far. > Distributed filesystems have a lot of subtle pitfalls - locking, cache Yes, thanks, I know. > coherency, journal replay to name a few - which you can hardly solve at the My simple suggestion is not to cache. I am of the opinion that in principle that solves all coherency problems, since there would be no stored state that needs to "cohere". The question is how to identify and remove the state that is currently cached. As to journal replay, there will be no journalling - if it breaks it breaks and somebody (fsck) can go fix it. I don't want to get anywhere near complicated. > VFS layer. > > Good reading would be any sort of entry literature on clustering, I would Please don't condescend! I am honestly not in need of education :-). > recommend "In search of clusters" and many of the whitepapers Google will turn > up for you, as well as the OpenGFS source. (Puhleeese!) We already know that we can have a perfectly fine and arbitrary shared file system, shared only at the block level if we 1) permit no new dirs or files to be made (disable O_CREAT or something like) 2) do all r/w on files with O_DIRECT 3) do file extensions via a new generic VFS "reserve" operation 4) have shared mutexes on all vfs op, implemented by passing down a special "tag" request to the block layer. 5) maintain read+write order at the shared resource. I have already implemented 2,4,5. The question is how to extend the range of useful operations. For the moment I would be happy simply to go ahead and implement 1) and 3), while taking serious strong advice on what to do about directories. Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 16:41 ` Peter T. Breuer @ 2002-09-03 17:07 ` David Lang 2002-09-03 17:30 ` Peter T. Breuer 2002-09-03 17:26 ` Rik van Riel ` (2 subsequent siblings) 3 siblings, 1 reply; 28+ messages in thread From: David Lang @ 2002-09-03 17:07 UTC (permalink / raw) To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, linux kernel Peter, the thing that you seem to be missing is that direct mode only works for writes, it doesn't force a filesystem to go to the hardware for reads. for many filesystems you cannot turn off their internal caching of data (metadata for some, all data for others) so to implement what you are after you will have to modify the filesystem to not cache anything, since you aren't going to do this for every filesystem you end up only haivng this option on the one(s) that you modify. if you have a single (or even just a few) filesystems that have this option you may as well include the locking/syncing software in them rather then modifying the VFS layer. David Lang On Tue, 3 Sep 2002, Peter T. Breuer wrote: > Date: Tue, 3 Sep 2002 18:41:49 +0200 (MET DST) > From: Peter T. Breuer <ptb@it.uc3m.es> > To: Lars Marowsky-Bree <lmb@suse.de> > Cc: Peter T. Breuer <ptb@it.uc3m.es>, > linux kernel <linux-kernel@vger.kernel.org> > Subject: Re: [RFC] mount flag "direct" > > "A month of sundays ago Lars Marowsky-Bree wrote:" > > "Peter T. Breuer" <ptb@it.uc3m.es> said: > > > > > No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs > > > support necessary :-). > > > > > > That's really what my question is driving at. I see that I need to > > > make VFS ops communicate "tag requests" to the block layer, in > > > order to implement locking. Now you and Rik have pointed out one > > > operation that needs locking. My next question is obviously: can you > > > point me more or less precisely at this operation in the VFS layer? > > > I've only started studying it and I am relatively unfamiliar with it. > > > > Your approach is not feasible. > > But you have to be specific about why not. I've responded to the > particular objections so far. > > > Distributed filesystems have a lot of subtle pitfalls - locking, cache > > Yes, thanks, I know. > > > coherency, journal replay to name a few - which you can hardly solve at the > > My simple suggestion is not to cache. I am of the opinion that in > principle that solves all coherency problems, since there would be no > stored state that needs to "cohere". The question is how to identify > and remove the state that is currently cached. > > As to journal replay, there will be no journalling - if it breaks it > breaks and somebody (fsck) can go fix it. I don't want to get anywhere > near complicated. > > > VFS layer. > > > > Good reading would be any sort of entry literature on clustering, I would > > Please don't condescend! I am honestly not in need of education :-). > > > recommend "In search of clusters" and many of the whitepapers Google will turn > > up for you, as well as the OpenGFS source. > > (Puhleeese!) > > We already know that we can have a perfectly fine and arbitrary > shared file system, shared only at the block level if we > > 1) permit no new dirs or files to be made (disable O_CREAT or something > like) > 2) do all r/w on files with O_DIRECT > 3) do file extensions via a new generic VFS "reserve" operation > 4) have shared mutexes on all vfs op, implemented by passing > down a special "tag" request to the block layer. > 5) maintain read+write order at the shared resource. > > I have already implemented 2,4,5. > > The question is how to extend the range of useful operations. For the > moment I would be happy simply to go ahead and implement 1) and 3), > while taking serious strong advice on what to do about directories. > > > > Peter > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 17:07 ` David Lang @ 2002-09-03 17:30 ` Peter T. Breuer 2002-09-03 17:40 ` David Lang 2002-09-04 5:57 ` Helge Hafting 0 siblings, 2 replies; 28+ messages in thread From: Peter T. Breuer @ 2002-09-03 17:30 UTC (permalink / raw) To: David Lang; +Cc: linux kernel "A month of sundays ago David Lang wrote:" > Peter, the thing that you seem to be missing is that direct mode only > works for writes, it doesn't force a filesystem to go to the hardware for > reads. Yes it does. I've checked! Well, at least I've checked that writing then reading causes the reads to get to the device driver. I haven't checked what reading twice does. If it doesn't cause the data to be read twice, then it ought to, and I'll fix it (given half a clue as extra pay ..:-) > for many filesystems you cannot turn off their internal caching of data > (metadata for some, all data for others) Well, let's take things one at a time. Put in a VFS mechanism and then convert some FSs to use it. > so to implement what you are after you will have to modify the filesystem > to not cache anything, since you aren't going to do this for every Yes. > filesystem you end up only haivng this option on the one(s) that you > modify. I intend to make the generic mechanism attractive. > if you have a single (or even just a few) filesystems that have this > option you may as well include the locking/syncing software in them rather > then modifying the VFS layer. Why? Are you advocating a particular approach? Yes, I agree that that is a possible way to go - but I will want the extra VFS ops anyway, and will want to modify the particular fs to use them, no? Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 17:30 ` Peter T. Breuer @ 2002-09-03 17:40 ` David Lang 2002-09-04 5:57 ` Helge Hafting 1 sibling, 0 replies; 28+ messages in thread From: David Lang @ 2002-09-03 17:40 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux kernel On Tue, 3 Sep 2002, Peter T. Breuer wrote: > "A month of sundays ago David Lang wrote:" > > Peter, the thing that you seem to be missing is that direct mode only > > works for writes, it doesn't force a filesystem to go to the hardware for > > reads. > > Yes it does. I've checked! Well, at least I've checked that writing > then reading causes the reads to get to the device driver. I haven't > checked what reading twice does. > > If it doesn't cause the data to be read twice, then it ought to, and > I'll fix it (given half a clue as extra pay ..:-) writing then reading the same file may cause it to be read from the disk, but reading /foo/bar then reading /foo/bar again will not cause two reads of all data. some filesystems go to a lot fo work to orginize the metadata in particular in memory to access things more efficiantly, you will have to go into each filesystem and modify them to not do this. in addition you will have lots of potential races as one system reads a block of data, modifies it, then writes it while the other system does the same thing. you cannot easily detect this in the low level drivers as these are seperate calls from the filesystem, and even if you do what error message will you send to the second system? there's no error that says 'the disk has changed under you, backup and re-read it before you modify it' yes this is stuff that could be added to all filesystems, but will the filesystem maintainsers let you do this major surgery to their systems? for example the XFS and JFS teams are going to a lot of effort to maintain their systems to be compatable with other OS's, they probably won't appriciate all the extra conditionals that you will need to put in to do all of this. even for ext2 there are people (including linus I believe) that are saying that major new features should not be added to ext2, but to a new filesystem forked off of ext2 (ext3 for example or a fork of it). David Lang > > for many filesystems you cannot turn off their internal caching of data > > (metadata for some, all data for others) > > Well, let's take things one at a time. Put in a VFS mechanism and then > convert some FSs to use it. > > > so to implement what you are after you will have to modify the filesystem > > to not cache anything, since you aren't going to do this for every > > Yes. > > > filesystem you end up only haivng this option on the one(s) that you > > modify. > > I intend to make the generic mechanism attractive. > > > if you have a single (or even just a few) filesystems that have this > > option you may as well include the locking/syncing software in them rather > > then modifying the VFS layer. > > Why? Are you advocating a particular approach? Yes, I agree that that > is a possible way to go - but I will want the extra VFS ops anyway, > and will want to modify the particular fs to use them, no? > > Peter > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 17:30 ` Peter T. Breuer 2002-09-03 17:40 ` David Lang @ 2002-09-04 5:57 ` Helge Hafting 2002-09-04 6:21 ` Peter T. Breuer 1 sibling, 1 reply; 28+ messages in thread From: Helge Hafting @ 2002-09-04 5:57 UTC (permalink / raw) To: ptb; +Cc: linux kernel "Peter T. Breuer" wrote: > > "A month of sundays ago David Lang wrote:" > > Peter, the thing that you seem to be missing is that direct mode only > > works for writes, it doesn't force a filesystem to go to the hardware for > > reads. > > Yes it does. I've checked! Well, at least I've checked that writing > then reading causes the reads to get to the device driver. I haven't > checked what reading twice does. You tried reading from a file? For how long are you going to work on that data you read? The other machine may ruin it anytime, even instantly after you read it. Now, try "ls -l" twice instead of reading from a file. Notice that no io happens the second time. Here we're reading metadata instead of file data. This sort of stuff is cached in separate caches that assumes nothing else modifies the disk. > > If it doesn't cause the data to be read twice, then it ought to, and > I'll fix it (given half a clue as extra pay ..:-) > > > for many filesystems you cannot turn off their internal caching of data > > (metadata for some, all data for others) > > Well, let's take things one at a time. Put in a VFS mechanism and then > convert some FSs to use it. > > > so to implement what you are after you will have to modify the filesystem > > to not cache anything, since you aren't going to do this for every > > Yes. > > > filesystem you end up only haivng this option on the one(s) that you > > modify. > > I intend to make the generic mechanism attractive. It won't be attractive, for the simple reason that a no-cache fs will be devastatingly slow. A program that read a file one byte at a time will do 1024 disk accesses to read a single kilobyte. And it will do that again if you run it again. Nobody will have time to wait for this, and this alone makes your idea useless. To get an idea - try booting with mem=4M and suffer. a cacheless fs will be much much worse than that. Using nfs or similiar will be so much faster. Existing network fs'es work around complexities by using one machine as disk server, others simply transfers requests to and from that machine and let it sort things out alone. The main reason I can imagine for letting two machines write to the *same* disk is performance. Going cacheless won't give you that. But you *can* beat nfs and friends by going for a "distributed ext2" or similiar where the participating machines talks to each other about who writes where. Each machine locks down the blocks they want to cache, with either a shared read lock or a exclusive write lock. There is a lot of performance tricks you may use, such as pre-reserving some free blocks for each machine, some ranges of inodes and so on, so each can modify those without asking the others. Then re-distribute stuff occationally so nobody runs out while the others have plenty. Helge Hafting ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-04 5:57 ` Helge Hafting @ 2002-09-04 6:21 ` Peter T. Breuer 2002-09-04 6:49 ` Helge Hafting 0 siblings, 1 reply; 28+ messages in thread From: Peter T. Breuer @ 2002-09-04 6:21 UTC (permalink / raw) To: Helge Hafting; +Cc: ptb, linux kernel "A month of sundays ago Helge Hafting wrote:" > "Peter T. Breuer" wrote: > > "A month of sundays ago David Lang wrote:" > > > Peter, the thing that you seem to be missing is that direct mode only > > > works for writes, it doesn't force a filesystem to go to the hardware for > > > reads. > > > > Yes it does. I've checked! Well, at least I've checked that writing > > then reading causes the reads to get to the device driver. I haven't > > checked what reading twice does. > > You tried reading from a file? For how long are you going to Yes I did. And I tried readingtwice too, and it reads twice at device level. > work on that data you read? The other machine may ruin it anytime, Well, as long as I want to. What's the problem? I read file X at time T and got data Y. That's all I need. > even instantly after you read it. So what? > Now, try "ls -l" twice instead of reading from a file. Notice > that no io happens the second time. Here we're reading Directory data is cached. > metadata instead of file data. This sort of stuff > is cached in separate caches that assumes nothing > else modifies the disk. True, and I'm happy to change it. I don't think we always had a directory cache. > > > filesystem you end up only haivng this option on the one(s) that you > > > modify. > > > > I intend to make the generic mechanism attractive. > > It won't be attractive, for the simple reason that a no-cache fs > will be devastatingly slow. A program that read a file one byte at A generic mechanism is not a "no cache fs". It's a generic mechanism. > Nobody will have time to wait for this, and this alone makes your Try arguing logically. I really don't like it when people invent their own straw men and then procede to reason as though it were *mine*. > The main reason I can imagine for letting two machines write to > the *same* disk is performance. Going cacheless won't give you Then imagine some more. I'm not responsible for your imagination ... > that. But you *can* beat nfs and friends by going for > a "distributed ext2" or similiar where the participating machines > talks to each other about who writes where. > Each machine locks down the blocks they want to cache, with > either a shared read lock or a exclusive write lock. That's already done. > There is a lot of performance tricks you may use, such as No tricks. Let's be simple. Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-04 6:21 ` Peter T. Breuer @ 2002-09-04 6:49 ` Helge Hafting 2002-09-04 9:15 ` Peter T. Breuer 0 siblings, 1 reply; 28+ messages in thread From: Helge Hafting @ 2002-09-04 6:49 UTC (permalink / raw) To: ptb; +Cc: linux kernel "Peter T. Breuer" wrote: > > "A month of sundays ago Helge Hafting wrote:" > > "Peter T. Breuer" wrote: > > > "A month of sundays ago David Lang wrote:" > > > > Peter, the thing that you seem to be missing is that direct mode only > > > > works for writes, it doesn't force a filesystem to go to the hardware for > > > > reads. > > > > > > Yes it does. I've checked! Well, at least I've checked that writing > > > then reading causes the reads to get to the device driver. I haven't > > > checked what reading twice does. > > > > You tried reading from a file? For how long are you going to > > Yes I did. And I tried readingtwice too, and it reads twice at device > level. > > > work on that data you read? The other machine may ruin it anytime, > > Well, as long as I want to. What's the problem? I read file X at time > T and got data Y. That's all I need. No problem if all you do is use file data. A serious problem if the stuff you read is used to make a decision about where to write something else on that shared disk. For example: The fs need to extend a file. It reads the free block bitmap, and finds a free block. Then it overwrites that free block, and also write back a changed block bitmap. Unfortunately some other machine just did the same thing and you know have a crosslinked and corrupt file. There are several similiar scenarios. You can't really talk about "not caching". Once you read something into memory it is "cached" in memory, even if you only use it once and then re-read it whenever you need it later. > > > even instantly after you read it. > > So what? See above. > > Now, try "ls -l" twice instead of reading from a file. Notice > > that no io happens the second time. Here we're reading > > Directory data is cached. > > > metadata instead of file data. This sort of stuff > > is cached in separate caches that assumes nothing > > else modifies the disk. > > True, and I'm happy to change it. I don't think we always had a > directory cache. > > > > > filesystem you end up only haivng this option on the one(s) that you > > > > modify. > > > > > > I intend to make the generic mechanism attractive. > > > > It won't be attractive, for the simple reason that a no-cache fs > > will be devastatingly slow. A program that read a file one byte at > > A generic mechanism is not a "no cache fs". It's a generic mechanism. > > > Nobody will have time to wait for this, and this alone makes your > > Try arguing logically. I really don't like it when people invent their > own straw men and then procede to reason as though it were *mine*. > Maybe I wasn't clear. What I say is that a fs that don't cache anything in order to avoid cache coherency problems will be too slow for generic use. (Such as two desktop computers sharing a single main disk with applications and data) Perhaps it is really useful for some special purpose, I haven't seen you claim what you want this for, so I assumed general use. There is nothing illogical about performance problems. A cacheless system may _work_ and it might be simple, but it is also _useless_ for a a lot of common situations where cached fs'es works fine. > > The main reason I can imagine for letting two machines write to > > the *same* disk is performance. Going cacheless won't give you > > Then imagine some more. I'm not responsible for your imagination ... You tell. You keep asking why your idea won't work and I give you "performance problems" _even_ if you sort out the correctness issues with no other cost than the lack of cache. Please tell what you think it can be used for. I do not say it is useless for everything, although it certainly is useless for the purposes I can come up with. The only uses *I* find for a shared writeable (but uncachable) disk is so special that I wouldn't bother putting a fs like ext2 on it. Sharing a raw block device is doable today if you let the programs using it keep track of data themselves instead of using a fs. This isn't what you want though. It could be interesting to know what you want, considering what your solution looks like. Helge Hafting ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-04 6:49 ` Helge Hafting @ 2002-09-04 9:15 ` Peter T. Breuer 2002-09-04 11:34 ` Helge Hafting 0 siblings, 1 reply; 28+ messages in thread From: Peter T. Breuer @ 2002-09-04 9:15 UTC (permalink / raw) To: Helge Hafting; +Cc: ptb, linux kernel "A month of sundays ago Helge Hafting wrote:" > No problem if all you do is use file data. A serious problem if > the stuff you read is used to make a decision about where > to write something else on that shared disk. For example: > The fs need to extend a file. It reads the free block bitmap, > and finds a free block. Then it overwrites that free block, > and also write back a changed block bitmap. Unfortunately That's the exact problem that's already been mentioned twice, and I'm confident of that one being solved. Lock the whole FS if necessary, but read the bitmap and lock the bitmap on disk until the extension is finished and the bitmap is written back. It has been suggested that the VFS support a "reserve/release blocks" operation. It would simply mark the ondisk bitmap bits as used and add them to our available list. Then every file extension or creation would need to be preceded by a reserve command, or fail, according to policy. > some other machine just did the same thing and you > know have a crosslinked and corrupt file. There is no problem locking and serializing groups of read/write accesses. Please stop harping on about THAT at least :-). What is a problem is marking the groups of accesses. > There are several similiar scenarios. You can't really talk > about "not caching". Once you read something into > memory it is "cached" in memory, even if you only use it once > and then re-read it whenever you need it later. That's fine. And I don't see what needs to be reread. You had this problem once with smp, and you beat it with locks. > > A generic mechanism is not a "no cache fs". It's a generic mechanism. > > > > > Nobody will have time to wait for this, and this alone makes your > > > > Try arguing logically. I really don't like it when people invent their > > own straw men and then procede to reason as though it were *mine*. > > > Maybe I wasn't clear. What I say is that a fs that don't cache > anything in order to avoid cache coherency problems will be > too slow for generic use. (Such as two desktop computers Quite possibly, but not too slow for reading data in and writing data out, at gigabyte/s rates overall, which is what the intention is. That's not general use. And even if it were general use, it would still be pretty acceptable _in general_. > > Then imagine some more. I'm not responsible for your imagination ... > > You tell. You keep asking why your idea won't work and I > give you "performance problems" _even_ if you sort out the > correctness issues with no other cost than the lack of cache. The correctness issues are the only important ones, once we have correct and fast shared read and write to (existing) files. > it is useless for everything, although it certainly is useless > for the purposes I can come up with. The only uses *I* find > for a shared writeable (but uncachable) disk is so special that > I wouldn't bother putting a fs like ext2 on it. Sharing a > raw block device is doable today if you let the programs It's far too inconvenient to be totally without a FS. What we want is a normal FS, but slower at some things, and faster at others, but correct and shared. It's an approach. The caclulations show clearly that r/w (once!) to existing files are the only performance issues. The rest is decor. But decor that is very nice to have around. Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-04 9:15 ` Peter T. Breuer @ 2002-09-04 11:34 ` Helge Hafting 0 siblings, 0 replies; 28+ messages in thread From: Helge Hafting @ 2002-09-04 11:34 UTC (permalink / raw) To: ptb; +Cc: linux-kernel "Peter T. Breuer" wrote: > There is no problem locking and serializing groups of > read/write accesses. Please stop harping on about THAT at > least :-). What is a problem is marking the groups of accesses. Sorry, I now see you dealt with that in other threads. > > That's fine. And I don't see what needs to be reread. You had this > problem once with smp, and you beat it with locks. > Consider that taking a lock on a SMP machine is a fairly fast operation. Taking a lock shared over a network probably takes about 100-1000 times as long. People submit patches for shaving a single instruction off the SMP locks, for performance. The locking is removed on UP, because it makes a difference even though the lock never is busy in the UP case. A much slower lock will either hurt performance a lot, or force a coarse granularity. The time spent on locking had better be a small fraction of total time, or you won't get your high performance. A coarse granularity will limit your software so the different machines mostly use different parts of the shared disks, or you'll loose the parallellism. I guess that is fine with you then. > > it is useless for everything, although it certainly is useless > > for the purposes I can come up with. The only uses *I* find > > for a shared writeable (but uncachable) disk is so special that > > I wouldn't bother putting a fs like ext2 on it. > > It's far too inconvenient to be totally without a FS. What we > want is a normal FS, but slower at some things, and faster at others, > but correct and shared. It's an approach. The caclulations show > clearly that r/w (once!) to existing files are the only performance > issues. The rest is decor. But decor that is very nice to have around. Ok. If r/w _once_ is what matters, then surely you don't need cache. I consider that a rather unusual case though, which is why you'll have a hard time getting this into the standard kernel. But maybe you don't need that? Still, you should consider writing a fs of your own. It is a _small_ job compared to implementing your locking system in existing filesystems. Remember that those filesystems are optimized for a common case of a few cpu's, where you may take and release hundreds or thousands of locks per second, and where data transfers often are small and repetitive. Caching is so useful for this case that current fs code is designed around it. With a fs of your own you won't have to worry about maintainers changing the rest of the fs code. That sort of thing is hard to keep up with with the massive changes you'll need for your sort of distributed fs. A single-purpose fs isn't such a big job, you can leave out design considerations that don't apply to your case. Helge Hafting ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 16:41 ` Peter T. Breuer 2002-09-03 17:07 ` David Lang @ 2002-09-03 17:26 ` Rik van Riel 2002-09-03 18:02 ` Andreas Dilger 2002-09-03 17:29 ` Jan Harkes 2002-09-03 18:31 ` Daniel Phillips 3 siblings, 1 reply; 28+ messages in thread From: Rik van Riel @ 2002-09-03 17:26 UTC (permalink / raw) To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, linux kernel On Tue, 3 Sep 2002, Peter T. Breuer wrote: > > Your approach is not feasible. > > But you have to be specific about why not. I've responded to the > particular objections so far. [snip] > Please don't condescend! I am honestly not in need of education :-). You make it sound like you bet your masters degree on doing a distributed filesystem without filesystem support ;) Rik -- http://www.linuxsymposium.org/2002/ "You're one of those condescending OLS attendants" "Here's a nickle kid. Go buy yourself a real t-shirt" http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 17:26 ` Rik van Riel @ 2002-09-03 18:02 ` Andreas Dilger 2002-09-03 18:44 ` Daniel Phillips 0 siblings, 1 reply; 28+ messages in thread From: Andreas Dilger @ 2002-09-03 18:02 UTC (permalink / raw) To: Rik van Riel; +Cc: Peter T. Breuer, Lars Marowsky-Bree, linux kernel On Sep 03, 2002 14:26 -0300, Rik van Riel wrote: > On Tue, 3 Sep 2002, Peter T. Breuer wrote: > > > Your approach is not feasible. > > > > But you have to be specific about why not. I've responded to the > > particular objections so far. > > You make it sound like you bet your masters degree on > doing a distributed filesystem without filesystem support ;) Actually, we are using ext3 pretty much as-is for our backing-store for Lustre. The same is true of InterMezzo, and NFS, for that matter. All of them live on top of a standard "local" filesystem, which doesn't know the things that happen above it to make it a network filesystem (locking, etc). That isn't to say that I agree with just taking a local filesystem and putting it on a shared block device and expecting it to work with only the normal filesystem code. We do all of our locking above the fs level, but we do have some help in the VFS (intent-based lookup, patch in the Lustre CVS repository, if people are interested). Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 18:02 ` Andreas Dilger @ 2002-09-03 18:44 ` Daniel Phillips 0 siblings, 0 replies; 28+ messages in thread From: Daniel Phillips @ 2002-09-03 18:44 UTC (permalink / raw) To: Andreas Dilger, Rik van Riel Cc: Peter T. Breuer, Lars Marowsky-Bree, linux kernel On Tuesday 03 September 2002 20:02, Andreas Dilger wrote: > Actually, we are using ext3 pretty much as-is for our backing-store > for Lustre. The same is true of InterMezzo, and NFS, for that matter. > All of them live on top of a standard "local" filesystem, which doesn't > know the things that happen above it to make it a network filesystem > (locking, etc). To put this in simplistic terms, this works because you treat the underlying filesystem simply as a storage device, a slightly funky kind of disk. -- Daniel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 16:41 ` Peter T. Breuer 2002-09-03 17:07 ` David Lang 2002-09-03 17:26 ` Rik van Riel @ 2002-09-03 17:29 ` Jan Harkes 2002-09-03 18:31 ` Daniel Phillips 3 siblings, 0 replies; 28+ messages in thread From: Jan Harkes @ 2002-09-03 17:29 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-kernel On Tue, Sep 03, 2002 at 06:41:49PM +0200, Peter T. Breuer wrote: > "A month of sundays ago Lars Marowsky-Bree wrote:" > > Your approach is not feasible. > > But you have to be specific about why not. I've responded to the > particular objections so far. > > > Distributed filesystems have a lot of subtle pitfalls - locking, cache > > Yes, thanks, I know. > > > coherency, journal replay to name a few - which you can hardly solve at the > > My simple suggestion is not to cache. I am of the opinion that in > principle that solves all coherency problems, since there would be no > stored state that needs to "cohere". The question is how to identify > and remove the state that is currently cached. That is a very simple suggestion, but not feasable because there always be 'cached copies' floating around. Even if you remove the dcache (directory lookups) and icache (inode cache) in the kernel, both filesystems will still need to look at the data in order to modify it. Looking at the data involves creating an in-memory representation of the object. If there is no locking, if one filesystem modifies the object, the other filesystem is looking at (and possibly modifying) stale data which causes consistency problems. > > Good reading would be any sort of entry literature on clustering, I would > > Please don't condescend! I am honestly not in need of education :-). I'm afraid that all of this has been very well documented, another example would be Tanenbaum's "Distributed Systems", especially the chapter on various consistency models is a nice read. > We already know that we can have a perfectly fine and arbitrary > shared file system, shared only at the block level if we > > 1) permit no new dirs or files to be made (disable O_CREAT or something > like) > 2) do all r/w on files with O_DIRECT > 3) do file extensions via a new generic VFS "reserve" operation > 4) have shared mutexes on all vfs op, implemented by passing > down a special "tag" request to the block layer. > 5) maintain read+write order at the shared resource. Can I quote your 'puhleese' here? Inodes are sharing the same on-disk blocks, so when one inode is changed (setattr, truncate) and written back to disk, it affects all other inodes stored in the same block. So the shared mutexes on the VFS level don't cover the necessary locking. Each time you add another point to work around the latest argument, someone will surely give you another argument until you end up with a system that is no longer practical. And then probably even slower because you absolutely cannot allow the FS to trust _any_ data without a locked read or write off the disk (or across the network). And because you seem to like cpu consistency that much this even involves the data that happens to be 'cached' in the CPU. > I have already implemented 2,4,5. > > The question is how to extend the range of useful operations. For the > moment I would be happy simply to go ahead and implement 1) and 3), > while taking serious strong advice on what to do about directories. Perhaps the fact that directories (and journalled filesystems) aren't already solved is an indication that the proposed 'solution' is flawed? Filesystems were designed to trust the disk as 'stable storage', i.e. anything that was read or recently written will be the same. NFS already weakens this model slightly. AFS and Coda go even further, we only guarantee that changes are propagated when a file is closed. There is a callback mechanism to invalidate cached copies. But even when we open a file, it could still have been changed within the past 1/2 RTT. This is a window we intentionally live with because it avoids the full RTT hit we would have if we had to go to the server on every file open. It is the latency that kills you when you can't cache. Jan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 16:41 ` Peter T. Breuer ` (2 preceding siblings ...) 2002-09-03 17:29 ` Jan Harkes @ 2002-09-03 18:31 ` Daniel Phillips 3 siblings, 0 replies; 28+ messages in thread From: Daniel Phillips @ 2002-09-03 18:31 UTC (permalink / raw) To: ptb, Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel On Tuesday 03 September 2002 18:41, Peter T. Breuer wrote: > > Distributed filesystems have a lot of subtle pitfalls - locking, cache > > Yes, thanks, I know. > > > coherency, journal replay to name a few - which you can hardly solve at > > the > > My simple suggestion is not to cache. I am of the opinion that in > principle that solves all coherency problems, since there would be no > stored state that needs to "cohere". The question is how to identify > and remove the state that is currently cached. Well, for example, you would not be able to have the same file open in two different kernels because the inode would be cached. So you'd have to close the root directory on one kernel before the other could access any file. Not only would that be horribly inefficient, you would *still* need to implement a locking protocol between the two kernels to make it work. There's no magic way of making this easy. -- Daniel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-03 15:44 ` Peter T. Breuer 2002-09-03 16:23 ` Lars Marowsky-Bree @ 2002-09-03 18:20 ` Daniel Phillips 1 sibling, 0 replies; 28+ messages in thread From: Daniel Phillips @ 2002-09-03 18:20 UTC (permalink / raw) To: ptb, Anton Altaparmakov; +Cc: Peter T. Breuer, linux kernel On Tuesday 03 September 2002 17:44, Peter T. Breuer wrote: > > > Scenario: > > > I have a driver which accesses a "disk" at the block level, to which > > > another driver on another machine is also writing. I want to have > > > an arbitrary FS on this device which can be read from and written to > > > from both kernels, and I want support at the block level for this idea. > > > > You cannot have an arbitrary fs. The two fs drivers must coordinate with > > each other in order for your scheme to work. Just think about if the two > > fs drivers work on the same file simultaneously and both start growing the > > file at the same time. All hell would break lose. > > Thanks! > > Rik also mentioned that objection! That's good. You both "only" see > the same problem, so there can't be many more like it.. (intentionally misinterpreting) No indeed, there are aren't many problems like it, in terms of sheer complexity. -- Daniel ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20020907164631.GA17696@marowsky-bree.de>]
* Re: [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] [not found] <20020907164631.GA17696@marowsky-bree.de> @ 2002-09-07 19:59 ` Peter T. Breuer 2002-09-07 21:14 ` [RFC] mount flag "direct" Lars Marowsky-Bree 0 siblings, 1 reply; 28+ messages in thread From: Peter T. Breuer @ 2002-09-07 19:59 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: linux kernel "A month of sundays ago Lars Marowsky-Bree wrote:" > as per your request, I am forwarding this mail to you again. The main point Thanks. > you'll find is that yes, I believe that your idea _can_ be made to work. Quite > frankly, there are very few ideas which _cannot_ be made to work. The > interesting question is whether it is worth it to a particular route or not. > > And let me say that I find it at least slightly rude to "miss" mail in a > discussion; if you are so important that you get so much mail every day, maybe > a public discussion on a mailing list isn't the proper way how to go about > something... Well, I'm sorry. I did explain that I am travelling, I think! And it is even very hard to connevct occasionally (it requires me finding an airport kiosk or an internet cafe), and then I have to _pay_ for the time to compose a reply, and so on! Well, if I don't read your mail for one day, then it will be filed somewhere for me by procmail, and I haven't been able to check any filings .. > > > *ouch* Sure. Right. You just have to read it from scratch every time. How > > > would you make readdir work? > > Well, one has to read it from scratch. I'll set about seeing how to do. > > CLues welcome. > > Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS, > OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre, > PvFS etc. Eh, I thought I saw this - didn't I reply? > Any of them will appreciate the good work of a bright fellow. Well, I know of some of these. Intermezzo I've tried lately and found near impossible to set up and work with (still, a great improvement over coda, which was absolutely impossible, to within an atom's breadth). And it's nowhere near got the right orientation. Lustre people have been pointing me at. What happened to petal? > Noone appreciates reinventing the wheel another time, especially if - for > simplification - it starts out as a square. But what I suggest is finding a simple way to turn an existing FS into a distributed one. I.e. NOT reinventing the wheel. All those other people are reinventing a wheel, for some reason :-). > You tell me why Distributed Filesystems are important. I fully agree. > > You fail to give a convincing reason why that must be made to work with > "all" conventional filesystems, especially given the constraints this implies. Because that's the simplest thing to do. > Conventional wisdom seems to be that this can much better be handled specially > by special filesystems, who can do finer grained locking etc because they > understand the on disk structures, can do distributed journal recovery etc. Well, how about allowing get_block to return an extra argument, which is the ondisk placement of the inode(s) concerned, so that the vfs can issue a lock request for them before the i/o starts. Let the FS return the list of metadata things to lock, and maybe a semaphore to start the i/o with. There you are: instant distribution. It works for thsoe fs's which cooperate. Make sure the FS can indicate whether it replied or not. > What you are starting would need at least 3-5 years to catch up with what > people currently already can do, and they'll improve in this time too. Maybe 3-4 weeks more like. The discussion is helping me get a picture, and when I'm back next week I'll try something. Then, unfortunately I am away again from the 18th ... > I've seen your academic track record and it is surely impressive. I am not I didn't even know it was available anywhere! (or that it was impressive - thank you). > saying that your approach won't work within the constraints. Given enough > thrust, pigs fly. I'm just saying that it would be nice to learn what reasons > you have for this, because I believe that "within the constraints" makes your > proposal essentially useless (see the other mails). > > In particular, they make them useless for the requirements you seem to have. A > petabyte filesystem without journaling? A petabyte filesystem with a single > write lock? Gimme a break. Journalling? Well, now you mention it, that would seem to be nice. But my experience with journalling FS's so far tells me that they break more horribly than normal. Also, 1PB or so is the aggregate, not the size of each FS on the local nodes. I don't think you can diagnose "journalling" from the numbers. I am even rather loath to journal, given what I have seen. > Please, do the research and tell us what features you desire to have which are > currently missing, and why implementing them essentially from scratch is No features. Just take any FS that corrently works, and see if you can distribute it. Get rid of all fancy features along the way. You mean "what's wrong with X"? Well, it won't be mainstream, for a start, and that's surely enough. The projects involved are huge, and they need to minimize risk, and maximize flexibility. This is CERN, by the way. > preferrable to extending existing solutions. > > You are dancing around all the hard parts. "Don't have a distributed lock > manager, have one central lock." Yeah, right, has scaled _really_ well in the > past. Then you figure this one out, and come up with a lock-bitmap on the > device itself for locking subtrees of the fs. Next you are going to realize > that a single block is not scalable either because one needs exclusive write I made suggestions, hoping that the suggestions would elicit a response of some kind. I need to explore as much as I can and get as much as I can back without "doing it first", because I need the insight you can offer. I don't have the experience in this area, and I have the experience to know that I need years of experience with that code to be able to generate the semantics from scracth. I'm happy with what I'm getting. I hope I'll be able to return soon with a trial patch. > lock to it, 'cause you can't just rewrite a single bit. You might then begin > to explore that a single bit won't cut it, because for recovery you'll need to > be able to pinpoint all locks a node had and recover them. Then you might > begin to think about the difficulties in distributed lock management and There is no difficulty with that - there are no distributed locks. All locks are held on the server of the disk (I decided not to be complicated to begine with as a matter of principle early in life ;-). > recovery. ("Transaction processing" is an exceptionally good book on that I > believe) Thanks but I don't feel like rolling it out and rolling it back! > I bet you a dinner that what you are going to come up with will look > frighteningly like one of the solutions which already exist; so why not Maybe. > research them first in depth and start working on the one you like most, > instead of wasting time on an academic exercise? Because I don't agree with your assessment of what I should waste my time on. Though I'm happy to take it into account! Maybe twenty years ago now I wrote my first disk based file system (for functional databases) and I didn't like debugging it then! I positively hate the thought of flattening trees and relating indices and pointers now :-). > > So, start thinking about general mechanisms to do distributed storage. > > Not particular FS solutions. > > Distributed storage needs a way to access it; in the Unix paradigm, > "everything is a file", that implies a distributed filesystem. Other > approaches would include accessing raw blocks and doing the locking in the > application / via a DLM (ie, what Oracle RAC does). Yep, but we want "normality", just normality writ a bit larger than normal. > Lars Marowsky-Br_e <lmb@suse.de> Thanks for the input. I don't know what I was supposed to take away from it though! Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-07 19:59 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Peter T. Breuer @ 2002-09-07 21:14 ` Lars Marowsky-Bree 2002-09-08 9:23 ` Peter T. Breuer 0 siblings, 1 reply; 28+ messages in thread From: Lars Marowsky-Bree @ 2002-09-07 21:14 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux kernel On 2002-09-07T21:59:20, "Peter T. Breuer" <ptb@it.uc3m.es> said: > > Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS, > > OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre, > > PvFS etc. > Eh, I thought I saw this - didn't I reply? No, you didn't. > > Noone appreciates reinventing the wheel another time, especially if - for > > simplification - it starts out as a square. > But what I suggest is finding a simple way to turn an existing FS into a > distributed one. I.e. NOT reinventing the wheel. All those other people > are reinventing a wheel, for some reason :-). Well, actually they aren't exactly. The hard part in a "distributed filesystem" isn't the filesystem itself; while it is very necessary of course. The locking, synchronization and cluster infrastructure is where the real difficulty tends to arise. Yes, it can be argued whether it is in fact easier to create a filesystem from scratch with clustering in mind (so it is "optimised" for being able to do fine-grained locking etc), or whether proping a generic clustering layer on top of existing ones. The guesstimate of those involved in the past have seemed to suggest that the first is the case. And I also tend to think this to be the case, but I've been wrong. That would - indeed - be very helpful research to do. I would start by comparing the places where those specialized fs's actually are doing cluster related stuff and checking whether it can be abstracted, generalized and improved. In any case, trying to pick apart OpenGFS for example will provide you more insight into the problem area that a discussion on l-k. If you want to look into "turn a local fs into a cluster fs", SGI has a "clustered XFS"; however I'm not too sure how public that extension is. The hooks might however be in the common XFS core though. Now, going on with the gedankenexperiment, given a distributed lock manager (IBM open-sourced one of theirs, though it is not currently perfectly working ;), the locking primitives in the filesystems could "simply" be changed from local-node SMP spinlocks to cluster-wide locks. That _should_ to a large degree take care of the locking. What remains is the invalidation of cache pages; I would expect similar problems must have arised in NC-NUMA style systems, so looking there should provide hints. > > You fail to give a convincing reason why that must be made to work with > > "all" conventional filesystems, especially given the constraints this > > implies. > Because that's the simplest thing to do. Why? I disagree. You will have to modify existing file systems quite a bit to work _efficiently_ in a cluster environment; not even the on-disk layout is guaranteed to stay consistent as soon as you add per-node journals etc. The real complexity is in the distributed nature, in particular the recovery (see below). "Simplest thing to do" might be to take your funding and give it to the OpenGFS group or have someone fix the Oracle Cluster FS. > > In particular, they make them useless for the requirements you seem to > > have. A petabyte filesystem without journaling? A petabyte filesystem with > > a single write lock? Gimme a break. > Journalling? Well, now you mention it, that would seem to be nice. "Nice" ? ;-) You gotta be kidding. If you don't have journaling, distributed recovery becomes near impossible - at least I don't have a good idea on how to do it if you don't know what the node had been working on prior to its failure. If "take down the entire filesystem on all nodes, run fsck" is your answer to that, I will start laughing in your face. Because then your requirements are kind of from outer space and will certainly not reflect a large part of the user base. > > Please, do the research and tell us what features you desire to have which > > are currently missing, and why implementing them essentially from scratch > > is > No features. So they implement what you need, but you don't like them because theres just so few of them to chose from? Interesting. > Just take any FS that corrently works, and see if you can distribute it. > Get rid of all fancy features along the way. The projects involved are > huge, and they need to minimize risk, and maximize flexibility. This is > CERN, by the way. Well, you are taking quite a risk trying to run a not-aimed-at-distributed-environments fs and trying to make it distributed by force. I _believe_ that you are missing where the real trouble lurks. You maximize flexibility for mediocre solutions; little caching, no journaling etc. What does this supposed "flexibility" buy you? Is there any real value in it or is it a "because!" ? > You mean "what's wrong with X"? Well, it won't be mainstream, for a start, > and that's surely enough. I have pulled these two sentences out because I don't get them. What "X" are you referring to? > of some kind. I need to explore as much as I can and get as much as I > can back without "doing it first", because I need the insight you can > offer. The insight I can offer you is look at OpenGFS, see and understand what it does, why and how. The try to come up with a generic approach on how to put this on top of a generic filesystem, without making it useless. Then I shall be amazed. > There is no difficulty with that - there are no distributed locks. All locks > are held on the server of the disk (I decided not to be complicated to > begine with as a matter of principle early in life ;-). Maybe you and I have a different idea of "distributed fs". I thought you had a central pool of disks. You want there to be local disks at each server, and other nodes can read locally and have it appear as a big, single filesystem? You'll still have to deal with node failure though. Interesting. One might consider to peel apart meta-data (which always goes through the "home" node) and data (which goes directly to disk via the SAN); if necessary, the reply to the meta-data request to the home node could tell the node where to write/read. This smells a lot like cXFS and co with a central metadata server. > > recovery. ("Transaction processing" is an exceptionally good book on that > > I believe) > Thanks but I don't feel like rolling it out and rolling it back! Please explain how you'll recover anywhere close to "fast" or even "acceptable" without transactions. Even if you don't have to fsck the petabyte filesystem completely, do a benchmark on how long e2fsck takes on, oh, 50gb only. > Thanks for the input. I don't know what I was supposed to take away > from it though! I apologize and am sorry if you didn't notice. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- Immortality is an adequate definition of high availability for me. --- Gregory F. Pfister ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-07 21:14 ` [RFC] mount flag "direct" Lars Marowsky-Bree @ 2002-09-08 9:23 ` Peter T. Breuer 2002-09-08 9:59 ` Lars Marowsky-Bree 0 siblings, 1 reply; 28+ messages in thread From: Peter T. Breuer @ 2002-09-08 9:23 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel "A month of sundays ago Lars Marowsky-Bree wrote:" > > > In particular, they make them useless for the requirements you seem to > > > have. A petabyte filesystem without journaling? A petabyte filesystem with > > > a single write lock? Gimme a break. > > Journalling? Well, now you mention it, that would seem to be nice. > > "Nice" ? ;-) You gotta be kidding. If you don't have journaling, distributed > recovery becomes near impossible - at least I don't have a good idea on how to It's OK. The calculations are duplicated and the FS's are too. The calculation is highly parallel. > do it if you don't know what the node had been working on prior to its > failure. Yes we do. Its place in the topology of the network dictates what it was working on, and anyway that's just a standard parallelism "barrier" problem. > Well, you are taking quite a risk trying to run a > not-aimed-at-distributed-environments fs and trying to make it distributed by > force. I _believe_ that you are missing where the real trouble lurks. There is no risk, because, as you say, we can always use nfs or another off the shelf solution. But 10% better is 10% more experiment for each timeslot for each group of investigators. > What does this supposed "flexibility" buy you? Is there any real value in it Ask the people ho might scream for 10% more experiment in their 2 weeks. > > You mean "what's wrong with X"? Well, it won't be mainstream, for a start, > > and that's surely enough. > > I have pulled these two sentences out because I don't get them. What "X" are > you referring to? Any X that is not a standard FS. Yes, I agree, not exact. > The insight I can offer you is look at OpenGFS, see and understand what it > does, why and how. The try to come up with a generic approach on how to put > this on top of a generic filesystem, without making it useless. > > Then I shall be amazed. I have to catch a plane .. Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-08 9:23 ` Peter T. Breuer @ 2002-09-08 9:59 ` Lars Marowsky-Bree 2002-09-08 16:46 ` Peter T. Breuer 0 siblings, 1 reply; 28+ messages in thread From: Lars Marowsky-Bree @ 2002-09-08 9:59 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux kernel On 2002-09-08T11:23:39, "Peter T. Breuer" <ptb@it.uc3m.es> said: > > do it if you don't know what the node had been working on prior to its > > failure. > Yes we do. Its place in the topology of the network dictates what it was > working on, and anyway that's just a standard parallelism "barrier" > problem. I meant wrt what is had been working on in the filesystem. You'll need to do a full fsck locally if it isn't journaled. Oh well. Maybe it would help if you outlined your architecture as you see it right now. > > Well, you are taking quite a risk trying to run a > > not-aimed-at-distributed-environments fs and trying to make it distributed > > by force. I _believe_ that you are missing where the real trouble lurks. > There is no risk, because, as you say, we can always use nfs or another off > the shelf solution. Oh, so the discussion is a purely academic mind experiment; it would have been helpful if you told us in the beginning. > But 10% better is 10% more experiment for each timeslot > for each group of investigators. > > What does this supposed "flexibility" buy you? Is there any real value in > > it > Ask the people ho might scream for 10% more experiment in their 2 weeks. > > > You mean "what's wrong with X"? Well, it won't be mainstream, for a start, > > > and that's surely enough. > > I have pulled these two sentences out because I don't get them. What "X" are > > you referring to? > Any X that is not a standard FS. Yes, I agree, not exact. So, your extensions are going to be "more" mainstream than OpenGFS / OCFS etc? What the hell have you been smoking? It has become apparent in the discussion that you are optimizing for a very rare special case. OpenGFS, Lustre etc at least try to remain useable for generic filesystem operation. That it won't be mainstream is wrong about _your_ approach, not about those "off the shelves" solutions. And your special "optimisations" (like, no caching, no journaling...) are supposed to be 10% _faster_ overall than these which are - to a certain extent - from the ground up optimised for this case? One of us isn't listening while clue is knocking. Now it might be me, but then I apologize for having wasted your time and will stand corrected as soon as you have produced working code. Until then, have fun. I feel like I am wasting both your and my time, and this isn't strictly necessary. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- Immortality is an adequate definition of high availability for me. --- Gregory F. Pfister ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] mount flag "direct" 2002-09-08 9:59 ` Lars Marowsky-Bree @ 2002-09-08 16:46 ` Peter T. Breuer 0 siblings, 0 replies; 28+ messages in thread From: Peter T. Breuer @ 2002-09-08 16:46 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Peter T. Breuer, linux kernel "A month of sundays ago Lars Marowsky-Bree wrote:" > On 2002-09-08T11:23:39, > "Peter T. Breuer" <ptb@it.uc3m.es> said: > > > > do it if you don't know what the node had been working on prior to its > > > failure. > > Yes we do. Its place in the topology of the network dictates what it was > > working on, and anyway that's just a standard parallelism "barrier" > > problem. > > I meant wrt what is had been working on in the filesystem. You'll need to do a > full fsck locally if it isn't journaled. Oh well. Well, something like that anyway. > Maybe it would help if you outlined your architecture as you see it right now. I did in another post, I think. A torus with local 4-way direct connectivity with each node connected to three neigbours and exporting one local resource and importing three more from neighbours. All shared. Add raid to taste. > > There is no risk, because, as you say, we can always use nfs or another off > > the shelf solution. > > Oh, so the discussion is a purely academic mind experiment; it would have been Puhleeese try not to go off the deep end at an innocent observation. Take the novocaine or something. I am just pointing out that there are obvious safe fallbacks, AND ... > helpful if you told us in the beginning. > > > But 10% better is 10% more experiment for each timeslot > > for each group of investigators. You see? > > > you referring to? > > Any X that is not a standard FS. Yes, I agree, not exact. > > So, your extensions are going to be "more" mainstream than OpenGFS / OCFS etc? Quite possibly/probably. Let's see how it goes, shall we? Do you want to shoot down returning the index of the inode in get_block in order that we can do a wlock on that index before the io to the file takes place? Not sufficient in itself, but enough to be going on with, and enough for FS's that are reasonable in what they do. Then we need to drop the dcache entry nonlocally. > What the hell have you been smoking? Unfortunately nothing at all, let alone worthwhile. > It has become apparent in the discussion that you are optimizing for a very To you, perhaps, not to me. What I am thinking about is a data analysis farm, handling about 20GB/s of input data in real time, with numbers of nodes measured in the thousands, and network raided internally. Well, you'd need a thousand nodes on the first ring alone just to stream to disk at 20MB/s per node, and that will generate three to six times that amount of internal traffic just from the raid. So aggregate bandwidth in the first analysis ring has to be order of 100GB/s. If the needs are special, it's because of the magnitude of the numbers, not because of any special quality. > rare special case. OpenGFS, Lustre etc at least try to remain useable for > generic filesystem operation. > > That it won't be mainstream is wrong about _your_ approach, not about those > "off the shelves" solutions. I'm willing to look at everything. > And your special "optimisations" (like, no caching, no journaling...) are > supposed to be 10% _faster_ overall than these which are - to a certain extent Yep. Caching looks irrelevant because we read once and write once, by and large. You could argue that we write once and read once, which would make caching sensible, but the data streams are so large to make it likely that caches would be flooded out anyway. Buffering would be irrelevant except inasmuch as it allows for asynchronous operation. And the network is so involved in this that I would really like to get rid of the current VMS however I could (it causes pulsing behaviour, which is most disagreeable). > - from the ground up optimised for this case? > > One of us isn't listening while clue is knocking. You have an interesting bedtime story manner. > Now it might be me, but then I apologize for having wasted your time and will > stand corrected as soon as you have produced working code. Shrug. > Until then, have fun. I feel like I am wasting both your and my time, and this > isn't strictly necessary. !! There's no argument. I'm simply looking for entry points to the code. I've got a lot of good information, especially from Anton (and other people!), that I can use straight off. My thanks for the insights. Peter ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2002-09-08 16:41 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer
2002-09-03 15:13 ` Rik van Riel
2002-09-03 15:53 ` Maciej W. Rozycki
2002-09-03 16:04 ` Peter T. Breuer
2002-09-03 16:08 ` Rik van Riel
2002-09-03 15:16 ` jbradford
2002-09-03 15:37 ` Anton Altaparmakov
2002-09-03 15:44 ` Peter T. Breuer
2002-09-03 16:23 ` Lars Marowsky-Bree
2002-09-03 16:41 ` Peter T. Breuer
2002-09-03 17:07 ` David Lang
2002-09-03 17:30 ` Peter T. Breuer
2002-09-03 17:40 ` David Lang
2002-09-04 5:57 ` Helge Hafting
2002-09-04 6:21 ` Peter T. Breuer
2002-09-04 6:49 ` Helge Hafting
2002-09-04 9:15 ` Peter T. Breuer
2002-09-04 11:34 ` Helge Hafting
2002-09-03 17:26 ` Rik van Riel
2002-09-03 18:02 ` Andreas Dilger
2002-09-03 18:44 ` Daniel Phillips
2002-09-03 17:29 ` Jan Harkes
2002-09-03 18:31 ` Daniel Phillips
2002-09-03 18:20 ` Daniel Phillips
[not found] <20020907164631.GA17696@marowsky-bree.de>
2002-09-07 19:59 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Peter T. Breuer
2002-09-07 21:14 ` [RFC] mount flag "direct" Lars Marowsky-Bree
2002-09-08 9:23 ` Peter T. Breuer
2002-09-08 9:59 ` Lars Marowsky-Bree
2002-09-08 16:46 ` Peter T. Breuer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox