* fixing redundant network opens on Linux file creation @ 2003-01-06 17:25 Steven French 2003-01-06 18:14 ` Richard Sharpe 2003-01-06 22:18 ` Marcos Dione 0 siblings, 2 replies; 19+ messages in thread From: Steven French @ 2003-01-06 17:25 UTC (permalink / raw) To: samba-technical, linux-fsdevel The creat() system call results (for the Linux kernel) in calls to create (via vfs_create) then later a call to open (via dentry_open) both of which eventually end up (for the cifs vfs) doing a network open of the file from the perspective of the CIFS protocol which degrades performance (because every creat does one additional open & close than ideal). In the cifs protocol file creation is handled as a flag on the open request so create has a sideeffect of opening the file. Unfortunately since mknod can call vfs_create (presumably without immediately afterwards calling open), it seems like a vfs can't assume that all creates are necessarily going to be immediately followed by a file open (server file handle leaks would be possible if such an assumption were made). smbfs in effect ignores the subsequent open and the nfs vfs doesn't have this problem because it doesn't send a remote open request in nfs_open (since v2 and v3 nfs doesn't really need an open file handle for file based operations like smb/cifs does). To improve creat() performance for cifs (without changing namei.c itself) it seems like there are only two obvious alternatives: 1) Have the cifs vfs ignore subsequent opens of the same file (never have more than one open per inode - ala smbfs) - which has the disadvantage of making the open flags (and pid) incorrect for subsequent opens and would cause server problems with handling byte range locks and potentially causes problems with other clients accessing a file that was just created via mknod and therefore should not be considered open anymore. 2) Have the cifs vfs do "lazy close" of files - perhaps using the original "opbatch" distributing caching mechanism in the smb/cifs protocol (which cached opens for optimal performance running batch files on network drives) for distributed cache management (so the client will not cause sharing violations if other clients try to access the same file). I prefer the latter but am working on proving that it works now. Any other approaches? Steve French Senior Software Engineer Linux Technology Center - IBM Austin phone: 512-838-2294 email: sfrench@us.ibm.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 17:25 fixing redundant network opens on Linux file creation Steven French @ 2003-01-06 18:14 ` Richard Sharpe 2003-01-06 17:59 ` Jan Hudec 2003-01-06 22:18 ` Marcos Dione 1 sibling, 1 reply; 19+ messages in thread From: Richard Sharpe @ 2003-01-06 18:14 UTC (permalink / raw) To: Steven French; +Cc: samba-technical, linux-fsdevel On Mon, 6 Jan 2003, Steven French wrote: > The creat() system call results (for the Linux kernel) in calls to create > (via vfs_create) then later a call to open (via dentry_open) both of which > eventually end up (for the cifs vfs) doing a network open of the file from > the perspective of the CIFS protocol which degrades performance (because > every creat does one additional open & close than ideal). In the cifs > protocol file creation is handled as a flag on the open request so create > has a sideeffect of opening the file. Unfortunately since mknod can call > vfs_create (presumably without immediately afterwards calling open), it > seems like a vfs can't assume that all creates are necessarily going to be > immediately followed by a file open (server file handle leaks would be > possible if such an assumption were made). smbfs in effect ignores the > subsequent open and the nfs vfs doesn't have this problem because it > doesn't send a remote open request in nfs_open (since v2 and v3 nfs doesn't > really need an open file handle for file based operations like smb/cifs > does). To improve creat() performance for cifs (without changing namei.c > itself) it seems like there are only two obvious alternatives: Isn't creat() a legacy call? I have never used it, and use open(..., O_CREAT,...) instead. Isn't this just a cost of using legacy calls? Why complicate things overly for a call that might not be used all that much? > 1) Have the cifs vfs ignore subsequent opens of the same file (never have > more than one open per inode - ala smbfs) - which has the disadvantage of > making the open flags (and pid) incorrect for subsequent opens and would > cause server problems with handling byte range locks and potentially causes > problems with other clients accessing a file that was just created via > mknod and therefore should not be considered open anymore. > > 2) Have the cifs vfs do "lazy close" of files - perhaps using the original > "opbatch" distributing caching mechanism in the smb/cifs protocol (which > cached opens for optimal performance running batch files on network drives) > for distributed cache management (so the client will not cause sharing > violations if other clients try to access the same file). > > I prefer the latter but am working on proving that it works now. Any > other approaches? > > Steve French > Senior Software Engineer > Linux Technology Center - IBM Austin > phone: 512-838-2294 > email: sfrench@us.ibm.com > -- Regards ----- Richard Sharpe, rsharpe[at]ns.aus.com, rsharpe[at]samba.org, sharpe[at]ethereal.com, http://www.richardsharpe.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 18:14 ` Richard Sharpe @ 2003-01-06 17:59 ` Jan Hudec 2003-01-06 19:42 ` Bryan Henderson 0 siblings, 1 reply; 19+ messages in thread From: Jan Hudec @ 2003-01-06 17:59 UTC (permalink / raw) To: Richard Sharpe; +Cc: Steven French, samba-technical, linux-fsdevel On Mon, Jan 06, 2003 at 10:14:10AM -0800, Richard Sharpe wrote: > On Mon, 6 Jan 2003, Steven French wrote: > > > The creat() system call results (for the Linux kernel) in calls to create > > (via vfs_create) then later a call to open (via dentry_open) both of which > > eventually end up (for the cifs vfs) doing a network open of the file from > > the perspective of the CIFS protocol which degrades performance (because > > every creat does one additional open & close than ideal). In the cifs > > protocol file creation is handled as a flag on the open request so create > > has a sideeffect of opening the file. Unfortunately since mknod can call > > vfs_create (presumably without immediately afterwards calling open), it > > seems like a vfs can't assume that all creates are necessarily going to be > > immediately followed by a file open (server file handle leaks would be > > possible if such an assumption were made). smbfs in effect ignores the > > subsequent open and the nfs vfs doesn't have this problem because it > > doesn't send a remote open request in nfs_open (since v2 and v3 nfs doesn't > > really need an open file handle for file based operations like smb/cifs > > does). To improve creat() performance for cifs (without changing namei.c > > itself) it seems like there are only two obvious alternatives: > > Isn't creat() a legacy call? I have never used it, and use open(..., > O_CREAT,...) instead. > > Isn't this just a cost of using legacy calls? Why complicate things overly > for a call that might not be used all that much? I am not sure, what it means "legacy call", but I am pretty sure, that creat and open(... O_CREAT) end up calling exactly the same filesystem methods with exactly the same parameters. (First lookup is called and it does not know, what is to happen to the file, then create is called and it does not know open mode for the file and last open is called with apropriate mode). > > 1) Have the cifs vfs ignore subsequent opens of the same file (never have > > more than one open per inode - ala smbfs) - which has the disadvantage of > > making the open flags (and pid) incorrect for subsequent opens and would > > cause server problems with handling byte range locks and potentially causes > > problems with other clients accessing a file that was just created via > > mknod and therefore should not be considered open anymore. > > > > 2) Have the cifs vfs do "lazy close" of files - perhaps using the original > > "opbatch" distributing caching mechanism in the smb/cifs protocol (which > > cached opens for optimal performance running batch files on network drives) > > for distributed cache management (so the client will not cause sharing > > violations if other clients try to access the same file). > > > > I prefer the latter but am working on proving that it works now. Any > > other approaches? There is a lookup intent patch from lustre group. It can be found somewhere in the archives. Pushing that (or something along that lines) to mainline and using that would be IMHO most beneficial (because all networking filesystems could benefit from this patch). However that does not fall into the category "not changing namei.c". ------------------------------------------------------------------------------- Jan 'Bulb' Hudec <bulb@ucw.cz> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 17:59 ` Jan Hudec @ 2003-01-06 19:42 ` Bryan Henderson 2003-01-06 19:56 ` Jan Harkes 2003-01-06 21:31 ` Andreas Dilger 0 siblings, 2 replies; 19+ messages in thread From: Bryan Henderson @ 2003-01-06 19:42 UTC (permalink / raw) To: Jan Hudec; +Cc: linux-fsdevel, Richard Sharpe, samba-technical, Steven French >There is a lookup intent patch from lustre group. It can be found >somewhere in the archives. Pushing that (or something along that lines) >to mainline and using that would be IMHO most beneficial Better still would be to add a "create-and-open" VFS call and have namei use it. This solves a number of problems, including the fact that it is impossible to correctly implement an exclusive create and open with a shared filesystem (because between when Linux confirms that the file doesn't exist and when Linux does the VFS create, another system may have created the file). "Intent," as it's generally understood, is not a promise of future activity -- it's either a hint to improve efficiency or it's a promise to restrict future activity, but it should be possible simply to bail out before exercising that intent. E.g. you can't open a file at the same time as it is looked up just because the looker upper says he intends to open it later. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 19:42 ` Bryan Henderson @ 2003-01-06 19:56 ` Jan Harkes 2003-01-06 21:58 ` Bryan Henderson 2003-01-06 21:31 ` Andreas Dilger 1 sibling, 1 reply; 19+ messages in thread From: Jan Harkes @ 2003-01-06 19:56 UTC (permalink / raw) To: linux-fsdevel On Mon, Jan 06, 2003 at 11:42:26AM -0800, Bryan Henderson wrote: > >There is a lookup intent patch from lustre group. It can be found > >somewhere in the archives. Pushing that (or something along that lines) > >to mainline and using that would be IMHO most beneficial > > Better still would be to add a "create-and-open" VFS call and have namei > use it. This solves a number of problems, including the fact that it is > impossible to correctly implement an exclusive create and open with a > shared filesystem (because between when Linux confirms that the file > doesn't exist and when Linux does the VFS create, another system may have > created the file). But create is a directory operation, while open is an operation on a file. It is just a matter of convenience that open(..., O_CREAT) happens to create the directory entry if it doesn't yet exist. Logically they shouldn't be combined. Perhaps having the exclusive create lock the object and pass that info on to the associated open. In Coda these objects are named 'virgin files'. And they have some special properties, such as being able to write to a file you were allowed to create even when the ACL's are set so that you have no write permission. I sometimes wish that file creation would have been done the other way around. - Open/create a new, unnamed object, which gives a file handle. - Link this open handle into the filesystem's namespace. That way the application can lock the object, or write the data to it, etc. before making it visible to the world. Might have solved some of the possible inconsistencies for networked filesystems and is probably more resiliant wrt. symlink attacks. Jan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 19:56 ` Jan Harkes @ 2003-01-06 21:58 ` Bryan Henderson 0 siblings, 0 replies; 19+ messages in thread From: Bryan Henderson @ 2003-01-06 21:58 UTC (permalink / raw) To: Jan Harkes; +Cc: linux-fsdevel >But create is a directory operation, But it isn't. Here's the thing: directories and files are intimately tied together in Unix. I often wish they weren't. If they weren't, as in VMS, creating a directory entry and creating a file would be independent operations. Most of a lookup would be done above the kernel. System calls would address filesystem objects by inode number. >It is just a matter of convenience that open(..., O_CREAT) >happens to create the directory entry if it doesn't yet exist. Logically >they shouldn't be combined. Logically, POSIX shouldn't require them to be combined, but it does. An alternative to having an atomic create-and-open VFS call would be to define VFS lock/unlock directory calls. For a shared filesystem, a central lock manager would have to coordinate these locks among the various systems -- and deal with the problems of systems crashing or dropping off the network while holding a directory lock. Considerably more implementation work. >Perhaps having the exclusive create lock the object and pass that info >on to the associated open. I don't see how this is different from just having the create open the file. You still have the call that adds an entry to a directory also doing a file operation (creating a file) and then another file operation (locking the file). Might as well just let it open the file. >I sometimes wish that file creation would have been done the other way >around. > >- Open/create a new, unnamed object, which gives a file handle. >- Link this open handle into the filesystem's namespace. Assuming the POSIX directory-file binding, this has a similar problem. User asks to open a file and create it if "it" doesn't already exist. namei determines the file doesn't exist, so creates and opens a new, unnamed file. Another system then creates the file and adds it to the directory. namei now goes to add the file it created into the directory, but can't. Now what? Incidentally, AIX VFS has the create-and-open call (consistent with the system call interface, creating is done just by flags on an open VFS call). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 19:42 ` Bryan Henderson 2003-01-06 19:56 ` Jan Harkes @ 2003-01-06 21:31 ` Andreas Dilger 2003-01-06 22:23 ` Bryan Henderson 1 sibling, 1 reply; 19+ messages in thread From: Andreas Dilger @ 2003-01-06 21:31 UTC (permalink / raw) To: Bryan Henderson Cc: Jan Hudec, linux-fsdevel, Richard Sharpe, samba-technical, Steven French, Lustre Development Mailing List On Jan 06, 2003 11:42 -0800, Bryan Henderson wrote: > >There is a lookup intent patch from lustre group. It can be found > >somewhere in the archives. Pushing that (or something along that lines) > >to mainline and using that would be IMHO most beneficial > > "Intent," as it's generally understood, is not a promise of future activity > -- it's either a hint to improve efficiency or it's a promise to restrict > future activity, but it should be possible simply to bail out before > exercising that intent. E.g. you can't open a file at the same time as it > is looked up just because the looker upper says he intends to open it > later. In our code, the lookup-with-intent actually performs both of the operations on the server, and it is up to the client methods to detect that the operation was done and deal with it appropriately. We have very well-tested code for 2.4 and 2.5 code is mostly functional (2.5 is a lot neater implementation but the changes mean that it isn't yet as functional as the 2.4 code). In the Lustre code, the premise is that the lookup-with-intent operation (called lookup2 for now) does one of: 1) the lookup + operation on the server in one RPC (i.e. lookup+create[+open], lookup+unlink, lookup+rename) and tells the client "I just did this for you, here are the attributes of the new entry and a lock on it if necessary, please fix up your local state to match", and the actual VFS operations are only doing the post-facto state cleanup. 2) OR it returns a lock to the client that grants the client exclusive control over the item(s) in question (normally the parent dir(s)) and lets the client do the operations locally and send the operations to the server separately. We currently implement (1) only right now, but the goal is to implement (2) in the future (which would be back to nearly what the VFS currently does, except that we are now granted the locks in advance) so that a client can do many operations locally without the need for getting lots of locks. For example, in the future, a Lustre client creating a new directory could be granted the lock on that directory, and it could then create files in that directory without further RPCs to the server very efficiently (e.g. untarring a file) until another client revokes the lock(s) and forces the client to flush all of its updates to the server. For an updated version of the intent patch, see: ftp://ftp.lustre.org/pub/kernels/patches/2.4.18-hp1_pnnl18_l5.patch ftp://ftp.lustre.org/pub/kernels/patches/37chaos-l5.patch The first patch is good for vanilla kernels, and the second for RH 2.4.18-17ish kernels. There is a bit of extra stuff therein which isn't really related to the intent changes. Cheers, Andreas PS - I've added lustre-devel to this thread so that the Lustre developers also see any discussion related to the intent changes. -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 21:31 ` Andreas Dilger @ 2003-01-06 22:23 ` Bryan Henderson 2003-01-06 22:48 ` Andreas Dilger 0 siblings, 1 reply; 19+ messages in thread From: Bryan Henderson @ 2003-01-06 22:23 UTC (permalink / raw) To: Andreas Dilger Cc: Jan Hudec, linux-fsdevel, Lustre Development Mailing List, Richard Sharpe, samba-technical, Steven French >In our code, the lookup-with-intent actually performs both of the operations >on the server, What I don't get is why is the concept of "intent" even involved here? If lookup-with-intent does the lookup and open (and, I guess, create where appropriate), why don't you call it "lookup-and-open" and then skip the subsequent VFS open call? You also mention the distributed version of the Lustre lookup-with-intent: >OR it returns a lock to the client that grants the client exclusive > control over the item(s) in question (normally the parent dir(s)) and > lets the client do the operations locally and send the operations to > the server separately. and the same question applies in that case. While the client may do the open separately, there's no reason it shouldn't do it before returning from the VFS lookup-with-intent call, which means it would be simpler as a lookup-and-open. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 22:23 ` Bryan Henderson @ 2003-01-06 22:48 ` Andreas Dilger 2003-01-07 1:06 ` Bryan Henderson 0 siblings, 1 reply; 19+ messages in thread From: Andreas Dilger @ 2003-01-06 22:48 UTC (permalink / raw) To: Bryan Henderson Cc: Jan Hudec, linux-fsdevel, Lustre Development Mailing List, Richard Sharpe, samba-technical, Steven French On Jan 06, 2003 14:23 -0800, Bryan Henderson wrote: > >In our code, the lookup-with-intent actually performs both of the > >operations on the server, > > What I don't get is why is the concept of "intent" even involved here? If > lookup-with-intent does the lookup and open (and, I guess, create where > appropriate), why don't you call it "lookup-and-open" and then skip the > subsequent VFS open call? Because the intent code is much more than just "lookup-and-open". It is also lookup-and-create, lookup-and-mkdir, lookup-and-unlink, lookup-and-setattr, etc. I don't think we want separate VFS ops for every possible VFS op. Also, in the Linux VFS, the lookup call is the one which is actually doing the locking on the appropriate objects for atomicity purposes, which is actually the critical thing here - we use lookup2 for doing the distributed locking as much as for the RPC savings. Like another Lustre developer remarked "it's a lock-with-intent on the wire, and a lookup-with-intent in the kernel". > You also mention the distributed version of the Lustre lookup-with-intent: > > >OR it returns a lock to the client that grants the client exclusive > > control over the item(s) in question (normally the parent dir(s)) and > > lets the client do the operations locally and send the operations to > > the server separately. > > and the same question applies in that case. While the client may do the > open separately, there's no reason it shouldn't do it before returning from > the VFS lookup-with-intent call, which means it would be simpler as a > lookup-and-open. The reason we still do a VFS open call after we do the lookup-with-intent are several: 1) like I said above, we don't want to have 2x every VFS op (one with lookup and another without) either in our code or in the VFS proper 2) the amount of changes needed to the VFS would be quite large, if it had to determine whether it should do a lookup-with-intent and no regular op, or the lookup + regular op 3) doing lookup-with-intent allows us to manage the locking internal to the filesystem however we want instead of having to live within the VFS's ideas of locking (e.g. we could split up the locks within a single directory so you could do concurrent creates/renames/unlinks in a single directory if we so choose, and we may). The thing to focus on here is that lookup2 is as much a locking API as it is a lookup+operation API. I was thinking a clever name for it would be "loockup", but that has some unfortunate connotations ;-). Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 22:48 ` Andreas Dilger @ 2003-01-07 1:06 ` Bryan Henderson 2003-01-07 13:19 ` [Lustre-devel] " Mike Shaver 0 siblings, 1 reply; 19+ messages in thread From: Bryan Henderson @ 2003-01-07 1:06 UTC (permalink / raw) To: Andreas Dilger Cc: Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Steven French >Because the intent code is much more than just "lookup-and-open". >It is also lookup-and-create, lookup-and-mkdir, lookup-and-unlink, >lookup-and-setattr, etc. I don't think we want separate VFS ops for >every possible VFS op. That's really orthogonal to this discussion. If you want to conserve the number of VFS operation routines, you can have a single routine with parameters for a dozen different operations whether it is lookup-with-intent or lookup-and-do. Pretty much the only difference in the C code is the name of the routine. But my discomfort with the lookup-with-intent approach is focused on the open/create operation in particular. From what I can tell, these intents are more than just declaration of intent. They're promises. If the VFS caller did a lookup with intent to create if not found, and then didn't follow through on that intent, I guess that would cause trouble on Lustre since the implementation of lookup-with-intent actually created the file. That's not the concept of intent declaration as I've seen it everywhere else. Something like "open with write intent" always means either "open the file and I won't do anything but write to it," or "open the file and I'll probably be writing to it," but never "open the file and the next thing you see from me will be a write of 10 bytes at offset 20." Another thing the structure of this "intent" interface says to me is that a filesystem driver might choose in some cases not to open the file but wait until the open is actually requested. If so, doesn't the filesystem driver have to maintain some cognizance of the thread of file accesses, so it can match up an open with a previous lookup-with-intent and know if that particular open is already done? That kind of state has always been intentionally omitted from the VFS interface. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation 2003-01-07 1:06 ` Bryan Henderson @ 2003-01-07 13:19 ` Mike Shaver 2003-01-07 17:28 ` Bryan Henderson 0 siblings, 1 reply; 19+ messages in thread From: Mike Shaver @ 2003-01-07 13:19 UTC (permalink / raw) To: Bryan Henderson Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Steven French On Jan 06, Bryan Henderson wrote: > That's really orthogonal to this discussion. If you want to conserve the > number of VFS operation routines, you can have a single routine with > parameters for a dozen different operations whether it is > lookup-with-intent or lookup-and-do. Pretty much the only difference in > the C code is the name of the routine. That may be true, but the invasiveness of the change to the Linux VFS would likely be much greater. Our intent patches are pretty small, and therefore much easier to port between versions, as well as more likely to be integrated into 2.5/2.6. > But my discomfort with the lookup-with-intent approach is focused on the > open/create operation in particular. From what I can tell, these intents > are more than just declaration of intent. They're promises. If the VFS > caller did a lookup with intent to create if not found, and then didn't > follow through on that intent, I guess that would cause trouble on Lustre > since the implementation of lookup-with-intent actually created the file. Do you use "the VFS caller" to mean "the code that calls into the VFS", or "the caller of the intent-handling operations, which is the VFS"? It's my understanding that these changes are transparent to the caller of the VFS, but if the VFS itself were to "abort" halfway we might well have a problem. Not because something created the file, but because we wouldn't necessarily clean up the intent structures correctly. I expect that this is a soluble problem, at the expense of more changes to the VFS. We haven't seen any problems with "aborted intent" in part because we don't depend on the caller-into-the-VFS to cooperate; the VFS itself completes the intent protocol correctly, every time, in no small part because the intent is declarative and binding, rather than just speculative. > That's not the concept of intent declaration as I've seen it everywhere > else. Something like "open with write intent" always means either "open > the file and I won't do anything but write to it," or "open the file and > I'll probably be writing to it," but never "open the file and the next > thing you see from me will be a write of 10 bytes at offset 20." Is the objection really just to the terminology, then? JFS, VxFS and NetApp seem to use "intent logging" to mean something similar ("I will be doing this next", rather than "I might be doing this next, but maybe not"). Maybe I misunderstand the intent log, though, and the time at which it gets updated. It certainly does seem to describe fact rather than a fallible expectation. The origin of the intent stuff is really, to my understanding, in the locking: the client requests a lock with the declared intent of performing some other FS operation (getattr, create of a child, etc.). The presence of that intent information, in the form of a fully-specified FS operation, is what permits the server to perform the desired operation on behalf of the client, where system performance would be degraded unacceptably by giving one client an exclusive lock on a contended resource. That we have intent-driven behaviour in lookup/lookup2 is largely due to the fact that it's in lookup that we need to acquire our locks. > Another thing the structure of this "intent" interface says to me is that a > filesystem driver might choose in some cases not to open the file but wait > until the open is actually requested. If so, doesn't the filesystem driver > have to maintain some cognizance of the thread of file accesses, so it can > match up an open with a previous lookup-with-intent and know if that > particular open is already done? That kind of state has always been > intentionally omitted from the VFS interface. I think it's that state, specifically, that's represented by the intent parameters added to the various ops. I understand that it was a design compromise motivated in no small part by the desire to minimize changes to the Linux VFS at this stage. I'm not at all certain that we would structure things in this form if we were writing an intent-enabled VFS from first principles. Mike ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation 2003-01-07 13:19 ` [Lustre-devel] " Mike Shaver @ 2003-01-07 17:28 ` Bryan Henderson 2003-01-07 18:50 ` Andreas Dilger 0 siblings, 1 reply; 19+ messages in thread From: Bryan Henderson @ 2003-01-07 17:28 UTC (permalink / raw) To: Mike Shaver Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Steven French >Is the objection really just to the terminology, then? Partly the terminology, and partly the things that the terminology implies. Those who support this name must do so because they expect the interface to have some properties of intent. But I've argued that it cannot be a true intent declaration, and therefore any approximation to intent will cause trouble. >JFS, VxFS and >NetApp seem to use "intent logging" to mean something similar ("I will >be doing this next", rather than "I might be doing this next, but maybe >not"). Maybe I misunderstand the intent log, though, and the time at >which it gets updated. It certainly does seem to describe fact rather >than a fallible expectation. I'm not a big fan of this use of the word "intent" either, and in fact the technique it refers to is often called other things. But it definitely _is_ a case where the intent can be abandonned. That's the whole point -- you log an intent to create a file, but don't actually commit to creating it. If the system should crash before all the corequesisites of that creation are complete, the file ends up never having been created. In contrst, the proposed Linux lookup-with-intent scheme appears actually to irrevocably create a file as soon as the "intent" to create it is declared. >I understand that it was a design >compromise motivated in no small part by the desire to minimize changes >to the Linux VFS at this stage. That explanation makes a lot of sense. What it really boils down to is that the parameters aren't so much a declaration of intent as a revelation of the context in which the caller is making the call. In other words, a contravention of modularity. That is always a minimal-lines-of-code solution to a protocol problem. It's a heavy design tradeoff, though. >Do you use "the VFS caller" to mean "the code that calls into the VFS", >or "the caller of the intent-handling operations, which is the VFS"? One of the irritating things about Linux filesystem discussions is the diversity of terminology. Several of the most key terms, including "VFS" are used to mean multiple very different things. In this case, you are clearly using "VFS" to mean the Linux code found in the 'fs' directory. That's common, but it is also common to use it to refer to the code found in directories such as 'fs/ext2'. I don't find either of those definitions useful. To me, VFS has always been the name of the protocol that said pieces of code use to talk to each other. And it applies in general to all operating systems that have such an interface inside them. The name "FS" works better for the code in the 'fs' directory (not just because that's what the directory is called, but also because the oldest documents describing it call it that). The term "filesystem driver" is far more descriptive, unambiguous, and universal for the code in fs/ext2. But people most often refer to that code as "a filesystem." Along with 5 other things they use the word "filesystem" for. Note that in theory, the invoker of the VFS protocol operations could be anything, and the filesystem driver should not care. Even in practice, it is not always the 'fs/' component. Sometimes it is the NFS server code. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation 2003-01-07 17:28 ` Bryan Henderson @ 2003-01-07 18:50 ` Andreas Dilger 2003-01-08 17:52 ` Bryan Henderson 0 siblings, 1 reply; 19+ messages in thread From: Andreas Dilger @ 2003-01-07 18:50 UTC (permalink / raw) To: Bryan Henderson Cc: Mike Shaver, Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Steven French On Jan 07, 2003 09:28 -0800, Bryan Henderson wrote: > >JFS, VxFS and > >NetApp seem to use "intent logging" to mean something similar ("I will > >be doing this next", rather than "I might be doing this next, but maybe > >not"). Maybe I misunderstand the intent log, though, and the time at > >which it gets updated. It certainly does seem to describe fact rather > >than a fallible expectation. > > I'm not a big fan of this use of the word "intent" either, and in fact the > technique it refers to is often called other things. But it definitely > _is_ a case where the intent can be abandonned. That's the whole point -- > you log an intent to create a file, but don't actually commit to creating > it. If the system should crash before all the corequesisites of that > creation are complete, the file ends up never having been created. In > contrst, the proposed Linux lookup-with-intent scheme appears actually to > irrevocably create a file as soon as the "intent" to create it is declared. I don't see where you are coming from here. Could you be more specific on whether you think the entity declaring an "intent" is user-space, the VFS code in fs/*.c, the filesystem driver code in fs/*/*.c or what? I don't really see where you can "change your mind" in the middle of creating a file, unless there was an error somewhere along the way. If you call sys_mkdir() you have declared an "intent" to create a directory, and the VFS better not arbitrarily decide that it doesn't feel like creating directories today. What I am getting at, is that once an application has called a system call, either that system call will do what it was supposed to do (e.g. create, rename, remove, change a file/dir) or it will have an error. Whether that operation was done in the "lookup-with-intent call on server + op fixup on client" or as a lookup+op call on a local filesystem is unrelated to the fact that the operation will complete either way. The "intent" that we are talking about in regards to Lustre is not a "maybe" thing like open(..., O_RDWR) where you may or may not read or write to a file after opening it. The intent is set up at entry to the kernel syscall code, and is destroyed before the syscall returns to user code again. The only two options are that the server acted on the intent and did the operation there and the kernel code on the client handles this, or the server granted a lock to the client, and the kernel code on the client is required to complete the operation itself. Anything else is a bug. > Note that in theory, the invoker of the VFS protocol operations could be > anything, and the filesystem driver should not care. Even in practice, it > is not always the 'fs/' component. Sometimes it is the NFS server code. Or, by no small coincidence, the Lustre target code. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation 2003-01-07 18:50 ` Andreas Dilger @ 2003-01-08 17:52 ` Bryan Henderson 2003-01-08 19:11 ` Peter Braam 0 siblings, 1 reply; 19+ messages in thread From: Bryan Henderson @ 2003-01-08 17:52 UTC (permalink / raw) To: Andreas Dilger Cc: Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Mike Shaver, Steven French >I don't see where you are coming from here. Could you be more specific on >whether you think the entity declaring an "intent" is user-space, the VFS >code in fs/*.c, the filesystem driver code in fs/*/*.c or what? As a general principle, any of those things could declare intent. In the Lustre design we're talking about, I don't believe any of them does. Hence my objection to the term "intent." Based on that word, I thought at first I might just have missed something in the definition of the interface, but I don't think so anymore. >I don't >really see where you can "change your mind" in the middle of creating a >file, unless there was an error somewhere along the way. I don't either. (And apparently, simple errors are no exception in the Lustre design). Hence, you have declared significantly more than an intent when you did the lookup. >If you call >sys_mkdir() you have declared an "intent" to create a directory Not as "intent" is usually understood. If you call sys_mkdir(), you have commanded the kernel to create the directory. That's a lot different from declaring that you intend to create the directory. I believe the lustre patch works. I also believe it uses the wrong terminology, creates an interface to filesystem drivers that is brittle and hard to understand, and doesn't solve as wide a range of problems as it could. I believe that what it calls a declaration of intent is really a declaration of what POSIX system call the caller is in the middle of performing. On the other hand, it has been pointed out that one of its goals was to minimize the changes to fs/*.c. I agree the patch is a good way to achieve that goal. If it were my decision, I would solve the Lustre problem, and the Samba problem, and some of my own as well, by putting higher level filesystem driver interfaces into Linux, such as some other kernels do. Let the filesystem driver do the whole "lookup, create directory, add directory entry" operation if it wants to, and in that case make just that one call to the filesystem driver and be done. Let the filesystem driver deal with the problems of failures halfway through the sequence. But suggestions I've made to give more power to filesystem drivers have in the past met resistance from those who want to keep centralized control and maintain uniformity among the various filesystem types). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation 2003-01-08 17:52 ` Bryan Henderson @ 2003-01-08 19:11 ` Peter Braam 2003-01-09 2:08 ` Bryan Henderson 0 siblings, 1 reply; 19+ messages in thread From: Peter Braam @ 2003-01-08 19:11 UTC (permalink / raw) To: Bryan Henderson Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Mike Shaver, Steven French Hi, I have no objections to a name change. We are not so religious about "intent" as a name. On Wed, Jan 08, 2003 at 10:52:51AM -0700, Bryan Henderson wrote: > >I don't see where you are coming from here. Could you be more specific on > >whether you think the entity declaring an "intent" is user-space, the VFS > >code in fs/*.c, the filesystem driver code in fs/*/*.c or what? > > As a general principle, any of those things could declare intent. In the > Lustre design we're talking about, I don't believe any of them does. Hence > my objection to the term "intent." Based on that word, I thought at first > I might just have missed something in the definition of the interface, but > I don't think so anymore. > > >I don't > >really see where you can "change your mind" in the middle of creating a > >file, unless there was an error somewhere along the way. open with O_CREATE | O_EXCL is a good example. > I don't either. (And apparently, simple errors are no exception in the > Lustre design). Hence, you have declared significantly more than an intent > when you did the lookup. > > >If you call > >sys_mkdir() you have declared an "intent" to create a directory > > Not as "intent" is usually understood. If you call sys_mkdir(), you have > commanded the kernel to create the directory. That's a lot different from > declaring that you intend to create the directory. > > I believe the lustre patch works. I also believe it uses the wrong > terminology, creates an interface to filesystem drivers that is brittle and > hard to understand, and doesn't solve as wide a range of problems as it > could. I believe that what it calls a declaration of intent is really a > declaration of what POSIX system call the caller is in the middle of > performing. > > On the other hand, it has been pointed out that one of its goals was to > minimize the changes to fs/*.c. I agree the patch is a good way to achieve > that goal. > > If it were my decision, I would solve the Lustre problem, and the Samba > problem, and some of my own as well, by putting higher level filesystem > driver interfaces into Linux, such as some other kernels do. > > Let the > filesystem driver do the whole "lookup, create directory, add directory > entry" operation if it wants to, and in that case make just that one call > to the filesystem driver and be done. Let the filesystem driver deal with > the problems of failures halfway through the sequence. > > But suggestions I've made to give more power to filesystem drivers have in > the past met resistance from those who want to keep centralized control and > maintain uniformity among the various filesystem types). That proposal has been made by many other people, everywhere. Of course we could work with that too. Personally I rather like the Linux VFS because it does locking etc: Al Viro has made it very clear that e.g. locking for renames, which is incredibly hard, is best done once (what you call centralized) than many times by different file systems. This is the one single reason that we used the "intent" solution: it can make use of the VFS infrastructure better than high level calls. But again, I'm not religious about this -- I am religious about getting correctness for clustering file systems. And we have had to do some other things (like dealing with dentries in highly non-standard ways) to get correctness. And of course, we have many problems left... - Peter - > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/lustre-devel - Peter - ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation 2003-01-08 19:11 ` Peter Braam @ 2003-01-09 2:08 ` Bryan Henderson 2003-01-09 3:36 ` Peter Braam 0 siblings, 1 reply; 19+ messages in thread From: Bryan Henderson @ 2003-01-09 2:08 UTC (permalink / raw) To: Peter Braam Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Mike Shaver, Steven French >I have no objections to a name change. We are not so religious about >"intent" as a name. How religious are you about the idea of having to have BOTH a lookup2() that contains all the information necessary to create a directory if the name is available, AND a subsequent "create directory" call? Because once you remove the word "intent" from the description, that looks even more silly. It is the relationship between those two (sometimes 3) redundant calls that is the real substance in what otherwise appears to be just a naming issue. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation 2003-01-09 2:08 ` Bryan Henderson @ 2003-01-09 3:36 ` Peter Braam 0 siblings, 0 replies; 19+ messages in thread From: Peter Braam @ 2003-01-09 3:36 UTC (permalink / raw) To: Bryan Henderson Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner, Lustre Development Mailing List, Richard Sharpe, samba-technical, Mike Shaver, Steven French Bryan, On Wed, Jan 08, 2003 at 06:08:48PM -0800, Bryan Henderson wrote: > > > > > >I have no objections to a name change. We are not so religious about > >"intent" as a name. > > How religious are you about the idea of having to have BOTH a lookup2() > that contains all the information necessary to create a directory if the > name is available, AND a subsequent "create directory" call? Because once > you remove the word "intent" from the description, that looks even more > silly. Good question. For mkdir your solution is much preferrable. So no religion here at all. But mkdir is an easy case, possibly the easiest. For open, rename, setattr and dealing with symbolic links we found having the separation of the lookup phase with intents and actual execution to be quite useful, since the symbolic links may bring you back to another file system. > It is the relationship between those two (sometimes 3) redundant calls that > is the real substance in what otherwise appears to be just a naming issue. Yes, and the answer is "sometimes" - in the mkdir case it (moderately) easy to give the whole task to the file system (symlinks remain hairy), in open, rename, setattr we found a lot of useful VFS functionality between lookup and operation. - Peter - ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 17:25 fixing redundant network opens on Linux file creation Steven French 2003-01-06 18:14 ` Richard Sharpe @ 2003-01-06 22:18 ` Marcos Dione 2003-01-07 9:35 ` Jan Hudec 1 sibling, 1 reply; 19+ messages in thread From: Marcos Dione @ 2003-01-06 22:18 UTC (permalink / raw) To: Steven French; +Cc: samba-technical, linux-fsdevel On Mon, Jan 06, 2003 at 11:25:32AM -0600, Steven French wrote: > The creat() system call results (for the Linux kernel) in calls to create > (via vfs_create) then later a call to open (via dentry_open) both of which > eventually end up (for the cifs vfs) doing a network open of the file from > the perspective of the CIFS protocol which degrades performance (because why not implement create as a separate feature? you can use a different message and mknod(2) on the server. I'm asking 'cause I'll have the same problem when implementing my thesis. -- well-designed technology should allow people the luxury of ignorance -- Eric S. Raymond ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fixing redundant network opens on Linux file creation 2003-01-06 22:18 ` Marcos Dione @ 2003-01-07 9:35 ` Jan Hudec 0 siblings, 0 replies; 19+ messages in thread From: Jan Hudec @ 2003-01-07 9:35 UTC (permalink / raw) To: Marcos Dione; +Cc: Steven French, samba-technical, linux-fsdevel On Mon, Jan 06, 2003 at 07:18:30PM -0300, Marcos Dione wrote: > On Mon, Jan 06, 2003 at 11:25:32AM -0600, Steven French wrote: > > The creat() system call results (for the Linux kernel) in calls to create > > (via vfs_create) then later a call to open (via dentry_open) both of which > > eventually end up (for the cifs vfs) doing a network open of the file from > > the perspective of the CIFS protocol which degrades performance (because > > why not implement create as a separate feature? you can use a > different message and mknod(2) on the server. > > I'm asking 'cause I'll have the same problem when implementing my > thesis. That won't help. You are still doing two upcalls, it still isn't atomic etc. etc. The problem is, that vfs always calls ->create and then ->open, both for open(O_CREAT) and create. ------------------------------------------------------------------------------- Jan 'Bulb' Hudec <bulb@ucw.cz> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2003-01-09 3:36 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-01-06 17:25 fixing redundant network opens on Linux file creation Steven French 2003-01-06 18:14 ` Richard Sharpe 2003-01-06 17:59 ` Jan Hudec 2003-01-06 19:42 ` Bryan Henderson 2003-01-06 19:56 ` Jan Harkes 2003-01-06 21:58 ` Bryan Henderson 2003-01-06 21:31 ` Andreas Dilger 2003-01-06 22:23 ` Bryan Henderson 2003-01-06 22:48 ` Andreas Dilger 2003-01-07 1:06 ` Bryan Henderson 2003-01-07 13:19 ` [Lustre-devel] " Mike Shaver 2003-01-07 17:28 ` Bryan Henderson 2003-01-07 18:50 ` Andreas Dilger 2003-01-08 17:52 ` Bryan Henderson 2003-01-08 19:11 ` Peter Braam 2003-01-09 2:08 ` Bryan Henderson 2003-01-09 3:36 ` Peter Braam 2003-01-06 22:18 ` Marcos Dione 2003-01-07 9:35 ` Jan Hudec
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).