* Re: New nanosecond stat patch for 2.5.44 [not found] ` <20021027214913.GA17533@clusterfs.com.suse.lists.linux.kernel> @ 2002-10-28 4:42 ` Andi Kleen 2002-10-28 5:35 ` Andreas Dilger [not found] ` <aphqqo$261$1@cesium.transmeta.com.suse.lists.linux.kernel> 1 sibling, 1 reply; 21+ messages in thread From: Andi Kleen @ 2002-10-28 4:42 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-kernel Andreas Dilger <adilger@clusterfs.com> writes: > On Oct 27, 2002 13:13 +0100, Andi Kleen wrote: > > Move time_t members in struct stat to struct timespec and allow subsecond > > timestamps for files. Too big to post on the list, because it edits > > a lot of file systems and drivers in a straight forward way. > > > > This is required for reliable "make" on fast computers. > > > > File systems that support nsec storage are currently: XFS, JFS, NFSv3 > > (if the filesystem on the server supports it), VFAT (not quite nanosecond), > > CIFS (unit in 100ns which is above what linux supports), SMBFS (for > > newer servers) > > Two notes I might make about this: > 1) It would be good if it were possible to select this with a config > option (I don't care which way the default goes), so that people who > don't need/care about the increased resolution don't need the extra > space in their inodes and minor extra overhead. To make this a lot > easier to code, having something akin to the inode_update_time() > which does all of the i_[acm]time updates as appropriate. You're joking right? That's twelve bytes of more state per struct inode and I bet even with the most insidious micro benchmark you won't be able to detect a difference in speed from the basic manipulation. What could hurt a bit is that the "only flush atime once a second" optimization is gone currently. The right way to address that would be a mount option "atime_flush_interval", not a CONFIG. > 2) Updating i_atime based on comparing the nsec timestamp is going to be > a killer. I think AKPM saw dramatic performance improvements when he > changed the code to only do the update once/second, and even though > you are "only" updating the atime if the times are different, in > practise this will be always. Even without the "per superblock interval" > you suggest we should probably only update the atime once a second (I > don't think anything is keyed off such high resolution atimes, unlike > make and mtime/ctime). Again I wrote about this in my original mail. Please see ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec.notes Basically I agree with you that it's a problem (although "killer" seems to be an exaggeration to me). The right solution IMHO would be to implement a new super block field / mount option that specifies that atime flush (basically generalized noatime). Then you can say you only want it flushed every 60s and the result will be much faster than what we have currently. Some file systems already implement intelligent atime flushing (like XFS) and they don't need it. But I didn't want to mix such a patch into the big patchkit. When the nsec patchkit is integrated and benchmarks show it is a problem I will submit a follow up patch instead. > 3) The fields you are usurping in struct stat are actually there for the > Y2038 problem (when time_t wraps). At least that's what Ted said when > we were looking into nsec times for ext2/3. Granted, we may all be > using 64-bit systems by 2038... I've always thought 64 bits is much > to large for time_t, so we could always use 20 or 30 bits for sub-second > times, and the remaining bits for extending time_t at the high end, > and mask those off for now, but that is a separate issue... I wrote about this in my original notes (perhaps I should repost them, I think they are still on the ftp server) For year 2038 we will need lots of new syscalls: new time(2), new gettimeofday(2) and lots of others. When all these are added then a new stat isn't that big a problem. Also glibc currently doesn't know how to use these fields for y2038, so all user programs need to be relinked anyways. Again when that happens it's no big issue to add a new stat. I bet when y2038 comes there will be other reasons for a new stat too. So it's fine to reuse these fields. The make problem is much more pressing. -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-28 4:42 ` New nanosecond stat patch for 2.5.44 Andi Kleen @ 2002-10-28 5:35 ` Andreas Dilger 0 siblings, 0 replies; 21+ messages in thread From: Andreas Dilger @ 2002-10-28 5:35 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Oct 28, 2002 05:42 +0100, Andi Kleen wrote: > Andreas Dilger <adilger@clusterfs.com> writes: > > 1) It would be good if it were possible to select this with a config > > option (I don't care which way the default goes), so that people who > > don't need/care about the increased resolution don't need the extra > > space in their inodes and minor extra overhead. To make this a lot > > easier to code, having something akin to the inode_update_time() > > which does all of the i_[acm]time updates as appropriate. > > You're joking right? That's twelve bytes of more state per struct inode > and I bet even with the most insidious micro benchmark you won't be > able to detect a difference in speed from the basic manipulation. Except that people have a lot of inodes in their slab caches... It's not so much the processing overhead as the extra memory. struct inode is bloated enough without adding more into it that isn't necessarily useful for some people (people who don't have lots of RAM, or don't use any filesystems which support the higher resolution, or are slow enough that compiles don't have problems, or don't compile at all)... Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <aphqqo$261$1@cesium.transmeta.com.suse.lists.linux.kernel>]
[parent not found: <3DBC9194.5090006@nortelnetworks.com.suse.lists.linux.kernel>]
* Re: New nanosecond stat patch for 2.5.44 [not found] ` <3DBC9194.5090006@nortelnetworks.com.suse.lists.linux.kernel> @ 2002-10-28 4:47 ` Andi Kleen 0 siblings, 0 replies; 21+ messages in thread From: Andi Kleen @ 2002-10-28 4:47 UTC (permalink / raw) To: Chris Friesen; +Cc: linux-kernel, hpa Chris Friesen <cfriesen@nortelnetworks.com> writes: > H. Peter Anvin wrote: > > > We probably need to revamp struct stat anyway, to support a larger > > dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t > > at least if we have to redesign the structure.) At that point I would > > really like to advocate for int64_t ts_sec and uint32_t ts_nsec and > > quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd > > personally like struct timespec to look like the above everywhere. > > For filesystems can we get away with just the 64-bit nanoseconds? By my > calculations that gives something like 584 years--do we need to worry > about files older than that? The current timestamps on 32bit systems are 32bit. 64bit nanoseconds would take the same room as 32bit second + 32bit nanosecond. And it would be incompatible with current glibc (which the additional nanosecond fields are perfectly compatible - they are zeroed currently). Also glibc would need to convert it to a timespec for Solaris compatbility anyways and need an unnecessary division for that. The same thing applies to file system storage. Storing in nanoseconds (like e.g. NTFS or CIFS do - they store 64bit in 100ns units since 1601) would require slow divisions to convert from the user visible format, needs the same space and has no advantage as far as I can see. -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* New nanosecond stat patch for 2.5.44 @ 2002-10-27 12:13 Andi Kleen 2002-10-27 21:49 ` Andreas Dilger 0 siblings, 1 reply; 21+ messages in thread From: Andi Kleen @ 2002-10-27 12:13 UTC (permalink / raw) To: linux-kernel Move time_t members in struct stat to struct timespec and allow subsecond timestamps for files. Too big to post on the list, because it edits a lot of file systems and drivers in a straight forward way. This is required for reliable "make" on fast computers. File systems that support nsec storage are currently: XFS, JFS, NFSv3 (if the filesystem on the server supports it), VFAT (not quite nanosecond), CIFS (unit in 100ns which is above what linux supports), SMBFS (for newer servers) This is proposed for 2.6. Changes against the last version: - Now always take xtime_lock when accessing the whole of xtime - Port to 2.5.44 - New filesystems supported: CIFS, AFS ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec-2.5.44-1.bz2 -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-27 12:13 Andi Kleen @ 2002-10-27 21:49 ` Andreas Dilger 2002-10-27 22:54 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Andreas Dilger @ 2002-10-27 21:49 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Oct 27, 2002 13:13 +0100, Andi Kleen wrote: > Move time_t members in struct stat to struct timespec and allow subsecond > timestamps for files. Too big to post on the list, because it edits > a lot of file systems and drivers in a straight forward way. > > This is required for reliable "make" on fast computers. > > File systems that support nsec storage are currently: XFS, JFS, NFSv3 > (if the filesystem on the server supports it), VFAT (not quite nanosecond), > CIFS (unit in 100ns which is above what linux supports), SMBFS (for > newer servers) Two notes I might make about this: 1) It would be good if it were possible to select this with a config option (I don't care which way the default goes), so that people who don't need/care about the increased resolution don't need the extra space in their inodes and minor extra overhead. To make this a lot easier to code, having something akin to the inode_update_time() which does all of the i_[acm]time updates as appropriate. 2) Updating i_atime based on comparing the nsec timestamp is going to be a killer. I think AKPM saw dramatic performance improvements when he changed the code to only do the update once/second, and even though you are "only" updating the atime if the times are different, in practise this will be always. Even without the "per superblock interval" you suggest we should probably only update the atime once a second (I don't think anything is keyed off such high resolution atimes, unlike make and mtime/ctime). 3) The fields you are usurping in struct stat are actually there for the Y2038 problem (when time_t wraps). At least that's what Ted said when we were looking into nsec times for ext2/3. Granted, we may all be using 64-bit systems by 2038... I've always thought 64 bits is much to large for time_t, so we could always use 20 or 30 bits for sub-second times, and the remaining bits for extending time_t at the high end, and mask those off for now, but that is a separate issue... Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-27 21:49 ` Andreas Dilger @ 2002-10-27 22:54 ` H. Peter Anvin 2002-10-28 1:23 ` Chris Friesen 2002-11-06 13:27 ` Gabriel Paubert 2002-10-27 23:16 ` Horst von Brand 2002-10-29 15:01 ` Bill Davidsen 2 siblings, 2 replies; 21+ messages in thread From: H. Peter Anvin @ 2002-10-27 22:54 UTC (permalink / raw) To: linux-kernel Followup to: <20021027214913.GA17533@clusterfs.com> By author: Andreas Dilger <adilger@clusterfs.com> In newsgroup: linux.dev.kernel > > 3) The fields you are usurping in struct stat are actually there for the > Y2038 problem (when time_t wraps). At least that's what Ted said when > we were looking into nsec times for ext2/3. Granted, we may all be > using 64-bit systems by 2038... I've always thought 64 bits is much > to large for time_t, so we could always use 20 or 30 bits for sub-second > times, and the remaining bits for extending time_t at the high end, > and mask those off for now, but that is a separate issue... > 64-bit time_t is nice because you don't *ever* need to worry about overflow; it's capable of handling times on a galactic lifespan scale. It's overkill, of course, but it's the *right* kind of overkill. We probably need to revamp struct stat anyway, to support a larger dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t at least if we have to redesign the structure.) At that point I would really like to advocate for int64_t ts_sec and uint32_t ts_nsec and quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd personally like struct timespec to look like the above everywhere. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-27 22:54 ` H. Peter Anvin @ 2002-10-28 1:23 ` Chris Friesen 2002-10-28 1:35 ` Rob Landley 2002-11-06 13:27 ` Gabriel Paubert 1 sibling, 1 reply; 21+ messages in thread From: Chris Friesen @ 2002-10-28 1:23 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel H. Peter Anvin wrote: > We probably need to revamp struct stat anyway, to support a larger > dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t > at least if we have to redesign the structure.) At that point I would > really like to advocate for int64_t ts_sec and uint32_t ts_nsec and > quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd > personally like struct timespec to look like the above everywhere. For filesystems can we get away with just the 64-bit nanoseconds? By my calculations that gives something like 584 years--do we need to worry about files older than that? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-28 1:23 ` Chris Friesen @ 2002-10-28 1:35 ` Rob Landley 0 siblings, 0 replies; 21+ messages in thread From: Rob Landley @ 2002-10-28 1:35 UTC (permalink / raw) To: Chris Friesen, H. Peter Anvin; +Cc: linux-kernel On Sunday 27 October 2002 19:23, Chris Friesen wrote: > H. Peter Anvin wrote: > > We probably need to revamp struct stat anyway, to support a larger > > dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t > > at least if we have to redesign the structure.) At that point I would > > really like to advocate for int64_t ts_sec and uint32_t ts_nsec and > > quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd > > personally like struct timespec to look like the above everywhere. > > For filesystems can we get away with just the 64-bit nanoseconds? By my > calculations that gives something like 584 years--do we need to worry > about files older than that? 1) The hard drive is only about 50 years old, so there aren't any files older than that at the moment: http://www.mdhc.scu.edu/100th/reyjohnson.htm 2) This thing is unlikely to be a problem in our lifetimes, our grandchildren's lifetimes, or our great grandchildren's lifetimes (barring unforseen advances in active telomere reconstruction and a regenerative interpretation of DNA that somehow looks at it as a blueprint rather than a recipe). 3) If any current hardware or software is still in use in the year 2554, it will be seriously overdue for an upgrade. Rob -- http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad, CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-27 22:54 ` H. Peter Anvin 2002-10-28 1:23 ` Chris Friesen @ 2002-11-06 13:27 ` Gabriel Paubert 2002-11-06 18:00 ` H. Peter Anvin 1 sibling, 1 reply; 21+ messages in thread From: Gabriel Paubert @ 2002-11-06 13:27 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel On 31 Oct 2002, H. Peter Anvin wrote: > Followup to: <20021027214913.GA17533@clusterfs.com> > By author: Andreas Dilger <adilger@clusterfs.com> > In newsgroup: linux.dev.kernel > > > > 3) The fields you are usurping in struct stat are actually there for the > > Y2038 problem (when time_t wraps). At least that's what Ted said when > > we were looking into nsec times for ext2/3. Granted, we may all be > > using 64-bit systems by 2038... I've always thought 64 bits is much > > to large for time_t, so we could always use 20 or 30 bits for sub-second > > times, and the remaining bits for extending time_t at the high end, > > and mask those off for now, but that is a separate issue... > > > > 64-bit time_t is nice because you don't *ever* need to worry about > overflow; it's capable of handling times on a galactic lifespan > scale. It's overkill, of course, but it's the *right* kind of > overkill. Indeed. > > We probably need to revamp struct stat anyway, to support a larger > dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t > at least if we have to redesign the structure.) At that point I would > really like to advocate for int64_t ts_sec and uint32_t ts_nsec and > quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd > personally like struct timespec to look like the above everywhere. I basically agree but I suspect that filesystem writers will not be very happy if you want to use 16 bytes for each timestamp, especially when 8 of the bytes (the 32 high order bits from the second count and the TAI-UT offset) do not change very often. (besides that tv_nsec is defined as a long, i.e. 64 bit on 64 bit machines and _signed_ , stupid if you ask me but I digress). The goal as I understand it is to avoid first the possibility of ambiguous timestamps, but then we have to be careful also not to break existing applications (although they already broken wrt leap seconds). I don't know how to trim the highly repeated most significant bytes of the tv_sec field (it's probably file system specific), but 4 bytes can easily be shaved from the on-disk structure by packing the leap second information in the high order bits of the nsec field: since the number of nanoseconds per second is unlikely to ever need more than 30 bits to be encoded ;-), the 2 most significant bits can be used to encode inserted leap seconds. Actually 1 bit should be sufficient but some texts claim that up to 2 leap seconds can be inserted, this has however actually never happened AFAICT and I believe that NTP for example does not support 2 leap seconds in a row. Converting this encoding to the format you suggest for stat(2) is trivial: it only needs a table of leap seconds. I don't care whether it's in the kernel or in user space: it's small and grows slowly. For now I have more problems with the fact that gettimeofday and friends do not properly handle leap seconds and lead to ambiguous timestamps. Once this problem (a real killer for astronomical data acquisition, leap seconds are infrequent but they are a problem) is solved, filesystems can be updated. What could be important now is to mask the low 30 bits of the nsec field and declare the 2 MSB reserved so that no kernel is out in the wild that simply copies the full nsec field to user space. Regards, Gabriel. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-11-06 13:27 ` Gabriel Paubert @ 2002-11-06 18:00 ` H. Peter Anvin 0 siblings, 0 replies; 21+ messages in thread From: H. Peter Anvin @ 2002-11-06 18:00 UTC (permalink / raw) To: Gabriel Paubert; +Cc: linux-kernel Gabriel Paubert wrote: > > I basically agree but I suspect that filesystem writers will not be very > happy if you want to use 16 bytes for each timestamp, especially when 8 of > the bytes (the 32 high order bits from the second count and the TAI-UT > offset) do not change very often. (besides that tv_nsec is defined as a > long, i.e. 64 bit on 64 bit machines and _signed_ , stupid if you ask me > but I digress). > The filesystem writers can compact things as they see fit. I'm mostly talking about the stat(2) format. -hpa ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-27 21:49 ` Andreas Dilger 2002-10-27 22:54 ` H. Peter Anvin @ 2002-10-27 23:16 ` Horst von Brand 2002-10-28 17:10 ` Andreas Dilger 2002-10-29 15:01 ` Bill Davidsen 2 siblings, 1 reply; 21+ messages in thread From: Horst von Brand @ 2002-10-27 23:16 UTC (permalink / raw) To: Andi Kleen, linux-kernel Andreas Dilger <adilger@clusterfs.com> said: > On Oct 27, 2002 13:13 +0100, Andi Kleen wrote: > > Move time_t members in struct stat to struct timespec and allow subsecond > > timestamps for files. Too big to post on the list, because it edits > > a lot of file systems and drivers in a straight forward way. > > > > This is required for reliable "make" on fast computers. > > > > File systems that support nsec storage are currently: XFS, JFS, NFSv3 > > (if the filesystem on the server supports it), VFAT (not quite nanosecond), > > CIFS (unit in 100ns which is above what linux supports), SMBFS (for > > newer servers) > > Two notes I might make about this: > 1) It would be good if it were possible to select this with a config > option (I don't care which way the default goes), so that people who > don't need/care about the increased resolution don't need the extra > space in their inodes and minor extra overhead. To make this a lot > easier to code, having something akin to the inode_update_time() > which does all of the i_[acm]time updates as appropriate. Please don't. Do not create incompatible versions of the same filesystem just because they were written on kernels compiled with different configurations. Superblock flags might be OK, but what is the point then? Better mount flags (mount with/without finegrained timestamps)? [....] > 3) The fields you are usurping in struct stat are actually there for the > Y2038 problem (when time_t wraps). At least that's what Ted said when > we were looking into nsec times for ext2/3. Granted, we may all be > using 64-bit systems by 2038... I've always thought 64 bits is much > to large for time_t, so we could always use 20 or 30 bits for sub-second > times, and the remaining bits for extending time_t at the high end, > and mask those off for now, but that is a separate issue... IMVHO, keeping fields in filesystems' inodes for 36 years in the future is daydreaming. Not even the filesystems in the just 11 year old Linux have survived unscathed... and by '38 we'll probably be by ext8 or so, under 64-bit CPUs. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-27 23:16 ` Horst von Brand @ 2002-10-28 17:10 ` Andreas Dilger 0 siblings, 0 replies; 21+ messages in thread From: Andreas Dilger @ 2002-10-28 17:10 UTC (permalink / raw) To: Horst von Brand; +Cc: Andi Kleen, linux-kernel On Oct 27, 2002 20:16 -0300, Horst von Brand wrote: > Andreas Dilger <adilger@clusterfs.com> said: > > 1) It would be good if it were possible to select this with a config > > option (I don't care which way the default goes), so that people who > > don't need/care about the increased resolution don't need the extra > > space in their inodes and minor extra overhead. To make this a lot > > easier to code, having something akin to the inode_update_time() > > which does all of the i_[acm]time updates as appropriate. > > Please don't. Do not create incompatible versions of the same filesystem > just because they were written on kernels compiled with different > configurations. Superblock flags might be OK, but what is the point then? > Better mount flags (mount with/without finegrained timestamps)? I don't say anything about creating incompatible versions of the same filesystem. Configuring out nsec timestamps is no different than what we have today. Many filesystems do not support nsec timestamps anyways. I just see this as one of many hundreds of "tiny" features that are added to Linux that could easily be made a config option when they are first added, but all just end up adding a tiny bit of bloat for people that don't need it. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-27 21:49 ` Andreas Dilger 2002-10-27 22:54 ` H. Peter Anvin 2002-10-27 23:16 ` Horst von Brand @ 2002-10-29 15:01 ` Bill Davidsen 2002-10-29 16:30 ` Andreas Dilger 2 siblings, 1 reply; 21+ messages in thread From: Bill Davidsen @ 2002-10-29 15:01 UTC (permalink / raw) To: Andreas Dilger; +Cc: Andi Kleen, linux-kernel On Sun, 27 Oct 2002, Andreas Dilger wrote: > Two notes I might make about this: > 1) It would be good if it were possible to select this with a config > option (I don't care which way the default goes), so that people who > don't need/care about the increased resolution don't need the extra > space in their inodes and minor extra overhead. To make this a lot > easier to code, having something akin to the inode_update_time() > which does all of the i_[acm]time updates as appropriate. Am I missing something? That would make it two file types, no? I bet there's more overhead in handling that problem than just writing the time. > 2) Updating i_atime based on comparing the nsec timestamp is going to be > a killer. I think AKPM saw dramatic performance improvements when he > changed the code to only do the update once/second, and even though > you are "only" updating the atime if the times are different, in > practise this will be always. Even without the "per superblock interval" > you suggest we should probably only update the atime once a second (I > don't think anything is keyed off such high resolution atimes, unlike > make and mtime/ctime). find -anewer seems to use as much resolution as it has. More to the point, what is the overhead of updating the time when an i/o is done? It would seem pretty trivial. If you are willing to give up a flag bit you could store the time in some native unit (machine type dependent) when an i/o is done, then do the convert to ns when it's used, such as compare, close, etc. You could have an inode walker thread do the convert in background if that seems needed. There are probably other ways to reduce overhead, those just came to mind. I think it's a pretty low impact problem with some effort on making it so. > 3) The fields you are usurping in struct stat are actually there for the > Y2038 problem (when time_t wraps). At least that's what Ted said when > we were looking into nsec times for ext2/3. Granted, we may all be > using 64-bit systems by 2038... I've always thought 64 bits is much > to large for time_t, so we could always use 20 or 30 bits for sub-second > times, and the remaining bits for extending time_t at the high end, > and mask those off for now, but that is a separate issue... As you say, but good that you brought it up! -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-29 15:01 ` Bill Davidsen @ 2002-10-29 16:30 ` Andreas Dilger 2002-10-29 20:37 ` Bill Davidsen 0 siblings, 1 reply; 21+ messages in thread From: Andreas Dilger @ 2002-10-29 16:30 UTC (permalink / raw) To: Bill Davidsen; +Cc: Andi Kleen, linux-kernel On Oct 29, 2002 10:01 -0500, Bill Davidsen wrote: > On Sun, 27 Oct 2002, Andreas Dilger wrote: > > 1) It would be good if it were possible to select this with a config > > option (I don't care which way the default goes), so that people who > > don't need/care about the increased resolution don't need the extra > > space in their inodes and minor extra overhead. To make this a lot > > easier to code, having something akin to the inode_update_time() > > which does all of the i_[acm]time updates as appropriate. > > Am I missing something? That would make it two file types, no? I bet > there's more overhead in handling that problem than just writing the time. Not necessarily. Most filesystems don't even have space for storing a sub-second time resolution, so having the extra time resolution is irrelevant. For filesystems which do have room for sub-second timestamps they currently just fill in 0 there, and if the sub-second time is here they will fill in that field, so still no incompatible on-disk formats. As for ext3 having sub-second timestamps, this will be done in a way which makes it compatible with older filesystem, so whether those timestamps are written or not written, the filesystem will still be readable on older kernels. The "inode" space that I'm referring to is the in-memory inode struct, and the presence of that would be determined at compile time. Granted, it would only be 12 bytes added to the inode, but if you have thousands or millions of inodes resident you start to feel the pinch. > > 2) Updating i_atime based on comparing the nsec timestamp is going to be > > a killer. I think AKPM saw dramatic performance improvements when he > > changed the code to only do the update once/second, and even though > > you are "only" updating the atime if the times are different, in > > practise this will be always. Even without the "per superblock interval" > > you suggest we should probably only update the atime once a second (I > > don't think anything is keyed off such high resolution atimes, unlike > > make and mtime/ctime). > > find -anewer seems to use as much resolution as it has. More to the point, > what is the overhead of updating the time when an i/o is done? It would > seem pretty trivial. It would be trivial if you are already updating the inode (and we should optimize for this case), but if you are reading a file in 5-byte chunks and you update the atime a thousand times a second it most certainly IS a lot of overhead. We currently limit atime updates to 1/second by checking if the atime has changed or not. The proposed patch checks if the atime.ts_nsec has changed, and it most certainly will have, so this will always be updating the atime on disk. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-29 16:30 ` Andreas Dilger @ 2002-10-29 20:37 ` Bill Davidsen 2002-10-30 0:44 ` Jamie Lokier 0 siblings, 1 reply; 21+ messages in thread From: Bill Davidsen @ 2002-10-29 20:37 UTC (permalink / raw) To: Andreas Dilger; +Cc: Andi Kleen, linux-kernel On Tue, 29 Oct 2002, Andreas Dilger wrote: > On Oct 29, 2002 10:01 -0500, Bill Davidsen wrote: > > On Sun, 27 Oct 2002, Andreas Dilger wrote: > > > 1) It would be good if it were possible to select this with a config > > > option (I don't care which way the default goes), so that people who > > > don't need/care about the increased resolution don't need the extra > > > space in their inodes and minor extra overhead. To make this a lot > > > easier to code, having something akin to the inode_update_time() > > > which does all of the i_[acm]time updates as appropriate. > > > > Am I missing something? That would make it two file types, no? I bet > > there's more overhead in handling that problem than just writing the time. > > Not necessarily. Most filesystems don't even have space for storing a > sub-second time resolution, so having the extra time resolution is > irrelevant. For filesystems which do have room for sub-second timestamps > they currently just fill in 0 there, and if the sub-second time is here > they will fill in that field, so still no incompatible on-disk formats. That was my concern. > As for ext3 having sub-second timestamps, this will be done in a way > which makes it compatible with older filesystem, so whether those > timestamps are written or not written, the filesystem will still be > readable on older kernels. I was more thinking of a kernel compiled without the hi-res timer code, if that should be done as an option. > The "inode" space that I'm referring to is the in-memory inode struct, > and the presence of that would be determined at compile time. Granted, > it would only be 12 bytes added to the inode, but if you have thousands > or millions of inodes resident you start to feel the pinch. I admit to being one of the "thousands" people, and even if I have 100k inodes (more likely to be 10% of that) it's in the order of a MB, and any machine which has 100k inodes open is likely to be large enough to ignore a MB. One advantage of keeping the HRT in the in-core inode is that it allows parallel make to work correctly even on a filesystem which doesn't have space to save that information. Feel free to tell me if that last isn't true. > > > 2) Updating i_atime based on comparing the nsec timestamp is going to be > > > a killer. I think AKPM saw dramatic performance improvements when he > > > changed the code to only do the update once/second, and even though > > > you are "only" updating the atime if the times are different, in > > > practise this will be always. Even without the "per superblock interval" > > > you suggest we should probably only update the atime once a second (I > > > don't think anything is keyed off such high resolution atimes, unlike > > > make and mtime/ctime). > > > > find -anewer seems to use as much resolution as it has. More to the point, > > what is the overhead of updating the time when an i/o is done? It would > > seem pretty trivial. > > It would be trivial if you are already updating the inode (and we should > optimize for this case), but if you are reading a file in 5-byte chunks > and you update the atime a thousand times a second it most certainly IS > a lot of overhead. We currently limit atime updates to 1/second by > checking if the atime has changed or not. The proposed patch checks if > the atime.ts_nsec has changed, and it most certainly will have, so this > will always be updating the atime on disk. 1 - any program which does unbuffered 5 byte reads is probably going to beat the machine to death anyway. Then the sysadmin will mount noatime. 2 - The patch isn't written in stone, going back to one per second shouldn't matter except in the case of network or devices shared between multiple systems (3.0?). processes on the same machine whould use the in-core information. 3 - updating once/sec could still be default, with HRT being a mount option like noatime. 4 - the time could be stored in register values, ticks, or whatever else, avoiding any conversion to ns. Then the time could be converted only when the inode was read, written out, etc. I'd really like your comments on these, you probably see things I've missed. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-29 20:37 ` Bill Davidsen @ 2002-10-30 0:44 ` Jamie Lokier 2002-10-30 21:12 ` Bill Davidsen 0 siblings, 1 reply; 21+ messages in thread From: Jamie Lokier @ 2002-10-30 0:44 UTC (permalink / raw) To: Bill Davidsen; +Cc: Andreas Dilger, Andi Kleen, linux-kernel Bill Davidsen wrote: > I admit to being one of the "thousands" people, and even if I have 100k > inodes (more likely to be 10% of that) it's in the order of a MB, and any > machine which has 100k inodes open is likely to be large enough to ignore > a MB. One advantage of keeping the HRT in the in-core inode is that it > allows parallel make to work correctly even on a filesystem which doesn't > have space to save that information. > > Feel free to tell me if that last isn't true. It isn't true if the parallel make actually uses your RAM for something, thus flushing some of the inodes from RAM. Admittedly it is no worse than we have at the moment. However, at the moment it is possible, to construct a "make" or other program of that ilk which can always make a safe decision: if it's ambiguous whether a file needs to be remade, then remake the file. As soon as we have inodes time stamp resolution being spontanously lowered (because some of the inodes are flushed from RAM and some aren't), then it's not possible to make a safe program like that anymore, unless you simply ignore the high resolution time stamps _all_ the time, even when they are present. You can just do that - it's correct behaviour. But it would be better to use the high precision when available, as that reduces the number of unnecessary remakes. > 4 - the time could be stored in register values, ticks, or whatever else, > avoiding any conversion to ns. Then the time could be converted only when > the inode was read, written out, etc. > > I'd really like your comments on these, you probably see things I've > missed. I know of exactly one application which depends on atime information: checking whether you have new mail in your inbox. That's done by comparing atime and mtime on the mailbox. Mail readers read the file after writing it, MTAs will simply write it. For this to function correctly, what's important is that the atime is updated to be at least the mtime. So for nanosecond atime updates, it makes sense that the _first_ read following a write should update the atime -- if not using the current clock, then simply copying the mtime value. -- Jamie ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-30 0:44 ` Jamie Lokier @ 2002-10-30 21:12 ` Bill Davidsen 2002-10-30 22:17 ` Jamie Lokier 0 siblings, 1 reply; 21+ messages in thread From: Bill Davidsen @ 2002-10-30 21:12 UTC (permalink / raw) To: Jamie Lokier; +Cc: Andreas Dilger, Andi Kleen, linux-kernel On Wed, 30 Oct 2002, Jamie Lokier wrote: > Bill Davidsen wrote: > > I admit to being one of the "thousands" people, and even if I have 100k > > inodes (more likely to be 10% of that) it's in the order of a MB, and any > > machine which has 100k inodes open is likely to be large enough to ignore > > a MB. One advantage of keeping the HRT in the in-core inode is that it > > allows parallel make to work correctly even on a filesystem which doesn't > > have space to save that information. > > > > Feel free to tell me if that last isn't true. > > It isn't true if the parallel make actually uses your RAM for > something, thus flushing some of the inodes from RAM. Hopefully it is being smart about doing that, or rather not doing that. But that would be a good thing to add to my responsiveness benchmark, to access a file, do a stat, and then do another stat later. Thanks for the idea, I expect to release a new version sometime this weekend. > Admittedly it is no worse than we have at the moment. However, at the > moment it is possible, to construct a "make" or other program of that > ilk which can always make a safe decision: if it's ambiguous whether a > file needs to be remade, then remake the file. > > As soon as we have inodes time stamp resolution being spontanously > lowered (because some of the inodes are flushed from RAM and some > aren't), then it's not possible to make a safe program like that > anymore, unless you simply ignore the high resolution time stamps > _all_ the time, even when they are present. > > You can just do that - it's correct behaviour. But it would be better > to use the high precision when available, as that reduces the number > of unnecessary remakes. I have to think about the point you raise of doing it one way or the other but not mixing. I had assumed that the inode of a file which was open would remain in core, and I want to look at the code before I form an opinion. If the file is not open or the inode is a non-file... > > 4 - the time could be stored in register values, ticks, or whatever else, > > avoiding any conversion to ns. Then the time could be converted only when > > the inode was read, written out, etc. > > > > I'd really like your comments on these, you probably see things I've > > missed. > > I know of exactly one application which depends on atime information: > checking whether you have new mail in your inbox. That's done by > comparing atime and mtime on the mailbox. Mail readers read the file > after writing it, MTAs will simply write it. > > For this to function correctly, what's important is that the atime is > updated to be at least the mtime. So for nanosecond atime updates, it > makes sense that the _first_ read following a write should update the > atime -- if not using the current clock, then simply copying the mtime > value. I think you may have missed the point of (4), some of the overhead of keeping HRT is the conversion of data to ns from some machine dependent information. Where possible the base information, such as a register, could be stored with a flag, avoiding the "convert to ns" CPU usage. The conversion could be done when the data was used, before save, at the time of a stat, etc. I have the feeling that would take some of the sting out of keeping HRT. It doesn't matter if it's atime, mtime or ctime, the atime was in response to "nobody uses HRT atime" in an earlier post. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-30 21:12 ` Bill Davidsen @ 2002-10-30 22:17 ` Jamie Lokier 2002-10-31 0:34 ` H. Peter Anvin 2002-11-01 1:57 ` Bill Davidsen 0 siblings, 2 replies; 21+ messages in thread From: Jamie Lokier @ 2002-10-30 22:17 UTC (permalink / raw) To: Bill Davidsen; +Cc: Andreas Dilger, Andi Kleen, linux-kernel Bill Davidsen wrote: > I have to think about the point you raise of doing it one way or the other > but not mixing. I had assumed that the inode of a file which was open > would remain in core, and I want to look at the code before I form an > opinion. If the file is not open or the inode is a non-file... Oh, the inode of a file which is open does remain in core. It's just that between runs of a program like "make", the file's aren't open are they? > I think you may have missed the point of (4), some of the overhead of > keeping HRT is the conversion of data to ns from some machine dependent > information. Where possible the base information, such as a register, > could be stored with a flag, avoiding the "convert to ns" CPU usage. The > conversion could be done when the data was used, before save, at the time > of a stat, etc. I have the feeling that would take some of the sting out > of keeping HRT. It doesn't matter if it's atime, mtime or ctime, the atime > was in response to "nobody uses HRT atime" in an earlier post. That's some of the overhead. The other overhead is reading the clock, which is quite high on x86 when TSC is not available. On a Pentium with no reliable TSC, I think that the time for a read() system call is comparable to the time to read the clock. -- Jamie ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-30 22:17 ` Jamie Lokier @ 2002-10-31 0:34 ` H. Peter Anvin 2002-11-01 1:57 ` Bill Davidsen 1 sibling, 0 replies; 21+ messages in thread From: H. Peter Anvin @ 2002-10-31 0:34 UTC (permalink / raw) To: linux-kernel Followup to: <20021030221724.GA25231@bjl1.asuk.net> By author: Jamie Lokier <lk@tantalophile.demon.co.uk> In newsgroup: linux.dev.kernel > > That's some of the overhead. The other overhead is reading the clock, > which is quite high on x86 when TSC is not available. On a Pentium > with no reliable TSC, I think that the time for a read() system call > is comparable to the time to read the clock. > Typically the way you deal with not having a usably cheap nanosecond-resolution clock is that you use the best available clock (say if HZ=1000 you'll increment by 1000000 each timer tick), and then simply use an atomic counter for the smaller divisions. This makes the relation "is A newer than B" correct, while avoiding the overhead of producing exact timestamps below the available resolution. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-10-30 22:17 ` Jamie Lokier 2002-10-31 0:34 ` H. Peter Anvin @ 2002-11-01 1:57 ` Bill Davidsen 2002-11-01 3:32 ` Jamie Lokier 1 sibling, 1 reply; 21+ messages in thread From: Bill Davidsen @ 2002-11-01 1:57 UTC (permalink / raw) To: Jamie Lokier; +Cc: Andreas Dilger, Andi Kleen, linux-kernel On Wed, 30 Oct 2002, Jamie Lokier wrote: > Bill Davidsen wrote: > > I have to think about the point you raise of doing it one way or the other > > but not mixing. I had assumed that the inode of a file which was open > > would remain in core, and I want to look at the code before I form an > > opinion. If the file is not open or the inode is a non-file... > > Oh, the inode of a file which is open does remain in core. It's just > that between runs of a program like "make", the file's aren't open are > they? I thought we were talking about parallel make, rather than "between runs." Your point is valid, but given the certainty that the inode has been recently used, hopefully the kernel is smart on releasing them. My first thought is that the commonly used filesystems, other than ext2, do or will support high resolution time. NFS is its own nasty little problem. > > I think you may have missed the point of (4), some of the overhead of > > keeping HRT is the conversion of data to ns from some machine dependent > > information. Where possible the base information, such as a register, > > could be stored with a flag, avoiding the "convert to ns" CPU usage. The > > conversion could be done when the data was used, before save, at the time > > of a stat, etc. I have the feeling that would take some of the sting out > > of keeping HRT. It doesn't matter if it's atime, mtime or ctime, the atime > > was in response to "nobody uses HRT atime" in an earlier post. > > That's some of the overhead. The other overhead is reading the clock, > which is quite high on x86 when TSC is not available. On a Pentium > with no reliable TSC, I think that the time for a read() system call > is comparable to the time to read the clock. Who uses a CPU without TSC? I guess the embedded folks and the people using really old systems. There was a suggestion on handling that posted, but I don't have it handy. Using the field as just a counter was the idea if I remember correctly. The NUMA folks have their own set of problems, I won't presume to even have an opinion on how they solve it, but if it needs doing I'm sure they can do it. Thinking out loud: To avoid overhead, the kernel needs to be smart about when the updated inode info is written to storage Perhaps on writes when the data written actually falls off the elevator or transferred to a network peer. Until then the time can stay in memory, if the system goes down write data is lost, so having the inode reflect the time of the last completed write to storage isn't wildly wrong mtime. For reads, having some bounded delay between the time of a system call to read() and the time saved in the inode is of limited impact, as long as the time to update the inode to storage doesn't get wildly behind the time of the read. The one second you mentioned is probably aggressive if anything. That might have to be a tunable. I haven't forgotten access via execute, I don't know if it differs from read in practice. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: New nanosecond stat patch for 2.5.44 2002-11-01 1:57 ` Bill Davidsen @ 2002-11-01 3:32 ` Jamie Lokier 0 siblings, 0 replies; 21+ messages in thread From: Jamie Lokier @ 2002-11-01 3:32 UTC (permalink / raw) To: Bill Davidsen; +Cc: Andreas Dilger, Andi Kleen, linux-kernel Bill Davidsen wrote: > > Oh, the inode of a file which is open does remain in core. It's just > > that between runs of a program like "make", the file's aren't open are > > they? > > I thought we were talking about parallel make, rather than "between runs." A parallel build often does call "make" separately many times, in parallel but not guaranteed to overlap all file opens. Between those, the files are closed. > Your point is valid, but given the certainty that the inode has been > recently used, hopefully the kernel is smart on releasing them. That's a "hopefully", and it depends on how much RAM you have as well as pure luck. I can live with that for building programs at home, but there are many applications where "hopefully" affecting correctness of behaviour is not acceptable. > My first thought is that the commonly used filesystems, other than ext2, > do or will support high resolution time. NFS is its own nasty little > problem. Do they support nanosecond time, though, or do they round it to microseconds or something like that? > [stuff about atime] There seems to be general agreement that atime is not a very important value, with which I concur. (Why do we even bother with nanosecond atimes?) I am only concerned about mtime, which is very useful indeed when we talk about building things which can detect changes to files. Andi, I belive there is space in every architecture's stat64 (i.e. all those that have one) for a word describing the mtime resolution. If I code a patch to create that field, would you be interested? -- Jamie ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2002-11-06 17:55 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20021027121318.GA2249@averell.suse.lists.linux.kernel>
[not found] ` <20021027214913.GA17533@clusterfs.com.suse.lists.linux.kernel>
2002-10-28 4:42 ` New nanosecond stat patch for 2.5.44 Andi Kleen
2002-10-28 5:35 ` Andreas Dilger
[not found] ` <aphqqo$261$1@cesium.transmeta.com.suse.lists.linux.kernel>
[not found] ` <3DBC9194.5090006@nortelnetworks.com.suse.lists.linux.kernel>
2002-10-28 4:47 ` Andi Kleen
2002-10-27 12:13 Andi Kleen
2002-10-27 21:49 ` Andreas Dilger
2002-10-27 22:54 ` H. Peter Anvin
2002-10-28 1:23 ` Chris Friesen
2002-10-28 1:35 ` Rob Landley
2002-11-06 13:27 ` Gabriel Paubert
2002-11-06 18:00 ` H. Peter Anvin
2002-10-27 23:16 ` Horst von Brand
2002-10-28 17:10 ` Andreas Dilger
2002-10-29 15:01 ` Bill Davidsen
2002-10-29 16:30 ` Andreas Dilger
2002-10-29 20:37 ` Bill Davidsen
2002-10-30 0:44 ` Jamie Lokier
2002-10-30 21:12 ` Bill Davidsen
2002-10-30 22:17 ` Jamie Lokier
2002-10-31 0:34 ` H. Peter Anvin
2002-11-01 1:57 ` Bill Davidsen
2002-11-01 3:32 ` Jamie Lokier
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.