* ext4 features
@ 2006-07-01 16:33 Thomas Glanzmann
2006-07-01 17:07 ` Tomasz Torcz
` (2 more replies)
0 siblings, 3 replies; 119+ messages in thread
From: Thomas Glanzmann @ 2006-07-01 16:33 UTC (permalink / raw)
To: Theodore Ts'o, LKML
Hello,
I would like to know which new features are planed to be incorported by
ext4. So far I only read about supporting bigger filesystems to fit
recent hardware developments. So are there any other big goals for ext4?
What I personally would like to see most in ext4 are
* checksums for data
* and snapshots on filesystem basis
But I guess that this is way out of scope for ext4.
Thomas
^ permalink raw reply [flat|nested] 119+ messages in thread* Re: ext4 features 2006-07-01 16:33 ext4 features Thomas Glanzmann @ 2006-07-01 17:07 ` Tomasz Torcz 2006-07-01 17:47 ` Thomas Glanzmann 2006-07-04 1:02 ` Theodore Tso 2006-07-04 14:36 ` Andi Kleen 2 siblings, 1 reply; 119+ messages in thread From: Tomasz Torcz @ 2006-07-01 17:07 UTC (permalink / raw) To: Thomas Glanzmann, Theodore Ts'o, LKML [-- Attachment #1: Type: text/plain, Size: 806 bytes --] On Sat, Jul 01, 2006 at 06:33:01PM +0200, Thomas Glanzmann wrote: > Hello, > I would like to know which new features are planed to be incorported by > ext4. So far I only read about supporting bigger filesystems to fit > recent hardware developments. So are there any other big goals for ext4? > > What I personally would like to see most in ext4 are > > * checksums for data Checksums are not very useful for themselves. They are useful when we have other copy of data (think raid mirroring) so data can be reconstructed from working copy. > * and snapshots on filesystem basis What's wrong with DM snapshots? -- Tomasz Torcz There exists no separation between gods and men: zdzichu@irc.-nie.spam-.pl one blends softly casual into the other. [-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 17:07 ` Tomasz Torcz @ 2006-07-01 17:47 ` Thomas Glanzmann 2006-07-01 18:09 ` Claudio Martins 2006-07-01 18:17 ` Tomasz Torcz 0 siblings, 2 replies; 119+ messages in thread From: Thomas Glanzmann @ 2006-07-01 17:47 UTC (permalink / raw) To: Theodore Ts'o, LKML Hello, > Checksums are not very useful for themselves. They are useful when we > have other copy of data (think raid mirroring) so data can be > reconstructed from working copy. it would be possible to identify data corruption. > What's wrong with DM snapshots? they're inefficient in matter of disk space consumption because they don't have a clue of the filesystems that are on top of them. Thomas ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 17:47 ` Thomas Glanzmann @ 2006-07-01 18:09 ` Claudio Martins 2006-07-01 18:59 ` Thomas Glanzmann 2006-07-01 18:17 ` Tomasz Torcz 1 sibling, 1 reply; 119+ messages in thread From: Claudio Martins @ 2006-07-01 18:09 UTC (permalink / raw) To: Thomas Glanzmann; +Cc: Theodore Ts'o, LKML On Saturday 01 July 2006 18:47, Thomas Glanzmann wrote: > Hello, > > > Checksums are not very useful for themselves. They are useful when we > > have other copy of data (think raid mirroring) so data can be > > reconstructed from working copy. > > it would be possible to identify data corruption. > > > What's wrong with DM snapshots? > > they're inefficient in matter of disk space consumption because they > don't have a clue of the filesystems that are on top of them. > May I recommend that you have a look at NILFS? http://nilfs.org/en/ The design is built from the ground up to support an almost arbitrary number of snapshots, and also has other advantages. And it works already. Regards Cláudio ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 18:09 ` Claudio Martins @ 2006-07-01 18:59 ` Thomas Glanzmann 0 siblings, 0 replies; 119+ messages in thread From: Thomas Glanzmann @ 2006-07-01 18:59 UTC (permalink / raw) To: Claudio Martins; +Cc: Theodore Ts'o, LKML Hello Cláudio, > May I recommend that you have a look at NILFS? thanks a lot for the heads-up. Indeed I was unaware of NILFS. It sounds very interesting. I give it a snapshot. :-) Thomas ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 17:47 ` Thomas Glanzmann 2006-07-01 18:09 ` Claudio Martins @ 2006-07-01 18:17 ` Tomasz Torcz 2006-07-03 9:44 ` Gabor Gombas ` (2 more replies) 1 sibling, 3 replies; 119+ messages in thread From: Tomasz Torcz @ 2006-07-01 18:17 UTC (permalink / raw) To: Thomas Glanzmann, Theodore Ts'o, LKML [-- Attachment #1: Type: text/plain, Size: 608 bytes --] On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote: > Hello, > > > Checksums are not very useful for themselves. They are useful when we > > have other copy of data (think raid mirroring) so data can be > > reconstructed from working copy. > > it would be possible to identify data corruption. > Yes, but what good is identification? We could only return I/O error. Ability to fix corruption (like ZFS) is the real killer. -- Tomasz Torcz There exists no separation between gods and men: zdzichu@irc.-nie.spam-.pl one blends softly casual into the other. [-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 18:17 ` Tomasz Torcz @ 2006-07-03 9:44 ` Gabor Gombas 2006-07-03 20:22 ` Helge Hafting 2006-07-06 15:12 ` Ric Wheeler 2 siblings, 0 replies; 119+ messages in thread From: Gabor Gombas @ 2006-07-03 9:44 UTC (permalink / raw) To: Thomas Glanzmann, Theodore Ts'o, LKML On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote: > Yes, but what good is identification? We could only return I/O error. I'm regularly using unison to sync my home directory to an USB drive, and about once in every 2-3 weeks unison complains that the data on the USB drive does not match the checksum unison expects. An umount/remount usually fixes the problem. There are no messages in the kernel log. It would be really nice if the file system should catch these silent data corruptions and at least warn me that something is fishy. Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 18:17 ` Tomasz Torcz 2006-07-03 9:44 ` Gabor Gombas @ 2006-07-03 20:22 ` Helge Hafting 2006-07-03 20:55 ` Tomasz Torcz 2006-07-03 21:34 ` ext4 features Bill Davidsen 2006-07-06 15:12 ` Ric Wheeler 2 siblings, 2 replies; 119+ messages in thread From: Helge Hafting @ 2006-07-03 20:22 UTC (permalink / raw) To: Thomas Glanzmann, Theodore Ts'o, LKML On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote: > On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote: > > Hello, > > > > > Checksums are not very useful for themselves. They are useful when we > > > have other copy of data (think raid mirroring) so data can be > > > reconstructed from working copy. > > > > it would be possible to identify data corruption. > > > > Yes, but what good is identification? We could only return I/O error. > Ability to fix corruption (like ZFS) is the real killer. Isn't that what we have RAID-1/5/6 for? Helge Hafting ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 20:22 ` Helge Hafting @ 2006-07-03 20:55 ` Tomasz Torcz 2006-07-03 21:01 ` Arjan van de Ven 2006-07-06 0:36 ` Blatant layering violations (was Re: ext4 features) Valerie Henson 2006-07-03 21:34 ` ext4 features Bill Davidsen 1 sibling, 2 replies; 119+ messages in thread From: Tomasz Torcz @ 2006-07-03 20:55 UTC (permalink / raw) To: Helge Hafting; +Cc: Thomas Glanzmann, Theodore Ts'o, LKML [-- Attachment #1: Type: text/plain, Size: 1006 bytes --] On Mon, Jul 03, 2006 at 10:22:19PM +0200, Helge Hafting wrote: > On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote: > > On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote: > > > Hello, > > > > > > > Checksums are not very useful for themselves. They are useful when we > > > > have other copy of data (think raid mirroring) so data can be > > > > reconstructed from working copy. > > > > > > it would be possible to identify data corruption. > > > > > > > Yes, but what good is identification? We could only return I/O error. > > Ability to fix corruption (like ZFS) is the real killer. > > Isn't that what we have RAID-1/5/6 for? ZFS was already called ,,blatant layering violation''. ;) Yes,that what RAID is for. And if we want checksums in filesystem, that's the best way to utilise them. -- Tomasz Torcz Morality must always be based on practicality. zdzichu@irc.-nie.spam-.pl -- Baron Vladimir Harkonnen [-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 20:55 ` Tomasz Torcz @ 2006-07-03 21:01 ` Arjan van de Ven 2006-07-03 21:46 ` Jeff V. Merkey 2006-07-03 22:12 ` Alan Cox 2006-07-06 0:36 ` Blatant layering violations (was Re: ext4 features) Valerie Henson 1 sibling, 2 replies; 119+ messages in thread From: Arjan van de Ven @ 2006-07-03 21:01 UTC (permalink / raw) To: Tomasz Torcz; +Cc: Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML > ZFS was already called ,,blatant layering violation''. ;) > Yes,that what RAID is for. And if we want checksums in filesystem, > that's the best way to utilise them. Hi, checksums have a very different purpose than raid. checksums are great at detecting corruption. And yes, corruption can happen even if you have raid, for many many reasons. Detecting means knowing when to not trust something, when to go for the backup tapes... raid is great for protecting against individual disks or sectors going bad. But raid, especially high performance implementations, do not checksum data or detect corruptions. They're different purpose with almost zero overlap in purpose or even goal... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:01 ` Arjan van de Ven @ 2006-07-03 21:46 ` Jeff V. Merkey 2006-07-03 21:25 ` Diego Calleja ` (3 more replies) 2006-07-03 22:12 ` Alan Cox 1 sibling, 4 replies; 119+ messages in thread From: Jeff V. Merkey @ 2006-07-03 21:46 UTC (permalink / raw) To: Arjan van de Ven Cc: Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Arjan van de Ven wrote: >> ZFS was already called ,,blatant layering violation''. ;) >>Yes,that what RAID is for. And if we want checksums in filesystem, >>that's the best way to utilise them. >> >> > > >Hi, > >checksums have a very different purpose than raid. > >checksums are great at detecting corruption. And yes, corruption can >happen even if you have raid, for many many reasons. Detecting means >knowing when to not trust something, when to go for the backup tapes... > >raid is great for protecting against individual disks or sectors going >bad. But raid, especially high performance implementations, do not >checksum data or detect corruptions. > >They're different purpose with almost zero overlap in purpose or even >goal... > >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > > Add a salvagable file system to ext4, i.e. when a file is deleted, you just rename it and move it to a directory called DELETED.SAV and recycle the files as people allocate new ones. Easy to do (internal "mv" of file to another directory) and modification of the allocation bitmaps. Very simple and will pay off big. If you need help designing it, just ask me. Jeff ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:46 ` Jeff V. Merkey @ 2006-07-03 21:25 ` Diego Calleja 2006-07-03 22:17 ` Alan Cox ` (3 more replies) 2006-07-03 21:46 ` Valdis.Kletnieks ` (2 subsequent siblings) 3 siblings, 4 replies; 119+ messages in thread From: Diego Calleja @ 2006-07-03 21:25 UTC (permalink / raw) To: Jeff V. Merkey; +Cc: arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel El Mon, 03 Jul 2006 15:46:55 -0600, "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió: > Add a salvagable file system to ext4, i.e. when a file is deleted, you > just rename it and move it to a directory called DELETED.SAV and recycle > the files as people allocate new ones. Easy to do (internal "mv" of Easily doable in userspace, why bother with kernel programming ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:25 ` Diego Calleja @ 2006-07-03 22:17 ` Alan Cox 2006-07-04 14:45 ` Jan Engelhardt 2006-07-03 23:01 ` Jeff V. Merkey ` (2 subsequent siblings) 3 siblings, 1 reply; 119+ messages in thread From: Alan Cox @ 2006-07-03 22:17 UTC (permalink / raw) To: Diego Calleja Cc: Jeff V. Merkey, arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel Ar Llu, 2006-07-03 am 23:25 +0200, ysgrifennodd Diego Calleja: > > Add a salvagable file system to ext4, i.e. when a file is deleted, you > > just rename it and move it to a directory called DELETED.SAV and recycle > > the files as people allocate new ones. Easy to do (internal "mv" of > > > Easily doable in userspace, why bother with kernel programming To get the semantics you need and avoid rewriting all of user space. At the moment some GNU apps support this type of stuff but its not in the core libraries so it isn't generalised. There are some big problems with "deleted" however and doing it in kernel space. A lot of programs just overwrite data. You would have to look for things like O_TRUNC on a file open and ftruncate. The ftruncate case is particularly ugly because there are programs that do lots of ftruncate calls as they run and don't neccessarily "overwrite" data but are merely trimming logs or database files. To add to the fun the 'old' file needs to be the one which ends up with a new inode number and the like. Alan ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 22:17 ` Alan Cox @ 2006-07-04 14:45 ` Jan Engelhardt 2006-07-04 16:35 ` Jeffrey V. Merkey 0 siblings, 1 reply; 119+ messages in thread From: Jan Engelhardt @ 2006-07-04 14:45 UTC (permalink / raw) To: Alan Cox Cc: Diego Calleja, Jeff V. Merkey, arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel > >There are some big problems with "deleted" however and doing it in >kernel space. A lot of programs just overwrite data. You would have to >look for things like O_TRUNC on a file open and ftruncate. > At least I only want deleted files to be saved, not truncated. The way the MSWIN (the gui parts) do it is enough for most users. Jan Engelhardt -- ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 14:45 ` Jan Engelhardt @ 2006-07-04 16:35 ` Jeffrey V. Merkey 2006-07-04 18:52 ` Jeff Garzik 2006-07-05 13:35 ` Lew Palm 0 siblings, 2 replies; 119+ messages in thread From: Jeffrey V. Merkey @ 2006-07-04 16:35 UTC (permalink / raw) To: Jan Engelhardt Cc: Alan Cox, Diego Calleja, arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel Jan Engelhardt wrote: >>There are some big problems with "deleted" however and doing it in >>kernel space. A lot of programs just overwrite data. You would have to >>look for things like O_TRUNC on a file open and ftruncate. >> >> >> >At least I only want deleted files to be saved, not truncated. The way >the MSWIN (the gui parts) do it is enough for most users. > > >Jan Engelhardt > > Well, The old novell model is simple. When someone unlinks a file, don't delete it, just mv it to another special directory called DELETED.SAV. Then setup the fs space allocation to reuse these files when the drive fills up by oldest files first. It's very simple. Then you have a salvagable file system. Jeff ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 16:35 ` Jeffrey V. Merkey @ 2006-07-04 18:52 ` Jeff Garzik 2006-07-04 19:40 ` Jeffrey V. Merkey 2006-07-05 13:35 ` Lew Palm 1 sibling, 1 reply; 119+ messages in thread From: Jeff Garzik @ 2006-07-04 18:52 UTC (permalink / raw) To: Jeffrey V. Merkey Cc: Jan Engelhardt, Alan Cox, Diego Calleja, arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel Jeffrey V. Merkey wrote: > The old novell model is simple. When someone unlinks a file, don't > delete it, just mv it to another special directory called DELETED.SAV. > Then setup the > fs space allocation to reuse these files when the drive fills up by > oldest files first. It's very simple. Then you have a salvagable file > system. Such a scheme makes it much more difficult to allocate large, contiguous runs of free space for storing newly written data. Jeff ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 18:52 ` Jeff Garzik @ 2006-07-04 19:40 ` Jeffrey V. Merkey 0 siblings, 0 replies; 119+ messages in thread From: Jeffrey V. Merkey @ 2006-07-04 19:40 UTC (permalink / raw) To: Jeff Garzik Cc: Jan Engelhardt, Alan Cox, Diego Calleja, arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel Jeff Garzik wrote: > Jeffrey V. Merkey wrote: > >> The old novell model is simple. When someone unlinks a file, don't >> delete it, just mv it to another special directory called >> DELETED.SAV. Then setup the >> fs space allocation to reuse these files when the drive fills up by >> oldest files first. It's very simple. Then you have a salvagable file >> system. > > > Such a scheme makes it much more difficult to allocate large, > contiguous runs of free space for storing newly written data. > > Jeff Possibly. Organize the files in DELETED.SAV by disk location and date. Files don't have to adhere to a strict date recycling process. Make it a mount option if the user wants strict date recycling. Make the default to choose between date and file sector locality. Jeff > > > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 16:35 ` Jeffrey V. Merkey 2006-07-04 18:52 ` Jeff Garzik @ 2006-07-05 13:35 ` Lew Palm 1 sibling, 0 replies; 119+ messages in thread From: Lew Palm @ 2006-07-05 13:35 UTC (permalink / raw) To: Jeffrey V. Merkey; +Cc: linux-kernel Jeffrey V. Merkey wrote: > The old novell model is simple. When someone unlinks a file, don't > delete it, just mv it to another special directory called DELETED.SAV. > Then setup the > fs space allocation to reuse these files when the drive fills up by > oldest files first. It's very simple. Then you have a salvagable file > system. A complete foolproof car is a car with a maximum speed of 0 mph. As a user I give commands to my computer, for example an order to delete a file. And this is what I expect it to do. If I want it to move a file to another position in the filesystem, I would use another command. I don't want my operating system to josh me, that's why I use Linux. Stealthy keeping of deleted files somewhere is a security black hole. But accidents happen. Hardware perishes, users are making mistakes, sometimes coffee is pouring... That's why we backup important data regulary. A not-really-deleting-filesystem wouldn't relieve us of that duty, but would make a system more insecure and ambiguous. Lew ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:25 ` Diego Calleja 2006-07-03 22:17 ` Alan Cox @ 2006-07-03 23:01 ` Jeff V. Merkey 2006-07-04 9:14 ` Benny Amorsen 2006-07-04 9:22 ` Petr Tesarik 3 siblings, 0 replies; 119+ messages in thread From: Jeff V. Merkey @ 2006-07-03 23:01 UTC (permalink / raw) To: Diego Calleja; +Cc: arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel Diego Calleja wrote: >El Mon, 03 Jul 2006 15:46:55 -0600, >"Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió: > > > >>Add a salvagable file system to ext4, i.e. when a file is deleted, you >>just rename it and move it to a directory called DELETED.SAV and recycle >>the files as people allocate new ones. Easy to do (internal "mv" of >> >> > > >Easily doable in userspace, why bother with kernel programming > > > Fine, leave it out. More for me that way in additive features for my products for stuff Linux does not provide. Jeff ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:25 ` Diego Calleja 2006-07-03 22:17 ` Alan Cox 2006-07-03 23:01 ` Jeff V. Merkey @ 2006-07-04 9:14 ` Benny Amorsen 2006-07-05 4:21 ` Bill Davidsen 2006-07-04 9:22 ` Petr Tesarik 3 siblings, 1 reply; 119+ messages in thread From: Benny Amorsen @ 2006-07-04 9:14 UTC (permalink / raw) To: linux-kernel >>>>> "DC" == Diego Calleja <diegocg@gmail.com> writes: DC> El Mon, 03 Jul 2006 15:46:55 -0600, "Jeff V. Merkey" DC> <jmerkey@wolfmountaingroup.com> escribió: >> Add a salvagable file system to ext4, i.e. when a file is deleted, >> you just rename it and move it to a directory called DELETED.SAV >> and recycle the files as people allocate new ones. Easy to do >> (internal "mv" of DC> Easily doable in userspace, why bother with kernel programming In userspace you can't automatically delete the files when the space becomes needed. The LD_PRELOAD/glibc methods also have the disadvantage of having to figure out where a file goes when it's deleted, depending on which device it happens to reside on. Demanding read access to /proc/mounts just to do rm could cause problems. Userspace has had 10 years to invent a good solution. If it was so easy, it would probably have been done. /Benny ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 9:14 ` Benny Amorsen @ 2006-07-05 4:21 ` Bill Davidsen 2006-07-05 5:13 ` H. Peter Anvin 2006-07-07 14:10 ` Pavel Machek 0 siblings, 2 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 4:21 UTC (permalink / raw) To: Benny Amorsen, linux-kernel Benny Amorsen wrote: >>>>>> "DC" == Diego Calleja <diegocg@gmail.com> writes: > > DC> El Mon, 03 Jul 2006 15:46:55 -0600, "Jeff V. Merkey" > DC> <jmerkey@wolfmountaingroup.com> escribió: > >>> Add a salvagable file system to ext4, i.e. when a file is deleted, >>> you just rename it and move it to a directory called DELETED.SAV >>> and recycle the files as people allocate new ones. Easy to do >>> (internal "mv" of > > > DC> Easily doable in userspace, why bother with kernel programming > > In userspace you can't automatically delete the files when the space > becomes needed. The LD_PRELOAD/glibc methods also have the > disadvantage of having to figure out where a file goes when it's > deleted, depending on which device it happens to reside on. Demanding > read access to /proc/mounts just to do rm could cause problems. > > Userspace has had 10 years to invent a good solution. If it was so > easy, it would probably have been done. > Actually, if it were so important it WOULD have been done. I suspect that the issue is not lack of a good solution, but lack of a good problem. The behavior you propose requires a lot of kernel cleverness, including make the inodes seem to go away, so the count is "right" for what the user sees. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 4:21 ` Bill Davidsen @ 2006-07-05 5:13 ` H. Peter Anvin 2006-07-05 5:45 ` Jeffrey V. Merkey 2006-07-05 10:38 ` Krzysztof Halasa 2006-07-07 14:10 ` Pavel Machek 1 sibling, 2 replies; 119+ messages in thread From: H. Peter Anvin @ 2006-07-05 5:13 UTC (permalink / raw) To: Bill Davidsen; +Cc: Benny Amorsen, linux-kernel Bill Davidsen wrote: >> >> DC> Easily doable in userspace, why bother with kernel programming >> >> In userspace you can't automatically delete the files when the space >> becomes needed. The LD_PRELOAD/glibc methods also have the >> disadvantage of having to figure out where a file goes when it's >> deleted, depending on which device it happens to reside on. Demanding >> read access to /proc/mounts just to do rm could cause problems. >> >> Userspace has had 10 years to invent a good solution. If it was so >> easy, it would probably have been done. >> > Actually, if it were so important it WOULD have been done. I suspect > that the issue is not lack of a good solution, but lack of a good > problem. The behavior you propose requires a lot of kernel cleverness, > including make the inodes seem to go away, so the count is "right" for > what the user sees. > The real solution for it is snapshots. -hpa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 5:13 ` H. Peter Anvin @ 2006-07-05 5:45 ` Jeffrey V. Merkey 2006-07-07 14:12 ` Pavel Machek 2006-07-05 10:38 ` Krzysztof Halasa 1 sibling, 1 reply; 119+ messages in thread From: Jeffrey V. Merkey @ 2006-07-05 5:45 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel H. Peter Anvin wrote: > Bill Davidsen wrote: > >>> >>> DC> Easily doable in userspace, why bother with kernel programming >>> >>> In userspace you can't automatically delete the files when the space >>> becomes needed. The LD_PRELOAD/glibc methods also have the >>> disadvantage of having to figure out where a file goes when it's >>> deleted, depending on which device it happens to reside on. Demanding >>> read access to /proc/mounts just to do rm could cause problems. >>> >>> Userspace has had 10 years to invent a good solution. If it was so >>> easy, it would probably have been done. >>> >> Actually, if it were so important it WOULD have been done. I suspect >> that the issue is not lack of a good solution, but lack of a good >> problem. The behavior you propose requires a lot of kernel >> cleverness, including make the inodes seem to go away, so the count >> is "right" for what the user sees. >> > > The real solution for it is snapshots. Peter, Explain what you are thinking here. What I proposed, I have already implemented in NetWare, it's very easy to do. Snapshotting is not complex for FS's but does require a lot of space for meta-data to manage it. EXT is not architecteced for something this complex. A simple hidden mv is much easier to do. Jeff > > -hpa > - > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 5:45 ` Jeffrey V. Merkey @ 2006-07-07 14:12 ` Pavel Machek 0 siblings, 0 replies; 119+ messages in thread From: Pavel Machek @ 2006-07-07 14:12 UTC (permalink / raw) To: Jeffrey V. Merkey Cc: H. Peter Anvin, Bill Davidsen, Benny Amorsen, linux-kernel Hi! > >>Actually, if it were so important it WOULD have been > >>done. I suspect that the issue is not lack of a good > >>solution, but lack of a good problem. The behavior you > >>propose requires a lot of kernel cleverness, including > >>make the inodes seem to go away, so the count is > >>"right" for what the user sees. > >> > > > >The real solution for it is snapshots. > > > Peter, > > Explain what you are thinking here. What I proposed, I > have already implemented in NetWare, it's very easy to > do. Snapshotting is not complex for FS's but does > require a lot of space for meta-data to manage it. EXT > is not architecteced for something this complex. A > simple hidden mv is much easier to do. Patch would be nice :-). Hidden mv is indeed simple; reclaiming space on demand may be trickier. Pavel -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 5:13 ` H. Peter Anvin 2006-07-05 5:45 ` Jeffrey V. Merkey @ 2006-07-05 10:38 ` Krzysztof Halasa 1 sibling, 0 replies; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-05 10:38 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel "H. Peter Anvin" <hpa@zytor.com> writes: > The real solution for it is snapshots. Or a continuous log. Since we already use a journal we could possibly make its contents stay forever (and the admin should be able to define the "forever"). -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 4:21 ` Bill Davidsen 2006-07-05 5:13 ` H. Peter Anvin @ 2006-07-07 14:10 ` Pavel Machek 2006-07-07 17:45 ` Krzysztof Halasa 1 sibling, 1 reply; 119+ messages in thread From: Pavel Machek @ 2006-07-07 14:10 UTC (permalink / raw) To: Bill Davidsen; +Cc: Benny Amorsen, linux-kernel On Wed 05-07-06 00:21:32, Bill Davidsen wrote: > Benny Amorsen wrote: > >>>>>>"DC" == Diego Calleja <diegocg@gmail.com> writes: > > > >DC> El Mon, 03 Jul 2006 15:46:55 -0600, "Jeff V. Merkey" > >DC> <jmerkey@wolfmountaingroup.com> escribió: > > > >>>Add a salvagable file system to ext4, i.e. when a > >>>file is deleted, > >>>you just rename it and move it to a directory called > >>>DELETED.SAV > >>>and recycle the files as people allocate new ones. > >>>Easy to do > >>>(internal "mv" of > > > > > >DC> Easily doable in userspace, why bother with kernel > >programming > > > >In userspace you can't automatically delete the files > >when the space > >becomes needed. The LD_PRELOAD/glibc methods also have > >the > >disadvantage of having to figure out where a file goes > >when it's > >deleted, depending on which device it happens to reside > >on. Demanding > >read access to /proc/mounts just to do rm could cause > >problems. > > > >Userspace has had 10 years to invent a good solution. > >If it was so > >easy, it would probably have been done. > > > Actually, if it were so important it WOULD have been > done. I suspect that the issue is not lack of a good It *was* done. mc supports undelete on ext2. Unfortunately ext3 broke that :-(. Pavel -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 14:10 ` Pavel Machek @ 2006-07-07 17:45 ` Krzysztof Halasa 2006-07-07 21:30 ` Pavel Machek 0 siblings, 1 reply; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-07 17:45 UTC (permalink / raw) To: Pavel Machek; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel Pavel Machek <pavel@ucw.cz> writes: > It *was* done. mc supports undelete on ext2. How does it do that? Directly accessing the device? -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 17:45 ` Krzysztof Halasa @ 2006-07-07 21:30 ` Pavel Machek 2006-07-08 10:52 ` Krzysztof Halasa 0 siblings, 1 reply; 119+ messages in thread From: Pavel Machek @ 2006-07-07 21:30 UTC (permalink / raw) To: Krzysztof Halasa; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel On Fri 07-07-06 19:45:21, Krzysztof Halasa wrote: > Pavel Machek <pavel@ucw.cz> writes: > > > It *was* done. mc supports undelete on ext2. > > How does it do that? Directly accessing the device? Yes. I used it once or twice, and was not happy when ext3 broke it. -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 21:30 ` Pavel Machek @ 2006-07-08 10:52 ` Krzysztof Halasa 2006-07-08 10:55 ` Pavel Machek 0 siblings, 1 reply; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-08 10:52 UTC (permalink / raw) To: Pavel Machek; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel Pavel Machek <pavel@ucw.cz> writes: >> > It *was* done. mc supports undelete on ext2. >> >> How does it do that? Directly accessing the device? > > Yes. I used it once or twice, and was not happy when ext3 broke it. I'd say it had to be broken from the beginning. Doing such things on live, mounted filesystem... -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-08 10:52 ` Krzysztof Halasa @ 2006-07-08 10:55 ` Pavel Machek 2006-07-08 11:19 ` Krzysztof Halasa 0 siblings, 1 reply; 119+ messages in thread From: Pavel Machek @ 2006-07-08 10:55 UTC (permalink / raw) To: Krzysztof Halasa; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel On Sat 2006-07-08 12:52:17, Krzysztof Halasa wrote: > Pavel Machek <pavel@ucw.cz> writes: > > >> > It *was* done. mc supports undelete on ext2. > >> > >> How does it do that? Directly accessing the device? > > > > Yes. I used it once or twice, and was not happy when ext3 broke it. > > I'd say it had to be broken from the beginning. Doing such things > on live, mounted filesystem... Why not? You use libextfs or how is it called to read the file from the disk directly (read-only access), then you write it back using regular calls. Of course, you can end up with "deleted" data being corrupted if kernel reused the area before undelete, or while you were doing undelete... but that's expected. They were _deleted_, right? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-08 10:55 ` Pavel Machek @ 2006-07-08 11:19 ` Krzysztof Halasa 2006-07-08 11:23 ` Pavel Machek 2006-07-08 18:45 ` Avi Kivity 0 siblings, 2 replies; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-08 11:19 UTC (permalink / raw) To: Pavel Machek; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel Pavel Machek <pavel@ucw.cz> writes: > Why not? You use libextfs or how is it called to read the file from > the disk directly (read-only access), then you write it back using > regular calls. > > Of course, you can end up with "deleted" data being corrupted if > kernel reused the area before undelete, or while you were doing > undelete... but that's expected. They were _deleted_, right? What if the "undeleted" file contained /etc/shadow because someone was changing password at the time? :-) -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-08 11:19 ` Krzysztof Halasa @ 2006-07-08 11:23 ` Pavel Machek 2006-07-08 18:45 ` Avi Kivity 1 sibling, 0 replies; 119+ messages in thread From: Pavel Machek @ 2006-07-08 11:23 UTC (permalink / raw) To: Krzysztof Halasa; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel On Sat 2006-07-08 13:19:52, Krzysztof Halasa wrote: > Pavel Machek <pavel@ucw.cz> writes: > > > Why not? You use libextfs or how is it called to read the file from > > the disk directly (read-only access), then you write it back using > > regular calls. > > > > Of course, you can end up with "deleted" data being corrupted if > > kernel reused the area before undelete, or while you were doing > > undelete... but that's expected. They were _deleted_, right? > > What if the "undeleted" file contained /etc/shadow because someone > was changing password at the time? :-) Well, that's okay :-). Pavel ...of course, undelete is root-only operation, and one that should not be taken lightly. You need to verify you got what you wanted at the end. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-08 11:19 ` Krzysztof Halasa 2006-07-08 11:23 ` Pavel Machek @ 2006-07-08 18:45 ` Avi Kivity 2006-07-08 20:24 ` Krzysztof Halasa 1 sibling, 1 reply; 119+ messages in thread From: Avi Kivity @ 2006-07-08 18:45 UTC (permalink / raw) To: Krzysztof Halasa; +Cc: Pavel Machek, Bill Davidsen, Benny Amorsen, linux-kernel Krzysztof Halasa wrote: > > Pavel Machek <pavel@ucw.cz> writes: > > > Why not? You use libextfs or how is it called to read the file from > > the disk directly (read-only access), then you write it back using > > regular calls. > > > > Of course, you can end up with "deleted" data being corrupted if > > kernel reused the area before undelete, or while you were doing > > undelete... but that's expected. They were _deleted_, right? > > What if the "undeleted" file contained /etc/shadow because someone > was changing password at the time? :-) > As the undeleter already had read access to the raw device, /etc/shadow was already compromised. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-08 18:45 ` Avi Kivity @ 2006-07-08 20:24 ` Krzysztof Halasa 0 siblings, 0 replies; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-08 20:24 UTC (permalink / raw) To: Avi Kivity; +Cc: Pavel Machek, Bill Davidsen, Benny Amorsen, linux-kernel Avi Kivity <avi@argo.co.il> writes: >> What if the "undeleted" file contained /etc/shadow because someone >> was changing password at the time? :-) >> > > As the undeleter already had read access to the raw device, > /etc/shadow was already compromised. I understand only root had access, but the file in question might be requested by a user. Of course root should have known the consequences but... -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:25 ` Diego Calleja ` (2 preceding siblings ...) 2006-07-04 9:14 ` Benny Amorsen @ 2006-07-04 9:22 ` Petr Tesarik 2006-07-04 11:35 ` Peter Zijlstra 3 siblings, 1 reply; 119+ messages in thread From: Petr Tesarik @ 2006-07-04 9:22 UTC (permalink / raw) To: Diego Calleja; +Cc: linux-kernel On Mon, 2006-07-03 at 23:25 +0200, Diego Calleja wrote: > El Mon, 03 Jul 2006 15:46:55 -0600, > "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió: > > > Add a salvagable file system to ext4, i.e. when a file is deleted, you > > just rename it and move it to a directory called DELETED.SAV and recycle > > the files as people allocate new ones. Easy to do (internal "mv" of > > > Easily doable in userspace, why bother with kernel programming Yes and no. A simple mv is better done in userspace, but what I'd _really_ appreciate would be a true kernel salvage (similar to the way NetWare does things). That means marking the file as deleted in the directory, marking its blocks as deleted but avoiding the use of those blocks. The kernel would then prefer allocating new blocks from elsewhere but once the filesystem runs out of space, it would start allocating from the deleted files area and marking the blocks as well as the corresponding files purged. Salvaging files would be done with a separate tool. Of course, if you delete more files with the same name in the same directory, you'd need to tell that tool which one of them you want to salvage. Yes, I really mean you'd have more than one deleted file with the same name in the directory. Anyway, I doubt we want such feature for ext4, because to make things efficient, you'd need to provide some kind of pointer from the deleted (but not yet purged) blocks to the corresponding file. Hard links are also problematic and there is a whole lot of other troubles I haven't even thought of. Just my two cents. -- Petr Tesarik ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 9:22 ` Petr Tesarik @ 2006-07-04 11:35 ` Peter Zijlstra 2006-07-04 11:55 ` ext4 features (salvage) Petr Tesarik ` (2 more replies) 0 siblings, 3 replies; 119+ messages in thread From: Peter Zijlstra @ 2006-07-04 11:35 UTC (permalink / raw) To: Petr Tesarik; +Cc: Diego Calleja, linux-kernel On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote: > On Mon, 2006-07-03 at 23:25 +0200, Diego Calleja wrote: > > El Mon, 03 Jul 2006 15:46:55 -0600, > > "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió: > > > > > Add a salvagable file system to ext4, i.e. when a file is deleted, you > > > just rename it and move it to a directory called DELETED.SAV and recycle > > > the files as people allocate new ones. Easy to do (internal "mv" of > > > > > > Easily doable in userspace, why bother with kernel programming > > Yes and no. A simple mv is better done in userspace, but what I'd > _really_ appreciate would be a true kernel salvage (similar to the way > NetWare does things). That means marking the file as deleted in the > directory, marking its blocks as deleted but avoiding the use of those > blocks. The kernel would then prefer allocating new blocks from > elsewhere but once the filesystem runs out of space, it would start > allocating from the deleted files area and marking the blocks as well as > the corresponding files purged. > > Salvaging files would be done with a separate tool. Of course, if you > delete more files with the same name in the same directory, you'd need > to tell that tool which one of them you want to salvage. Yes, I really > mean you'd have more than one deleted file with the same name in the > directory. > > Anyway, I doubt we want such feature for ext4, because to make things > efficient, you'd need to provide some kind of pointer from the deleted > (but not yet purged) blocks to the corresponding file. Hard links are > also problematic and there is a whole lot of other troubles I haven't > even thought of. Wouldn't such a scheme interfere with the block allocator algorithms, and hence increase the risk of fragmentation? Schemes like this realy put my hairs on end, 1) if you don't want to lose your data, make backups; 2) if I mean to delete a file, I want it gone proper. Silently keeping it about is not unix like; 3) don't aid third parties in recovering your removed data. If I want them to have it I'll give it to them. Peter ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (salvage) 2006-07-04 11:35 ` Peter Zijlstra @ 2006-07-04 11:55 ` Petr Tesarik [not found] ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com> 2006-07-04 16:20 ` Matthew Frost 2006-07-04 15:25 ` ext4 features Pavel Machek 2006-07-05 4:10 ` Bill Davidsen 2 siblings, 2 replies; 119+ messages in thread From: Petr Tesarik @ 2006-07-04 11:55 UTC (permalink / raw) To: Peter Zijlstra On Tue, 2006-07-04 at 13:35 +0200, Peter Zijlstra wrote: > On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote: > > Yes and no. A simple mv is better done in userspace, but what I'd > > _really_ appreciate would be a true kernel salvage (similar to the way > > NetWare does things). That means marking the file as deleted in the > > directory, marking its blocks as deleted but avoiding the use of those > > blocks. The kernel would then prefer allocating new blocks from > > elsewhere but once the filesystem runs out of space, it would start > > allocating from the deleted files area and marking the blocks as well as > > the corresponding files purged. > > > > Salvaging files would be done with a separate tool. Of course, if you > > delete more files with the same name in the same directory, you'd need > > to tell that tool which one of them you want to salvage. Yes, I really > > mean you'd have more than one deleted file with the same name in the > > directory. > > Wouldn't such a scheme interfere with the block allocator algorithms, > and hence increase the risk of fragmentation? Schemes like this realy > put my hairs on end, Yes, they would interfere. That's why I'm not proposing to add them to ext4 in the first place. > 1) if you don't want to lose your data, make backups; Generally, I agree. > 2) if I mean to delete a file, I want it gone proper. Silently keeping > it about is not unix like; Yes, this is a problem. Although you would of course have a tool for purging the files unconditionally, some programs may need the assumption that an unlinked file is gone forever. Regarding the second clause, well, Linux is not Unix-like in many respects and we want it like that. That's a weak argument. > 3) don't aid third parties in recovering your removed data. If I want > them to have it I'll give it to them. See 2. Explicit purging is of course possible. (Novell Netware also had a "purge" command.) Anyway, it seems that there is some functionality which many users want but which can't be provided in user space: - if files are moved to the recycle-bin-or-whatever-you-call-it, their size is added to disk free space and - automatically purging least recently deleted files. Regards, Petr Tesarik ^ permalink raw reply [flat|nested] 119+ messages in thread
[parent not found: <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com>]
* Re: ext4 features (salvage) [not found] ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com> @ 2006-07-04 12:31 ` Petr Tesarik 2006-07-04 12:42 ` Helge Hafting 0 siblings, 1 reply; 119+ messages in thread From: Petr Tesarik @ 2006-07-04 12:31 UTC (permalink / raw) To: Lex Lyamin; +Cc: linux-kernel On Tue, 2006-07-04 at 16:08 +0400, Lex Lyamin wrote: > you mean that blocks are naturaly free, but we cant use them because > someone may made them free by accident, but we cant use them... > > hmm... > great idea! > > wait, its not. > because of we cant use those blocks we cant optimise way we write one > disk , and if we have defragmenter we cant make use of them either. > and if (just if) this is online defragmenter, it cant use them too. Well, the way I saw it done was that you had no guarantee that any deleted file could be salvaged. Sometimes you even could salvage a file but not another one which was deleted later. Users seemed to be content with that, because in most situations it did help them restore files they deleted and within a few seconds realized that they didn't want to. This means that the allocator MAY purge any deleted block at any moment, although it tends to allocate blocks from areas of disk which haven't been used recently. And the benefits? The performance of such a filesystem could be better than snapshots, while allowing to cope with one of the most common human errors. Regards, Petr Tesarik > for what purpose ? > are not we trying play out solution to problem from level 3 on level > 2 ? > does the soulion really belong to this level ? > would people pay with performance for "feature" which *probably* will > help them to restore their files ? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (salvage) 2006-07-04 12:31 ` Petr Tesarik @ 2006-07-04 12:42 ` Helge Hafting 0 siblings, 0 replies; 119+ messages in thread From: Helge Hafting @ 2006-07-04 12:42 UTC (permalink / raw) To: Petr Tesarik; +Cc: Lex Lyamin, linux-kernel On Tue, Jul 04, 2006 at 02:31:56PM +0200, Petr Tesarik wrote: > On Tue, 2006-07-04 at 16:08 +0400, Lex Lyamin wrote: > > you mean that blocks are naturaly free, but we cant use them because > > someone may made them free by accident, but we cant use them... > > > > hmm... > > great idea! > > > > wait, its not. > > because of we cant use those blocks we cant optimise way we write one > > disk , and if we have defragmenter we cant make use of them either. > > and if (just if) this is online defragmenter, it cant use them too. > > Well, the way I saw it done was that you had no guarantee that any > deleted file could be salvaged. Sometimes you even could salvage a file > but not another one which was deleted later. Users seemed to be content > with that, because in most situations it did help them restore files > they deleted and within a few seconds realized that they didn't want to. > > This means that the allocator MAY purge any deleted block at any moment, > although it tends to allocate blocks from areas of disk which haven't > been used recently. > > And the benefits? The performance of such a filesystem could be better > than snapshots, while allowing to cope with one of the most common human > errors. The most common error? A few years ago I restored a file from backup, because I deleted it in error. I can't even remember the second-last time I had that problem. I'd say this error is among the easiest to avoid. :-) Even a little performance loss won't justify it for me. Now, there may be clumsier users than me, but they tend to be using GUI "file managers" which do implement a "wastebasket" for all internal deletion. Helge Hafting ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (salvage) 2006-07-04 11:55 ` ext4 features (salvage) Petr Tesarik [not found] ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com> @ 2006-07-04 16:20 ` Matthew Frost 1 sibling, 0 replies; 119+ messages in thread From: Matthew Frost @ 2006-07-04 16:20 UTC (permalink / raw) To: linux kernel mailing list (Stupid mailer + user error = not sent to list) Petr Tesarik wrote: > On Tue, 2006-07-04 at 13:35 +0200, Peter Zijlstra wrote: >> On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote: >>> Yes and no. A simple mv is better done in userspace, but what I'd >>> _really_ appreciate would be a true kernel salvage (similar to the way >>> NetWare does things). That means marking the file as deleted in the >>> directory, marking its blocks as deleted but avoiding the use of those >>> blocks. The kernel would then prefer allocating new blocks from >>> elsewhere but once the filesystem runs out of space, it would start >>> allocating from the deleted files area and marking the blocks as well as >>> the corresponding files purged. >>> >>> Salvaging files would be done with a separate tool. Of course, if you >>> delete more files with the same name in the same directory, you'd need >>> to tell that tool which one of them you want to salvage. Yes, I really >>> mean you'd have more than one deleted file with the same name in the >>> directory. >> Wouldn't such a scheme interfere with the block allocator algorithms, >> and hence increase the risk of fragmentation? Schemes like this realy >> put my hairs on end, > > Yes, they would interfere. That's why I'm not proposing to add them to > ext4 in the first place. > >> 1) if you don't want to lose your data, make backups; > > Generally, I agree. > >> 2) if I mean to delete a file, I want it gone proper. Silently keeping >> it about is not unix like; > > Yes, this is a problem. Although you would of course have a tool for > purging the files unconditionally, some programs may need the assumption > that an unlinked file is gone forever. > > Regarding the second clause, well, Linux is not Unix-like in many > respects and we want it like that. That's a weak argument. We silently keep files around in many filesystems, at least until whatever reclamation process runs. The delete event doesn't itself generally purge the data from disk. However, this is a matter of simple tools doing simple things. Designing an intentional structure around not actually deleting deleted files, but keeping them around just in case may be lauded as "user-friendly", but it is counter-intuitive. It is cleverness over clarity, good design smothered under feature demand. In the ways in which it counts, in the sensible, useful, elegantly simple ways, the "Do one thing and do it well" ways, Linux tries to be Unix-like. We want stupid programs. A filesystem that decides that it knows better than the user is not desirable. Filesystem programmers that decide that they know better than the user are likewise sub-optimal. Protect my data against accidental failure. Do not protect it against me. If you have to add a "really delete, I mean it" command, you're breaking fundamental assumptions. > >> 3) don't aid third parties in recovering your removed data. If I want >> them to have it I'll give it to them. > > See 2. Explicit purging is of course possible. (Novell Netware also had > a "purge" command.) > > Anyway, it seems that there is some functionality which many users want > but which can't be provided in user space: > > - if files are moved to the recycle-bin-or-whatever-you-call-it, their > size is added to disk free space and Why add non-free space to the free space count, when we're intentionally keeping those files? If you have to be counter-intuitive, why go the second counter of hiding it from the user who "wants us to keep and index his deleted files"? > - automatically purging least recently deleted files. > > Regards, > Petr Tesarik Matt ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 11:35 ` Peter Zijlstra 2006-07-04 11:55 ` ext4 features (salvage) Petr Tesarik @ 2006-07-04 15:25 ` Pavel Machek 2006-07-05 4:10 ` Bill Davidsen 2 siblings, 0 replies; 119+ messages in thread From: Pavel Machek @ 2006-07-04 15:25 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Petr Tesarik, Diego Calleja, linux-kernel Hi! > > > > Add a salvagable file system to ext4, i.e. when a file is deleted, you > > > > just rename it and move it to a directory called DELETED.SAV and recycle > > > > the files as people allocate new ones. Easy to do (internal "mv" of > > > > > > > > > Easily doable in userspace, why bother with kernel programming > > > > Yes and no. A simple mv is better done in userspace, but what I'd > > _really_ appreciate would be a true kernel salvage (similar to the way > > NetWare does things). That means marking the file as deleted in the I have code doing ld_preload tricks to force safe deletion... somewhere. > Wouldn't such a scheme interfere with the block allocator algorithms, > and hence increase the risk of fragmentation? Schemes like this realy > put my hairs on end, > > 1) if you don't want to lose your data, make backups; > 2) if I mean to delete a file, I want it gone proper. Silently keeping > it about is not unix like; Well, mc supports undelete on ext2 for a *long* time. And it works okay... And yes, doing echo > important_file instead of echo >> important file is way too easy with unix shells. Pavel -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 11:35 ` Peter Zijlstra 2006-07-04 11:55 ` ext4 features (salvage) Petr Tesarik 2006-07-04 15:25 ` ext4 features Pavel Machek @ 2006-07-05 4:10 ` Bill Davidsen 2 siblings, 0 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 4:10 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Diego Calleja, linux-kernel Peter Zijlstra wrote: > On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote: >> On Mon, 2006-07-03 at 23:25 +0200, Diego Calleja wrote: >>> El Mon, 03 Jul 2006 15:46:55 -0600, >>> "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió: >>> >>>> Add a salvagable file system to ext4, i.e. when a file is deleted, you >>>> just rename it and move it to a directory called DELETED.SAV and recycle >>>> the files as people allocate new ones. Easy to do (internal "mv" of >>> >>> Easily doable in userspace, why bother with kernel programming >> Yes and no. A simple mv is better done in userspace, but what I'd >> _really_ appreciate would be a true kernel salvage (similar to the way >> NetWare does things). That means marking the file as deleted in the >> directory, marking its blocks as deleted but avoiding the use of those >> blocks. The kernel would then prefer allocating new blocks from >> elsewhere but once the filesystem runs out of space, it would start >> allocating from the deleted files area and marking the blocks as well as >> the corresponding files purged. >> >> Salvaging files would be done with a separate tool. Of course, if you >> delete more files with the same name in the same directory, you'd need >> to tell that tool which one of them you want to salvage. Yes, I really >> mean you'd have more than one deleted file with the same name in the >> directory. >> >> Anyway, I doubt we want such feature for ext4, because to make things >> efficient, you'd need to provide some kind of pointer from the deleted >> (but not yet purged) blocks to the corresponding file. Hard links are >> also problematic and there is a whole lot of other troubles I haven't >> even thought of. > > Wouldn't such a scheme interfere with the block allocator algorithms, > and hence increase the risk of fragmentation? Schemes like this realy > put my hairs on end, > > 1) if you don't want to lose your data, make backups; > 2) if I mean to delete a file, I want it gone proper. Silently keeping > it about is not unix like; > 3) don't aid third parties in recovering your removed data. If I want > them to have it I'll give it to them. > > Peter > If you wanted to add a feature which would overwrite the file when removed or truncated I'd be happy. Yes I know about attributes and dban, and I have a version of rm which does that if people use it, but would be nice to have it on the whole filesystem. It's not proof against a TLA, but nice for casual snooping. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:46 ` Jeff V. Merkey 2006-07-03 21:25 ` Diego Calleja @ 2006-07-03 21:46 ` Valdis.Kletnieks [not found] ` <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr> 2006-07-04 11:14 ` ext4 features Krzysztof Halasa 2006-07-04 22:35 ` Frank van Maarseveen 3 siblings, 1 reply; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-03 21:46 UTC (permalink / raw) To: Jeff V. Merkey Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML [-- Attachment #1: Type: text/plain, Size: 1006 bytes --] On Mon, 03 Jul 2006 15:46:55 MDT, "Jeff V. Merkey" said: > Add a salvagable file system to ext4, i.e. when a file is deleted, you > just rename it and move it to a directory called DELETED.SAV and recycle > the files as people allocate new ones. Easy to do (internal "mv" of > file to another directory) and modification of the allocation bitmaps. > Very simple and will pay off big. If you need help designing it, just Much better done in userspace - the kernel can't get this right without some user hinting. For starters, it creates a big security hole in all the code that does an open()/unlink(). Also, how do you handle the corner cases? The fact you're adding to the pathname of the file means you might push some long names over the MAXPATHLEN value, and you have to worry about name collisions in the directory, and so on. There's also more subtle leakage issues, such as properly handling the permissions on the files on a multi-user system so users can't rummage each other's trash.... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
[parent not found: <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr>]
* Re: Kernel recycler [was: ext4 features] [not found] ` <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr> @ 2006-07-04 14:37 ` Jan Engelhardt 0 siblings, 0 replies; 119+ messages in thread From: Jan Engelhardt @ 2006-07-04 14:37 UTC (permalink / raw) To: Jeff V. Merkey, Diego Calleja, Valdis.Kletnieks Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Hm this one did not appear on LKML so I resend it. On Jul 4 2006 00:01, Jan Engelhardt wrote: >Date: Tue, 4 Jul 2006 00:01:56 +0200 (MEST) >From: Jan Engelhardt <jengelh@linux01.gwdg.de> >To: Jeff V. Merkey <jmerkey@wolfmountaingroup.com>, > Diego Calleja <diegocg@gmail.com>, Valdis.Kletnieks@vt.edu >Cc: Arjan van de Ven <arjan@infradead.org>, Tomasz Torcz <zdzichu@irc.pl>, > Helge Hafting <helgehaf@aitel.hist.no>, > Thomas Glanzmann <sithglan@stud.uni-erlangen.de>, > Theodore Ts'o <tytso@mit.edu>, LKML <linux-kernel@vger.kernel.org> >Subject: Kernel recycler [was: ext4 features] > >>> >> Add a salvagable file system to ext4, i.e. when a file is deleted, you just >> rename it and move it to a directory called DELETED.SAV and recycle the files >> as people allocate new ones. Easy to do (internal "mv" of file to another >> directory) and modification of the allocation bitmaps. Very simple and will >> pay off big. If you need help designing it, just ask me. >> > >Hey, can you help? I had this idea of a kernel-level 'recyler' (FS-independent) >a while ago (patch file is March 26 according to my `ls -l`) [1], but I have >suspended it for the moment because it is a tedius task for API-newcomers >like me. (I currently have to look at a lot of other kernel code to figure >out what the proper way of doing things is.) > >And it comes with some problems: > >- recycled files ("deleted" and moved) shall not count into the user's quota > >- rm -Rf bigfatdirectory will keep a lot of files around, therefore we would > need an extra kthread that kills all files in DELETED.SAV after a tunable > period. > >[1] http://jengelh.hopto.org/recycler.diff > > >>From: Diego Calleja <diegocg@gmail.com> >> >>Easily doable in userspace, why bother with kernel programming > >Because not every application will use KDE's trash feature, or will use >/bin/my_rm or or or. I certainly do not have the time to patch any program out >there to use /bin/my_rm or my_unlink() function. What about statically compiled >programs? They call the syscall directly, so there is no way (without >recompiling - if possible at all) to catch it within userspace. > >And what about if knfsd is about to delete a file? Let's assume we cannot trust >the client, so the only choice here is to have a kernel recycler. > >>Much better done in userspace - the kernel can't get this right without >>some user hinting. For starters, it creates a big security hole in all >>the code that does an open()/unlink(). >> >>Also, how do you handle the corner cases? The fact you're adding to the >>pathname of the file means you might push some long names over the MAXPATHLEN >>value, and you have to worry about name collisions in the directory, and >>so on. There's also more subtle leakage issues, such as properly handling >>the permissions on the files on a multi-user system so users can't rummage >>each other's trash.... > >I am aware of these problems, but at least for fun & profit, I would like to >complete the kernel-level recycler. > > >Jan Engelhardt >-- > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:46 ` Jeff V. Merkey 2006-07-03 21:25 ` Diego Calleja 2006-07-03 21:46 ` Valdis.Kletnieks @ 2006-07-04 11:14 ` Krzysztof Halasa 2006-07-04 22:35 ` Frank van Maarseveen 3 siblings, 0 replies; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-04 11:14 UTC (permalink / raw) To: Jeff V. Merkey Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> writes: > Add a salvagable file system to ext4, i.e. when a file is deleted, you > just rename it and move it to a directory called DELETED.SAV and > recycle the files as people allocate new ones. Due to the problems pointed what would be really needed is a filesystem with a full log of operations. Then the fs state (full contents of all files etc.) at any given time can be restored. May not be very efficient, though (probably people doing databases and transaction logging have something to say). I'd rather have better backups (so I can restore from them) instead of such logging. -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:46 ` Jeff V. Merkey ` (2 preceding siblings ...) 2006-07-04 11:14 ` ext4 features Krzysztof Halasa @ 2006-07-04 22:35 ` Frank van Maarseveen 2006-07-04 23:47 ` Claudio Martins 3 siblings, 1 reply; 119+ messages in thread From: Frank van Maarseveen @ 2006-07-04 22:35 UTC (permalink / raw) To: Jeff V. Merkey Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML On Mon, Jul 03, 2006 at 03:46:55PM -0600, Jeff V. Merkey wrote: [...] > Add a salvagable file system to ext4, i.e. when a file is deleted, you > just rename it and move it to a directory called DELETED.SAV and recycle > the files as people allocate new ones. Easy to do (internal "mv" of > file to another directory) and modification of the allocation bitmaps. > Very simple and will pay off big. If you need help designing it, just > ask me. Do you have any idea how to undo the effect of rm -rf /bigtree at the FS level? I think such an "undelete" feature should be implemented in userspace. A filesystem which can travel back in time could be useful however. -- Frank ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 22:35 ` Frank van Maarseveen @ 2006-07-04 23:47 ` Claudio Martins 0 siblings, 0 replies; 119+ messages in thread From: Claudio Martins @ 2006-07-04 23:47 UTC (permalink / raw) To: Frank van Maarseveen Cc: Jeff V. Merkey, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML On Tuesday 04 July 2006 23:35, Frank van Maarseveen wrote: > > Do you have any idea how to undo the effect of rm -rf /bigtree at > the FS level? > > I think such an "undelete" feature should be implemented in userspace. > A filesystem which can travel back in time could be useful however. Indeed. See: http://lkml.org/lkml/2006/7/1/114 I'm starting to repeat myself, but at least one filesystem of that kind is already being developed, lets try to support them! :-) Regards Cláudio ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:01 ` Arjan van de Ven 2006-07-03 21:46 ` Jeff V. Merkey @ 2006-07-03 22:12 ` Alan Cox 2006-07-03 21:59 ` Arjan van de Ven 2006-07-03 23:31 ` ext4 features (checksums) Neil Brown 1 sibling, 2 replies; 119+ messages in thread From: Alan Cox @ 2006-07-03 22:12 UTC (permalink / raw) To: Arjan van de Ven Cc: Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Ar Llu, 2006-07-03 am 23:01 +0200, ysgrifennodd Arjan van de Ven: > raid is great for protecting against individual disks or sectors going > bad. But raid, especially high performance implementations, do not > checksum data or detect corruptions. > > They're different purpose with almost zero overlap in purpose or even > goal... Same layer though - checksums are really a device mapper type problem rather than an fs type problem. Alan ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 22:12 ` Alan Cox @ 2006-07-03 21:59 ` Arjan van de Ven 2006-07-03 23:31 ` ext4 features (checksums) Neil Brown 1 sibling, 0 replies; 119+ messages in thread From: Arjan van de Ven @ 2006-07-03 21:59 UTC (permalink / raw) To: Alan Cox Cc: Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML On Mon, 2006-07-03 at 23:12 +0100, Alan Cox wrote: > Ar Llu, 2006-07-03 am 23:01 +0200, ysgrifennodd Arjan van de Ven: > > raid is great for protecting against individual disks or sectors going > > bad. But raid, especially high performance implementations, do not > > checksum data or detect corruptions. > > > > They're different purpose with almost zero overlap in purpose or even > > goal... > > Same layer though - checksums are really a device mapper type problem > rather than an fs type problem. file payload checksums.. I'd agree filesystem metadata.. there checksums do provide value ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-03 22:12 ` Alan Cox 2006-07-03 21:59 ` Arjan van de Ven @ 2006-07-03 23:31 ` Neil Brown 2006-07-04 1:03 ` Jeff Garzik ` (3 more replies) 1 sibling, 4 replies; 119+ messages in thread From: Neil Brown @ 2006-07-03 23:31 UTC (permalink / raw) To: Alan Cox Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML On Monday July 3, alan@lxorguk.ukuu.org.uk wrote: > Ar Llu, 2006-07-03 am 23:01 +0200, ysgrifennodd Arjan van de Ven: > > raid is great for protecting against individual disks or sectors going > > bad. But raid, especially high performance implementations, do not > > checksum data or detect corruptions. > > > > They're different purpose with almost zero overlap in purpose or even > > goal... > > Same layer though - checksums are really a device mapper type problem > rather than an fs type problem. Can't say I agree with this layering distinction. It's been some years that I've felt that most 'logical volume management' really belongs in the filesystem. Why have a dm that chops devices up in to segments and assembles them to look like a big device, only to have that big device chopped up and presented as files. Seems like double handling to me. With checksums - the filesystem is in a better position to: - be selective about what is checksummed - no point checksumming blocks that aren't part of any file. Some blocks (highlevel metadata) might always be checksummed, while other blocks (regular data) might not if a 'fast' option was chosen. - record the checksum somewhere easily accessible. The dm layer could do little better than store a block of checksums for every 10 blocks of data. A filesystem can store checksums with indexing information, or ensure that checksums for consecutive blocks in a file are stored together, even if the blocks cannot be. I think that for a filesystem that makes heavy use of trees to find things, it makes a lot of sense to checksum and replicate the upper levels of the tree, while checksumming and replicating lower levels has a very different cost/benefit tradeoff. These distinctions are easy to make in a filesystem, and hard to make in a block device. To my mind, the only thing you should put between the filesystem and the raw devices is RAID (real-raid - not raid0 or linear). NeilBrown ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-03 23:31 ` ext4 features (checksums) Neil Brown @ 2006-07-04 1:03 ` Jeff Garzik 2006-07-04 6:09 ` Avi Kivity ` (2 subsequent siblings) 3 siblings, 0 replies; 119+ messages in thread From: Jeff Garzik @ 2006-07-04 1:03 UTC (permalink / raw) To: Neil Brown Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Neil Brown wrote: > Can't say I agree with this layering distinction. > It's been some years that I've felt that most 'logical volume > management' really belongs in the filesystem. > Why have a dm that chops devices up in to segments and assembles them to > look like a big device, only to have that big device chopped up and > presented as files. Seems like double handling to me. Agreed, and allow me to take an even more radical position: I've long felt that things like snapshotting and mirroring made a lot of sense at the filesystem level -- as do layered filesystems, just like we layer block devices. Block device drivers (MD, DM) get ever more complicated, and ultimately become mini-filesystems themselves. The metadata managed by blkdev drivers continues to increase in complexity. What is represented to the upper layer as a contiguous run of bytes is really, under the hood, chunks of data coalesced logically -- just like files in a filesystem. The more complex that blkdev drivers become, the more and more they will look like filesystems. Jeff ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-03 23:31 ` ext4 features (checksums) Neil Brown 2006-07-04 1:03 ` Jeff Garzik @ 2006-07-04 6:09 ` Avi Kivity 2006-07-04 7:02 ` Neil Brown 2006-07-05 12:06 ` Bill Davidsen 2006-07-04 8:17 ` Alan Cox 2006-07-04 11:19 ` Krzysztof Halasa 3 siblings, 2 replies; 119+ messages in thread From: Avi Kivity @ 2006-07-04 6:09 UTC (permalink / raw) To: Neil Brown Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Neil Brown wrote: > > To my mind, the only thing you should put between the filesystem and > the raw devices is RAID (real-raid - not raid0 or linear). > I believe that implementing RAID in the filesystem has many benefits too: - multiple RAID levels: store metadata in triple-mirror RAID 1, random write intensive data in RAID 1, bulk data in RAID 5/6 - improved write throughput - since stripes can be variable size, any large enough write fills a whole stripe -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-04 6:09 ` Avi Kivity @ 2006-07-04 7:02 ` Neil Brown 2006-07-04 8:26 ` Avi Kivity 2006-07-05 12:06 ` Bill Davidsen 1 sibling, 1 reply; 119+ messages in thread From: Neil Brown @ 2006-07-04 7:02 UTC (permalink / raw) To: Avi Kivity Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML On Tuesday July 4, avi@argo.co.il wrote: > Neil Brown wrote: > > > > To my mind, the only thing you should put between the filesystem and > > the raw devices is RAID (real-raid - not raid0 or linear). > > > I believe that implementing RAID in the filesystem has many benefits too: > - multiple RAID levels: store metadata in triple-mirror RAID 1, random > write intensive data in RAID 1, bulk data in RAID 5/6 > - improved write throughput - since stripes can be variable size, any > large enough write fills a whole stripe Maybe.... Now imagine what would be required to rebuild a whole drive onto a spare after a drive failure. I'm sure it is possible, and I believe ZFS does something like that. I find it hard to imagine getting reasonable speed if there is much complexity. And the longer it takes, the longer your data is exposed to multiple-failures. There may well be room there to come up with a really clever idea that makes it both flexible and fast.... Note that 'resync' wouldn't be a problem. Having the filesystem know about the raid means that resync (after unclean shutdown) can be quite trivial (I believe there is a paper related to this at OLS this year). NeilBrown ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-04 7:02 ` Neil Brown @ 2006-07-04 8:26 ` Avi Kivity 2006-07-05 11:56 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Avi Kivity @ 2006-07-04 8:26 UTC (permalink / raw) To: Neil Brown Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Neil Brown wrote: > > On Tuesday July 4, avi@argo.co.il wrote: > > Neil Brown wrote: > > > > > > To my mind, the only thing you should put between the filesystem and > > > the raw devices is RAID (real-raid - not raid0 or linear). > > > > > I believe that implementing RAID in the filesystem has many benefits > too: > > - multiple RAID levels: store metadata in triple-mirror RAID 1, random > > write intensive data in RAID 1, bulk data in RAID 5/6 > > - improved write throughput - since stripes can be variable size, any > > large enough write fills a whole stripe > > Maybe.... > > Now imagine what would be required to rebuild a whole drive onto a > spare after a drive failure. > > I'm sure it is possible, and I believe ZFS does something like that. > I find it hard to imagine getting reasonable speed if there is much > complexity. And the longer it takes, the longer your data is exposed > to multiple-failures. > A company called Isilon does this on a cluster. They claim (IIRC) a one hour rebuild time for a failure. AFAIK they rebuild into cluster free space, so they are not bound by the spare's bandwidth; they can utilize all cluster resources for a rebuild. (You don't need spare disks, just spare free space; so you don't have idle disk heads) In terms of complexity, I imagine one needs a reverse mapping (extent -> (inode, offset)); given that, one can very easily rebuild failed disks, and more features are easy to implement, like evacuation of a drive, or rebalancing data across all drives when new disks are added. The same ideas can be applied to a non-clustered filesystem, of course. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-04 8:26 ` Avi Kivity @ 2006-07-05 11:56 ` Bill Davidsen 0 siblings, 0 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 11:56 UTC (permalink / raw) To: Avi Kivity Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Avi Kivity wrote: > Neil Brown wrote: >> >> On Tuesday July 4, avi@argo.co.il wrote: >> > Neil Brown wrote: >> > > >> > > To my mind, the only thing you should put between the filesystem and >> > > the raw devices is RAID (real-raid - not raid0 or linear). >> > > >> > I believe that implementing RAID in the filesystem has many benefits >> too: >> > - multiple RAID levels: store metadata in triple-mirror RAID 1, random >> > write intensive data in RAID 1, bulk data in RAID 5/6 >> > - improved write throughput - since stripes can be variable size, any >> > large enough write fills a whole stripe >> >> Maybe.... >> >> Now imagine what would be required to rebuild a whole drive onto a >> spare after a drive failure. >> >> I'm sure it is possible, and I believe ZFS does something like that. >> I find it hard to imagine getting reasonable speed if there is much >> complexity. And the longer it takes, the longer your data is exposed >> to multiple-failures. >> > > A company called Isilon does this on a cluster. They claim (IIRC) a one > hour rebuild time for a failure. AFAIK they rebuild into cluster free > space, so they are not bound by the spare's bandwidth; they can utilize > all cluster resources for a rebuild. > > (You don't need spare disks, just spare free space; so you don't have > idle disk heads) > Readers of the RAID list will recognize this description, it matches my comments on RAID5E (distributed hot spare) very well. And I suppose there could be RAID6E as well, although I haven't really thought about it. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-04 6:09 ` Avi Kivity 2006-07-04 7:02 ` Neil Brown @ 2006-07-05 12:06 ` Bill Davidsen 2006-07-05 12:19 ` Avi Kivity 1 sibling, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 12:06 UTC (permalink / raw) To: Avi Kivity Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Avi Kivity wrote: > Neil Brown wrote: >> >> To my mind, the only thing you should put between the filesystem and >> the raw devices is RAID (real-raid - not raid0 or linear). >> > I believe that implementing RAID in the filesystem has many benefits too: > - multiple RAID levels: store metadata in triple-mirror RAID 1, random > write intensive data in RAID 1, bulk data in RAID 5/6 > - improved write throughput - since stripes can be variable size, any > large enough write fills a whole stripe > I rather like the idea of allowing metadata to be on another device in general, or at least the inodes. That way a very small chunk size can be used for the inodes, to spread head motion, while a larger chunk size is appropriate for data in some cases. Larger max block sizes would be useful as well. Feel free to discuss the actual value of "larger." -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-05 12:06 ` Bill Davidsen @ 2006-07-05 12:19 ` Avi Kivity 2006-07-08 17:54 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Avi Kivity @ 2006-07-05 12:19 UTC (permalink / raw) To: Bill Davidsen Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Bill Davidsen wrote: > > > I believe that implementing RAID in the filesystem has many benefits > too: > > - multiple RAID levels: store metadata in triple-mirror RAID 1, random > > write intensive data in RAID 1, bulk data in RAID 5/6 > > - improved write throughput - since stripes can be variable size, any > > large enough write fills a whole stripe > > > I rather like the idea of allowing metadata to be on another device in > general, or at least the inodes. That way a very small chunk size can be > used for the inodes, to spread head motion, while a larger chunk size is > appropriate for data in some cases. > If your workload is metadata intensive, your data disks are idle; if you're reading data, the inode device is gathering dust. You can run out of inodes before you run out of space and vice-versa. Very suboptimal. A symmetric configuration allows full use of all resources for any workload, at the cost of increased complexity - every extent has its own RAID level and RAID component devices. > Larger max block sizes would be useful as well. Feel free to discuss the > actual value of "larger." > Filesystems should use extents, not blocks, avoiding the block size tradeoff entirely. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-05 12:19 ` Avi Kivity @ 2006-07-08 17:54 ` Bill Davidsen 0 siblings, 0 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-08 17:54 UTC (permalink / raw) To: Avi Kivity Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Avi Kivity wrote: > Bill Davidsen wrote: >> >> > I believe that implementing RAID in the filesystem has many benefits >> too: >> > - multiple RAID levels: store metadata in triple-mirror RAID 1, random >> > write intensive data in RAID 1, bulk data in RAID 5/6 >> > - improved write throughput - since stripes can be variable size, any >> > large enough write fills a whole stripe >> > >> I rather like the idea of allowing metadata to be on another device in >> general, or at least the inodes. That way a very small chunk size can be >> used for the inodes, to spread head motion, while a larger chunk size is >> appropriate for data in some cases. >> > > If your workload is metadata intensive, your data disks are idle; if > you're reading data, the inode device is gathering dust. You can run out > of inodes before you run out of space and vice-versa. Very suboptimal. Using the correct resource for the job is very optimal, no RAID will make big slow cheap drives fast for inodes, no fast drive is practical in cost or heat for moderately large data. > > A symmetric configuration allows full use of all resources for any > workload, at the cost of increased complexity - every extent has its own > RAID level and RAID component devices. Why would you want to use all your resources when only part of them are at all suited to the job? Do consider the price and performance of 15k RPM Ultra320 drives (32GB) vs. 750GB SATA before telling me that it doesn't work better to have metadata on fast storage and application data on cheap drives. You can use 10TB of 300kB avg files in random directories as a model. Figure 10% churn every day, delete and create not rewrite, 27 creates/sec and 200-300 open for read/sec. > >> Larger max block sizes would be useful as well. Feel free to discuss the >> actual value of "larger." >> > > Filesystems should use extents, not blocks, avoiding the block size > tradeoff entirely. > -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-03 23:31 ` ext4 features (checksums) Neil Brown 2006-07-04 1:03 ` Jeff Garzik 2006-07-04 6:09 ` Avi Kivity @ 2006-07-04 8:17 ` Alan Cox 2006-07-04 11:08 ` Thomas Glanzmann 2006-07-04 11:19 ` Krzysztof Halasa 3 siblings, 1 reply; 119+ messages in thread From: Alan Cox @ 2006-07-04 8:17 UTC (permalink / raw) To: Neil Brown Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Ar Maw, 2006-07-04 am 09:31 +1000, ysgrifennodd Neil Brown: > It's been some years that I've felt that most 'logical volume > management' really belongs in the filesystem. > Why have a dm that chops devices up in to segments and assembles them to > look like a big device, only to have that big device chopped up and > presented as files. Seems like double handling to me. Because the interface model is wrong ? Various people have long said the model actually should look rather more like fs to block: handle = alloc_extent(near_handle*, info) write_extent(handle, buffer, offset, length) read_extent(handle, buffer, offset, length) free_extent(handle) (probably with resize_extent) This makes LVM, remapping, checksumming and the like all naturally slip out of the fs but not into the block layer. [Many very good points snipped] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-04 8:17 ` Alan Cox @ 2006-07-04 11:08 ` Thomas Glanzmann 0 siblings, 0 replies; 119+ messages in thread From: Thomas Glanzmann @ 2006-07-04 11:08 UTC (permalink / raw) To: Alan Cox Cc: Neil Brown, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Theodore Ts'o, LKML Hello Alan, > This makes LVM, remapping, checksumming and the like all naturally slip > out of the fs but not into the block layer. enhance LVM and have the functionality for all available fs. I think this is the right way to go with checksums and fault tolerance but not with snapshots. Thomas ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-03 23:31 ` ext4 features (checksums) Neil Brown ` (2 preceding siblings ...) 2006-07-04 8:17 ` Alan Cox @ 2006-07-04 11:19 ` Krzysztof Halasa 2006-07-04 12:49 ` Helge Hafting 3 siblings, 1 reply; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-04 11:19 UTC (permalink / raw) To: Neil Brown Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML Neil Brown <neilb@suse.de> writes: > With checksums - the filesystem is in a better position to: > - be selective about what is checksummed - no point checksumming > blocks that aren't part of any file. Some blocks (highlevel > metadata) might always be checksummed, while other blocks > (regular data) might not if a 'fast' option was chosen. The same applies to RAID - for example, why "synchronise" unused area? While fs vs. RAID provides a good layering scheme and is easier, integrating them into one entity (as with ZFS) would certainly be more efficient (and probably harder to maintain). -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-04 11:19 ` Krzysztof Halasa @ 2006-07-04 12:49 ` Helge Hafting 2006-07-05 12:01 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Helge Hafting @ 2006-07-04 12:49 UTC (permalink / raw) To: Krzysztof Halasa Cc: Neil Brown, Alan Cox, Arjan van de Ven, Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML On Tue, Jul 04, 2006 at 01:19:11PM +0200, Krzysztof Halasa wrote: > Neil Brown <neilb@suse.de> writes: > > > With checksums - the filesystem is in a better position to: > > - be selective about what is checksummed - no point checksumming > > blocks that aren't part of any file. Some blocks (highlevel > > metadata) might always be checksummed, while other blocks > > (regular data) might not if a 'fast' option was chosen. > > The same applies to RAID - for example, why "synchronise" unused area? > Indeed. RAID usually avoid checksumming unused area, it sums on write and you don't write "unused" stuff. Not syncing unused area is possible, if there was a way for raid resync to ask the fs what blocks are not in use. I.e. get the free block list in disk block order. Then raid resync could skip those. Helge Hafting ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-04 12:49 ` Helge Hafting @ 2006-07-05 12:01 ` Bill Davidsen 2006-07-05 12:10 ` Avi Kivity 0 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 12:01 UTC (permalink / raw) To: Helge Hafting Cc: Neil Brown, Alan Cox, Arjan van de Ven, Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML Helge Hafting wrote: > On Tue, Jul 04, 2006 at 01:19:11PM +0200, Krzysztof Halasa wrote: >> Neil Brown <neilb@suse.de> writes: >> >>> With checksums - the filesystem is in a better position to: >>> - be selective about what is checksummed - no point checksumming >>> blocks that aren't part of any file. Some blocks (highlevel >>> metadata) might always be checksummed, while other blocks >>> (regular data) might not if a 'fast' option was chosen. >> The same applies to RAID - for example, why "synchronise" unused area? >> > Indeed. RAID usually avoid checksumming unused area, it sums on write > and you don't write "unused" stuff. > > Not syncing unused area is possible, if there was a way for raid resync > to ask the fs what blocks are not in use. I.e. get the > free block list in disk block order. Then raid resync could skip those. > Current RAID code supports having a bitmap of dirty stripes, and can just sync those during recovery. I'm sure Neil could explain it better, but this is available without worrying about fs type. Now. Today. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-05 12:01 ` Bill Davidsen @ 2006-07-05 12:10 ` Avi Kivity 2006-07-08 18:02 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Avi Kivity @ 2006-07-05 12:10 UTC (permalink / raw) To: Bill Davidsen Cc: Helge Hafting, Neil Brown, Alan Cox, Arjan van de Ven, Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML Bill Davidsen wrote: > > > > Not syncing unused area is possible, if there was a way for raid resync > > to ask the fs what blocks are not in use. I.e. get the > > free block list in disk block order. Then raid resync could skip > those. > > > Current RAID code supports having a bitmap of dirty stripes, and can > just sync those during recovery. I'm sure Neil could explain it better, > but this is available without worrying about fs type. Now. Today. > This is only when the you reconstruct a disk that was once part of the RAID. If you are adding a brand new disk, all stripes are dirty. This happens in two scenarios: an unclean RAID shutdown, and when you have a remote mirror which can be disconnected by network problems. If the RAID is integrated in the filesystem (or into an object storage system), you can handle the new disk case too. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features (checksums) 2006-07-05 12:10 ` Avi Kivity @ 2006-07-08 18:02 ` Bill Davidsen 0 siblings, 0 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-08 18:02 UTC (permalink / raw) To: Avi Kivity Cc: Helge Hafting, Neil Brown, Alan Cox, Arjan van de Ven, Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML Avi Kivity wrote: > Bill Davidsen wrote: >> >> >> > Not syncing unused area is possible, if there was a way for raid resync >> > to ask the fs what blocks are not in use. I.e. get the >> > free block list in disk block order. Then raid resync could skip >> those. >> > >> Current RAID code supports having a bitmap of dirty stripes, and can >> just sync those during recovery. I'm sure Neil could explain it better, >> but this is available without worrying about fs type. Now. Today. >> > > This is only when the you reconstruct a disk that was once part of the > RAID. If you are adding a brand new disk, all stripes are dirty. I will leave Neil to explain this to you, it appears to be a totally different case for reconfiguration, but I don't pretend to understand the code well enough to clarify it. > > This happens in two scenarios: an unclean RAID shutdown, and when you > have a remote mirror which can be disconnected by network problems. > > If the RAID is integrated in the filesystem (or into an object storage > system), you can handle the new disk case too. > I'm not sure that building the RAID into the filesystem is ever a good idea, it certainly seems likely to either prevent certain RAID devices from being used, or make them perform suboptimally. There are times when being able to move a filesystem to a new device is REALLY useful, and byte copy is more practical than file by file copy. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Blatant layering violations (was Re: ext4 features) 2006-07-03 20:55 ` Tomasz Torcz 2006-07-03 21:01 ` Arjan van de Ven @ 2006-07-06 0:36 ` Valerie Henson 2006-07-06 12:15 ` Xavier Bestel 2006-07-06 20:02 ` Tom Vier 1 sibling, 2 replies; 119+ messages in thread From: Valerie Henson @ 2006-07-06 0:36 UTC (permalink / raw) To: LKML; +Cc: Helge Hafting, Thomas Glanzmann, Theodore Ts'o, Andrew Morton On Mon, Jul 03, 2006 at 10:55:23PM +0200, Tomasz Torcz wrote: > > ZFS was already called ,,blatant layering violation''. ;) I kind of like the phrase "blatant layering violation" - catchy, isn't it? The main reason people think of ZFS as a blatant layering violation is because it has the letters "FS" in the name, but it does a lot more than a file system. ZFS actually includes three distinct layers with well-defined interfaces, none of which directly maps to most people's conception of a "file system." The really painfully short summary of the layers is: SPA - Storage Pool Allocator, disks go into the bottom, virtually addressed, explicitly freed/allocated blocks come out of the top DMU - Data Management Unit, virtually addressed blocks go in the bottom, plain objects come out the top (an object is like a file with no dangly bits like permissions, etc.) ZPL - ZFS POSIX Layer, plain objects go in the bottom, VFS ops come out the top For a really nice, much more detailed ZFS source tour, see: http://www.opensolaris.org/os/community/zfs/source/ -VAL ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Blatant layering violations (was Re: ext4 features) 2006-07-06 0:36 ` Blatant layering violations (was Re: ext4 features) Valerie Henson @ 2006-07-06 12:15 ` Xavier Bestel 2006-07-06 17:06 ` Valdis.Kletnieks 2006-07-06 20:02 ` Tom Vier 1 sibling, 1 reply; 119+ messages in thread From: Xavier Bestel @ 2006-07-06 12:15 UTC (permalink / raw) To: Valerie Henson Cc: LKML, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, Andrew Morton On Thu, 2006-07-06 at 02:36, Valerie Henson wrote: > For a really nice, much more detailed ZFS source tour, see: > > http://www.opensolaris.org/os/community/zfs/source/ Posting an URL with CDDL-licensed sourcecode to LKML seems weird to me. Do you try to pull an SCO ? :) Xav ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Blatant layering violations (was Re: ext4 features) 2006-07-06 12:15 ` Xavier Bestel @ 2006-07-06 17:06 ` Valdis.Kletnieks 0 siblings, 0 replies; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-06 17:06 UTC (permalink / raw) To: Xavier Bestel Cc: Valerie Henson, LKML, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 391 bytes --] On Thu, 06 Jul 2006 14:15:26 +0200, Xavier Bestel said: > On Thu, 2006-07-06 at 02:36, Valerie Henson wrote: > > For a really nice, much more detailed ZFS source tour, see: > > > > http://www.opensolaris.org/os/community/zfs/source/ > > Posting an URL with CDDL-licensed sourcecode to LKML seems weird to me. > Do you try to pull an SCO ? :) "ideas and concepts". We can steal those. :) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Blatant layering violations (was Re: ext4 features) 2006-07-06 0:36 ` Blatant layering violations (was Re: ext4 features) Valerie Henson 2006-07-06 12:15 ` Xavier Bestel @ 2006-07-06 20:02 ` Tom Vier 1 sibling, 0 replies; 119+ messages in thread From: Tom Vier @ 2006-07-06 20:02 UTC (permalink / raw) To: Valerie Henson Cc: LKML, Helge Hafting, Thomas Glanzmann, Theodore Ts'o, Andrew Morton On Wed, Jul 05, 2006 at 05:36:39PM -0700, Valerie Henson wrote: > On Mon, Jul 03, 2006 at 10:55:23PM +0200, Tomasz Torcz wrote: > > > > ZFS was already called ,,blatant layering violation''. ;) It buys you some preformance. Someone here already mentioned variable stripe sizes. ZFS doesn't just add a checksum sector after each block (something i've been planning to write an md module for, for a couple years). It writes the checksum at the end of the tree member, inode, dirent, whatever. So there's no read-modify-write when you write < 1 checksum block size. One thing i noticed about zfs that surprised me: it's using indirect blocks, from what i saw. -- Tom Vier <tmv@comcast.net> DSA Key ID 0x15741ECE ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 20:22 ` Helge Hafting 2006-07-03 20:55 ` Tomasz Torcz @ 2006-07-03 21:34 ` Bill Davidsen 2006-07-03 21:50 ` Valdis.Kletnieks 1 sibling, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-03 21:34 UTC (permalink / raw) To: linux-kernel Helge Hafting wrote: > On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote: >> On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote: >>> Hello, >>> >>>> Checksums are not very useful for themselves. They are useful when we >>>> have other copy of data (think raid mirroring) so data can be >>>> reconstructed from working copy. >>> it would be possible to identify data corruption. >>> >> Yes, but what good is identification? We could only return I/O error. >> Ability to fix corruption (like ZFS) is the real killer. > > Isn't that what we have RAID-1/5/6 for? I think he is talking about another problem. RAID addresses detectable failures at the hardware level. I believe that he wants validation after the data is returned (without error) from the device. While in most cases if what you wrote and what you read don't match it's memory, improving the chances of catching the error is useful, given that non-server often lacks ECC on memory, or people buy cheaper non-parity memory. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:34 ` ext4 features Bill Davidsen @ 2006-07-03 21:50 ` Valdis.Kletnieks 2006-07-03 22:04 ` Bruce Ferrell ` (2 more replies) 0 siblings, 3 replies; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-03 21:50 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 586 bytes --] On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said: > I think he is talking about another problem. RAID addresses detectable > failures at the hardware level. I believe that he wants validation after > the data is returned (without error) from the device. While in most > cases if what you wrote and what you read don't match it's memory, > improving the chances of catching the error is useful, given that > non-server often lacks ECC on memory, or people buy cheaper non-parity > memory. There's other issues as well. Why do people run 'tripwire' on boxes that have RAID on them? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:50 ` Valdis.Kletnieks @ 2006-07-03 22:04 ` Bruce Ferrell 2006-07-04 14:48 ` Valdis.Kletnieks 2006-07-03 23:00 ` Bill Davidsen 2006-07-04 12:52 ` Helge Hafting 2 siblings, 1 reply; 119+ messages in thread From: Bruce Ferrell @ 2006-07-03 22:04 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: Bill Davidsen, linux-kernel Valdis.Kletnieks@vt.edu wrote: > On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said: > >>I think he is talking about another problem. RAID addresses detectable >>failures at the hardware level. I believe that he wants validation after >>the data is returned (without error) from the device. While in most >>cases if what you wrote and what you read don't match it's memory, >>improving the chances of catching the error is useful, given that >>non-server often lacks ECC on memory, or people buy cheaper non-parity >>memory. > > > There's other issues as well. Why do people run 'tripwire' on boxes that > have RAID on them? Because they're looking for malicous changes ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 22:04 ` Bruce Ferrell @ 2006-07-04 14:48 ` Valdis.Kletnieks 0 siblings, 0 replies; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-04 14:48 UTC (permalink / raw) To: Bruce Ferrell; +Cc: Bill Davidsen, linux-kernel [-- Attachment #1: Type: text/plain, Size: 523 bytes --] On Mon, 03 Jul 2006 15:04:54 PDT, Bruce Ferrell said: > Valdis.Kletnieks@vt.edu wrote: > > There's other issues as well. Why do people run 'tripwire' on boxes that > > have RAID on them? > > Because they're looking for malicous changes Close, but no cigar. I've had tripwire detect *accidental* changes as well (including borked patchsets that replaced unrelated files). The reason they run tripwire as well as RAID is to detect changes that are visible only with the assistance of information from the filesystem. [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:50 ` Valdis.Kletnieks 2006-07-03 22:04 ` Bruce Ferrell @ 2006-07-03 23:00 ` Bill Davidsen 2006-07-04 15:01 ` Valdis.Kletnieks 2006-07-04 12:52 ` Helge Hafting 2 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-03 23:00 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: linux-kernel Valdis.Kletnieks@vt.edu wrote: >On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said: > > >>I think he is talking about another problem. RAID addresses detectable >>failures at the hardware level. I believe that he wants validation after >>the data is returned (without error) from the device. While in most >>cases if what you wrote and what you read don't match it's memory, >>improving the chances of catching the error is useful, given that >>non-server often lacks ECC on memory, or people buy cheaper non-parity >>memory. >> >> > >There's other issues as well. Why do people run 'tripwire' on boxes that >have RAID on them? > > What has RAID got to do with detecting hacking? -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 23:00 ` Bill Davidsen @ 2006-07-04 15:01 ` Valdis.Kletnieks 2006-07-05 2:40 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-04 15:01 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 605 bytes --] On Mon, 03 Jul 2006 19:00:38 EDT, Bill Davidsen said: > Valdis.Kletnieks@vt.edu wrote: > >There's other issues as well. Why do people run 'tripwire' on boxes that > >have RAID on them? > What has RAID got to do with detecting hacking? Actually, I've had tripwire detect more *accidental* changes due to buggy software than I have had it detect actual hacking. Oh, and it's good at catching unintended config changes - I started using tripwire after I fat-fingered a script, and the machine backed up to /dev/null instead of /dev/rmt0. In fact, I've never actually had tripwire detect actual hacking. [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 15:01 ` Valdis.Kletnieks @ 2006-07-05 2:40 ` Bill Davidsen 2006-07-05 2:47 ` Valdis.Kletnieks 0 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 2:40 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: linux-kernel Valdis.Kletnieks@vt.edu wrote: >On Mon, 03 Jul 2006 19:00:38 EDT, Bill Davidsen said: > > >>Valdis.Kletnieks@vt.edu wrote: >> >> > > > >>>There's other issues as well. Why do people run 'tripwire' on boxes that >>>have RAID on them? >>> >>> >>What has RAID got to do with detecting hacking? >> >> > >Actually, I've had tripwire detect more *accidental* changes due to buggy >software than I have had it detect actual hacking. Oh, and it's good at >catching unintended config changes - I started using tripwire after I >fat-fingered a script, and the machine backed up to /dev/null instead of >/dev/rmt0. > > But it ran faster, right? ;-) >In fact, I've never actually had tripwire detect actual hacking. > > I was using hacking in the general sense, I have a spiffy quote around about being in more danger from incompetence than malice. Patches with side effects, changes which work but reset directory permissions and/or ownership... I think it was Pogo who said "we have met the enemy and he is us." -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 2:40 ` Bill Davidsen @ 2006-07-05 2:47 ` Valdis.Kletnieks 0 siblings, 0 replies; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-05 2:47 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 677 bytes --] On Tue, 04 Jul 2006 22:40:05 EDT, Bill Davidsen said: > Valdis.Kletnieks@vt.edu wrote: > >catching unintended config changes - I started using tripwire after I > >fat-fingered a script, and the machine backed up to /dev/null instead of > >/dev/rmt0. > But it ran faster, right? ;-) Yeah. The tape ops figured I must have optimized something or gotten it to do better incrementals - it would ask for the tape, and spit it out in 15 minutes instead of the 40-45 it used to take. So it went un-noticed till a full cycle of tapes had gone by... Guess who was totally mystified when we lost a disk, we restored from tape, and the system had time warped itself back 2 months? :) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-03 21:50 ` Valdis.Kletnieks 2006-07-03 22:04 ` Bruce Ferrell 2006-07-03 23:00 ` Bill Davidsen @ 2006-07-04 12:52 ` Helge Hafting 2 siblings, 0 replies; 119+ messages in thread From: Helge Hafting @ 2006-07-04 12:52 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: Bill Davidsen, linux-kernel On Mon, Jul 03, 2006 at 05:50:22PM -0400, Valdis.Kletnieks@vt.edu wrote: > On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said: > > I think he is talking about another problem. RAID addresses detectable > > failures at the hardware level. I believe that he wants validation after > > the data is returned (without error) from the device. While in most > > cases if what you wrote and what you read don't match it's memory, > > improving the chances of catching the error is useful, given that > > non-server often lacks ECC on memory, or people buy cheaper non-parity > > memory. > > There's other issues as well. Why do people run 'tripwire' on boxes that > have RAID on them? To notice hacking. RAID protects against hardware failure, it does _not_ protect against any change that comes through the normal filesystem channels. RAID doesn't help the slightest against viruses and hackers. RAID is _not_ a backup, when a hacker (or a user error) changes an important file, it is changed in all mirrors of a raid-1 set, and raid-5 checksums are updated so the change becomes valid. But tripwire will notice that a protected file changed. Helge Hafting ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 18:17 ` Tomasz Torcz 2006-07-03 9:44 ` Gabor Gombas 2006-07-03 20:22 ` Helge Hafting @ 2006-07-06 15:12 ` Ric Wheeler 2006-07-06 17:05 ` Krzysztof Halasa 2 siblings, 1 reply; 119+ messages in thread From: Ric Wheeler @ 2006-07-06 15:12 UTC (permalink / raw) To: Tomasz Torcz; +Cc: Thomas Glanzmann, Theodore Ts'o, LKML Tomasz Torcz wrote: >On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote: > > >>Hello, >> >> >>>Checksums are not very useful for themselves. They are useful when we >>>have other copy of data (think raid mirroring) so data can be >>>reconstructed from working copy. >>> >>> >>it would be possible to identify data corruption. >> >> > > Yes, but what good is identification? We could only return I/O error. >Ability to fix corruption (like ZFS) is the real killer. > > Having a checksum (or even a digital signature on a file) that lets us detect corruption is very useful since, in many cases, it allows us to flag the file as corrupt before it gets used. In some cases, this is a big hint that you should restore it from backup (tape, other disk, etc). I think that it is a generally useful thing even when not on a self correcting device, ric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 15:12 ` Ric Wheeler @ 2006-07-06 17:05 ` Krzysztof Halasa 2006-07-06 17:27 ` Ric Wheeler 0 siblings, 1 reply; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-06 17:05 UTC (permalink / raw) To: ric; +Cc: Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML Ric Wheeler <ric@emc.com> writes: > Having a checksum (or even a digital signature on a file) that lets us > detect corruption is very useful since, in many cases, it allows us to > flag the file as corrupt before it gets used. We can't have that. Sector/block/etc. checksums - yes. A checksum, signature, hash etc. of the whole file would require actually reading the whole file. It can be done by tripwire or backup, and even by fsck, but not by the filesystem in normal operation. -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 17:05 ` Krzysztof Halasa @ 2006-07-06 17:27 ` Ric Wheeler 2006-07-06 20:52 ` Valdis.Kletnieks 2006-07-07 17:34 ` Krzysztof Halasa 0 siblings, 2 replies; 119+ messages in thread From: Ric Wheeler @ 2006-07-06 17:27 UTC (permalink / raw) To: Krzysztof Halasa; +Cc: Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML Krzysztof Halasa wrote: >Ric Wheeler <ric@emc.com> writes: > > > >>Having a checksum (or even a digital signature on a file) that lets us >>detect corruption is very useful since, in many cases, it allows us to >>flag the file as corrupt before it gets used. >> >> > >We can't have that. Sector/block/etc. checksums - yes. > > I certainly don't object to sector and block checksums, but they do require a specially formatted disk or high end array (which my employer would be happy to sell you ;-)). If you record a per sector or FS block level checksum in user space, you have to keep in mind the sheer size of today's commodity disks and the amount of space that would consume - it would be much more efficient to store one such signature per file. Where you put those checksums/signatures and when you look at them/update them/validate them can cause lots of headaches. >A checksum, signature, hash etc. of the whole file would require >actually reading the whole file. It can be done by tripwire or >backup, and even by fsck, but not by the filesystem in normal >operation. > > There was some talk about this at the file system mini-summit. Clearly, you would not want to compute (and continually update) the checksum/signature on an actively written file. It might be useful to compute at close time (or when you set a special attr, etc). We could also special case sequentially written files (storing & updating the partial signature as we go, but that could be a bit iffy). The key is to keep the signature/checksum with the file - tripwire and backup programs could do this (and even store it their own extended attribute), but I think that it is more generically useful than that. If you care enough about the data integrity of a file, having this kind of optional validation on any open would be very useful. ric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 17:27 ` Ric Wheeler @ 2006-07-06 20:52 ` Valdis.Kletnieks 2006-07-07 17:41 ` Krzysztof Halasa 2006-07-07 17:34 ` Krzysztof Halasa 1 sibling, 1 reply; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-06 20:52 UTC (permalink / raw) To: ric Cc: Krzysztof Halasa, Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML [-- Attachment #1: Type: text/plain, Size: 852 bytes --] On Thu, 06 Jul 2006 13:27:35 EDT, Ric Wheeler said: > The key is to keep the signature/checksum with the file - tripwire and > backup programs could do this (and even store it their own extended > attribute), but I think that it is more generically useful than that. Backup programs want it stored with the file. Tripwire wants it stored as far away from the file as possible. Remember - for Tripwire, we *don't* want the "current maintained value", we want "the snapshotted value from a known good state". If the filesystem stored a "guaranteed trustable current hash", Tripwire *could* use it to compare against its database rather than having to re-read the file and recompute it. Unfortunately, a useful trustable hash is basically incompatible with any sort of incremental updating (except for the special case of appending to the file). [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 20:52 ` Valdis.Kletnieks @ 2006-07-07 17:41 ` Krzysztof Halasa 0 siblings, 0 replies; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-07 17:41 UTC (permalink / raw) To: Valdis.Kletnieks Cc: ric, Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML Valdis.Kletnieks@vt.edu writes: > Backup programs want it stored with the file. Not necessarily - backup may want to store the hashes in some central place as well. I'm using such solution and it has only positives. > If the filesystem stored a "guaranteed trustable current hash", Tripwire > *could* use it to compare against its database rather than having to re-read > the file and recompute it. Unfortunately, a useful trustable hash is > basically incompatible with any sort of incremental updating (except for > the special case of appending to the file). Block hashes + master hash could allow something like that. Not sure if we want it in the fs, though. -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 17:27 ` Ric Wheeler 2006-07-06 20:52 ` Valdis.Kletnieks @ 2006-07-07 17:34 ` Krzysztof Halasa 1 sibling, 0 replies; 119+ messages in thread From: Krzysztof Halasa @ 2006-07-07 17:34 UTC (permalink / raw) To: ric; +Cc: Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML Ric Wheeler <ric@emc.com> writes: > The key is to keep the signature/checksum with the file - tripwire and > backup programs could do this (and even store it their own extended > attribute), but I think that it is more generically useful than that. I can't still see a sane way to do it. Yes, there might be some way for very special purposes but there is no solution for general use. > If you care enough about the data integrity of a file, having this > kind of optional validation on any open would be very useful. Given we can only do that for very specific purposes, I think the userspace is better suited. -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 16:33 ext4 features Thomas Glanzmann 2006-07-01 17:07 ` Tomasz Torcz @ 2006-07-04 1:02 ` Theodore Tso 2006-07-04 19:16 ` Thomas Glanzmann ` (3 more replies) 2006-07-04 14:36 ` Andi Kleen 2 siblings, 4 replies; 119+ messages in thread From: Theodore Tso @ 2006-07-04 1:02 UTC (permalink / raw) To: Thomas Glanzmann, LKML On Sat, Jul 01, 2006 at 06:33:01PM +0200, Thomas Glanzmann wrote: > I would like to know which new features are planed to be incorported by > ext4. So far I only read about supporting bigger filesystems to fit > recent hardware developments. So are there any other big goals for ext4? Some of the ideas which have been tossed about include: * nanosecond timestamps, and support for time beyond the 2038 * extents (better performance, faster fsck times) * persistent preallocation (valid bit in the extent) * larger extended attributes * checksums for metadata ... but the list of features are not necessarily fixed; if you have a great ideas, patches are always appreciated. :-) > What I personally would like to see most in ext4 are > > * checksums for data One of the more interesting ways of implementing this is that newer disks will be providing a facility (at the SCSI layer, and presumably eventually for SATA drives as well) where a checksum and some "application" (read: filesystem) data. The way this works, as I understand it, is that the OS provides the sector-level checksum as part of the write operation, which is then checked by the disk before it is written (to catch corruption at the bus level) and written on the disk. On a read operation, the checksum is read, and the data verified at the disk, as well as being passed back to the OS, so the OS can do end-to-end level checksum checking. More interestingly, there is space for "applation level" (read: filesystem) tagged data, which we could use to store information about the inode # and logical block # that a particular data blocks is associated with. This would allow for a much better recoverability from the inode table getting trashed. (Of course, the amount of time it would take to recover such a file via this method for future terrabyte and pedabyte filesystems is such that restoring from backup tapes is almost always going to be faster. So such a scheme would only be used when some Ph.D. student has ten years of thesis research on a disk with no backups and then accidentally runs mkfs on the wrong partition..... of course, one could argue that such a stupid student doesnt *deserve* to get a Ph.D. :-) > * and snapshots on filesystem basis This requires a filesystem that is designed from the get-go to support snapshots. So yes, it's lilely not going to happen for ext4. Although, if you have a really clever idea, feel free to post patches or a detailed technical proposal for how to achieve such a goal. :-) - Ted ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 1:02 ` Theodore Tso @ 2006-07-04 19:16 ` Thomas Glanzmann 2006-07-04 19:30 ` Valdis.Kletnieks ` (2 subsequent siblings) 3 siblings, 0 replies; 119+ messages in thread From: Thomas Glanzmann @ 2006-07-04 19:16 UTC (permalink / raw) To: Theodore Tso, LKML Hello, ... wow ... thank you for all the awareness training. I have now a much better idea what is happening now. And who knows, maybe I am going to submit some patches when ext4 isn't already released in three months. I didn't knew about the checksum capability of newer drives. I only new about the DMA crc. But it is definitively the right way to go. Thomas ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 1:02 ` Theodore Tso 2006-07-04 19:16 ` Thomas Glanzmann @ 2006-07-04 19:30 ` Valdis.Kletnieks 2006-07-05 12:24 ` Bill Davidsen 2006-07-05 14:04 ` Avi Kivity 3 siblings, 0 replies; 119+ messages in thread From: Valdis.Kletnieks @ 2006-07-04 19:30 UTC (permalink / raw) To: Theodore Tso; +Cc: Thomas Glanzmann, LKML [-- Attachment #1: Type: text/plain, Size: 824 bytes --] On Mon, 03 Jul 2006 21:02:40 EDT, Theodore Tso said: > So such a scheme would only be used when some Ph.D. student has ten > years of thesis research on a disk with no backups and then > accidentally runs mkfs on the wrong partition..... of course, one > could argue that such a stupid student doesnt *deserve* to get a Ph.D. :-) The more common use case is a department hires a grad student to run the department server rather than somebody who knows what they're doing (but costs more than a grad student stipend), and said grad student first sets up a borked backup scheme that looks like it works, but doesn't actually produce restorable backups, and then runs mkfs on /home, nuking all the thesis work of all the students.... (And yes, I've seen that more than once in a quarter century of working at .edu's... ;) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 1:02 ` Theodore Tso 2006-07-04 19:16 ` Thomas Glanzmann 2006-07-04 19:30 ` Valdis.Kletnieks @ 2006-07-05 12:24 ` Bill Davidsen 2006-07-05 12:59 ` J. Bruce Fields 2006-07-05 14:04 ` Avi Kivity 3 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 12:24 UTC (permalink / raw) To: Theodore Tso, Thomas Glanzmann, LKML Theodore Tso wrote: > On Sat, Jul 01, 2006 at 06:33:01PM +0200, Thomas Glanzmann wrote: >> I would like to know which new features are planed to be incorported by >> ext4. So far I only read about supporting bigger filesystems to fit >> recent hardware developments. So are there any other big goals for ext4? > > Some of the ideas which have been tossed about include: > > * nanosecond timestamps, and support for time beyond the 2038 The 2nd one is probably more urgent than the first. I can see a general benefit from timestamp in ms, beyond that seems to be a specialty requirement best provided at the application level rather than the bits of a trillion inodes which need no such thing. One argument against it is that with SMP with *almost* the same time in each CPU, cache everywhere in the i/o process, and various flavors of network filesystems, the atime/mtime become less and less useful for determining with great precision which file is most recently modified or accessed. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 12:24 ` Bill Davidsen @ 2006-07-05 12:59 ` J. Bruce Fields 2006-07-05 13:17 ` Pádraig Brady ` (2 more replies) 0 siblings, 3 replies; 119+ messages in thread From: J. Bruce Fields @ 2006-07-05 12:59 UTC (permalink / raw) To: Bill Davidsen; +Cc: Theodore Tso, Thomas Glanzmann, LKML On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote: > Theodore Tso wrote: > >Some of the ideas which have been tossed about include: > > > > * nanosecond timestamps, and support for time beyond the 2038 > > The 2nd one is probably more urgent than the first. I can see a general > benefit from timestamp in ms, beyond that seems to be a specialty > requirement best provided at the application level rather than the bits > of a trillion inodes which need no such thing. What's urgently needed for NFS (and I suspect for most other applications demanding higher timestamps) isn't really nanosecond precision so much as something that's guaranteed to increase whenever the file changes. Of course, just adding space in the inodes for nanoseconds isn't sufficient. XFS, for example, has nanosecond timestamps, but it's still easy to modify a file twice without seeing the ctime or mtime change. So either we need a timesource guaranteed to tick faster than the kernel can process IO, or we have to be willing to, say, add 1 to the nanoseconds field whenever the time doesn't change between operations. Or we could add an entirely separate attribute that's guaranteed to increase whenever the ctime is updated, and that doesn't necessarily have any connection with time--call it a version number or something. --b. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 12:59 ` J. Bruce Fields @ 2006-07-05 13:17 ` Pádraig Brady 2006-07-05 19:33 ` Trond Myklebust 2006-07-05 21:12 ` Bill Davidsen 2 siblings, 0 replies; 119+ messages in thread From: Pádraig Brady @ 2006-07-05 13:17 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Bill Davidsen, Theodore Tso, Thomas Glanzmann, LKML J. Bruce Fields wrote: > On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote: > >>Theodore Tso wrote: >> >>>Some of the ideas which have been tossed about include: >>> >>> * nanosecond timestamps, and support for time beyond the 2038 >> >>The 2nd one is probably more urgent than the first. I can see a general >>benefit from timestamp in ms, beyond that seems to be a specialty >>requirement best provided at the application level rather than the bits >>of a trillion inodes which need no such thing. > > > What's urgently needed for NFS (and I suspect for most other > applications demanding higher timestamps) isn't really nanosecond > precision so much as something that's guaranteed to increase whenever > the file changes. Yes please! http://lkml.org/lkml/2001/10/8/18 Pádraig. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 12:59 ` J. Bruce Fields 2006-07-05 13:17 ` Pádraig Brady @ 2006-07-05 19:33 ` Trond Myklebust 2006-07-05 21:22 ` Bill Davidsen 2006-07-05 21:12 ` Bill Davidsen 2 siblings, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-05 19:33 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Bill Davidsen, Theodore Tso, Thomas Glanzmann, LKML On Wed, 2006-07-05 at 08:59 -0400, J. Bruce Fields wrote: > On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote: > > Theodore Tso wrote: > > >Some of the ideas which have been tossed about include: > > > > > > * nanosecond timestamps, and support for time beyond the 2038 > > > > The 2nd one is probably more urgent than the first. I can see a general > > benefit from timestamp in ms, beyond that seems to be a specialty > > requirement best provided at the application level rather than the bits > > of a trillion inodes which need no such thing. > > What's urgently needed for NFS (and I suspect for most other > applications demanding higher timestamps) isn't really nanosecond > precision so much as something that's guaranteed to increase whenever > the file changes. NFS doesn't necessarily require monotonicity either. The only real requirement that knfsd has is that the timestamp needs to change every time the file data (mtime+ctime) and/or metadata (ctime only) is changed. Applications like 'make' OTOH, probably would be happier if the timestamps are guaranteed to be monotonic. Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 19:33 ` Trond Myklebust @ 2006-07-05 21:22 ` Bill Davidsen 2006-07-05 21:42 ` Trond Myklebust 0 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 21:22 UTC (permalink / raw) To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML Trond Myklebust wrote: >On Wed, 2006-07-05 at 08:59 -0400, J. Bruce Fields wrote: > > >>On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote: >> >> >>>Theodore Tso wrote: >>> >>> >>>>Some of the ideas which have been tossed about include: >>>> >>>> * nanosecond timestamps, and support for time beyond the 2038 >>>> >>>> >>>The 2nd one is probably more urgent than the first. I can see a general >>>benefit from timestamp in ms, beyond that seems to be a specialty >>>requirement best provided at the application level rather than the bits >>>of a trillion inodes which need no such thing. >>> >>> >>What's urgently needed for NFS (and I suspect for most other >>applications demanding higher timestamps) isn't really nanosecond >>precision so much as something that's guaranteed to increase whenever >>the file changes. >> >> > >NFS doesn't necessarily require monotonicity either. The only real >requirement that knfsd has is that the timestamp needs to change every >time the file data (mtime+ctime) and/or metadata (ctime only) is >changed. > >Applications like 'make' OTOH, probably would be happier if the >timestamps are guaranteed to be monotonic. > Consider the case where the build machine reads source from one network filesystem and write the binary result to another on another machine. If you know that I have the kernel source on a file server, do the compiles on a compute server, and store the binaries on three test machines for evaluation, you might guess this really can happen. Just increasing the timestamp may not solve the problem, unless you have a system call to set timestamp over network f/s, like a high resolution touch. It's a problem when there are multiple times involved. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 21:22 ` Bill Davidsen @ 2006-07-05 21:42 ` Trond Myklebust 2006-07-08 21:04 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-05 21:42 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Wed, 2006-07-05 at 17:22 -0400, Bill Davidsen wrote: > Consider the case where the build machine reads source from one network > filesystem and write the binary result to another on another machine. If > you know that I have the kernel source on a file server, do the compiles > on a compute server, and store the binaries on three test machines for > evaluation, you might guess this really can happen. Just increasing the > timestamp may not solve the problem, unless you have a system call to > set timestamp over network f/s, like a high resolution touch. If you are running 'touch' manually on all your files, you can always arrange to set the timestamp to something more recent. You don't normally need a high resolution version of utimes() (and SuSv3 won't provide you with one). Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 21:42 ` Trond Myklebust @ 2006-07-08 21:04 ` Bill Davidsen 2006-07-10 20:08 ` Trond Myklebust 0 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-08 21:04 UTC (permalink / raw) To: linux-kernel; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML Trond Myklebust wrote: > On Wed, 2006-07-05 at 17:22 -0400, Bill Davidsen wrote: >> Consider the case where the build machine reads source from one network >> filesystem and write the binary result to another on another machine. If >> you know that I have the kernel source on a file server, do the compiles >> on a compute server, and store the binaries on three test machines for >> evaluation, you might guess this really can happen. Just increasing the >> timestamp may not solve the problem, unless you have a system call to >> set timestamp over network f/s, like a high resolution touch. > > If you are running 'touch' manually on all your files, you can always > arrange to set the timestamp to something more recent. You don't > normally need a high resolution version of utimes() (and SuSv3 won't > provide you with one). No, I didn't quite mean a manual touch, but a system call to "close and set time to high resolution" for files where time uniformity is important. Consider that in most cases the inodes times are set by the host machine clock, which I close the change reflects the fileserving host idea of time. If there were a call to close a file and set the times like touch, then that could be used, for both local and network files. Clearly if multiple clients are changing the same file that doesn't work, and I doubt that any solution is going to avoid at least some undesired side effects. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-08 21:04 ` Bill Davidsen @ 2006-07-10 20:08 ` Trond Myklebust 2006-07-10 22:37 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-10 20:08 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Sat, 2006-07-08 at 17:04 -0400, Bill Davidsen wrote: > No, I didn't quite mean a manual touch, but a system call to "close and > set time to high resolution" for files where time uniformity is > important. Consider that in most cases the inodes times are set by the > host machine clock, which I close the change reflects the fileserving > host idea of time. If there were a call to close a file and set the > times like touch, then that could be used, for both local and network files. Close should never update the time since that would be a violation of POSIX rules. Normally, an NFS client will never need to update the time: RPC calls like WRITE, READ and SETATTR will automatically do it for us whenever necessary. Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-10 20:08 ` Trond Myklebust @ 2006-07-10 22:37 ` Bill Davidsen 2006-07-11 2:36 ` Trond Myklebust 0 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-10 22:37 UTC (permalink / raw) To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML Trond Myklebust wrote: >On Sat, 2006-07-08 at 17:04 -0400, Bill Davidsen wrote: > > >>No, I didn't quite mean a manual touch, but a system call to "close and >>set time to high resolution" for files where time uniformity is >>important. Consider that in most cases the inodes times are set by the >>host machine clock, which I close the change reflects the fileserving >>host idea of time. If there were a call to close a file and set the >>times like touch, then that could be used, for both local and network files. >> >> > >Close should never update the time since that would be a violation of >POSIX rules. Normally, an NFS client will never need to update the time: >RPC calls like WRITE, READ and SETATTR will automatically do it for us >whenever necessary. > > Let me restate this a third time in another way. What I suggest is a system call, NOT NAMED CLOSE, which does a close and touch. This was all blue sky discussion, new system calls are as valid as nanosecond resolution and syncronization between servers. Since this is a new call it is not specified by POSIX. And Linus has already suggested that he would accept something similar, when I proposed something like "noatime" mounts, which only updated atime and mtime on open and close, to keep metadata relevant but not have the overhead of constant updates. Actually, now that I have more free time I may revisit that idea. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-10 22:37 ` Bill Davidsen @ 2006-07-11 2:36 ` Trond Myklebust 2006-07-21 3:10 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-11 2:36 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Mon, 2006-07-10 at 18:37 -0400, Bill Davidsen wrote: > Trond Myklebust wrote: > > >On Sat, 2006-07-08 at 17:04 -0400, Bill Davidsen wrote: > > > > > >>No, I didn't quite mean a manual touch, but a system call to "close and > >>set time to high resolution" for files where time uniformity is > >>important. Consider that in most cases the inodes times are set by the > >>host machine clock, which I close the change reflects the fileserving > >>host idea of time. If there were a call to close a file and set the > >>times like touch, then that could be used, for both local and network files. > >> > >> > > > >Close should never update the time since that would be a violation of > >POSIX rules. Normally, an NFS client will never need to update the time: > >RPC calls like WRITE, READ and SETATTR will automatically do it for us > >whenever necessary. > > > > > > Let me restate this a third time in another way. What I suggest is a > system call, NOT NAMED CLOSE, which does a close and touch. This was all > blue sky discussion, new system calls are as valid as nanosecond > resolution and syncronization between servers. Since this is a new call > it is not specified by POSIX. > > And Linus has already suggested that he would accept something similar, > when I proposed something like "noatime" mounts, which only updated > atime and mtime on open and close, to keep metadata relevant but not > have the overhead of constant updates. > > Actually, now that I have more free time I may revisit that idea. Linus might accept it, but I won't. It is totally unnecessary. Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-11 2:36 ` Trond Myklebust @ 2006-07-21 3:10 ` Bill Davidsen 2006-07-21 12:06 ` Trond Myklebust 0 siblings, 1 reply; 119+ messages in thread From: Bill Davidsen @ 2006-07-21 3:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML Trond Myklebust wrote: >On Mon, 2006-07-10 at 18:37 -0400, Bill Davidsen wrote: > > >Linus might accept it, but I won't. It is totally unnecessary. > > By "totally unnecessary" you mean "I don't see why it's useful." The reason for using noatime is to avoid generating disk activity while the data is being accessed. It's not usually used to hide the fact that the data has been used and is therefore useful to someone. In a perfect world, where money is no object, all data is on very fast storage which never fails. In my world I would like to identify which data, source or documentation, has been referenced over some period of time. This is useful for moving some data to slower (yes I mean less expensive) storage. It's also useful to identify stuff which no one has used in a very long time and which is a candidate for not being on line at all. By keeping lazy track of access time it's possible to still have that data, with minimal disk access cost. And to some people that can be really useful, such as those of us who have to justify expenditures. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-21 3:10 ` Bill Davidsen @ 2006-07-21 12:06 ` Trond Myklebust 2006-07-21 14:36 ` Theodore Tso 0 siblings, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-21 12:06 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Thu, 2006-07-20 at 23:10 -0400, Bill Davidsen wrote: > Trond Myklebust wrote: > > >On Mon, 2006-07-10 at 18:37 -0400, Bill Davidsen wrote: > > > > > >Linus might accept it, but I won't. It is totally unnecessary. > > > > > > By "totally unnecessary" you mean "I don't see why it's useful." > > The reason for using noatime is to avoid generating disk activity while > the data is being accessed. It's not usually used to hide the fact that > the data has been used and is therefore useful to someone. In a perfect > world, where money is no object, all data is on very fast storage which > never fails. In my world I would like to identify which data, source or > documentation, has been referenced over some period of time. This is > useful for moving some data to slower (yes I mean less expensive) storage. > > It's also useful to identify stuff which no one has used in a very long > time and which is a candidate for not being on line at all. > > By keeping lazy track of access time it's possible to still have that > data, with minimal disk access cost. And to some people that can be > really useful, such as those of us who have to justify expenditures. What you propose violates both POSIX and SuSv3. close() does not update the atime on a file. I can't see anyone accepting that there is a need for this. If you want to force close to update atime automatically on your system, then you should already be able to hack up libc to do it. There are no discernable advantages to doing it in the kernel. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-21 12:06 ` Trond Myklebust @ 2006-07-21 14:36 ` Theodore Tso 2006-07-21 19:02 ` Trond Myklebust 0 siblings, 1 reply; 119+ messages in thread From: Theodore Tso @ 2006-07-21 14:36 UTC (permalink / raw) To: Trond Myklebust; +Cc: Bill Davidsen, J. Bruce Fields, Thomas Glanzmann, LKML On Fri, Jul 21, 2006 at 08:06:10AM -0400, Trond Myklebust wrote: > > By keeping lazy track of access time it's possible to still have that > > data, with minimal disk access cost. And to some people that can be > > really useful, such as those of us who have to justify expenditures. > > What you propose violates both POSIX and SuSv3. close() does not update > the atime on a file. I can't see anyone accepting that there is a need > for this. Nope, it doesn't violate POSIX/SuSv3. The specifications only control what happens if the system is cleanly shutdown. What happens on an unclean shutdown is explicitly undefined. Hence, lazy atime update where there is a "dirty" and "mostly clean" (i.e., atime-dirty) bit, and where "mostly clean" inodes are only flushed out to disk when they become fully dirty and then written out to disk, or when the filesystem is unmounted, or when the filesystem feels like it (i.e., when we need to clear out in-core inodes in response to memory pressure), would in fact be completely POSIX/SuSv3 compliant. - Ted ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-21 14:36 ` Theodore Tso @ 2006-07-21 19:02 ` Trond Myklebust 2006-07-22 12:25 ` Theodore Tso 0 siblings, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-21 19:02 UTC (permalink / raw) To: Theodore Tso; +Cc: Bill Davidsen, J. Bruce Fields, Thomas Glanzmann, LKML On Fri, 2006-07-21 at 10:36 -0400, Theodore Tso wrote: > On Fri, Jul 21, 2006 at 08:06:10AM -0400, Trond Myklebust wrote: > > > By keeping lazy track of access time it's possible to still have that > > > data, with minimal disk access cost. And to some people that can be > > > really useful, such as those of us who have to justify expenditures. > > > > What you propose violates both POSIX and SuSv3. close() does not update > > the atime on a file. I can't see anyone accepting that there is a need > > for this. > > Nope, it doesn't violate POSIX/SuSv3. The specifications only control > what happens if the system is cleanly shutdown. What happens on an > unclean shutdown is explicitly undefined. Hence, lazy atime update > where there is a "dirty" and "mostly clean" (i.e., atime-dirty) bit, > and where "mostly clean" inodes are only flushed out to disk when they > become fully dirty and then written out to disk, or when the > filesystem is unmounted, or when the filesystem feels like it (i.e., > when we need to clear out in-core inodes in response to memory > pressure), would in fact be completely POSIX/SuSv3 compliant. I agree that POSIX does not place any requirements on caching, but what you propose is impossible to implement on NFS: you may be able to get the atime 'right' (assuming that you are using something like ntp to ensure that client and server are in sync) but the NFS SETATTR primitives do not permit you to set the ctime, so that will be set to the time on the server it processed your SETATTR call (i.e. the close time). That violates POSIX semantics. Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-21 19:02 ` Trond Myklebust @ 2006-07-22 12:25 ` Theodore Tso 0 siblings, 0 replies; 119+ messages in thread From: Theodore Tso @ 2006-07-22 12:25 UTC (permalink / raw) To: Trond Myklebust; +Cc: Bill Davidsen, J. Bruce Fields, Thomas Glanzmann, LKML On Fri, Jul 21, 2006 at 03:02:35PM -0400, Trond Myklebust wrote: > > Nope, it doesn't violate POSIX/SuSv3. The specifications only control > > what happens if the system is cleanly shutdown. What happens on an > > unclean shutdown is explicitly undefined. Hence, lazy atime update > > where there is a "dirty" and "mostly clean" (i.e., atime-dirty) bit, > > and where "mostly clean" inodes are only flushed out to disk when they > > become fully dirty and then written out to disk, or when the > > filesystem is unmounted, or when the filesystem feels like it (i.e., > > when we need to clear out in-core inodes in response to memory > > pressure), would in fact be completely POSIX/SuSv3 compliant. > > I agree that POSIX does not place any requirements on caching, but what > you propose is impossible to implement on NFS: you may be able to get > the atime 'right' (assuming that you are using something like ntp to > ensure that client and server are in sync) but the NFS SETATTR > primitives do not permit you to set the ctime, so that will be set to > the time on the server it processed your SETATTR call (i.e. the close > time). That violates POSIX semantics. Sure, this is something that could only be done on local disk filesystems. But those are the ones most likely to be running on battery power on notebook computers. :-) - Ted ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 12:59 ` J. Bruce Fields 2006-07-05 13:17 ` Pádraig Brady 2006-07-05 19:33 ` Trond Myklebust @ 2006-07-05 21:12 ` Bill Davidsen 2006-07-05 21:27 ` linux-os (Dick Johnson) 2006-07-05 21:41 ` J. Bruce Fields 2 siblings, 2 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-05 21:12 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Theodore Tso, Thomas Glanzmann, LKML J. Bruce Fields wrote: >On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote: > > >>Theodore Tso wrote: >> >> >>>Some of the ideas which have been tossed about include: >>> >>> * nanosecond timestamps, and support for time beyond the 2038 >>> >>> >>The 2nd one is probably more urgent than the first. I can see a general >>benefit from timestamp in ms, beyond that seems to be a specialty >>requirement best provided at the application level rather than the bits >>of a trillion inodes which need no such thing. >> >> > >What's urgently needed for NFS (and I suspect for most other >applications demanding higher timestamps) isn't really nanosecond >precision so much as something that's guaranteed to increase whenever >the file changes. > >Of course, just adding space in the inodes for nanoseconds isn't >sufficient. XFS, for example, has nanosecond timestamps, but it's still >easy to modify a file twice without seeing the ctime or mtime change. >So either we need a timesource guaranteed to tick faster than the kernel >can process IO, or we have to be willing to, say, add 1 to the >nanoseconds field whenever the time doesn't change between operations. > >Or we could add an entirely separate attribute that's guaranteed to >increase whenever the ctime is updated, and that doesn't necessarily >have any connection with time--call it a version number or something. > > There are versions in both VMS and the ISO filesystem. I have a sneaking suspicion that those of us who ever use them are few and far between. The other issue is that unless the field is time, programs like make can't really use it, at least without becoming Linux specific. I'm not sure exactly how a "version" value would be used other than detecting the fact that the file had been changed in some way. Feel free to show me, I seem to come up empty on using this value. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 21:12 ` Bill Davidsen @ 2006-07-05 21:27 ` linux-os (Dick Johnson) 2006-07-05 21:41 ` J. Bruce Fields 1 sibling, 0 replies; 119+ messages in thread From: linux-os (Dick Johnson) @ 2006-07-05 21:27 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Wed, 5 Jul 2006, Bill Davidsen wrote: > J. Bruce Fields wrote: > >> On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote: >> >> >>> Theodore Tso wrote: >>> >>> >>>> Some of the ideas which have been tossed about include: >>>> >>>> * nanosecond timestamps, and support for time beyond the 2038 >>>> >>>> >>> The 2nd one is probably more urgent than the first. I can see a general >>> benefit from timestamp in ms, beyond that seems to be a specialty >>> requirement best provided at the application level rather than the bits >>> of a trillion inodes which need no such thing. >>> >>> >> >> What's urgently needed for NFS (and I suspect for most other >> applications demanding higher timestamps) isn't really nanosecond >> precision so much as something that's guaranteed to increase whenever >> the file changes. >> >> Of course, just adding space in the inodes for nanoseconds isn't >> sufficient. XFS, for example, has nanosecond timestamps, but it's still >> easy to modify a file twice without seeing the ctime or mtime change. >> So either we need a timesource guaranteed to tick faster than the kernel >> can process IO, or we have to be willing to, say, add 1 to the >> nanoseconds field whenever the time doesn't change between operations. >> >> Or we could add an entirely separate attribute that's guaranteed to >> increase whenever the ctime is updated, and that doesn't necessarily >> have any connection with time--call it a version number or something. >> >> > There are versions in both VMS and the ISO filesystem. I have a sneaking > suspicion that those of us who ever use them are few and far between. > The other issue is that unless the field is time, programs like make > can't really use it, at least without becoming Linux specific. > > I'm not sure exactly how a "version" value would be used other than > detecting the fact that the file had been changed in some way. Feel free > to show me, I seem to come up empty on using this value. > > -- > bill davidsen <davidsen@tmr.com> > CTO TMR Associates, Inc > Doing interesting things with small computers since 1979 Currently a version is just a convention for not deleting on create. Remember VAX/VMS MYFILE.TXT;1, create another one, you have MYFILE.TXT;2. They are not related in any way. Any internal value won't be much use if Unix semantics are retained because multiple directory entries pointing to the same file are just links. And identical names, pointing to different files in the same directory are prevented as well. Maybe the 'version' is the number of times the file has been modified since creation. This might be useful. Cheers, Dick Johnson Penguin : Linux version 2.6.16.4 on an i686 machine (5592.88 BogoMips). New book: http://www.AbominableFirebug.com/ _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 21:12 ` Bill Davidsen 2006-07-05 21:27 ` linux-os (Dick Johnson) @ 2006-07-05 21:41 ` J. Bruce Fields 2006-07-06 2:32 ` Bill Davidsen 1 sibling, 1 reply; 119+ messages in thread From: J. Bruce Fields @ 2006-07-05 21:41 UTC (permalink / raw) To: Bill Davidsen; +Cc: Theodore Tso, Thomas Glanzmann, LKML On Wed, Jul 05, 2006 at 05:12:54PM -0400, Bill Davidsen wrote: > J. Bruce Fields wrote: > >Or we could add an entirely separate attribute that's guaranteed to > >increase whenever the ctime is updated, and that doesn't necessarily > >have any connection with time--call it a version number or something. > > > There are versions in both VMS and the ISO filesystem. I have a sneaking > suspicion that those of us who ever use them are few and far between. > The other issue is that unless the field is time, programs like make > can't really use it, at least without becoming Linux specific. Sure. > I'm not sure exactly how a "version" value would be used other than > detecting the fact that the file had been changed in some way. I agree. But "detecting the fact that the file has been changed" is a really important use! I think the challenge would be to come up with applications that really depend on timestamps and that use them for anything *other* than detecting when a file has changed. (OK, so make is a special case--it cares not only about whether a file has changed, but also about whether it has changed more recently than some other file. But I'd think a simple version would useful to any network filesystem, or more generally to anything that caches a view of the filesystem either on another machine or in userspace.) > Feel free to show me, I seem to come up empty on using this value. Betraying my own interests--the NFSv4 protocol (unlike v2 or v3) uses a specialized "change" attribute to maintain cache consistency instead of depending on mtime/ctime. So nfsd would be one immediate in-kernel user. Currently we're using ctime, which causes obvious problems. But an improved ctime--one that actually increased whenever the file changed--would also do the job. --b. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-05 21:41 ` J. Bruce Fields @ 2006-07-06 2:32 ` Bill Davidsen 2006-07-06 2:42 ` Nigel Cunningham 2006-07-06 12:43 ` Trond Myklebust 0 siblings, 2 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-06 2:32 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Theodore Tso, Thomas Glanzmann, LKML J. Bruce Fields wrote: >On Wed, Jul 05, 2006 at 05:12:54PM -0400, Bill Davidsen wrote: > > >>J. Bruce Fields wrote: >> >> >>>Or we could add an entirely separate attribute that's guaranteed to >>>increase whenever the ctime is updated, and that doesn't necessarily >>>have any connection with time--call it a version number or something. >>> >>> >>> >>There are versions in both VMS and the ISO filesystem. I have a sneaking >>suspicion that those of us who ever use them are few and far between. >>The other issue is that unless the field is time, programs like make >>can't really use it, at least without becoming Linux specific. >> >> > >Sure. > > > >>I'm not sure exactly how a "version" value would be used other than >>detecting the fact that the file had been changed in some way. >> >> > >I agree. But "detecting the fact that the file has been changed" is a >really important use! I think the challenge would be to come up with >applications that really depend on timestamps and that use them for >anything *other* than detecting when a file has changed. > > But with timestamps I need remember only one number, the time of my last backup. Skipping over the question of "who's idea of time" inherent in network filesystems. I compare all ctimes with the time of the last backup and do incremental on the newer ones. If we use versioning I have to remember the version for each file! In practice I really question if the benefit justified keeping all that metadata between backups. And if I delete a file and create another by the same name, what is it's version? I'll say it again, I think ms resolution is readily achieved today, even over network files, I think greater resolution or versions are going to be more trouble than they are worth. >(OK, so make is a special case--it cares not only about whether a file >has changed, but also about whether it has changed more recently than >some other file. But I'd think a simple version would useful to any >network filesystem, or more generally to anything that caches a view of >the filesystem either on another machine or in userspace.) > > > >>Feel free to show me, I seem to come up empty on using this value. >> >> > >Betraying my own interests--the NFSv4 protocol (unlike v2 or v3) uses a >specialized "change" attribute to maintain cache consistency instead of >depending on mtime/ctime. So nfsd would be one immediate in-kernel >user. Currently we're using ctime, which causes obvious problems. > >But an improved ctime--one that actually increased whenever the file >changed--would also do the job. > No comment, I would have to see a state table to be sure I saw the races or that there were none. With a single writer and a sinple dirty bit there is no issue, it behaves like an elevator, more or less. With multiple writers I bet changes are written in the order submitted rather than the order done, but multiple writers without locks are a train wreck waiting to happen anyway. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 2:32 ` Bill Davidsen @ 2006-07-06 2:42 ` Nigel Cunningham 2006-07-06 12:43 ` Trond Myklebust 1 sibling, 0 replies; 119+ messages in thread From: Nigel Cunningham @ 2006-07-06 2:42 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML [-- Attachment #1: Type: text/plain, Size: 1547 bytes --] Hi. There are so many points in this conversation where I could jump in and make the comment I want to provide (below). Sorry if I haven't picked the best one. On Thursday 06 July 2006 12:32, Bill Davidsen wrote: > No comment, I would have to see a state table to be sure I saw the races > or that there were none. With a single writer and a sinple dirty bit > there is no issue, it behaves like an elevator, more or less. With > multiple writers I bet changes are written in the order submitted rather > than the order done, but multiple writers without locks are a train > wreck waiting to happen anyway. One application I can see for this careful checking is checkpointing. IIRC, Linus recently said he'd like to see suspending to disk treated as a special case of checkpointing, and I can see good sense in that. But the support is just not there at the moment. An important part of implementing that would be having a filesystem where we could know exactly what the state of the filesystem was at the last checkpoint, and roll back to it if necessary. Of course this would need to be tied to tracking changes in memory and to writing the memory state to storage, but they're separate problems. Ext3 has a history of being the best filesystem to use in developing and testing suspend to disk. It would be great if ext4 was the basis for implementing serious checkpointing support. Regards, Nigel -- Nigel, Michelle and Alisdair Cunningham 5 Mitchell Street Cobden 3266 Victoria, Australia [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 2:32 ` Bill Davidsen 2006-07-06 2:42 ` Nigel Cunningham @ 2006-07-06 12:43 ` Trond Myklebust 2006-07-07 2:15 ` Bill Davidsen 1 sibling, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-06 12:43 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Wed, 2006-07-05 at 22:32 -0400, Bill Davidsen wrote: > But with timestamps I need remember only one number, the time of my last > backup. Skipping over the question of "who's idea of time" inherent in > network filesystems. I compare all ctimes with the time of the last > backup and do incremental on the newer ones. If we use versioning I have > to remember the version for each file! In practice I really question if > the benefit justified keeping all that metadata between backups. And if > I delete a file and create another by the same name, what is it's version? You are completely missing the point. Our background is that all NFS clients are required to use the mtime and ctime timestamps in order to figure out if their cached data is valid. They need to do this extremely frequently (in fact, every time you open() the file). Nobody gives a rats arse about backups: those are infrequent and can/should use more sophisticated techniques such as checksumming. Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-06 12:43 ` Trond Myklebust @ 2006-07-07 2:15 ` Bill Davidsen 2006-07-07 2:30 ` Trond Myklebust ` (2 more replies) 0 siblings, 3 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-07 2:15 UTC (permalink / raw) To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML Trond Myklebust wrote: >Nobody gives a rats arse about backups: those are infrequent and >can/should use more sophisticated techniques such as checksumming. > Actually, those of us who do run production servers care vastly about backups. And beside being utterly unscalable (checksum 20 TB of files four times a day to find what changed???), you would have to remember the checksums for all those files. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 2:15 ` Bill Davidsen @ 2006-07-07 2:30 ` Trond Myklebust 2006-07-07 2:42 ` Ric Wheeler 2006-07-07 19:52 ` Theodore Tso 2 siblings, 0 replies; 119+ messages in thread From: Trond Myklebust @ 2006-07-07 2:30 UTC (permalink / raw) To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Thu, 2006-07-06 at 22:15 -0400, Bill Davidsen wrote: > Trond Myklebust wrote: > > >Nobody gives a rats arse about backups: those are infrequent and > >can/should use more sophisticated techniques such as checksumming. > > > Actually, those of us who do run production servers care vastly about > backups. And beside being utterly unscalable (checksum 20 TB of files > four times a day to find what changed???), you would have to remember > the checksums for all those files. It is trivial to check if your last backup of the file was started within 1 second or so of the last change made to the file, in which case your backup program needs to perform a more thorough check. That sort of thing is possible when you are talking about a daily (or even hourly) backup. Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 2:15 ` Bill Davidsen 2006-07-07 2:30 ` Trond Myklebust @ 2006-07-07 2:42 ` Ric Wheeler 2006-07-07 2:46 ` Trond Myklebust 2006-07-07 19:52 ` Theodore Tso 2 siblings, 1 reply; 119+ messages in thread From: Ric Wheeler @ 2006-07-07 2:42 UTC (permalink / raw) To: Bill Davidsen Cc: Trond Myklebust, J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML Bill Davidsen wrote: > Trond Myklebust wrote: > >> Nobody gives a rats arse about backups: those are infrequent and >> can/should use more sophisticated techniques such as checksumming. >> > Actually, those of us who do run production servers care vastly about > backups. And beside being utterly unscalable (checksum 20 TB of files > four times a day to find what changed???), you would have to remember > the checksums for all those files. > The point of using checksums (or digital signatures on files) is to be able to detect when the on disk file has been corrupted - not to look for updates. With normal disks, even writes that are flagged as correct will occasionally actually end up corrupt on disk. The rate that you need to validate the checksums is not at a 4 time a day rate. Buying a nice, high array can make this much less of a concern, but those of us who get stuck using commodity disks should always have a way of detecting corruption and a backup (either on tape or on another box). ric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 2:42 ` Ric Wheeler @ 2006-07-07 2:46 ` Trond Myklebust 2006-07-07 3:16 ` Bill Davidsen 0 siblings, 1 reply; 119+ messages in thread From: Trond Myklebust @ 2006-07-07 2:46 UTC (permalink / raw) To: Ric Wheeler Cc: Bill Davidsen, J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Thu, 2006-07-06 at 22:42 -0400, Ric Wheeler wrote: > The point of using checksums (or digital signatures on files) is to be > able to detect when the on disk file has been corrupted - not to look > for updates. With normal disks, even writes that are flagged as correct > will occasionally actually end up corrupt on disk. The rate that you > need to validate the checksums is not at a 4 time a day rate. > > Buying a nice, high array can make this much less of a concern, but > those of us who get stuck using commodity disks should always have a way > of detecting corruption and a backup (either on tape or on another box). I repeat: you do _not_ need high res ctime/mtime updates in order to figure out whether or not you need to do a daily backup on your file. You do need it in order to figure out if the page you just read in from your NFS server 2 microseconds ago is still valid. Cheers, Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 2:46 ` Trond Myklebust @ 2006-07-07 3:16 ` Bill Davidsen 2006-07-07 8:09 ` Bernd Petrovitsch 2006-07-07 14:56 ` Trond Myklebust 0 siblings, 2 replies; 119+ messages in thread From: Bill Davidsen @ 2006-07-07 3:16 UTC (permalink / raw) To: Trond Myklebust Cc: Ric Wheeler, J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML Trond Myklebust wrote: >On Thu, 2006-07-06 at 22:42 -0400, Ric Wheeler wrote: > > > >>The point of using checksums (or digital signatures on files) is to be >>able to detect when the on disk file has been corrupted - not to look >>for updates. With normal disks, even writes that are flagged as correct >>will occasionally actually end up corrupt on disk. The rate that you >>need to validate the checksums is not at a 4 time a day rate. >> >>Buying a nice, high array can make this much less of a concern, but >>those of us who get stuck using commodity disks should always have a way >>of detecting corruption and a backup (either on tape or on another box). >> >> > >I repeat: you do _not_ need high res ctime/mtime updates in order to >figure out whether or not you need to do a daily backup on your file. >You do need it in order to figure out if the page you just read in from >your NFS server 2 microseconds ago is still valid. > In most cases you don't care and would be using locking if you did. The old value was valid when you read it, the new value is valid, and if data is changing in 2us and the change matters, you can't process the data before it changes again (or at least may change). -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 3:16 ` Bill Davidsen @ 2006-07-07 8:09 ` Bernd Petrovitsch 2006-07-07 14:56 ` Trond Myklebust 1 sibling, 0 replies; 119+ messages in thread From: Bernd Petrovitsch @ 2006-07-07 8:09 UTC (permalink / raw) To: Bill Davidsen Cc: Trond Myklebust, Ric Wheeler, J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Thu, 2006-07-06 at 23:16 -0400, Bill Davidsen wrote: > Trond Myklebust wrote: [...] > >I repeat: you do _not_ need high res ctime/mtime updates in order to > >figure out whether or not you need to do a daily backup on your file. > >You do need it in order to figure out if the page you just read in from > >your NFS server 2 microseconds ago is still valid. > > > In most cases you don't care and would be using locking if you did. The > old value was valid when you read it, the new value is valid, and if > data is changing in 2us and the change matters, you can't process the > data before it changes again (or at least may change). Do you never use `make` on NFS-mounted filesystems (for e.g. kernel compilation)? Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 3:16 ` Bill Davidsen 2006-07-07 8:09 ` Bernd Petrovitsch @ 2006-07-07 14:56 ` Trond Myklebust 1 sibling, 0 replies; 119+ messages in thread From: Trond Myklebust @ 2006-07-07 14:56 UTC (permalink / raw) To: Bill Davidsen Cc: Ric Wheeler, J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML On Thu, 2006-07-06 at 23:16 -0400, Bill Davidsen wrote: > In most cases you don't care and would be using locking if you did. The > old value was valid when you read it, the new value is valid, and if > data is changing in 2us and the change matters, you can't process the > data before it changes again (or at least may change). Wrong! The NFS cache consistency model (close-to-open cache consistency) requires you to be able to revalidate the cache on open() whether or not you are using posix locking. In fact, most alternatives to posix locking (for instance dotlocking, which is frequently used by email applications) rely heavily on this. See for instance http://nfs.sourceforge.net/#faq_a8 Trond ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-07 2:15 ` Bill Davidsen 2006-07-07 2:30 ` Trond Myklebust 2006-07-07 2:42 ` Ric Wheeler @ 2006-07-07 19:52 ` Theodore Tso 2 siblings, 0 replies; 119+ messages in thread From: Theodore Tso @ 2006-07-07 19:52 UTC (permalink / raw) To: Bill Davidsen; +Cc: Trond Myklebust, J. Bruce Fields, Thomas Glanzmann, LKML On Thu, Jul 06, 2006 at 10:15:42PM -0400, Bill Davidsen wrote: > Trond Myklebust wrote: > > >Nobody gives a rats arse about backups: those are infrequent and > >can/should use more sophisticated techniques such as checksumming. > > > Actually, those of us who do run production servers care vastly about > backups. And beside being utterly unscalable (checksum 20 TB of files > four times a day to find what changed???), you would have to remember > the checksums for all those files. Not four times a day, but probably once a month or two it would be a *very* good idea to do periodic sweeps of files to make sure the hard drive hasn't corrupted the files out from under you. If you have 20+ TB of data, the probability of silent data corruption starts going up. That would be justification for storing the checksum in the inode or in the EA of the file, with the kernel automatically clearing it if the file was *deliberately* changed. The goal is to detect the disk silently changing the data for you, free of charge.... - Ted ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 1:02 ` Theodore Tso ` (2 preceding siblings ...) 2006-07-05 12:24 ` Bill Davidsen @ 2006-07-05 14:04 ` Avi Kivity 3 siblings, 0 replies; 119+ messages in thread From: Avi Kivity @ 2006-07-05 14:04 UTC (permalink / raw) To: Theodore Tso; +Cc: Thomas Glanzmann, LKML Theodore Tso wrote: > > could argue that such a stupid student doesnt *deserve* to get a > Ph.D. :-) > > > * and snapshots on filesystem basis > > This requires a filesystem that is designed from the get-go to support > snapshots. So yes, it's lilely not going to happen for ext4. > Although, if you have a really clever idea, feel free to post patches > or a detailed technical proposal for how to achieve such a goal. :-) > To take a snapshot of a file, copy its inode to a free inode (call it a frozen inode, or finode). The inode is at the head of a linked list of finodes, each older than its predecessor. Finodes have the same content as the inode they were clones from except the extent map. A new finode's extent map contains a single extent the size of the entire file with a flag that means "look in the nearest future finode (or inode)". When writing to a file, first look at the nearest finode's mapping for that range. If it has a normal extent, go ahead and write. If it has a future extent for that range, first transfer that extent to the finode (replacing the future extent), then write the data to newly allocated extents. Of course this process can break up extents. One can choose whether to transfer the block pointers or just the data; a tradeoff of additional data copying vs. fragmentation avoidance. When reading from a finode, if you're reading a normal extent, proceed normally. If you encounter a future extent, keep searching for the range in newer finodes until you encounter a normal extent or the base inode. To snapshot the entire filesystem, have a snapshot generation count in the superblock and in each inode. Incrementing the superblock generation count snapshots the filesystem. Whenever you write to a file, if its generation number lags the filesystem generation number, you take a file snapshot as outlined above. Directories are handled in the same way as files, although special care is necessary for inode reference counts. Deleting a snapshots means merging the preceding and next finodes' extent maps and freeing blocks. We'd need a linked list of all finodes belonging to a snapshot generation. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-01 16:33 ext4 features Thomas Glanzmann 2006-07-01 17:07 ` Tomasz Torcz 2006-07-04 1:02 ` Theodore Tso @ 2006-07-04 14:36 ` Andi Kleen 2006-07-04 14:43 ` Thomas Glanzmann 2 siblings, 1 reply; 119+ messages in thread From: Andi Kleen @ 2006-07-04 14:36 UTC (permalink / raw) To: Thomas Glanzmann; +Cc: linux-kernel Thomas Glanzmann <sithglan@stud.uni-erlangen.de> writes: > > What I personally would like to see most in ext4 are > > * checksums for data Sounds good. When can we expect the initial patch submission? -Andi ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: ext4 features 2006-07-04 14:36 ` Andi Kleen @ 2006-07-04 14:43 ` Thomas Glanzmann 0 siblings, 0 replies; 119+ messages in thread From: Thomas Glanzmann @ 2006-07-04 14:43 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel Hello, initial question was: Is there anything besides 64 bit migration going into ext4? > > checksum support for ext4 > Sounds good. When can we expect the initial patch submission? this was actually a question (for which I didn't get an answer by now, even if 34 people replied). I didn't want to start a stupid debate on principials. However I am more interested in snapshots anyway. And if I would provide patches, I would provide snapshot patches. - Which I don't, because I am god damn busy at the moment. Thomas ^ permalink raw reply [flat|nested] 119+ messages in thread
end of thread, other threads:[~2006-07-22 12:26 UTC | newest]
Thread overview: 119+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-01 16:33 ext4 features Thomas Glanzmann
2006-07-01 17:07 ` Tomasz Torcz
2006-07-01 17:47 ` Thomas Glanzmann
2006-07-01 18:09 ` Claudio Martins
2006-07-01 18:59 ` Thomas Glanzmann
2006-07-01 18:17 ` Tomasz Torcz
2006-07-03 9:44 ` Gabor Gombas
2006-07-03 20:22 ` Helge Hafting
2006-07-03 20:55 ` Tomasz Torcz
2006-07-03 21:01 ` Arjan van de Ven
2006-07-03 21:46 ` Jeff V. Merkey
2006-07-03 21:25 ` Diego Calleja
2006-07-03 22:17 ` Alan Cox
2006-07-04 14:45 ` Jan Engelhardt
2006-07-04 16:35 ` Jeffrey V. Merkey
2006-07-04 18:52 ` Jeff Garzik
2006-07-04 19:40 ` Jeffrey V. Merkey
2006-07-05 13:35 ` Lew Palm
2006-07-03 23:01 ` Jeff V. Merkey
2006-07-04 9:14 ` Benny Amorsen
2006-07-05 4:21 ` Bill Davidsen
2006-07-05 5:13 ` H. Peter Anvin
2006-07-05 5:45 ` Jeffrey V. Merkey
2006-07-07 14:12 ` Pavel Machek
2006-07-05 10:38 ` Krzysztof Halasa
2006-07-07 14:10 ` Pavel Machek
2006-07-07 17:45 ` Krzysztof Halasa
2006-07-07 21:30 ` Pavel Machek
2006-07-08 10:52 ` Krzysztof Halasa
2006-07-08 10:55 ` Pavel Machek
2006-07-08 11:19 ` Krzysztof Halasa
2006-07-08 11:23 ` Pavel Machek
2006-07-08 18:45 ` Avi Kivity
2006-07-08 20:24 ` Krzysztof Halasa
2006-07-04 9:22 ` Petr Tesarik
2006-07-04 11:35 ` Peter Zijlstra
2006-07-04 11:55 ` ext4 features (salvage) Petr Tesarik
[not found] ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com>
2006-07-04 12:31 ` Petr Tesarik
2006-07-04 12:42 ` Helge Hafting
2006-07-04 16:20 ` Matthew Frost
2006-07-04 15:25 ` ext4 features Pavel Machek
2006-07-05 4:10 ` Bill Davidsen
2006-07-03 21:46 ` Valdis.Kletnieks
[not found] ` <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr>
2006-07-04 14:37 ` Kernel recycler [was: ext4 features] Jan Engelhardt
2006-07-04 11:14 ` ext4 features Krzysztof Halasa
2006-07-04 22:35 ` Frank van Maarseveen
2006-07-04 23:47 ` Claudio Martins
2006-07-03 22:12 ` Alan Cox
2006-07-03 21:59 ` Arjan van de Ven
2006-07-03 23:31 ` ext4 features (checksums) Neil Brown
2006-07-04 1:03 ` Jeff Garzik
2006-07-04 6:09 ` Avi Kivity
2006-07-04 7:02 ` Neil Brown
2006-07-04 8:26 ` Avi Kivity
2006-07-05 11:56 ` Bill Davidsen
2006-07-05 12:06 ` Bill Davidsen
2006-07-05 12:19 ` Avi Kivity
2006-07-08 17:54 ` Bill Davidsen
2006-07-04 8:17 ` Alan Cox
2006-07-04 11:08 ` Thomas Glanzmann
2006-07-04 11:19 ` Krzysztof Halasa
2006-07-04 12:49 ` Helge Hafting
2006-07-05 12:01 ` Bill Davidsen
2006-07-05 12:10 ` Avi Kivity
2006-07-08 18:02 ` Bill Davidsen
2006-07-06 0:36 ` Blatant layering violations (was Re: ext4 features) Valerie Henson
2006-07-06 12:15 ` Xavier Bestel
2006-07-06 17:06 ` Valdis.Kletnieks
2006-07-06 20:02 ` Tom Vier
2006-07-03 21:34 ` ext4 features Bill Davidsen
2006-07-03 21:50 ` Valdis.Kletnieks
2006-07-03 22:04 ` Bruce Ferrell
2006-07-04 14:48 ` Valdis.Kletnieks
2006-07-03 23:00 ` Bill Davidsen
2006-07-04 15:01 ` Valdis.Kletnieks
2006-07-05 2:40 ` Bill Davidsen
2006-07-05 2:47 ` Valdis.Kletnieks
2006-07-04 12:52 ` Helge Hafting
2006-07-06 15:12 ` Ric Wheeler
2006-07-06 17:05 ` Krzysztof Halasa
2006-07-06 17:27 ` Ric Wheeler
2006-07-06 20:52 ` Valdis.Kletnieks
2006-07-07 17:41 ` Krzysztof Halasa
2006-07-07 17:34 ` Krzysztof Halasa
2006-07-04 1:02 ` Theodore Tso
2006-07-04 19:16 ` Thomas Glanzmann
2006-07-04 19:30 ` Valdis.Kletnieks
2006-07-05 12:24 ` Bill Davidsen
2006-07-05 12:59 ` J. Bruce Fields
2006-07-05 13:17 ` Pádraig Brady
2006-07-05 19:33 ` Trond Myklebust
2006-07-05 21:22 ` Bill Davidsen
2006-07-05 21:42 ` Trond Myklebust
2006-07-08 21:04 ` Bill Davidsen
2006-07-10 20:08 ` Trond Myklebust
2006-07-10 22:37 ` Bill Davidsen
2006-07-11 2:36 ` Trond Myklebust
2006-07-21 3:10 ` Bill Davidsen
2006-07-21 12:06 ` Trond Myklebust
2006-07-21 14:36 ` Theodore Tso
2006-07-21 19:02 ` Trond Myklebust
2006-07-22 12:25 ` Theodore Tso
2006-07-05 21:12 ` Bill Davidsen
2006-07-05 21:27 ` linux-os (Dick Johnson)
2006-07-05 21:41 ` J. Bruce Fields
2006-07-06 2:32 ` Bill Davidsen
2006-07-06 2:42 ` Nigel Cunningham
2006-07-06 12:43 ` Trond Myklebust
2006-07-07 2:15 ` Bill Davidsen
2006-07-07 2:30 ` Trond Myklebust
2006-07-07 2:42 ` Ric Wheeler
2006-07-07 2:46 ` Trond Myklebust
2006-07-07 3:16 ` Bill Davidsen
2006-07-07 8:09 ` Bernd Petrovitsch
2006-07-07 14:56 ` Trond Myklebust
2006-07-07 19:52 ` Theodore Tso
2006-07-05 14:04 ` Avi Kivity
2006-07-04 14:36 ` Andi Kleen
2006-07-04 14:43 ` Thomas Glanzmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox