ext4 features

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ext4 features
@ 2006-07-01 16:33 Thomas Glanzmann
  2006-07-01 17:07 ` Tomasz Torcz
                   ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Thomas Glanzmann @ 2006-07-01 16:33 UTC (permalink / raw)
  To: Theodore Ts'o, LKML

Hello,
I would like to know which new features are planed to be incorported by
ext4. So far I only read about supporting bigger filesystems to fit
recent hardware developments. So are there any other big goals for ext4?

What I personally would like to see most in ext4 are

        * checksums for data
        * and snapshots on filesystem basis

But I guess that this is way out of scope for ext4.

        Thomas

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 16:33 ext4 features Thomas Glanzmann
@ 2006-07-01 17:07 ` Tomasz Torcz
  2006-07-01 17:47   ` Thomas Glanzmann
  2006-07-04  1:02 ` Theodore Tso
  2006-07-04 14:36 ` Andi Kleen
  2 siblings, 1 reply; 119+ messages in thread
From: Tomasz Torcz @ 2006-07-01 17:07 UTC (permalink / raw)
  To: Thomas Glanzmann, Theodore Ts'o, LKML

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Sat, Jul 01, 2006 at 06:33:01PM +0200, Thomas Glanzmann wrote:
> Hello,
> I would like to know which new features are planed to be incorported by
> ext4. So far I only read about supporting bigger filesystems to fit
> recent hardware developments. So are there any other big goals for ext4?
> 
> What I personally would like to see most in ext4 are
> 
>         * checksums for data

  Checksums are not very useful for themselves. They are useful when we
have other copy of data (think raid mirroring) so data can be
reconstructed from working copy.

>         * and snapshots on filesystem basis

  What's wrong with DM snapshots?

-- 
Tomasz Torcz            There exists no separation between gods and men:
zdzichu@irc.-nie.spam-.pl   one blends softly casual into the other.


[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 17:07 ` Tomasz Torcz
@ 2006-07-01 17:47   ` Thomas Glanzmann
  2006-07-01 18:09     ` Claudio Martins
  2006-07-01 18:17     ` Tomasz Torcz
  0 siblings, 2 replies; 119+ messages in thread
From: Thomas Glanzmann @ 2006-07-01 17:47 UTC (permalink / raw)
  To: Theodore Ts'o, LKML

Hello,

> Checksums are not very useful for themselves. They are useful when we
> have other copy of data (think raid mirroring) so data can be
> reconstructed from working copy.

it would be possible to identify data corruption.

>   What's wrong with DM snapshots?

they're inefficient in matter of disk space consumption because they
don't have a clue of the filesystems that are on top of them.

        Thomas

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 17:47   ` Thomas Glanzmann
@ 2006-07-01 18:09     ` Claudio Martins
  2006-07-01 18:59       ` Thomas Glanzmann
  2006-07-01 18:17     ` Tomasz Torcz
  1 sibling, 1 reply; 119+ messages in thread
From: Claudio Martins @ 2006-07-01 18:09 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: Theodore Ts'o, LKML


On Saturday 01 July 2006 18:47, Thomas Glanzmann wrote:
> Hello,
>
> > Checksums are not very useful for themselves. They are useful when we
> > have other copy of data (think raid mirroring) so data can be
> > reconstructed from working copy.
>
> it would be possible to identify data corruption.
>
> >   What's wrong with DM snapshots?
>
> they're inefficient in matter of disk space consumption because they
> don't have a clue of the filesystems that are on top of them.
>

 May I recommend that you have a look at NILFS?

 http://nilfs.org/en/

 The design is built from the ground up to support an almost arbitrary number 
of snapshots, and also has other advantages. And it works already.

Regards

Cláudio


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 18:09     ` Claudio Martins
@ 2006-07-01 18:59       ` Thomas Glanzmann
  0 siblings, 0 replies; 119+ messages in thread
From: Thomas Glanzmann @ 2006-07-01 18:59 UTC (permalink / raw)
  To: Claudio Martins; +Cc: Theodore Ts'o, LKML

Hello Cláudio,

>  May I recommend that you have a look at NILFS?

thanks a lot for the heads-up. Indeed I was unaware of NILFS. It sounds
very interesting. I give it a snapshot. :-)

        Thomas

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 17:47   ` Thomas Glanzmann
  2006-07-01 18:09     ` Claudio Martins
@ 2006-07-01 18:17     ` Tomasz Torcz
  2006-07-03  9:44       ` Gabor Gombas
                         ` (2 more replies)
  1 sibling, 3 replies; 119+ messages in thread
From: Tomasz Torcz @ 2006-07-01 18:17 UTC (permalink / raw)
  To: Thomas Glanzmann, Theodore Ts'o, LKML

[-- Attachment #1: Type: text/plain, Size: 608 bytes --]

On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote:
> Hello,
> 
> > Checksums are not very useful for themselves. They are useful when we
> > have other copy of data (think raid mirroring) so data can be
> > reconstructed from working copy.
> 
> it would be possible to identify data corruption.
> 

  Yes, but what good is identification? We could only return I/O error.
Ability to fix corruption (like ZFS) is the real killer.

-- 
Tomasz Torcz            There exists no separation between gods and men:
zdzichu@irc.-nie.spam-.pl   one blends softly casual into the other.


[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 18:17     ` Tomasz Torcz
@ 2006-07-03  9:44       ` Gabor Gombas
  2006-07-03 20:22       ` Helge Hafting
  2006-07-06 15:12       ` Ric Wheeler
  2 siblings, 0 replies; 119+ messages in thread
From: Gabor Gombas @ 2006-07-03  9:44 UTC (permalink / raw)
  To: Thomas Glanzmann, Theodore Ts'o, LKML

On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote:

>   Yes, but what good is identification? We could only return I/O error.

I'm regularly using unison to sync my home directory to an USB drive,
and about once in every 2-3 weeks unison complains that the data on the
USB drive does not match the checksum unison expects. An umount/remount
usually fixes the problem. There are no messages in the kernel log.

It would be really nice if the file system should catch these silent
data corruptions and at least warn me that something is fishy.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 18:17     ` Tomasz Torcz
  2006-07-03  9:44       ` Gabor Gombas
@ 2006-07-03 20:22       ` Helge Hafting
  2006-07-03 20:55         ` Tomasz Torcz
  2006-07-03 21:34         ` ext4 features Bill Davidsen
  2006-07-06 15:12       ` Ric Wheeler
  2 siblings, 2 replies; 119+ messages in thread
From: Helge Hafting @ 2006-07-03 20:22 UTC (permalink / raw)
  To: Thomas Glanzmann, Theodore Ts'o, LKML

On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote:
> On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote:
> > Hello,
> > 
> > > Checksums are not very useful for themselves. They are useful when we
> > > have other copy of data (think raid mirroring) so data can be
> > > reconstructed from working copy.
> > 
> > it would be possible to identify data corruption.
> > 
> 
>   Yes, but what good is identification? We could only return I/O error.
> Ability to fix corruption (like ZFS) is the real killer.

Isn't that what we have RAID-1/5/6 for?  

Helge Hafting


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 20:22       ` Helge Hafting
@ 2006-07-03 20:55         ` Tomasz Torcz
  2006-07-03 21:01           ` Arjan van de Ven
  2006-07-06  0:36           ` Blatant layering violations (was Re: ext4 features) Valerie Henson
  2006-07-03 21:34         ` ext4 features Bill Davidsen
  1 sibling, 2 replies; 119+ messages in thread
From: Tomasz Torcz @ 2006-07-03 20:55 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Thomas Glanzmann, Theodore Ts'o, LKML

[-- Attachment #1: Type: text/plain, Size: 1006 bytes --]

On Mon, Jul 03, 2006 at 10:22:19PM +0200, Helge Hafting wrote:
> On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote:
> > On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote:
> > > Hello,
> > > 
> > > > Checksums are not very useful for themselves. They are useful when we
> > > > have other copy of data (think raid mirroring) so data can be
> > > > reconstructed from working copy.
> > > 
> > > it would be possible to identify data corruption.
> > > 
> > 
> >   Yes, but what good is identification? We could only return I/O error.
> > Ability to fix corruption (like ZFS) is the real killer.
> 
> Isn't that what we have RAID-1/5/6 for?  

  ZFS was already called ,,blatant layering violation''. ;)
Yes,that what RAID is for. And if we want checksums in filesystem,
that's the best way to utilise them.

-- 
Tomasz Torcz                 Morality must always be based on practicality.
zdzichu@irc.-nie.spam-.pl                -- Baron Vladimir Harkonnen


[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 20:55         ` Tomasz Torcz
@ 2006-07-03 21:01           ` Arjan van de Ven
  2006-07-03 21:46             ` Jeff V. Merkey
  2006-07-03 22:12             ` Alan Cox
  2006-07-06  0:36           ` Blatant layering violations (was Re: ext4 features) Valerie Henson
  1 sibling, 2 replies; 119+ messages in thread
From: Arjan van de Ven @ 2006-07-03 21:01 UTC (permalink / raw)
  To: Tomasz Torcz; +Cc: Helge Hafting, Thomas Glanzmann, Theodore Ts'o, LKML

>   ZFS was already called ,,blatant layering violation''. ;)
> Yes,that what RAID is for. And if we want checksums in filesystem,
> that's the best way to utilise them.

Hi,

checksums have a very different purpose than raid.

checksums are great at detecting corruption. And yes, corruption can
happen even if you have raid, for many many reasons. Detecting means
knowing when to not trust something, when to go for the backup tapes...

raid is great for protecting against individual disks or sectors going
bad. But raid, especially high performance implementations, do not
checksum data or detect corruptions. 

They're different purpose with almost zero overlap in purpose or even
goal...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:01           ` Arjan van de Ven
@ 2006-07-03 21:46             ` Jeff V. Merkey
  2006-07-03 21:25               ` Diego Calleja
                                 ` (3 more replies)
  2006-07-03 22:12             ` Alan Cox
  1 sibling, 4 replies; 119+ messages in thread
From: Jeff V. Merkey @ 2006-07-03 21:46 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o,
	LKML

Arjan van de Ven wrote:

>>  ZFS was already called ,,blatant layering violation''. ;)
>>Yes,that what RAID is for. And if we want checksums in filesystem,
>>that's the best way to utilise them.
>>    
>>
>
>
>Hi,
>
>checksums have a very different purpose than raid.
>
>checksums are great at detecting corruption. And yes, corruption can
>happen even if you have raid, for many many reasons. Detecting means
>knowing when to not trust something, when to go for the backup tapes...
>
>raid is great for protecting against individual disks or sectors going
>bad. But raid, especially high performance implementations, do not
>checksum data or detect corruptions. 
>
>They're different purpose with almost zero overlap in purpose or even
>goal...
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>
Add a salvagable file system to ext4, i.e. when a file is deleted, you 
just rename it and move it to a directory called DELETED.SAV and recycle 
the files as people allocate new ones.  Easy to do (internal "mv" of 
file to another directory) and modification of the allocation bitmaps.  
Very simple and will pay off big.  If you need help designing it, just 
ask me.

Jeff

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:46             ` Jeff V. Merkey
@ 2006-07-03 21:25               ` Diego Calleja
  2006-07-03 22:17                 ` Alan Cox
                                   ` (3 more replies)
  2006-07-03 21:46               ` Valdis.Kletnieks
                                 ` (2 subsequent siblings)
  3 siblings, 4 replies; 119+ messages in thread
From: Diego Calleja @ 2006-07-03 21:25 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel

El Mon, 03 Jul 2006 15:46:55 -0600,
"Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió:

> Add a salvagable file system to ext4, i.e. when a file is deleted, you 
> just rename it and move it to a directory called DELETED.SAV and recycle 
> the files as people allocate new ones.  Easy to do (internal "mv" of 


Easily doable in userspace, why bother with kernel programming

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:25               ` Diego Calleja
@ 2006-07-03 22:17                 ` Alan Cox
  2006-07-04 14:45                   ` Jan Engelhardt
  2006-07-03 23:01                 ` Jeff V. Merkey
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 119+ messages in thread
From: Alan Cox @ 2006-07-03 22:17 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Jeff V. Merkey, arjan, zdzichu, helgehaf, sithglan, tytso,
	linux-kernel

Ar Llu, 2006-07-03 am 23:25 +0200, ysgrifennodd Diego Calleja:
> > Add a salvagable file system to ext4, i.e. when a file is deleted, you 
> > just rename it and move it to a directory called DELETED.SAV and recycle 
> > the files as people allocate new ones.  Easy to do (internal "mv" of 
> 
> 
> Easily doable in userspace, why bother with kernel programming

To get the semantics you need and avoid rewriting all of user space. At
the moment some GNU apps support this type of stuff but its not in the
core libraries so it isn't generalised.

There are some big problems with "deleted" however and doing it in
kernel space. A lot of programs just overwrite data. You would have to
look for things like O_TRUNC on a file open and ftruncate.

The ftruncate case is particularly ugly because there are programs that
do lots of ftruncate calls as they run and don't neccessarily
"overwrite" data but are merely trimming logs or database files. 

To add to the fun the 'old' file needs to be the one which ends up with
a new inode number and the like.

Alan

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 22:17                 ` Alan Cox
@ 2006-07-04 14:45                   ` Jan Engelhardt
  2006-07-04 16:35                     ` Jeffrey V. Merkey
  0 siblings, 1 reply; 119+ messages in thread
From: Jan Engelhardt @ 2006-07-04 14:45 UTC (permalink / raw)
  To: Alan Cox
  Cc: Diego Calleja, Jeff V. Merkey, arjan, zdzichu, helgehaf, sithglan,
	tytso, linux-kernel

>
>There are some big problems with "deleted" however and doing it in
>kernel space. A lot of programs just overwrite data. You would have to
>look for things like O_TRUNC on a file open and ftruncate.
>
At least I only want deleted files to be saved, not truncated. The way 
the MSWIN (the gui parts) do it is enough for most users.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 14:45                   ` Jan Engelhardt
@ 2006-07-04 16:35                     ` Jeffrey V. Merkey
  2006-07-04 18:52                       ` Jeff Garzik
  2006-07-05 13:35                       ` Lew Palm
  0 siblings, 2 replies; 119+ messages in thread
From: Jeffrey V. Merkey @ 2006-07-04 16:35 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Alan Cox, Diego Calleja, arjan, zdzichu, helgehaf, sithglan,
	tytso, linux-kernel

Jan Engelhardt wrote:

>>There are some big problems with "deleted" however and doing it in
>>kernel space. A lot of programs just overwrite data. You would have to
>>look for things like O_TRUNC on a file open and ftruncate.
>>
>>    
>>
>At least I only want deleted files to be saved, not truncated. The way 
>the MSWIN (the gui parts) do it is enough for most users.
>
>
>Jan Engelhardt
>  
>
Well,

The old novell model is simple. When someone unlinks a file, don't 
delete it, just mv it to another special directory called DELETED.SAV. 
Then setup the
fs space allocation to reuse these files when the drive fills up by 
oldest files first. It's very simple. Then you have a salvagable file 
system.

Jeff

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 16:35                     ` Jeffrey V. Merkey
@ 2006-07-04 18:52                       ` Jeff Garzik
  2006-07-04 19:40                         ` Jeffrey V. Merkey
  2006-07-05 13:35                       ` Lew Palm
  1 sibling, 1 reply; 119+ messages in thread
From: Jeff Garzik @ 2006-07-04 18:52 UTC (permalink / raw)
  To: Jeffrey V. Merkey
  Cc: Jan Engelhardt, Alan Cox, Diego Calleja, arjan, zdzichu, helgehaf,
	sithglan, tytso, linux-kernel

Jeffrey V. Merkey wrote:
> The old novell model is simple. When someone unlinks a file, don't 
> delete it, just mv it to another special directory called DELETED.SAV. 
> Then setup the
> fs space allocation to reuse these files when the drive fills up by 
> oldest files first. It's very simple. Then you have a salvagable file 
> system.

Such a scheme makes it much more difficult to allocate large, contiguous 
runs of free space for storing newly written data.

	Jeff



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 18:52                       ` Jeff Garzik
@ 2006-07-04 19:40                         ` Jeffrey V. Merkey
  0 siblings, 0 replies; 119+ messages in thread
From: Jeffrey V. Merkey @ 2006-07-04 19:40 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jan Engelhardt, Alan Cox, Diego Calleja, arjan, zdzichu, helgehaf,
	sithglan, tytso, linux-kernel

Jeff Garzik wrote:

> Jeffrey V. Merkey wrote:
>
>> The old novell model is simple. When someone unlinks a file, don't 
>> delete it, just mv it to another special directory called 
>> DELETED.SAV. Then setup the
>> fs space allocation to reuse these files when the drive fills up by 
>> oldest files first. It's very simple. Then you have a salvagable file 
>> system.
>
>
> Such a scheme makes it much more difficult to allocate large, 
> contiguous runs of free space for storing newly written data. 
>
>     Jeff


Possibly.  Organize the files in DELETED.SAV by disk location and 
date.     Files don't have to adhere to a strict date recycling 
process.  Make it a mount
option if the user wants strict date recycling.  Make the default to 
choose between date and file sector locality.

Jeff

>
>
>


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 16:35                     ` Jeffrey V. Merkey
  2006-07-04 18:52                       ` Jeff Garzik
@ 2006-07-05 13:35                       ` Lew Palm
  1 sibling, 0 replies; 119+ messages in thread
From: Lew Palm @ 2006-07-05 13:35 UTC (permalink / raw)
  To: Jeffrey V. Merkey; +Cc: linux-kernel

Jeffrey V. Merkey wrote:
> The old novell model is simple. When someone unlinks a file, don't
> delete it, just mv it to another special directory called DELETED.SAV.
> Then setup the
> fs space allocation to reuse these files when the drive fills up by
> oldest files first. It's very simple. Then you have a salvagable file
> system.

A complete foolproof car is a car with a maximum speed of 0 mph.
As a user I give commands to my computer, for example an order to delete a
file. And this is what I expect it to do.
If I want it to move a file to another position in the filesystem, I would
use another command. I don't want my operating system to josh me, that's why
I use Linux.
Stealthy keeping of deleted files somewhere is a security black hole.

But accidents happen. Hardware perishes, users are making mistakes, sometimes
coffee is pouring...
That's why we backup important data regulary.
A not-really-deleting-filesystem wouldn't relieve us of that duty, but would
make a system more insecure and ambiguous.

Lew

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:25               ` Diego Calleja
  2006-07-03 22:17                 ` Alan Cox
@ 2006-07-03 23:01                 ` Jeff V. Merkey
  2006-07-04  9:14                 ` Benny Amorsen
  2006-07-04  9:22                 ` Petr Tesarik
  3 siblings, 0 replies; 119+ messages in thread
From: Jeff V. Merkey @ 2006-07-03 23:01 UTC (permalink / raw)
  To: Diego Calleja; +Cc: arjan, zdzichu, helgehaf, sithglan, tytso, linux-kernel

Diego Calleja wrote:

>El Mon, 03 Jul 2006 15:46:55 -0600,
>"Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió:
>
>  
>
>>Add a salvagable file system to ext4, i.e. when a file is deleted, you 
>>just rename it and move it to a directory called DELETED.SAV and recycle 
>>the files as people allocate new ones.  Easy to do (internal "mv" of 
>>    
>>
>
>
>Easily doable in userspace, why bother with kernel programming
>
>  
>
Fine, leave it out.  More for me that way in additive features for my 
products for stuff Linux does not provide.

Jeff

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:25               ` Diego Calleja
  2006-07-03 22:17                 ` Alan Cox
  2006-07-03 23:01                 ` Jeff V. Merkey
@ 2006-07-04  9:14                 ` Benny Amorsen
  2006-07-05  4:21                   ` Bill Davidsen
  2006-07-04  9:22                 ` Petr Tesarik
  3 siblings, 1 reply; 119+ messages in thread
From: Benny Amorsen @ 2006-07-04  9:14 UTC (permalink / raw)
  To: linux-kernel

>>>>> "DC" == Diego Calleja <diegocg@gmail.com> writes:

DC> El Mon, 03 Jul 2006 15:46:55 -0600, "Jeff V. Merkey"
DC> <jmerkey@wolfmountaingroup.com> escribió:

>> Add a salvagable file system to ext4, i.e. when a file is deleted,
>> you just rename it and move it to a directory called DELETED.SAV
>> and recycle the files as people allocate new ones. Easy to do
>> (internal "mv" of

DC> Easily doable in userspace, why bother with kernel programming

In userspace you can't automatically delete the files when the space
becomes needed. The LD_PRELOAD/glibc methods also have the
disadvantage of having to figure out where a file goes when it's
deleted, depending on which device it happens to reside on. Demanding
read access to /proc/mounts just to do rm could cause problems.

Userspace has had 10 years to invent a good solution. If it was so
easy, it would probably have been done.

/Benny

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04  9:14                 ` Benny Amorsen
@ 2006-07-05  4:21                   ` Bill Davidsen
  2006-07-05  5:13                     ` H. Peter Anvin
  2006-07-07 14:10                     ` Pavel Machek
  0 siblings, 2 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05  4:21 UTC (permalink / raw)
  To: Benny Amorsen, linux-kernel

Benny Amorsen wrote:
>>>>>> "DC" == Diego Calleja <diegocg@gmail.com> writes:
> 
> DC> El Mon, 03 Jul 2006 15:46:55 -0600, "Jeff V. Merkey"
> DC> <jmerkey@wolfmountaingroup.com> escribió:
> 
>>> Add a salvagable file system to ext4, i.e. when a file is deleted,
>>> you just rename it and move it to a directory called DELETED.SAV
>>> and recycle the files as people allocate new ones. Easy to do
>>> (internal "mv" of
> 
> 
> DC> Easily doable in userspace, why bother with kernel programming
> 
> In userspace you can't automatically delete the files when the space
> becomes needed. The LD_PRELOAD/glibc methods also have the
> disadvantage of having to figure out where a file goes when it's
> deleted, depending on which device it happens to reside on. Demanding
> read access to /proc/mounts just to do rm could cause problems.
> 
> Userspace has had 10 years to invent a good solution. If it was so
> easy, it would probably have been done.
> 
Actually, if it were so important it WOULD have been done. I suspect 
that the issue is not lack of a good solution, but lack of a good 
problem. The behavior you propose requires a lot of kernel cleverness, 
including make the inodes seem to go away, so the count is "right" for 
what the user sees.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05  4:21                   ` Bill Davidsen
@ 2006-07-05  5:13                     ` H. Peter Anvin
  2006-07-05  5:45                       ` Jeffrey V. Merkey
  2006-07-05 10:38                       ` Krzysztof Halasa
  2006-07-07 14:10                     ` Pavel Machek
  1 sibling, 2 replies; 119+ messages in thread
From: H. Peter Anvin @ 2006-07-05  5:13 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Benny Amorsen, linux-kernel

Bill Davidsen wrote:
>>
>> DC> Easily doable in userspace, why bother with kernel programming
>>
>> In userspace you can't automatically delete the files when the space
>> becomes needed. The LD_PRELOAD/glibc methods also have the
>> disadvantage of having to figure out where a file goes when it's
>> deleted, depending on which device it happens to reside on. Demanding
>> read access to /proc/mounts just to do rm could cause problems.
>>
>> Userspace has had 10 years to invent a good solution. If it was so
>> easy, it would probably have been done.
>>
> Actually, if it were so important it WOULD have been done. I suspect 
> that the issue is not lack of a good solution, but lack of a good 
> problem. The behavior you propose requires a lot of kernel cleverness, 
> including make the inodes seem to go away, so the count is "right" for 
> what the user sees.
> 

The real solution for it is snapshots.

	-hpa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05  5:13                     ` H. Peter Anvin
@ 2006-07-05  5:45                       ` Jeffrey V. Merkey
  2006-07-07 14:12                         ` Pavel Machek
  2006-07-05 10:38                       ` Krzysztof Halasa
  1 sibling, 1 reply; 119+ messages in thread
From: Jeffrey V. Merkey @ 2006-07-05  5:45 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

H. Peter Anvin wrote:

> Bill Davidsen wrote:
>
>>>
>>> DC> Easily doable in userspace, why bother with kernel programming
>>>
>>> In userspace you can't automatically delete the files when the space
>>> becomes needed. The LD_PRELOAD/glibc methods also have the
>>> disadvantage of having to figure out where a file goes when it's
>>> deleted, depending on which device it happens to reside on. Demanding
>>> read access to /proc/mounts just to do rm could cause problems.
>>>
>>> Userspace has had 10 years to invent a good solution. If it was so
>>> easy, it would probably have been done.
>>>
>> Actually, if it were so important it WOULD have been done. I suspect 
>> that the issue is not lack of a good solution, but lack of a good 
>> problem. The behavior you propose requires a lot of kernel 
>> cleverness, including make the inodes seem to go away, so the count 
>> is "right" for what the user sees.
>>
>
> The real solution for it is snapshots.


Peter,

Explain what you are thinking here.  What I proposed, I have already 
implemented in NetWare, it's very easy to do.  Snapshotting is not 
complex for FS's but does require a lot of space for meta-data to manage 
it.  EXT is not architecteced for something this complex.  A simple 
hidden mv is much easier to do.

Jeff

>
>     -hpa
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05  5:45                       ` Jeffrey V. Merkey
@ 2006-07-07 14:12                         ` Pavel Machek
  0 siblings, 0 replies; 119+ messages in thread
From: Pavel Machek @ 2006-07-07 14:12 UTC (permalink / raw)
  To: Jeffrey V. Merkey
  Cc: H. Peter Anvin, Bill Davidsen, Benny Amorsen, linux-kernel

Hi!

> >>Actually, if it were so important it WOULD have been 
> >>done. I suspect that the issue is not lack of a good 
> >>solution, but lack of a good problem. The behavior you 
> >>propose requires a lot of kernel cleverness, including 
> >>make the inodes seem to go away, so the count is 
> >>"right" for what the user sees.
> >>
> >
> >The real solution for it is snapshots.
> 
> 
> Peter,
> 
> Explain what you are thinking here.  What I proposed, I 
> have already implemented in NetWare, it's very easy to 
> do.  Snapshotting is not complex for FS's but does 
> require a lot of space for meta-data to manage it.  EXT 
> is not architecteced for something this complex.  A 
> simple hidden mv is much easier to do.

Patch would be nice :-).

Hidden mv is indeed simple; reclaiming space on demand may be
trickier.
							Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05  5:13                     ` H. Peter Anvin
  2006-07-05  5:45                       ` Jeffrey V. Merkey
@ 2006-07-05 10:38                       ` Krzysztof Halasa
  1 sibling, 0 replies; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-05 10:38 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:

> The real solution for it is snapshots.

Or a continuous log. Since we already use a journal we could possibly
make its contents stay forever (and the admin should be able to define
the "forever").
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05  4:21                   ` Bill Davidsen
  2006-07-05  5:13                     ` H. Peter Anvin
@ 2006-07-07 14:10                     ` Pavel Machek
  2006-07-07 17:45                       ` Krzysztof Halasa
  1 sibling, 1 reply; 119+ messages in thread
From: Pavel Machek @ 2006-07-07 14:10 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Benny Amorsen, linux-kernel

On Wed 05-07-06 00:21:32, Bill Davidsen wrote:
> Benny Amorsen wrote:
> >>>>>>"DC" == Diego Calleja <diegocg@gmail.com> writes:
> >
> >DC> El Mon, 03 Jul 2006 15:46:55 -0600, "Jeff V. Merkey"
> >DC> <jmerkey@wolfmountaingroup.com> escribió:
> >
> >>>Add a salvagable file system to ext4, i.e. when a 
> >>>file is deleted,
> >>>you just rename it and move it to a directory called 
> >>>DELETED.SAV
> >>>and recycle the files as people allocate new ones. 
> >>>Easy to do
> >>>(internal "mv" of
> >
> >
> >DC> Easily doable in userspace, why bother with kernel 
> >programming
> >
> >In userspace you can't automatically delete the files 
> >when the space
> >becomes needed. The LD_PRELOAD/glibc methods also have 
> >the
> >disadvantage of having to figure out where a file goes 
> >when it's
> >deleted, depending on which device it happens to reside 
> >on. Demanding
> >read access to /proc/mounts just to do rm could cause 
> >problems.
> >
> >Userspace has had 10 years to invent a good solution. 
> >If it was so
> >easy, it would probably have been done.
> >
> Actually, if it were so important it WOULD have been 
> done. I suspect that the issue is not lack of a good 

It *was* done. mc supports undelete on ext2. Unfortunately ext3 broke
that :-(.
							Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07 14:10                     ` Pavel Machek
@ 2006-07-07 17:45                       ` Krzysztof Halasa
  2006-07-07 21:30                         ` Pavel Machek
  0 siblings, 1 reply; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-07 17:45 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

Pavel Machek <pavel@ucw.cz> writes:

> It *was* done. mc supports undelete on ext2.

How does it do that? Directly accessing the device?
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07 17:45                       ` Krzysztof Halasa
@ 2006-07-07 21:30                         ` Pavel Machek
  2006-07-08 10:52                           ` Krzysztof Halasa
  0 siblings, 1 reply; 119+ messages in thread
From: Pavel Machek @ 2006-07-07 21:30 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

On Fri 07-07-06 19:45:21, Krzysztof Halasa wrote:
> Pavel Machek <pavel@ucw.cz> writes:
> 
> > It *was* done. mc supports undelete on ext2.
> 
> How does it do that? Directly accessing the device?

Yes. I used it once or twice, and was not happy when ext3 broke it.

-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07 21:30                         ` Pavel Machek
@ 2006-07-08 10:52                           ` Krzysztof Halasa
  2006-07-08 10:55                             ` Pavel Machek
  0 siblings, 1 reply; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-08 10:52 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

Pavel Machek <pavel@ucw.cz> writes:

>> > It *was* done. mc supports undelete on ext2.
>> 
>> How does it do that? Directly accessing the device?
>
> Yes. I used it once or twice, and was not happy when ext3 broke it.

I'd say it had to be broken from the beginning. Doing such things
on live, mounted filesystem...
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-08 10:52                           ` Krzysztof Halasa
@ 2006-07-08 10:55                             ` Pavel Machek
  2006-07-08 11:19                               ` Krzysztof Halasa
  0 siblings, 1 reply; 119+ messages in thread
From: Pavel Machek @ 2006-07-08 10:55 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

On Sat 2006-07-08 12:52:17, Krzysztof Halasa wrote:
> Pavel Machek <pavel@ucw.cz> writes:
> 
> >> > It *was* done. mc supports undelete on ext2.
> >> 
> >> How does it do that? Directly accessing the device?
> >
> > Yes. I used it once or twice, and was not happy when ext3 broke it.
> 
> I'd say it had to be broken from the beginning. Doing such things
> on live, mounted filesystem...

Why not? You use libextfs or how is it called to read the file from
the disk directly (read-only access), then you write it back using
regular calls.

Of course, you can end up with "deleted" data being corrupted if
kernel reused the area before undelete, or while you were doing
undelete... but that's expected. They were _deleted_, right?

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-08 10:55                             ` Pavel Machek
@ 2006-07-08 11:19                               ` Krzysztof Halasa
  2006-07-08 11:23                                 ` Pavel Machek
  2006-07-08 18:45                                 ` Avi Kivity
  0 siblings, 2 replies; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-08 11:19 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

Pavel Machek <pavel@ucw.cz> writes:

> Why not? You use libextfs or how is it called to read the file from
> the disk directly (read-only access), then you write it back using
> regular calls.
>
> Of course, you can end up with "deleted" data being corrupted if
> kernel reused the area before undelete, or while you were doing
> undelete... but that's expected. They were _deleted_, right?

What if the "undeleted" file contained /etc/shadow because someone
was changing password at the time? :-)
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-08 11:19                               ` Krzysztof Halasa
@ 2006-07-08 11:23                                 ` Pavel Machek
  2006-07-08 18:45                                 ` Avi Kivity
  1 sibling, 0 replies; 119+ messages in thread
From: Pavel Machek @ 2006-07-08 11:23 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: Bill Davidsen, Benny Amorsen, linux-kernel

On Sat 2006-07-08 13:19:52, Krzysztof Halasa wrote:
> Pavel Machek <pavel@ucw.cz> writes:
> 
> > Why not? You use libextfs or how is it called to read the file from
> > the disk directly (read-only access), then you write it back using
> > regular calls.
> >
> > Of course, you can end up with "deleted" data being corrupted if
> > kernel reused the area before undelete, or while you were doing
> > undelete... but that's expected. They were _deleted_, right?
> 
> What if the "undeleted" file contained /etc/shadow because someone
> was changing password at the time? :-)

Well, that's okay :-).
								Pavel


























...of course, undelete is root-only operation, and one that should not
be taken lightly. You need to verify you got what you wanted at the end.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-08 11:19                               ` Krzysztof Halasa
  2006-07-08 11:23                                 ` Pavel Machek
@ 2006-07-08 18:45                                 ` Avi Kivity
  2006-07-08 20:24                                   ` Krzysztof Halasa
  1 sibling, 1 reply; 119+ messages in thread
From: Avi Kivity @ 2006-07-08 18:45 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: Pavel Machek, Bill Davidsen, Benny Amorsen, linux-kernel

Krzysztof Halasa wrote:
>
> Pavel Machek <pavel@ucw.cz> writes:
>
> > Why not? You use libextfs or how is it called to read the file from
> > the disk directly (read-only access), then you write it back using
> > regular calls.
> >
> > Of course, you can end up with "deleted" data being corrupted if
> > kernel reused the area before undelete, or while you were doing
> > undelete... but that's expected. They were _deleted_, right?
>
> What if the "undeleted" file contained /etc/shadow because someone
> was changing password at the time? :-)
>

As the undeleter already had read access to the raw device, /etc/shadow 
was already compromised.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-08 18:45                                 ` Avi Kivity
@ 2006-07-08 20:24                                   ` Krzysztof Halasa
  0 siblings, 0 replies; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-08 20:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Pavel Machek, Bill Davidsen, Benny Amorsen, linux-kernel

Avi Kivity <avi@argo.co.il> writes:

>> What if the "undeleted" file contained /etc/shadow because someone
>> was changing password at the time? :-)
>>
>
> As the undeleter already had read access to the raw device,
> /etc/shadow was already compromised.

I understand only root had access, but the file in question might be
requested by a user. Of course root should have known the consequences
but...
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:25               ` Diego Calleja
                                   ` (2 preceding siblings ...)
  2006-07-04  9:14                 ` Benny Amorsen
@ 2006-07-04  9:22                 ` Petr Tesarik
  2006-07-04 11:35                   ` Peter Zijlstra
  3 siblings, 1 reply; 119+ messages in thread
From: Petr Tesarik @ 2006-07-04  9:22 UTC (permalink / raw)
  To: Diego Calleja; +Cc: linux-kernel

On Mon, 2006-07-03 at 23:25 +0200, Diego Calleja wrote:
> El Mon, 03 Jul 2006 15:46:55 -0600,
> "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió:
> 
> > Add a salvagable file system to ext4, i.e. when a file is deleted, you 
> > just rename it and move it to a directory called DELETED.SAV and recycle 
> > the files as people allocate new ones.  Easy to do (internal "mv" of 
> 
> 
> Easily doable in userspace, why bother with kernel programming

Yes and no. A simple mv is better done in userspace, but what I'd
_really_ appreciate would be a true kernel salvage (similar to the way
NetWare does things). That means marking the file as deleted in the
directory, marking its blocks as deleted but avoiding the use of those
blocks. The kernel would then prefer allocating new blocks from
elsewhere but once the filesystem runs out of space, it would start
allocating from the deleted files area and marking the blocks as well as
the corresponding files purged.

Salvaging files would be done with a separate tool. Of course, if you
delete more files with the same name in the same directory, you'd need
to tell that tool which one of them you want to salvage. Yes, I really
mean you'd have more than one deleted file with the same name in the
directory.

Anyway, I doubt we want such feature for ext4, because to make things
efficient, you'd need to provide some kind of pointer from the deleted
(but not yet purged) blocks to the corresponding file. Hard links are
also problematic and there is a whole lot of other troubles I haven't
even thought of.

Just my two cents.

--
Petr Tesarik

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04  9:22                 ` Petr Tesarik
@ 2006-07-04 11:35                   ` Peter Zijlstra
  2006-07-04 11:55                     ` ext4 features (salvage) Petr Tesarik
                                       ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Peter Zijlstra @ 2006-07-04 11:35 UTC (permalink / raw)
  To: Petr Tesarik; +Cc: Diego Calleja, linux-kernel

On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote:
> On Mon, 2006-07-03 at 23:25 +0200, Diego Calleja wrote:
> > El Mon, 03 Jul 2006 15:46:55 -0600,
> > "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió:
> > 
> > > Add a salvagable file system to ext4, i.e. when a file is deleted, you 
> > > just rename it and move it to a directory called DELETED.SAV and recycle 
> > > the files as people allocate new ones.  Easy to do (internal "mv" of 
> > 
> > 
> > Easily doable in userspace, why bother with kernel programming
> 
> Yes and no. A simple mv is better done in userspace, but what I'd
> _really_ appreciate would be a true kernel salvage (similar to the way
> NetWare does things). That means marking the file as deleted in the
> directory, marking its blocks as deleted but avoiding the use of those
> blocks. The kernel would then prefer allocating new blocks from
> elsewhere but once the filesystem runs out of space, it would start
> allocating from the deleted files area and marking the blocks as well as
> the corresponding files purged.
> 
> Salvaging files would be done with a separate tool. Of course, if you
> delete more files with the same name in the same directory, you'd need
> to tell that tool which one of them you want to salvage. Yes, I really
> mean you'd have more than one deleted file with the same name in the
> directory.
> 
> Anyway, I doubt we want such feature for ext4, because to make things
> efficient, you'd need to provide some kind of pointer from the deleted
> (but not yet purged) blocks to the corresponding file. Hard links are
> also problematic and there is a whole lot of other troubles I haven't
> even thought of.

Wouldn't such a scheme interfere with the block allocator algorithms,
and hence increase the risk of fragmentation? Schemes like this realy
put my hairs on end,

  1) if you don't want to lose your data, make backups; 
  2) if I mean to delete a file, I want it gone proper. Silently keeping
     it about is not unix like;
  3) don't aid third parties in recovering your removed data. If I want
     them to have it I'll give it to them.

Peter


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (salvage)
  2006-07-04 11:35                   ` Peter Zijlstra
@ 2006-07-04 11:55                     ` Petr Tesarik
       [not found]                       ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com>
  2006-07-04 16:20                       ` Matthew Frost
  2006-07-04 15:25                     ` ext4 features Pavel Machek
  2006-07-05  4:10                     ` Bill Davidsen
  2 siblings, 2 replies; 119+ messages in thread
From: Petr Tesarik @ 2006-07-04 11:55 UTC (permalink / raw)
  To: Peter Zijlstra

On Tue, 2006-07-04 at 13:35 +0200, Peter Zijlstra wrote:
> On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote:
> > Yes and no. A simple mv is better done in userspace, but what I'd
> > _really_ appreciate would be a true kernel salvage (similar to the way
> > NetWare does things). That means marking the file as deleted in the
> > directory, marking its blocks as deleted but avoiding the use of those
> > blocks. The kernel would then prefer allocating new blocks from
> > elsewhere but once the filesystem runs out of space, it would start
> > allocating from the deleted files area and marking the blocks as well as
> > the corresponding files purged.
> > 
> > Salvaging files would be done with a separate tool. Of course, if you
> > delete more files with the same name in the same directory, you'd need
> > to tell that tool which one of them you want to salvage. Yes, I really
> > mean you'd have more than one deleted file with the same name in the
> > directory.
> 
> Wouldn't such a scheme interfere with the block allocator algorithms,
> and hence increase the risk of fragmentation? Schemes like this realy
> put my hairs on end,

Yes, they would interfere. That's why I'm not proposing to add them to
ext4 in the first place.

>   1) if you don't want to lose your data, make backups; 

Generally, I agree.

>   2) if I mean to delete a file, I want it gone proper. Silently keeping
>      it about is not unix like;

Yes, this is a problem. Although you would of course have a tool for
purging the files unconditionally, some programs may need the assumption
that an unlinked file is gone forever.

Regarding the second clause, well, Linux is not Unix-like in many
respects and we want it like that. That's a weak argument.

>   3) don't aid third parties in recovering your removed data. If I want
>      them to have it I'll give it to them.

See 2. Explicit purging is of course possible. (Novell Netware also had
a "purge" command.)

Anyway, it seems that there is some functionality which many users want
but which can't be provided in user space: 

  - if files are moved to the recycle-bin-or-whatever-you-call-it, their
size is added to disk free space and
  - automatically purging least recently deleted files.

Regards,
Petr Tesarik

^ permalink raw reply	[flat|nested] 119+ messages in thread

[parent not found: <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com>]

* Re: ext4 features (salvage)
       [not found]                       ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com>
@ 2006-07-04 12:31                         ` Petr Tesarik
  2006-07-04 12:42                           ` Helge Hafting
  0 siblings, 1 reply; 119+ messages in thread
From: Petr Tesarik @ 2006-07-04 12:31 UTC (permalink / raw)
  To: Lex Lyamin; +Cc: linux-kernel

On Tue, 2006-07-04 at 16:08 +0400, Lex Lyamin wrote:
> you mean that blocks are naturaly free, but we cant use them because
> someone may made them free by accident, but we cant use them...
> 
> hmm...
> great idea!
> 
> wait, its not.
> because of we cant use those blocks we cant optimise way we write one
> disk , and if we have defragmenter we cant  make use of them either.
> and if (just if) this is online defragmenter, it cant use them too. 

Well, the way I saw it done was that you had no guarantee that any
deleted file could be salvaged. Sometimes you even could salvage a file
but not another one which was deleted later. Users seemed to be content
with that, because in most situations it did help them restore files
they deleted and within a few seconds realized that they didn't want to.

This means that the allocator MAY purge any deleted block at any moment,
although it tends to allocate blocks from areas of disk which haven't
been used recently.

And the benefits? The performance of such a filesystem could be better
than snapshots, while allowing to cope with one of the most common human
errors.

Regards,
Petr Tesarik

> for what purpose ?
> are not we trying  play out solution to problem from level 3 on level
> 2 ?
> does the soulion really belong to this level ?
> would people pay with performance for "feature" which *probably* will
> help them to restore their files ? 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (salvage)
  2006-07-04 12:31                         ` Petr Tesarik
@ 2006-07-04 12:42                           ` Helge Hafting
  0 siblings, 0 replies; 119+ messages in thread
From: Helge Hafting @ 2006-07-04 12:42 UTC (permalink / raw)
  To: Petr Tesarik; +Cc: Lex Lyamin, linux-kernel

On Tue, Jul 04, 2006 at 02:31:56PM +0200, Petr Tesarik wrote:
> On Tue, 2006-07-04 at 16:08 +0400, Lex Lyamin wrote:
> > you mean that blocks are naturaly free, but we cant use them because
> > someone may made them free by accident, but we cant use them...
> > 
> > hmm...
> > great idea!
> > 
> > wait, its not.
> > because of we cant use those blocks we cant optimise way we write one
> > disk , and if we have defragmenter we cant  make use of them either.
> > and if (just if) this is online defragmenter, it cant use them too. 
> 
> Well, the way I saw it done was that you had no guarantee that any
> deleted file could be salvaged. Sometimes you even could salvage a file
> but not another one which was deleted later. Users seemed to be content
> with that, because in most situations it did help them restore files
> they deleted and within a few seconds realized that they didn't want to.
> 
> This means that the allocator MAY purge any deleted block at any moment,
> although it tends to allocate blocks from areas of disk which haven't
> been used recently.
> 
> And the benefits? The performance of such a filesystem could be better
> than snapshots, while allowing to cope with one of the most common human
> errors.

The most common error?  A few years ago I restored a file from
backup, because I deleted it in error.  I can't even remember
the second-last time I had that problem.

I'd say this error is among the easiest to avoid. :-)
Even a little performance loss won't justify it for me.

Now, there may be clumsier users than me, but they tend to
be using GUI "file managers" which do implement a "wastebasket"
for all internal deletion.

Helge Hafting



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (salvage)
  2006-07-04 11:55                     ` ext4 features (salvage) Petr Tesarik
       [not found]                       ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com>
@ 2006-07-04 16:20                       ` Matthew Frost
  1 sibling, 0 replies; 119+ messages in thread
From: Matthew Frost @ 2006-07-04 16:20 UTC (permalink / raw)
  To: linux kernel mailing list

(Stupid mailer + user error = not sent to list)

Petr Tesarik wrote:
 > On Tue, 2006-07-04 at 13:35 +0200, Peter Zijlstra wrote:
 >> On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote:
 >>> Yes and no. A simple mv is better done in userspace, but what I'd
 >>> _really_ appreciate would be a true kernel salvage (similar to the way
 >>> NetWare does things). That means marking the file as deleted in the
 >>> directory, marking its blocks as deleted but avoiding the use of those
 >>> blocks. The kernel would then prefer allocating new blocks from
 >>> elsewhere but once the filesystem runs out of space, it would start
 >>> allocating from the deleted files area and marking the blocks as 
well as
 >>> the corresponding files purged.
 >>>
 >>> Salvaging files would be done with a separate tool. Of course, if you
 >>> delete more files with the same name in the same directory, you'd need
 >>> to tell that tool which one of them you want to salvage. Yes, I really
 >>> mean you'd have more than one deleted file with the same name in the
 >>> directory.
 >> Wouldn't such a scheme interfere with the block allocator algorithms,
 >> and hence increase the risk of fragmentation? Schemes like this realy
 >> put my hairs on end,
 >
 > Yes, they would interfere. That's why I'm not proposing to add them to
 > ext4 in the first place.
 >
 >>   1) if you don't want to lose your data, make backups;
 >
 > Generally, I agree.
 >
 >>   2) if I mean to delete a file, I want it gone proper. Silently keeping
 >>      it about is not unix like;
 >
 > Yes, this is a problem. Although you would of course have a tool for
 > purging the files unconditionally, some programs may need the assumption
 > that an unlinked file is gone forever.
 >
 > Regarding the second clause, well, Linux is not Unix-like in many
 > respects and we want it like that. That's a weak argument.

We silently keep files around in many filesystems, at least until
whatever reclamation process runs.  The delete event doesn't itself
generally purge the data from disk.  However, this is a matter of simple
tools doing simple things.  Designing an intentional structure around
not actually deleting deleted files, but keeping them around just in
case may be lauded as "user-friendly", but it is counter-intuitive.  It
is cleverness over clarity, good design smothered under feature demand.

In the ways in which it counts, in the sensible, useful, elegantly
simple ways, the "Do one thing and do it well" ways, Linux tries to be
Unix-like.  We want stupid programs.  A filesystem that decides that it
knows better than the user is not desirable.  Filesystem programmers
that decide that they know better than the user are likewise sub-optimal.

Protect my data against accidental failure.  Do not protect it against me.

If you have to add a "really delete, I mean it" command, you're breaking
fundamental assumptions.

 >
 >>   3) don't aid third parties in recovering your removed data. If I want
 >>      them to have it I'll give it to them.
 >
 > See 2. Explicit purging is of course possible. (Novell Netware also had
 > a "purge" command.)
 >
 > Anyway, it seems that there is some functionality which many users want
 > but which can't be provided in user space:
 >
 >   - if files are moved to the recycle-bin-or-whatever-you-call-it, their
 > size is added to disk free space and

Why add non-free space to the free space count, when we're intentionally
keeping those files?  If you have to be counter-intuitive, why go the
second counter of hiding it from the user who "wants us to keep and
index his deleted files"?

 >   - automatically purging least recently deleted files.
 >
 > Regards,
 > Petr Tesarik

Matt




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 11:35                   ` Peter Zijlstra
  2006-07-04 11:55                     ` ext4 features (salvage) Petr Tesarik
@ 2006-07-04 15:25                     ` Pavel Machek
  2006-07-05  4:10                     ` Bill Davidsen
  2 siblings, 0 replies; 119+ messages in thread
From: Pavel Machek @ 2006-07-04 15:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Petr Tesarik, Diego Calleja, linux-kernel

Hi!

> > > > Add a salvagable file system to ext4, i.e. when a file is deleted, you 
> > > > just rename it and move it to a directory called DELETED.SAV and recycle 
> > > > the files as people allocate new ones.  Easy to do (internal "mv" of 
> > > 
> > > 
> > > Easily doable in userspace, why bother with kernel programming
> > 
> > Yes and no. A simple mv is better done in userspace, but what I'd
> > _really_ appreciate would be a true kernel salvage (similar to the way
> > NetWare does things). That means marking the file as deleted in the

I have code doing ld_preload tricks to force safe deletion... somewhere.

> Wouldn't such a scheme interfere with the block allocator algorithms,
> and hence increase the risk of fragmentation? Schemes like this realy
> put my hairs on end,
> 
>   1) if you don't want to lose your data, make backups; 
>   2) if I mean to delete a file, I want it gone proper. Silently keeping
>      it about is not unix like;

Well, mc supports undelete on ext2 for a *long* time. And it works
okay...

And yes, doing echo > important_file instead of echo >> important file
is way too easy with unix shells.

						Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 11:35                   ` Peter Zijlstra
  2006-07-04 11:55                     ` ext4 features (salvage) Petr Tesarik
  2006-07-04 15:25                     ` ext4 features Pavel Machek
@ 2006-07-05  4:10                     ` Bill Davidsen
  2 siblings, 0 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05  4:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Diego Calleja, linux-kernel

Peter Zijlstra wrote:
> On Tue, 2006-07-04 at 11:22 +0200, Petr Tesarik wrote:
>> On Mon, 2006-07-03 at 23:25 +0200, Diego Calleja wrote:
>>> El Mon, 03 Jul 2006 15:46:55 -0600,
>>> "Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> escribió:
>>>
>>>> Add a salvagable file system to ext4, i.e. when a file is deleted, you 
>>>> just rename it and move it to a directory called DELETED.SAV and recycle 
>>>> the files as people allocate new ones.  Easy to do (internal "mv" of 
>>>
>>> Easily doable in userspace, why bother with kernel programming
>> Yes and no. A simple mv is better done in userspace, but what I'd
>> _really_ appreciate would be a true kernel salvage (similar to the way
>> NetWare does things). That means marking the file as deleted in the
>> directory, marking its blocks as deleted but avoiding the use of those
>> blocks. The kernel would then prefer allocating new blocks from
>> elsewhere but once the filesystem runs out of space, it would start
>> allocating from the deleted files area and marking the blocks as well as
>> the corresponding files purged.
>>
>> Salvaging files would be done with a separate tool. Of course, if you
>> delete more files with the same name in the same directory, you'd need
>> to tell that tool which one of them you want to salvage. Yes, I really
>> mean you'd have more than one deleted file with the same name in the
>> directory.
>>
>> Anyway, I doubt we want such feature for ext4, because to make things
>> efficient, you'd need to provide some kind of pointer from the deleted
>> (but not yet purged) blocks to the corresponding file. Hard links are
>> also problematic and there is a whole lot of other troubles I haven't
>> even thought of.
> 
> Wouldn't such a scheme interfere with the block allocator algorithms,
> and hence increase the risk of fragmentation? Schemes like this realy
> put my hairs on end,
> 
>   1) if you don't want to lose your data, make backups; 
>   2) if I mean to delete a file, I want it gone proper. Silently keeping
>      it about is not unix like;
>   3) don't aid third parties in recovering your removed data. If I want
>      them to have it I'll give it to them.
> 
> Peter
> 
If you wanted to add a feature which would overwrite the file when 
removed or truncated I'd be happy. Yes I know about attributes and dban, 
and I have a version of rm which does that if people use it, but would 
be nice to have it on the whole filesystem. It's not proof against a 
TLA, but nice for casual snooping.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:46             ` Jeff V. Merkey
  2006-07-03 21:25               ` Diego Calleja
@ 2006-07-03 21:46               ` Valdis.Kletnieks
       [not found]                 ` <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr>
  2006-07-04 11:14               ` ext4 features Krzysztof Halasa
  2006-07-04 22:35               ` Frank van Maarseveen
  3 siblings, 1 reply; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-03 21:46 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann,
	Theodore Ts'o, LKML

[-- Attachment #1: Type: text/plain, Size: 1006 bytes --]

On Mon, 03 Jul 2006 15:46:55 MDT, "Jeff V. Merkey" said:
> Add a salvagable file system to ext4, i.e. when a file is deleted, you
> just rename it and move it to a directory called DELETED.SAV and recycle
> the files as people allocate new ones.  Easy to do (internal "mv" of
> file to another directory) and modification of the allocation bitmaps.  
> Very simple and will pay off big.  If you need help designing it, just

Much better done in userspace - the kernel can't get this right without
some user hinting.  For starters, it creates a big security hole in all
the code that does an open()/unlink().

Also, how do you handle the corner cases?  The fact you're adding to the
pathname of the file means you might push some long names over the MAXPATHLEN
value, and you have to worry about name collisions in the directory, and
so on.  There's also more subtle leakage issues, such as properly handling
the permissions on the files on a multi-user system so users can't rummage
each other's trash....

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

[parent not found: <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr>]

* Re: Kernel recycler [was: ext4 features]
       [not found]                 ` <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr>
@ 2006-07-04 14:37                   ` Jan Engelhardt
  0 siblings, 0 replies; 119+ messages in thread
From: Jan Engelhardt @ 2006-07-04 14:37 UTC (permalink / raw)
  To: Jeff V. Merkey, Diego Calleja, Valdis.Kletnieks
  Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann,
	Theodore Ts'o, LKML

Hm this one did not appear on LKML so I resend it.

On Jul 4 2006 00:01, Jan Engelhardt wrote:

>Date: Tue, 4 Jul 2006 00:01:56 +0200 (MEST)
>From: Jan Engelhardt <jengelh@linux01.gwdg.de>
>To: Jeff V. Merkey <jmerkey@wolfmountaingroup.com>,
>    Diego Calleja <diegocg@gmail.com>, Valdis.Kletnieks@vt.edu
>Cc: Arjan van de Ven <arjan@infradead.org>, Tomasz Torcz <zdzichu@irc.pl>,
>    Helge Hafting <helgehaf@aitel.hist.no>,
>    Thomas Glanzmann <sithglan@stud.uni-erlangen.de>,
>    Theodore Ts'o <tytso@mit.edu>, LKML <linux-kernel@vger.kernel.org>
>Subject: Kernel recycler [was: ext4 features]
>
>>>
>> Add a salvagable file system to ext4, i.e. when a file is deleted, you just
>> rename it and move it to a directory called DELETED.SAV and recycle the files
>> as people allocate new ones.  Easy to do (internal "mv" of file to another
>> directory) and modification of the allocation bitmaps.  Very simple and will
>> pay off big.  If you need help designing it, just ask me.
>>
>
>Hey, can you help? I had this idea of a kernel-level 'recyler' (FS-independent)
>a while ago (patch file is March 26 according to my `ls -l`) [1], but I have
>suspended it for the moment because it is a tedius task for API-newcomers 
>like me. (I currently have to look at a lot of other kernel code to figure 
>out what the proper way of doing things is.)
>
>And it comes with some problems:
>
>- recycled files ("deleted" and moved) shall not count into the user's quota
>
>- rm -Rf bigfatdirectory will keep a lot of files around, therefore we would
>  need an extra kthread that kills all files in DELETED.SAV after a tunable
>  period.
>
>[1] http://jengelh.hopto.org/recycler.diff
>
>
>>From: Diego Calleja <diegocg@gmail.com>
>>
>>Easily doable in userspace, why bother with kernel programming
>
>Because not every application will use KDE's trash feature, or will use
>/bin/my_rm or or or. I certainly do not have the time to patch any program out
>there to use /bin/my_rm or my_unlink() function. What about statically compiled
>programs? They call the syscall directly, so there is no way (without
>recompiling - if possible at all) to catch it within userspace.
>
>And what about if knfsd is about to delete a file? Let's assume we cannot trust
>the client, so the only choice here is to have a kernel recycler.
>
>>Much better done in userspace - the kernel can't get this right without
>>some user hinting.  For starters, it creates a big security hole in all
>>the code that does an open()/unlink().
>>
>>Also, how do you handle the corner cases?  The fact you're adding to the
>>pathname of the file means you might push some long names over the MAXPATHLEN
>>value, and you have to worry about name collisions in the directory, and
>>so on.  There's also more subtle leakage issues, such as properly handling
>>the permissions on the files on a multi-user system so users can't rummage
>>each other's trash....
>
>I am aware of these problems, but at least for fun & profit, I would like to
>complete the kernel-level recycler.
>
>
>Jan Engelhardt
>-- 
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:46             ` Jeff V. Merkey
  2006-07-03 21:25               ` Diego Calleja
  2006-07-03 21:46               ` Valdis.Kletnieks
@ 2006-07-04 11:14               ` Krzysztof Halasa
  2006-07-04 22:35               ` Frank van Maarseveen
  3 siblings, 0 replies; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-04 11:14 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann,
	Theodore Ts'o, LKML

"Jeff V. Merkey" <jmerkey@wolfmountaingroup.com> writes:

> Add a salvagable file system to ext4, i.e. when a file is deleted, you
> just rename it and move it to a directory called DELETED.SAV and
> recycle the files as people allocate new ones.

Due to the problems pointed what would be really needed is a filesystem
with a full log of operations. Then the fs state (full contents of all
files etc.) at any given time can be restored.

May not be very efficient, though (probably people doing databases and
transaction logging have something to say). I'd rather have better
backups (so I can restore from them) instead of such logging.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:46             ` Jeff V. Merkey
                                 ` (2 preceding siblings ...)
  2006-07-04 11:14               ` ext4 features Krzysztof Halasa
@ 2006-07-04 22:35               ` Frank van Maarseveen
  2006-07-04 23:47                 ` Claudio Martins
  3 siblings, 1 reply; 119+ messages in thread
From: Frank van Maarseveen @ 2006-07-04 22:35 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann,
	Theodore Ts'o, LKML

On Mon, Jul 03, 2006 at 03:46:55PM -0600, Jeff V. Merkey wrote:
[...]
> Add a salvagable file system to ext4, i.e. when a file is deleted, you 
> just rename it and move it to a directory called DELETED.SAV and recycle 
> the files as people allocate new ones.  Easy to do (internal "mv" of 
> file to another directory) and modification of the allocation bitmaps.  
> Very simple and will pay off big.  If you need help designing it, just 
> ask me.

Do you have any idea how to undo the effect of rm -rf /bigtree at
the FS level?

I think such an "undelete" feature should be implemented in userspace.
A filesystem which can travel back in time could be useful however.

-- 
Frank

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 22:35               ` Frank van Maarseveen
@ 2006-07-04 23:47                 ` Claudio Martins
  0 siblings, 0 replies; 119+ messages in thread
From: Claudio Martins @ 2006-07-04 23:47 UTC (permalink / raw)
  To: Frank van Maarseveen
  Cc: Jeff V. Merkey, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML


On Tuesday 04 July 2006 23:35, Frank van Maarseveen wrote:
>
> Do you have any idea how to undo the effect of rm -rf /bigtree at
> the FS level?
>
> I think such an "undelete" feature should be implemented in userspace.
> A filesystem which can travel back in time could be useful however.

 Indeed.

 See:

http://lkml.org/lkml/2006/7/1/114

 I'm starting to repeat myself, but at least one filesystem of that kind is 
already being developed, lets try to support them! :-)

Regards

Cláudio


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:01           ` Arjan van de Ven
  2006-07-03 21:46             ` Jeff V. Merkey
@ 2006-07-03 22:12             ` Alan Cox
  2006-07-03 21:59               ` Arjan van de Ven
  2006-07-03 23:31               ` ext4 features (checksums) Neil Brown
  1 sibling, 2 replies; 119+ messages in thread
From: Alan Cox @ 2006-07-03 22:12 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o,
	LKML

Ar Llu, 2006-07-03 am 23:01 +0200, ysgrifennodd Arjan van de Ven:
> raid is great for protecting against individual disks or sectors going
> bad. But raid, especially high performance implementations, do not
> checksum data or detect corruptions. 
> 
> They're different purpose with almost zero overlap in purpose or even
> goal...

Same layer though - checksums are really a device mapper type problem
rather than an fs type problem.

Alan


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 22:12             ` Alan Cox
@ 2006-07-03 21:59               ` Arjan van de Ven
  2006-07-03 23:31               ` ext4 features (checksums) Neil Brown
  1 sibling, 0 replies; 119+ messages in thread
From: Arjan van de Ven @ 2006-07-03 21:59 UTC (permalink / raw)
  To: Alan Cox
  Cc: Tomasz Torcz, Helge Hafting, Thomas Glanzmann, Theodore Ts'o,
	LKML

On Mon, 2006-07-03 at 23:12 +0100, Alan Cox wrote:
> Ar Llu, 2006-07-03 am 23:01 +0200, ysgrifennodd Arjan van de Ven:
> > raid is great for protecting against individual disks or sectors going
> > bad. But raid, especially high performance implementations, do not
> > checksum data or detect corruptions. 
> > 
> > They're different purpose with almost zero overlap in purpose or even
> > goal...
> 
> Same layer though - checksums are really a device mapper type problem
> rather than an fs type problem.

file payload checksums.. I'd agree
filesystem metadata.. there checksums do provide value 


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-03 22:12             ` Alan Cox
  2006-07-03 21:59               ` Arjan van de Ven
@ 2006-07-03 23:31               ` Neil Brown
  2006-07-04  1:03                 ` Jeff Garzik
                                   ` (3 more replies)
  1 sibling, 4 replies; 119+ messages in thread
From: Neil Brown @ 2006-07-03 23:31 UTC (permalink / raw)
  To: Alan Cox
  Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann,
	Theodore Ts'o, LKML

On Monday July 3, alan@lxorguk.ukuu.org.uk wrote:
> Ar Llu, 2006-07-03 am 23:01 +0200, ysgrifennodd Arjan van de Ven:
> > raid is great for protecting against individual disks or sectors going
> > bad. But raid, especially high performance implementations, do not
> > checksum data or detect corruptions. 
> > 
> > They're different purpose with almost zero overlap in purpose or even
> > goal...
> 
> Same layer though - checksums are really a device mapper type problem
> rather than an fs type problem.

Can't say I agree with this layering distinction.
It's been some years that I've felt that most 'logical volume
management' really belongs in the filesystem.
Why have a dm that chops devices up in to segments and assembles them to
look like a big device, only to have that big device chopped up and
presented as files.  Seems like double handling to me.

With checksums - the filesystem is in a better position to:
 - be selective about what is checksummed - no point checksumming
   blocks that aren't part of any file.  Some blocks (highlevel
   metadata) might always be checksummed, while other blocks
   (regular data) might not if a 'fast' option was chosen.
 - record the checksum somewhere easily accessible.  The dm layer
   could do little better than store a block of checksums for every 10
   blocks of data.  A filesystem can store checksums with indexing
   information, or ensure that checksums for consecutive blocks in a
   file are stored together, even if the blocks cannot be.

I think that for a filesystem that makes heavy use of trees to find
things, it makes a lot of sense to checksum and replicate the upper
levels of the tree, while checksumming and replicating lower levels
has a very different cost/benefit tradeoff.   These distinctions are
easy to make in a filesystem, and hard to make in a block device.

To my mind, the only thing you should put between the filesystem and
the raw devices is RAID (real-raid - not raid0 or linear).

NeilBrown

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-03 23:31               ` ext4 features (checksums) Neil Brown
@ 2006-07-04  1:03                 ` Jeff Garzik
  2006-07-04  6:09                 ` Avi Kivity
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 119+ messages in thread
From: Jeff Garzik @ 2006-07-04  1:03 UTC (permalink / raw)
  To: Neil Brown
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Neil Brown wrote:
> Can't say I agree with this layering distinction.
> It's been some years that I've felt that most 'logical volume
> management' really belongs in the filesystem.
> Why have a dm that chops devices up in to segments and assembles them to
> look like a big device, only to have that big device chopped up and
> presented as files.  Seems like double handling to me.

Agreed, and allow me to take an even more radical position:

I've long felt that things like snapshotting and mirroring made a lot of 
sense at the filesystem level -- as do layered filesystems, just like we 
layer block devices.

Block device drivers (MD, DM) get ever more complicated, and ultimately 
become mini-filesystems themselves.  The metadata managed by blkdev 
drivers continues to increase in complexity.  What is represented to the 
upper layer as a contiguous run of bytes is really, under the hood, 
chunks of data coalesced logically -- just like files in a filesystem.

The more complex that blkdev drivers become, the more and more they will 
look like filesystems.

	Jeff

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-03 23:31               ` ext4 features (checksums) Neil Brown
  2006-07-04  1:03                 ` Jeff Garzik
@ 2006-07-04  6:09                 ` Avi Kivity
  2006-07-04  7:02                   ` Neil Brown
  2006-07-05 12:06                   ` Bill Davidsen
  2006-07-04  8:17                 ` Alan Cox
  2006-07-04 11:19                 ` Krzysztof Halasa
  3 siblings, 2 replies; 119+ messages in thread
From: Avi Kivity @ 2006-07-04  6:09 UTC (permalink / raw)
  To: Neil Brown
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Neil Brown wrote:
>
> To my mind, the only thing you should put between the filesystem and
> the raw devices is RAID (real-raid - not raid0 or linear).
>
I believe that implementing RAID in the filesystem has many benefits too:
 - multiple RAID levels: store metadata in triple-mirror RAID 1, random 
write intensive data in RAID 1, bulk data in RAID 5/6
 - improved write throughput - since stripes can be variable size, any 
large enough write fills a whole stripe

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-04  6:09                 ` Avi Kivity
@ 2006-07-04  7:02                   ` Neil Brown
  2006-07-04  8:26                     ` Avi Kivity
  2006-07-05 12:06                   ` Bill Davidsen
  1 sibling, 1 reply; 119+ messages in thread
From: Neil Brown @ 2006-07-04  7:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

On Tuesday July 4, avi@argo.co.il wrote:
> Neil Brown wrote:
> >
> > To my mind, the only thing you should put between the filesystem and
> > the raw devices is RAID (real-raid - not raid0 or linear).
> >
> I believe that implementing RAID in the filesystem has many benefits too:
>  - multiple RAID levels: store metadata in triple-mirror RAID 1, random 
> write intensive data in RAID 1, bulk data in RAID 5/6
>  - improved write throughput - since stripes can be variable size, any 
> large enough write fills a whole stripe

Maybe....

Now imagine what would be required to rebuild a whole drive onto a
spare after a drive failure.

I'm sure it is possible, and I believe ZFS does something like that.
I find it hard to imagine getting reasonable speed if there is much
complexity.  And the longer it takes, the longer your data is exposed
to multiple-failures.

There may well be room there to come up with a really clever idea that
makes it both flexible and fast....

Note that 'resync' wouldn't be a problem.  Having the filesystem know
about the raid means that resync (after unclean shutdown) can be quite
trivial (I believe there is a paper related to this at OLS this year).

NeilBrown

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-04  7:02                   ` Neil Brown
@ 2006-07-04  8:26                     ` Avi Kivity
  2006-07-05 11:56                       ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Avi Kivity @ 2006-07-04  8:26 UTC (permalink / raw)
  To: Neil Brown
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Neil Brown wrote:
>
> On Tuesday July 4, avi@argo.co.il wrote:
> > Neil Brown wrote:
> > >
> > > To my mind, the only thing you should put between the filesystem and
> > > the raw devices is RAID (real-raid - not raid0 or linear).
> > >
> > I believe that implementing RAID in the filesystem has many benefits 
> too:
> >  - multiple RAID levels: store metadata in triple-mirror RAID 1, random
> > write intensive data in RAID 1, bulk data in RAID 5/6
> >  - improved write throughput - since stripes can be variable size, any
> > large enough write fills a whole stripe
>
> Maybe....
>
> Now imagine what would be required to rebuild a whole drive onto a
> spare after a drive failure.
>
> I'm sure it is possible, and I believe ZFS does something like that.
> I find it hard to imagine getting reasonable speed if there is much
> complexity.  And the longer it takes, the longer your data is exposed
> to multiple-failures.
>

A company called Isilon does this on a cluster.  They claim (IIRC) a one 
hour rebuild time for a failure.  AFAIK they rebuild into cluster free 
space, so they are not bound by the spare's bandwidth; they can utilize 
all cluster resources for a rebuild.

(You don't need spare disks, just spare free space; so you don't have 
idle disk heads)

In terms of complexity, I imagine one needs a reverse mapping (extent -> 
(inode, offset)); given that, one can very easily rebuild failed disks, 
and more features are easy to implement, like evacuation of a drive, or 
rebalancing data across all drives when new disks are added.

The same ideas can be applied to a non-clustered filesystem, of course.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-04  8:26                     ` Avi Kivity
@ 2006-07-05 11:56                       ` Bill Davidsen
  0 siblings, 0 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05 11:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Avi Kivity wrote:
> Neil Brown wrote:
>>
>> On Tuesday July 4, avi@argo.co.il wrote:
>> > Neil Brown wrote:
>> > >
>> > > To my mind, the only thing you should put between the filesystem and
>> > > the raw devices is RAID (real-raid - not raid0 or linear).
>> > >
>> > I believe that implementing RAID in the filesystem has many benefits 
>> too:
>> >  - multiple RAID levels: store metadata in triple-mirror RAID 1, random
>> > write intensive data in RAID 1, bulk data in RAID 5/6
>> >  - improved write throughput - since stripes can be variable size, any
>> > large enough write fills a whole stripe
>>
>> Maybe....
>>
>> Now imagine what would be required to rebuild a whole drive onto a
>> spare after a drive failure.
>>
>> I'm sure it is possible, and I believe ZFS does something like that.
>> I find it hard to imagine getting reasonable speed if there is much
>> complexity.  And the longer it takes, the longer your data is exposed
>> to multiple-failures.
>>
> 
> A company called Isilon does this on a cluster.  They claim (IIRC) a one 
> hour rebuild time for a failure.  AFAIK they rebuild into cluster free 
> space, so they are not bound by the spare's bandwidth; they can utilize 
> all cluster resources for a rebuild.
> 
> (You don't need spare disks, just spare free space; so you don't have 
> idle disk heads)
> 
Readers of the RAID list will recognize this description, it matches my 
comments on RAID5E (distributed hot spare) very well. And I suppose 
there could be RAID6E as well, although I haven't really thought about it.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-04  6:09                 ` Avi Kivity
  2006-07-04  7:02                   ` Neil Brown
@ 2006-07-05 12:06                   ` Bill Davidsen
  2006-07-05 12:19                     ` Avi Kivity
  1 sibling, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05 12:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Avi Kivity wrote:
> Neil Brown wrote:
>>
>> To my mind, the only thing you should put between the filesystem and
>> the raw devices is RAID (real-raid - not raid0 or linear).
>>
> I believe that implementing RAID in the filesystem has many benefits too:
> - multiple RAID levels: store metadata in triple-mirror RAID 1, random 
> write intensive data in RAID 1, bulk data in RAID 5/6
> - improved write throughput - since stripes can be variable size, any 
> large enough write fills a whole stripe
> 
I rather like the idea of allowing metadata to be on another device in 
general, or at least the inodes. That way a very small chunk size can be 
used for the inodes, to spread head motion, while a larger chunk size is 
appropriate for data in some cases.

Larger max block sizes would be useful as well. Feel free to discuss the 
actual value of "larger."

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-05 12:06                   ` Bill Davidsen
@ 2006-07-05 12:19                     ` Avi Kivity
  2006-07-08 17:54                       ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Avi Kivity @ 2006-07-05 12:19 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Bill Davidsen wrote:
>
> > I believe that implementing RAID in the filesystem has many benefits 
> too:
> > - multiple RAID levels: store metadata in triple-mirror RAID 1, random
> > write intensive data in RAID 1, bulk data in RAID 5/6
> > - improved write throughput - since stripes can be variable size, any
> > large enough write fills a whole stripe
> >
> I rather like the idea of allowing metadata to be on another device in
> general, or at least the inodes. That way a very small chunk size can be
> used for the inodes, to spread head motion, while a larger chunk size is
> appropriate for data in some cases.
>

If your workload is metadata intensive, your data disks are idle; if 
you're reading data, the inode device is gathering dust. You can run out 
of inodes before you run out of space and vice-versa. Very suboptimal.

A symmetric configuration allows full use of all resources for any 
workload, at the cost of increased complexity - every extent has its own 
RAID level and RAID component devices.

> Larger max block sizes would be useful as well. Feel free to discuss the
> actual value of "larger."
>

Filesystems should use extents, not blocks, avoiding the block size 
tradeoff entirely.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-05 12:19                     ` Avi Kivity
@ 2006-07-08 17:54                       ` Bill Davidsen
  0 siblings, 0 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-08 17:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Avi Kivity wrote:
> Bill Davidsen wrote:
>>
>> > I believe that implementing RAID in the filesystem has many benefits 
>> too:
>> > - multiple RAID levels: store metadata in triple-mirror RAID 1, random
>> > write intensive data in RAID 1, bulk data in RAID 5/6
>> > - improved write throughput - since stripes can be variable size, any
>> > large enough write fills a whole stripe
>> >
>> I rather like the idea of allowing metadata to be on another device in
>> general, or at least the inodes. That way a very small chunk size can be
>> used for the inodes, to spread head motion, while a larger chunk size is
>> appropriate for data in some cases.
>>
> 
> If your workload is metadata intensive, your data disks are idle; if 
> you're reading data, the inode device is gathering dust. You can run out 
> of inodes before you run out of space and vice-versa. Very suboptimal.

Using the correct resource for the job is very optimal, no RAID will 
make big slow cheap drives fast for inodes, no fast drive is practical 
in cost or heat for moderately large data.
> 
> A symmetric configuration allows full use of all resources for any 
> workload, at the cost of increased complexity - every extent has its own 
> RAID level and RAID component devices.

Why would you want to use all your resources when only part of them are 
at all suited to the job?

Do consider the price and performance of 15k RPM Ultra320 drives (32GB) 
vs. 750GB SATA before telling me that it doesn't work better to have 
metadata on fast storage and application data on cheap drives. You can 
use 10TB of 300kB avg files in random directories as a model. Figure 10% 
churn every day, delete and create not rewrite, 27 creates/sec and 
200-300 open for read/sec.
> 
>> Larger max block sizes would be useful as well. Feel free to discuss the
>> actual value of "larger."
>>
> 
> Filesystems should use extents, not blocks, avoiding the block size 
> tradeoff entirely.
> 

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-03 23:31               ` ext4 features (checksums) Neil Brown
  2006-07-04  1:03                 ` Jeff Garzik
  2006-07-04  6:09                 ` Avi Kivity
@ 2006-07-04  8:17                 ` Alan Cox
  2006-07-04 11:08                   ` Thomas Glanzmann
  2006-07-04 11:19                 ` Krzysztof Halasa
  3 siblings, 1 reply; 119+ messages in thread
From: Alan Cox @ 2006-07-04  8:17 UTC (permalink / raw)
  To: Neil Brown
  Cc: Arjan van de Ven, Tomasz Torcz, Helge Hafting, Thomas Glanzmann,
	Theodore Ts'o, LKML

Ar Maw, 2006-07-04 am 09:31 +1000, ysgrifennodd Neil Brown:
> It's been some years that I've felt that most 'logical volume
> management' really belongs in the filesystem.
> Why have a dm that chops devices up in to segments and assembles them to
> look like a big device, only to have that big device chopped up and
> presented as files.  Seems like double handling to me.

Because the interface model is wrong ?

Various people have long said the model actually should look rather more
like

fs to block:
	handle = alloc_extent(near_handle*, info)
	write_extent(handle, buffer, offset, length)
	read_extent(handle, buffer, offset, length)
	free_extent(handle)

(probably with resize_extent)

This makes LVM, remapping, checksumming and the like all naturally slip
out of the fs but not into the block layer.


[Many very good points snipped]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-04  8:17                 ` Alan Cox
@ 2006-07-04 11:08                   ` Thomas Glanzmann
  0 siblings, 0 replies; 119+ messages in thread
From: Thomas Glanzmann @ 2006-07-04 11:08 UTC (permalink / raw)
  To: Alan Cox
  Cc: Neil Brown, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Theodore Ts'o, LKML

Hello Alan,

> This makes LVM, remapping, checksumming and the like all naturally slip
> out of the fs but not into the block layer.

enhance LVM and have the functionality for all available fs. I think
this is the right way to go with checksums and fault tolerance but not
with snapshots.

        Thomas

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-03 23:31               ` ext4 features (checksums) Neil Brown
                                   ` (2 preceding siblings ...)
  2006-07-04  8:17                 ` Alan Cox
@ 2006-07-04 11:19                 ` Krzysztof Halasa
  2006-07-04 12:49                   ` Helge Hafting
  3 siblings, 1 reply; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-04 11:19 UTC (permalink / raw)
  To: Neil Brown
  Cc: Alan Cox, Arjan van de Ven, Tomasz Torcz, Helge Hafting,
	Thomas Glanzmann, Theodore Ts'o, LKML

Neil Brown <neilb@suse.de> writes:

> With checksums - the filesystem is in a better position to:
>  - be selective about what is checksummed - no point checksumming
>    blocks that aren't part of any file.  Some blocks (highlevel
>    metadata) might always be checksummed, while other blocks
>    (regular data) might not if a 'fast' option was chosen.

The same applies to RAID - for example, why "synchronise" unused area?

While fs vs. RAID provides a good layering scheme and is easier,
integrating them into one entity (as with ZFS) would certainly be
more efficient (and probably harder to maintain).
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-04 11:19                 ` Krzysztof Halasa
@ 2006-07-04 12:49                   ` Helge Hafting
  2006-07-05 12:01                     ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Helge Hafting @ 2006-07-04 12:49 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Neil Brown, Alan Cox, Arjan van de Ven, Tomasz Torcz,
	Thomas Glanzmann, Theodore Ts'o, LKML

On Tue, Jul 04, 2006 at 01:19:11PM +0200, Krzysztof Halasa wrote:
> Neil Brown <neilb@suse.de> writes:
> 
> > With checksums - the filesystem is in a better position to:
> >  - be selective about what is checksummed - no point checksumming
> >    blocks that aren't part of any file.  Some blocks (highlevel
> >    metadata) might always be checksummed, while other blocks
> >    (regular data) might not if a 'fast' option was chosen.
> 
> The same applies to RAID - for example, why "synchronise" unused area?
> 
Indeed.  RAID usually avoid checksumming unused area, it sums on write
and you don't write "unused" stuff.  

Not syncing unused area is possible, if there was a way for raid resync
to ask the fs what blocks are not in use.  I.e. get the
free block list in disk block order.  Then raid resync could skip those.

Helge Hafting

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-04 12:49                   ` Helge Hafting
@ 2006-07-05 12:01                     ` Bill Davidsen
  2006-07-05 12:10                       ` Avi Kivity
  0 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05 12:01 UTC (permalink / raw)
  To: Helge Hafting
  Cc: Neil Brown, Alan Cox, Arjan van de Ven, Tomasz Torcz,
	Thomas Glanzmann, Theodore Ts'o, LKML

Helge Hafting wrote:
> On Tue, Jul 04, 2006 at 01:19:11PM +0200, Krzysztof Halasa wrote:
>> Neil Brown <neilb@suse.de> writes:
>>
>>> With checksums - the filesystem is in a better position to:
>>>  - be selective about what is checksummed - no point checksumming
>>>    blocks that aren't part of any file.  Some blocks (highlevel
>>>    metadata) might always be checksummed, while other blocks
>>>    (regular data) might not if a 'fast' option was chosen.
>> The same applies to RAID - for example, why "synchronise" unused area?
>>
> Indeed.  RAID usually avoid checksumming unused area, it sums on write
> and you don't write "unused" stuff.  
> 
> Not syncing unused area is possible, if there was a way for raid resync
> to ask the fs what blocks are not in use.  I.e. get the
> free block list in disk block order.  Then raid resync could skip those.
> 
Current RAID code supports having a bitmap of dirty stripes, and can 
just sync those during recovery. I'm sure Neil could explain it better, 
but this is available without worrying about fs type. Now. Today.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-05 12:01                     ` Bill Davidsen
@ 2006-07-05 12:10                       ` Avi Kivity
  2006-07-08 18:02                         ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Avi Kivity @ 2006-07-05 12:10 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Helge Hafting, Neil Brown, Alan Cox, Arjan van de Ven,
	Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML

Bill Davidsen wrote:
>
>
> > Not syncing unused area is possible, if there was a way for raid resync
> > to ask the fs what blocks are not in use.  I.e. get the
> > free block list in disk block order.  Then raid resync could skip 
> those.
> >
> Current RAID code supports having a bitmap of dirty stripes, and can
> just sync those during recovery. I'm sure Neil could explain it better,
> but this is available without worrying about fs type. Now. Today.
>

This is only when the you reconstruct a disk that was once part of the 
RAID.  If you are adding a brand new disk, all stripes are dirty.

This happens in two scenarios: an unclean RAID shutdown, and when you 
have a remote mirror which can be disconnected by network problems.

If the RAID is integrated in the filesystem (or into an object storage 
system), you can handle the new disk case too.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features (checksums)
  2006-07-05 12:10                       ` Avi Kivity
@ 2006-07-08 18:02                         ` Bill Davidsen
  0 siblings, 0 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-08 18:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Helge Hafting, Neil Brown, Alan Cox, Arjan van de Ven,
	Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML

Avi Kivity wrote:
> Bill Davidsen wrote:
>>
>>
>> > Not syncing unused area is possible, if there was a way for raid resync
>> > to ask the fs what blocks are not in use.  I.e. get the
>> > free block list in disk block order.  Then raid resync could skip 
>> those.
>> >
>> Current RAID code supports having a bitmap of dirty stripes, and can
>> just sync those during recovery. I'm sure Neil could explain it better,
>> but this is available without worrying about fs type. Now. Today.
>>
> 
> This is only when the you reconstruct a disk that was once part of the 
> RAID.  If you are adding a brand new disk, all stripes are dirty.

I will leave Neil to explain this to you, it appears to be a totally 
different case for reconfiguration, but I don't pretend to understand 
the code well enough to clarify it.
> 
> This happens in two scenarios: an unclean RAID shutdown, and when you 
> have a remote mirror which can be disconnected by network problems.
> 
> If the RAID is integrated in the filesystem (or into an object storage 
> system), you can handle the new disk case too.
> 
I'm not sure that building the RAID into the filesystem is ever a good 
idea, it certainly seems likely to either prevent certain RAID devices 
from being used, or make them perform suboptimally. There are times when 
  being able to move a filesystem to a new device is REALLY useful, and 
byte copy is more practical than file by file copy.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Blatant layering violations (was Re: ext4 features)
  2006-07-03 20:55         ` Tomasz Torcz
  2006-07-03 21:01           ` Arjan van de Ven
@ 2006-07-06  0:36           ` Valerie Henson
  2006-07-06 12:15             ` Xavier Bestel
  2006-07-06 20:02             ` Tom Vier
  1 sibling, 2 replies; 119+ messages in thread
From: Valerie Henson @ 2006-07-06  0:36 UTC (permalink / raw)
  To: LKML; +Cc: Helge Hafting, Thomas Glanzmann, Theodore Ts'o, Andrew Morton

On Mon, Jul 03, 2006 at 10:55:23PM +0200, Tomasz Torcz wrote:
> 
>   ZFS was already called ,,blatant layering violation''. ;)

I kind of like the phrase "blatant layering violation" - catchy, isn't
it?  The main reason people think of ZFS as a blatant layering
violation is because it has the letters "FS" in the name, but it does
a lot more than a file system.  ZFS actually includes three distinct
layers with well-defined interfaces, none of which directly maps to
most people's conception of a "file system."

The really painfully short summary of the layers is:

SPA - Storage Pool Allocator, disks go into the bottom, virtually
addressed, explicitly freed/allocated blocks come out of the top

DMU - Data Management Unit, virtually addressed blocks go in the
bottom, plain objects come out the top (an object is like a file with
no dangly bits like permissions, etc.)

ZPL - ZFS POSIX Layer, plain objects go in the bottom, VFS ops come
out the top

For a really nice, much more detailed ZFS source tour, see:

http://www.opensolaris.org/os/community/zfs/source/

-VAL

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Blatant layering violations (was Re: ext4 features)
  2006-07-06  0:36           ` Blatant layering violations (was Re: ext4 features) Valerie Henson
@ 2006-07-06 12:15             ` Xavier Bestel
  2006-07-06 17:06               ` Valdis.Kletnieks
  2006-07-06 20:02             ` Tom Vier
  1 sibling, 1 reply; 119+ messages in thread
From: Xavier Bestel @ 2006-07-06 12:15 UTC (permalink / raw)
  To: Valerie Henson
  Cc: LKML, Helge Hafting, Thomas Glanzmann, Theodore Ts'o,
	Andrew Morton

On Thu, 2006-07-06 at 02:36, Valerie Henson wrote:
> For a really nice, much more detailed ZFS source tour, see:
> 
> http://www.opensolaris.org/os/community/zfs/source/

Posting an URL with CDDL-licensed sourcecode to LKML seems weird to me.
Do you try to pull an SCO ? :)

	Xav


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Blatant layering violations (was Re: ext4 features)
  2006-07-06 12:15             ` Xavier Bestel
@ 2006-07-06 17:06               ` Valdis.Kletnieks
  0 siblings, 0 replies; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-06 17:06 UTC (permalink / raw)
  To: Xavier Bestel
  Cc: Valerie Henson, LKML, Helge Hafting, Thomas Glanzmann,
	Theodore Ts'o, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 391 bytes --]

On Thu, 06 Jul 2006 14:15:26 +0200, Xavier Bestel said:
> On Thu, 2006-07-06 at 02:36, Valerie Henson wrote:
> > For a really nice, much more detailed ZFS source tour, see:
> > 
> > http://www.opensolaris.org/os/community/zfs/source/
> 
> Posting an URL with CDDL-licensed sourcecode to LKML seems weird to me.
> Do you try to pull an SCO ? :)

"ideas and concepts".  We can steal those. :)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Blatant layering violations (was Re: ext4 features)
  2006-07-06  0:36           ` Blatant layering violations (was Re: ext4 features) Valerie Henson
  2006-07-06 12:15             ` Xavier Bestel
@ 2006-07-06 20:02             ` Tom Vier
  1 sibling, 0 replies; 119+ messages in thread
From: Tom Vier @ 2006-07-06 20:02 UTC (permalink / raw)
  To: Valerie Henson
  Cc: LKML, Helge Hafting, Thomas Glanzmann, Theodore Ts'o,
	Andrew Morton

On Wed, Jul 05, 2006 at 05:36:39PM -0700, Valerie Henson wrote:
> On Mon, Jul 03, 2006 at 10:55:23PM +0200, Tomasz Torcz wrote:
> > 
> >   ZFS was already called ,,blatant layering violation''. ;)

It buys you some preformance. Someone here already mentioned variable stripe
sizes. ZFS doesn't just add a checksum sector after each block (something
i've been planning to write an md module for, for a couple years). It writes
the checksum at the end of the tree member, inode, dirent, whatever. So
there's no read-modify-write when you write < 1 checksum block size.

One thing i noticed about zfs that surprised me: it's using indirect blocks,
from what i saw.

-- 
Tom Vier <tmv@comcast.net>
DSA Key ID 0x15741ECE

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 20:22       ` Helge Hafting
  2006-07-03 20:55         ` Tomasz Torcz
@ 2006-07-03 21:34         ` Bill Davidsen
  2006-07-03 21:50           ` Valdis.Kletnieks
  1 sibling, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-03 21:34 UTC (permalink / raw)
  To: linux-kernel

Helge Hafting wrote:
> On Sat, Jul 01, 2006 at 08:17:02PM +0200, Tomasz Torcz wrote:
>> On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote:
>>> Hello,
>>>
>>>> Checksums are not very useful for themselves. They are useful when we
>>>> have other copy of data (think raid mirroring) so data can be
>>>> reconstructed from working copy.
>>> it would be possible to identify data corruption.
>>>
>>   Yes, but what good is identification? We could only return I/O error.
>> Ability to fix corruption (like ZFS) is the real killer.
> 
> Isn't that what we have RAID-1/5/6 for?  

I think he is talking about another problem. RAID addresses detectable 
failures at the hardware level. I believe that he wants validation after 
the data is returned (without error) from the device. While in most 
cases if what you wrote and what you read don't match it's memory, 
improving the chances of catching the error is useful, given that 
non-server often lacks ECC on memory, or people buy cheaper non-parity 
memory.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:34         ` ext4 features Bill Davidsen
@ 2006-07-03 21:50           ` Valdis.Kletnieks
  2006-07-03 22:04             ` Bruce Ferrell
                               ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-03 21:50 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said:
> I think he is talking about another problem. RAID addresses detectable
> failures at the hardware level. I believe that he wants validation after
> the data is returned (without error) from the device. While in most
> cases if what you wrote and what you read don't match it's memory,
> improving the chances of catching the error is useful, given that
> non-server often lacks ECC on memory, or people buy cheaper non-parity
> memory.

There's other issues as well.  Why do people run 'tripwire' on boxes that
have RAID on them?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:50           ` Valdis.Kletnieks
@ 2006-07-03 22:04             ` Bruce Ferrell
  2006-07-04 14:48               ` Valdis.Kletnieks
  2006-07-03 23:00             ` Bill Davidsen
  2006-07-04 12:52             ` Helge Hafting
  2 siblings, 1 reply; 119+ messages in thread
From: Bruce Ferrell @ 2006-07-03 22:04 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Bill Davidsen, linux-kernel

Valdis.Kletnieks@vt.edu wrote:
> On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said:
> 
>>I think he is talking about another problem. RAID addresses detectable
>>failures at the hardware level. I believe that he wants validation after
>>the data is returned (without error) from the device. While in most
>>cases if what you wrote and what you read don't match it's memory,
>>improving the chances of catching the error is useful, given that
>>non-server often lacks ECC on memory, or people buy cheaper non-parity
>>memory.
> 
> 
> There's other issues as well.  Why do people run 'tripwire' on boxes that
> have RAID on them?

Because they're looking for malicous changes

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 22:04             ` Bruce Ferrell
@ 2006-07-04 14:48               ` Valdis.Kletnieks
  0 siblings, 0 replies; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-04 14:48 UTC (permalink / raw)
  To: Bruce Ferrell; +Cc: Bill Davidsen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 523 bytes --]

On Mon, 03 Jul 2006 15:04:54 PDT, Bruce Ferrell said:
> Valdis.Kletnieks@vt.edu wrote:

> > There's other issues as well.  Why do people run 'tripwire' on boxes that
> > have RAID on them?
> 
> Because they're looking for malicous changes

Close, but no cigar.

I've had tripwire detect *accidental* changes as well (including borked
patchsets that replaced unrelated files).  The reason they run tripwire
as well as RAID is to detect changes that are visible only with the assistance
of information from the filesystem.  

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:50           ` Valdis.Kletnieks
  2006-07-03 22:04             ` Bruce Ferrell
@ 2006-07-03 23:00             ` Bill Davidsen
  2006-07-04 15:01               ` Valdis.Kletnieks
  2006-07-04 12:52             ` Helge Hafting
  2 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-03 23:00 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

Valdis.Kletnieks@vt.edu wrote:

>On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said:
>  
>
>>I think he is talking about another problem. RAID addresses detectable
>>failures at the hardware level. I believe that he wants validation after
>>the data is returned (without error) from the device. While in most
>>cases if what you wrote and what you read don't match it's memory,
>>improving the chances of catching the error is useful, given that
>>non-server often lacks ECC on memory, or people buy cheaper non-parity
>>memory.
>>    
>>
>
>There's other issues as well.  Why do people run 'tripwire' on boxes that
>have RAID on them?
>  
>
What has RAID got to do with detecting hacking?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 23:00             ` Bill Davidsen
@ 2006-07-04 15:01               ` Valdis.Kletnieks
  2006-07-05  2:40                 ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-04 15:01 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 605 bytes --]

On Mon, 03 Jul 2006 19:00:38 EDT, Bill Davidsen said:
> Valdis.Kletnieks@vt.edu wrote:

> >There's other issues as well.  Why do people run 'tripwire' on boxes that
> >have RAID on them?
> What has RAID got to do with detecting hacking?

Actually, I've had tripwire detect more *accidental* changes due to buggy
software than I have had it detect actual hacking.  Oh, and it's good at
catching unintended config changes - I started using tripwire after I
fat-fingered a script, and the machine backed up to /dev/null instead of
/dev/rmt0.

In fact, I've never actually had tripwire detect actual hacking.

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 15:01               ` Valdis.Kletnieks
@ 2006-07-05  2:40                 ` Bill Davidsen
  2006-07-05  2:47                   ` Valdis.Kletnieks
  0 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05  2:40 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

Valdis.Kletnieks@vt.edu wrote:

>On Mon, 03 Jul 2006 19:00:38 EDT, Bill Davidsen said:
>  
>
>>Valdis.Kletnieks@vt.edu wrote:
>>    
>>
>
>  
>
>>>There's other issues as well.  Why do people run 'tripwire' on boxes that
>>>have RAID on them?
>>>      
>>>
>>What has RAID got to do with detecting hacking?
>>    
>>
>
>Actually, I've had tripwire detect more *accidental* changes due to buggy
>software than I have had it detect actual hacking.  Oh, and it's good at
>catching unintended config changes - I started using tripwire after I
>fat-fingered a script, and the machine backed up to /dev/null instead of
>/dev/rmt0.
>  
>
But it ran faster, right? ;-)

>In fact, I've never actually had tripwire detect actual hacking.
>  
>
I was using hacking in the general sense, I have a spiffy quote around 
about being in more danger from incompetence than malice. Patches with 
side effects, changes which work but reset directory permissions and/or 
ownership... I think it was Pogo who said "we have met the enemy and he 
is us."

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05  2:40                 ` Bill Davidsen
@ 2006-07-05  2:47                   ` Valdis.Kletnieks
  0 siblings, 0 replies; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-05  2:47 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 677 bytes --]

On Tue, 04 Jul 2006 22:40:05 EDT, Bill Davidsen said:
> Valdis.Kletnieks@vt.edu wrote:
> >catching unintended config changes - I started using tripwire after I
> >fat-fingered a script, and the machine backed up to /dev/null instead of
> >/dev/rmt0.
> But it ran faster, right? ;-)

Yeah.  The tape ops figured I must have optimized something or gotten it
to do better incrementals - it would ask for the tape, and spit it out in
15 minutes instead of the 40-45 it used to take.  So it went un-noticed till
a full cycle of tapes had gone by...

Guess who was totally mystified when we lost a disk, we restored from tape,
and the system had time warped itself back 2 months? :)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-03 21:50           ` Valdis.Kletnieks
  2006-07-03 22:04             ` Bruce Ferrell
  2006-07-03 23:00             ` Bill Davidsen
@ 2006-07-04 12:52             ` Helge Hafting
  2 siblings, 0 replies; 119+ messages in thread
From: Helge Hafting @ 2006-07-04 12:52 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Bill Davidsen, linux-kernel

On Mon, Jul 03, 2006 at 05:50:22PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 03 Jul 2006 17:34:18 EDT, Bill Davidsen said:
> > I think he is talking about another problem. RAID addresses detectable
> > failures at the hardware level. I believe that he wants validation after
> > the data is returned (without error) from the device. While in most
> > cases if what you wrote and what you read don't match it's memory,
> > improving the chances of catching the error is useful, given that
> > non-server often lacks ECC on memory, or people buy cheaper non-parity
> > memory.
> 
> There's other issues as well.  Why do people run 'tripwire' on boxes that
> have RAID on them?

To notice hacking.  RAID protects against hardware failure, it does
_not_ protect against any change that comes through the normal
filesystem channels.  RAID doesn't help the slightest against
viruses and hackers.  RAID is _not_ a backup, when a hacker
(or a user error) changes an important file, it is changed
in all mirrors of a raid-1 set, and raid-5 checksums are updated
so the change becomes valid.  But tripwire will notice that
a protected file changed.

Helge Hafting

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 18:17     ` Tomasz Torcz
  2006-07-03  9:44       ` Gabor Gombas
  2006-07-03 20:22       ` Helge Hafting
@ 2006-07-06 15:12       ` Ric Wheeler
  2006-07-06 17:05         ` Krzysztof Halasa
  2 siblings, 1 reply; 119+ messages in thread
From: Ric Wheeler @ 2006-07-06 15:12 UTC (permalink / raw)
  To: Tomasz Torcz; +Cc: Thomas Glanzmann, Theodore Ts'o, LKML

Tomasz Torcz wrote:

>On Sat, Jul 01, 2006 at 07:47:16PM +0200, Thomas Glanzmann wrote:
>  
>
>>Hello,
>>    
>>
>>>Checksums are not very useful for themselves. They are useful when we
>>>have other copy of data (think raid mirroring) so data can be
>>>reconstructed from working copy.
>>>      
>>>
>>it would be possible to identify data corruption.
>>    
>>
>
>  Yes, but what good is identification? We could only return I/O error.
>Ability to fix corruption (like ZFS) is the real killer.
>  
>

Having a checksum (or even a digital signature on a file) that lets us 
detect corruption is very useful since, in many cases, it allows us to 
flag the file as corrupt before it gets used.

In some cases, this is a big hint that you should restore it from backup 
(tape, other disk, etc).

I think that it is a generally useful thing even when not on a self 
correcting device,

ric


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06 15:12       ` Ric Wheeler
@ 2006-07-06 17:05         ` Krzysztof Halasa
  2006-07-06 17:27           ` Ric Wheeler
  0 siblings, 1 reply; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-06 17:05 UTC (permalink / raw)
  To: ric; +Cc: Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML

Ric Wheeler <ric@emc.com> writes:

> Having a checksum (or even a digital signature on a file) that lets us
> detect corruption is very useful since, in many cases, it allows us to
> flag the file as corrupt before it gets used.

We can't have that. Sector/block/etc. checksums - yes.

A checksum, signature, hash etc. of the whole file would require
actually reading the whole file. It can be done by tripwire or
backup, and even by fsck, but not by the filesystem in normal
operation.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06 17:05         ` Krzysztof Halasa
@ 2006-07-06 17:27           ` Ric Wheeler
  2006-07-06 20:52             ` Valdis.Kletnieks
  2006-07-07 17:34             ` Krzysztof Halasa
  0 siblings, 2 replies; 119+ messages in thread
From: Ric Wheeler @ 2006-07-06 17:27 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML

Krzysztof Halasa wrote:

>Ric Wheeler <ric@emc.com> writes:
>
>  
>
>>Having a checksum (or even a digital signature on a file) that lets us
>>detect corruption is very useful since, in many cases, it allows us to
>>flag the file as corrupt before it gets used.
>>    
>>
>
>We can't have that. Sector/block/etc. checksums - yes.
>  
>
I certainly don't object to sector and block checksums, but they do 
require a specially formatted disk or high end array (which my employer 
would be happy to sell you ;-)).

If you record a per sector or FS block level checksum in user space, you 
have to keep in mind the sheer size of today's commodity disks and the 
amount of space that would consume - it would be much more efficient to 
store one such signature per file. Where you put those 
checksums/signatures and when you look at them/update them/validate them 
can cause lots of headaches.

>A checksum, signature, hash etc. of the whole file would require
>actually reading the whole file. It can be done by tripwire or
>backup, and even by fsck, but not by the filesystem in normal
>operation.
>  
>
There was some  talk about this at the file system mini-summit.  
Clearly, you would not want to compute (and continually update) the 
checksum/signature on an actively written  file.

It might be useful to compute at close time (or when you set a special 
attr, etc). We could also special case sequentially written files 
(storing & updating the partial signature as we go, but that could be a 
bit iffy).

The key is to keep the signature/checksum with the file - tripwire and 
backup programs could do this (and even store it their own extended 
attribute), but I think that it is more generically useful than that. 

If you care enough about the data integrity of a file, having this kind 
of optional validation on any open would be very useful.

ric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06 17:27           ` Ric Wheeler
@ 2006-07-06 20:52             ` Valdis.Kletnieks
  2006-07-07 17:41               ` Krzysztof Halasa
  2006-07-07 17:34             ` Krzysztof Halasa
  1 sibling, 1 reply; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-06 20:52 UTC (permalink / raw)
  To: ric
  Cc: Krzysztof Halasa, Tomasz Torcz, Thomas Glanzmann,
	Theodore Ts'o, LKML

[-- Attachment #1: Type: text/plain, Size: 852 bytes --]

On Thu, 06 Jul 2006 13:27:35 EDT, Ric Wheeler said:

> The key is to keep the signature/checksum with the file - tripwire and 
> backup programs could do this (and even store it their own extended 
> attribute), but I think that it is more generically useful than that. 

Backup programs want it stored with the file.  Tripwire wants it stored
as far away from the file as possible.  Remember - for Tripwire, we *don't*
want the "current maintained value", we want "the snapshotted value from
a known good state".

If the filesystem stored a "guaranteed trustable current hash", Tripwire
*could* use it to compare against its database rather than having to re-read
the file and recompute it.  Unfortunately, a useful trustable hash is
basically incompatible with any sort of incremental updating (except for
the special case of appending to the file).

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06 20:52             ` Valdis.Kletnieks
@ 2006-07-07 17:41               ` Krzysztof Halasa
  0 siblings, 0 replies; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-07 17:41 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: ric, Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML

Valdis.Kletnieks@vt.edu writes:

> Backup programs want it stored with the file.

Not necessarily - backup may want to store the hashes in some central
place as well. I'm using such solution and it has only positives.

> If the filesystem stored a "guaranteed trustable current hash", Tripwire
> *could* use it to compare against its database rather than having to re-read
> the file and recompute it.  Unfortunately, a useful trustable hash is
> basically incompatible with any sort of incremental updating (except for
> the special case of appending to the file).

Block hashes + master hash could allow something like that. Not sure
if we want it in the fs, though.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06 17:27           ` Ric Wheeler
  2006-07-06 20:52             ` Valdis.Kletnieks
@ 2006-07-07 17:34             ` Krzysztof Halasa
  1 sibling, 0 replies; 119+ messages in thread
From: Krzysztof Halasa @ 2006-07-07 17:34 UTC (permalink / raw)
  To: ric; +Cc: Tomasz Torcz, Thomas Glanzmann, Theodore Ts'o, LKML

Ric Wheeler <ric@emc.com> writes:

> The key is to keep the signature/checksum with the file - tripwire and
> backup programs could do this (and even store it their own extended
> attribute), but I think that it is more generically useful than that.

I can't still see a sane way to do it. Yes, there might be some way
for very special purposes but there is no solution for general use.

> If you care enough about the data integrity of a file, having this
> kind of optional validation on any open would be very useful.

Given we can only do that for very specific purposes, I think the
userspace is better suited.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 16:33 ext4 features Thomas Glanzmann
  2006-07-01 17:07 ` Tomasz Torcz
@ 2006-07-04  1:02 ` Theodore Tso
  2006-07-04 19:16   ` Thomas Glanzmann
                     ` (3 more replies)
  2006-07-04 14:36 ` Andi Kleen
  2 siblings, 4 replies; 119+ messages in thread
From: Theodore Tso @ 2006-07-04  1:02 UTC (permalink / raw)
  To: Thomas Glanzmann, LKML

On Sat, Jul 01, 2006 at 06:33:01PM +0200, Thomas Glanzmann wrote:
> I would like to know which new features are planed to be incorported by
> ext4. So far I only read about supporting bigger filesystems to fit
> recent hardware developments. So are there any other big goals for ext4?

Some of the ideas which have been tossed about include:

	* nanosecond timestamps, and support for time beyond the 2038
	* extents (better performance, faster fsck times)
	* persistent preallocation (valid bit in the extent)
	* larger extended attributes
	* checksums for metadata

... but the list of features are not necessarily fixed; if you have a
great ideas, patches are always appreciated.  :-)

> What I personally would like to see most in ext4 are
> 
>         * checksums for data

One of the more interesting ways of implementing this is that newer
disks will be providing a facility (at the SCSI layer, and presumably
eventually for SATA drives as well) where a checksum and some
"application" (read: filesystem) data.  The way this works, as I
understand it, is that the OS provides the sector-level checksum as
part of the write operation, which is then checked by the disk before
it is written (to catch corruption at the bus level) and written on
the disk.  On a read operation, the checksum is read, and the data
verified at the disk, as well as being passed back to the OS, so the
OS can do end-to-end level checksum checking.  More interestingly,
there is space for "applation level" (read: filesystem) tagged data,
which we could use to store information about the inode # and logical
block # that a particular data blocks is associated with.  This would
allow for a much better recoverability from the inode table getting
trashed.  

(Of course, the amount of time it would take to recover such a file
via this method for future terrabyte and pedabyte filesystems is such
that restoring from backup tapes is almost always going to be faster.
So such a scheme would only be used when some Ph.D. student has ten
years of thesis research on a disk with no backups and then
accidentally runs mkfs on the wrong partition.....  of course, one
could argue that such a stupid student doesnt *deserve* to get a Ph.D.  :-)

>         * and snapshots on filesystem basis

This requires a filesystem that is designed from the get-go to support
snapshots.  So yes, it's lilely not going to happen for ext4.
Although, if you have a really clever idea, feel free to post patches
or a detailed technical proposal for how to achieve such a goal.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04  1:02 ` Theodore Tso
@ 2006-07-04 19:16   ` Thomas Glanzmann
  2006-07-04 19:30   ` Valdis.Kletnieks
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 119+ messages in thread
From: Thomas Glanzmann @ 2006-07-04 19:16 UTC (permalink / raw)
  To: Theodore Tso, LKML

Hello,
... wow ... thank you for all the awareness training. I have now a much
better idea what is happening now. And who knows, maybe I am going to submit
some patches when ext4 isn't already released in three months. I didn't
knew about the checksum capability of newer drives. I only new about the
DMA crc. But it is definitively the right way to go.

        Thomas

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04  1:02 ` Theodore Tso
  2006-07-04 19:16   ` Thomas Glanzmann
@ 2006-07-04 19:30   ` Valdis.Kletnieks
  2006-07-05 12:24   ` Bill Davidsen
  2006-07-05 14:04   ` Avi Kivity
  3 siblings, 0 replies; 119+ messages in thread
From: Valdis.Kletnieks @ 2006-07-04 19:30 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Thomas Glanzmann, LKML

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

On Mon, 03 Jul 2006 21:02:40 EDT, Theodore Tso said:

> So such a scheme would only be used when some Ph.D. student has ten
> years of thesis research on a disk with no backups and then
> accidentally runs mkfs on the wrong partition.....  of course, one
> could argue that such a stupid student doesnt *deserve* to get a Ph.D.  :-)

The more common use case is a department hires a grad student to run the
department server rather than somebody who knows what they're doing (but
costs more than a grad student stipend), and said grad student first sets
up a borked backup scheme that looks like it works, but doesn't actually
produce restorable backups, and then runs mkfs on /home, nuking all the
thesis work of all the students....

(And yes, I've seen that more than once in a quarter century of working
at .edu's... ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04  1:02 ` Theodore Tso
  2006-07-04 19:16   ` Thomas Glanzmann
  2006-07-04 19:30   ` Valdis.Kletnieks
@ 2006-07-05 12:24   ` Bill Davidsen
  2006-07-05 12:59     ` J. Bruce Fields
  2006-07-05 14:04   ` Avi Kivity
  3 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05 12:24 UTC (permalink / raw)
  To: Theodore Tso, Thomas Glanzmann, LKML

Theodore Tso wrote:
> On Sat, Jul 01, 2006 at 06:33:01PM +0200, Thomas Glanzmann wrote:
>> I would like to know which new features are planed to be incorported by
>> ext4. So far I only read about supporting bigger filesystems to fit
>> recent hardware developments. So are there any other big goals for ext4?
> 
> Some of the ideas which have been tossed about include:
> 
> 	* nanosecond timestamps, and support for time beyond the 2038

The 2nd one is probably more urgent than the first. I can see a general 
benefit from timestamp in ms, beyond that seems to be a specialty 
requirement best provided at the application level rather than the bits 
of a trillion inodes which need no such thing.

One argument against it is that with SMP with *almost* the same time in 
each CPU, cache everywhere in the i/o process, and various flavors of 
network filesystems, the atime/mtime become less and less useful for 
determining with great precision which file is most recently modified or 
accessed.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 12:24   ` Bill Davidsen
@ 2006-07-05 12:59     ` J. Bruce Fields
  2006-07-05 13:17       ` Pádraig Brady
                         ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: J. Bruce Fields @ 2006-07-05 12:59 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Theodore Tso, Thomas Glanzmann, LKML

On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote:
> Theodore Tso wrote:
> >Some of the ideas which have been tossed about include:
> >
> >	* nanosecond timestamps, and support for time beyond the 2038
> 
> The 2nd one is probably more urgent than the first. I can see a general 
> benefit from timestamp in ms, beyond that seems to be a specialty 
> requirement best provided at the application level rather than the bits 
> of a trillion inodes which need no such thing.

What's urgently needed for NFS (and I suspect for most other
applications demanding higher timestamps) isn't really nanosecond
precision so much as something that's guaranteed to increase whenever
the file changes.

Of course, just adding space in the inodes for nanoseconds isn't
sufficient.  XFS, for example, has nanosecond timestamps, but it's still
easy to modify a file twice without seeing the ctime or mtime change.
So either we need a timesource guaranteed to tick faster than the kernel
can process IO, or we have to be willing to, say, add 1 to the
nanoseconds field whenever the time doesn't change between operations.

Or we could add an entirely separate attribute that's guaranteed to
increase whenever the ctime is updated, and that doesn't necessarily
have any connection with time--call it a version number or something.

--b.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 12:59     ` J. Bruce Fields
@ 2006-07-05 13:17       ` Pádraig Brady
  2006-07-05 19:33       ` Trond Myklebust
  2006-07-05 21:12       ` Bill Davidsen
  2 siblings, 0 replies; 119+ messages in thread
From: Pádraig Brady @ 2006-07-05 13:17 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Bill Davidsen, Theodore Tso, Thomas Glanzmann, LKML

J. Bruce Fields wrote:
> On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote:
> 
>>Theodore Tso wrote:
>>
>>>Some of the ideas which have been tossed about include:
>>>
>>>	* nanosecond timestamps, and support for time beyond the 2038
>>
>>The 2nd one is probably more urgent than the first. I can see a general 
>>benefit from timestamp in ms, beyond that seems to be a specialty 
>>requirement best provided at the application level rather than the bits 
>>of a trillion inodes which need no such thing.
> 
> 
> What's urgently needed for NFS (and I suspect for most other
> applications demanding higher timestamps) isn't really nanosecond
> precision so much as something that's guaranteed to increase whenever
> the file changes.

Yes please!
http://lkml.org/lkml/2001/10/8/18

Pádraig.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 12:59     ` J. Bruce Fields
  2006-07-05 13:17       ` Pádraig Brady
@ 2006-07-05 19:33       ` Trond Myklebust
  2006-07-05 21:22         ` Bill Davidsen
  2006-07-05 21:12       ` Bill Davidsen
  2 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-05 19:33 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Bill Davidsen, Theodore Tso, Thomas Glanzmann, LKML

On Wed, 2006-07-05 at 08:59 -0400, J. Bruce Fields wrote:
> On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote:
> > Theodore Tso wrote:
> > >Some of the ideas which have been tossed about include:
> > >
> > >	* nanosecond timestamps, and support for time beyond the 2038
> > 
> > The 2nd one is probably more urgent than the first. I can see a general 
> > benefit from timestamp in ms, beyond that seems to be a specialty 
> > requirement best provided at the application level rather than the bits 
> > of a trillion inodes which need no such thing.
> 
> What's urgently needed for NFS (and I suspect for most other
> applications demanding higher timestamps) isn't really nanosecond
> precision so much as something that's guaranteed to increase whenever
> the file changes.

NFS doesn't necessarily require monotonicity either. The only real
requirement that knfsd has is that the timestamp needs to change every
time the file data (mtime+ctime) and/or metadata (ctime only) is
changed.

Applications like 'make' OTOH, probably would be happier if the
timestamps are guaranteed to be monotonic.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 19:33       ` Trond Myklebust
@ 2006-07-05 21:22         ` Bill Davidsen
  2006-07-05 21:42           ` Trond Myklebust
  0 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05 21:22 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

Trond Myklebust wrote:

>On Wed, 2006-07-05 at 08:59 -0400, J. Bruce Fields wrote:
>  
>
>>On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote:
>>    
>>
>>>Theodore Tso wrote:
>>>      
>>>
>>>>Some of the ideas which have been tossed about include:
>>>>
>>>>	* nanosecond timestamps, and support for time beyond the 2038
>>>>        
>>>>
>>>The 2nd one is probably more urgent than the first. I can see a general 
>>>benefit from timestamp in ms, beyond that seems to be a specialty 
>>>requirement best provided at the application level rather than the bits 
>>>of a trillion inodes which need no such thing.
>>>      
>>>
>>What's urgently needed for NFS (and I suspect for most other
>>applications demanding higher timestamps) isn't really nanosecond
>>precision so much as something that's guaranteed to increase whenever
>>the file changes.
>>    
>>
>
>NFS doesn't necessarily require monotonicity either. The only real
>requirement that knfsd has is that the timestamp needs to change every
>time the file data (mtime+ctime) and/or metadata (ctime only) is
>changed.
>
>Applications like 'make' OTOH, probably would be happier if the
>timestamps are guaranteed to be monotonic.
>

Consider the case where the build machine reads source from one network 
filesystem and write the binary result to another on another machine. If 
you know that I have the kernel source on a file server, do the compiles 
on a compute server, and store the binaries on three test machines for 
evaluation, you might guess this really can happen. Just increasing the 
timestamp may not solve the problem, unless you have a system call to 
set timestamp over network f/s, like a high resolution touch.

It's a problem when there are multiple times involved.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 21:22         ` Bill Davidsen
@ 2006-07-05 21:42           ` Trond Myklebust
  2006-07-08 21:04             ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-05 21:42 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

On Wed, 2006-07-05 at 17:22 -0400, Bill Davidsen wrote:
> Consider the case where the build machine reads source from one network 
> filesystem and write the binary result to another on another machine. If 
> you know that I have the kernel source on a file server, do the compiles 
> on a compute server, and store the binaries on three test machines for 
> evaluation, you might guess this really can happen. Just increasing the 
> timestamp may not solve the problem, unless you have a system call to 
> set timestamp over network f/s, like a high resolution touch.

If you are running 'touch' manually on all your files, you can always
arrange to set the timestamp to something more recent. You don't
normally need a high resolution version of utimes() (and SuSv3 won't
provide you with one).

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 21:42           ` Trond Myklebust
@ 2006-07-08 21:04             ` Bill Davidsen
  2006-07-10 20:08               ` Trond Myklebust
  0 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-08 21:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

Trond Myklebust wrote:
> On Wed, 2006-07-05 at 17:22 -0400, Bill Davidsen wrote:
>> Consider the case where the build machine reads source from one network 
>> filesystem and write the binary result to another on another machine. If 
>> you know that I have the kernel source on a file server, do the compiles 
>> on a compute server, and store the binaries on three test machines for 
>> evaluation, you might guess this really can happen. Just increasing the 
>> timestamp may not solve the problem, unless you have a system call to 
>> set timestamp over network f/s, like a high resolution touch.
> 
> If you are running 'touch' manually on all your files, you can always
> arrange to set the timestamp to something more recent. You don't
> normally need a high resolution version of utimes() (and SuSv3 won't
> provide you with one).

No, I didn't quite mean a manual touch, but a system call to "close and 
set time to high resolution" for files where time uniformity is 
important. Consider that in most cases the inodes times are set by the 
host machine clock, which I close the change reflects the fileserving 
host idea of time. If there were a call to close a file and set the 
times like touch, then that could be used, for both local and network files.

Clearly if multiple clients are changing the same file that doesn't 
work, and I doubt that any solution is going to avoid at least some 
undesired side effects.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-08 21:04             ` Bill Davidsen
@ 2006-07-10 20:08               ` Trond Myklebust
  2006-07-10 22:37                 ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-10 20:08 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

On Sat, 2006-07-08 at 17:04 -0400, Bill Davidsen wrote:
> No, I didn't quite mean a manual touch, but a system call to "close and 
> set time to high resolution" for files where time uniformity is 
> important. Consider that in most cases the inodes times are set by the 
> host machine clock, which I close the change reflects the fileserving 
> host idea of time. If there were a call to close a file and set the 
> times like touch, then that could be used, for both local and network files.

Close should never update the time since that would be a violation of
POSIX rules. Normally, an NFS client will never need to update the time:
RPC calls like WRITE, READ and SETATTR will automatically do it for us
whenever necessary.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-10 20:08               ` Trond Myklebust
@ 2006-07-10 22:37                 ` Bill Davidsen
  2006-07-11  2:36                   ` Trond Myklebust
  0 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-10 22:37 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

Trond Myklebust wrote:

>On Sat, 2006-07-08 at 17:04 -0400, Bill Davidsen wrote:
>  
>
>>No, I didn't quite mean a manual touch, but a system call to "close and 
>>set time to high resolution" for files where time uniformity is 
>>important. Consider that in most cases the inodes times are set by the 
>>host machine clock, which I close the change reflects the fileserving 
>>host idea of time. If there were a call to close a file and set the 
>>times like touch, then that could be used, for both local and network files.
>>    
>>
>
>Close should never update the time since that would be a violation of
>POSIX rules. Normally, an NFS client will never need to update the time:
>RPC calls like WRITE, READ and SETATTR will automatically do it for us
>whenever necessary.
>  
>

Let me restate this a third time in another way. What I suggest is a 
system call, NOT NAMED CLOSE, which does a close and touch. This was all 
blue sky discussion, new system calls are as valid as nanosecond 
resolution and syncronization between servers. Since this is a new call 
it is not specified by POSIX.

And Linus has already suggested that he would accept something similar, 
when I proposed something like "noatime" mounts, which only updated 
atime and mtime on open and close, to keep metadata relevant but not 
have the overhead of constant updates.

Actually, now that I have more free time I may revisit that idea.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-10 22:37                 ` Bill Davidsen
@ 2006-07-11  2:36                   ` Trond Myklebust
  2006-07-21  3:10                     ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-11  2:36 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

On Mon, 2006-07-10 at 18:37 -0400, Bill Davidsen wrote:
> Trond Myklebust wrote:
> 
> >On Sat, 2006-07-08 at 17:04 -0400, Bill Davidsen wrote:
> >  
> >
> >>No, I didn't quite mean a manual touch, but a system call to "close and 
> >>set time to high resolution" for files where time uniformity is 
> >>important. Consider that in most cases the inodes times are set by the 
> >>host machine clock, which I close the change reflects the fileserving 
> >>host idea of time. If there were a call to close a file and set the 
> >>times like touch, then that could be used, for both local and network files.
> >>    
> >>
> >
> >Close should never update the time since that would be a violation of
> >POSIX rules. Normally, an NFS client will never need to update the time:
> >RPC calls like WRITE, READ and SETATTR will automatically do it for us
> >whenever necessary.
> >  
> >
> 
> Let me restate this a third time in another way. What I suggest is a 
> system call, NOT NAMED CLOSE, which does a close and touch. This was all 
> blue sky discussion, new system calls are as valid as nanosecond 
> resolution and syncronization between servers. Since this is a new call 
> it is not specified by POSIX.
> 
> And Linus has already suggested that he would accept something similar, 
> when I proposed something like "noatime" mounts, which only updated 
> atime and mtime on open and close, to keep metadata relevant but not 
> have the overhead of constant updates.
> 
> Actually, now that I have more free time I may revisit that idea.

Linus might accept it, but I won't. It is totally unnecessary.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-11  2:36                   ` Trond Myklebust
@ 2006-07-21  3:10                     ` Bill Davidsen
  2006-07-21 12:06                       ` Trond Myklebust
  0 siblings, 1 reply; 119+ messages in thread
From: Bill Davidsen @ 2006-07-21  3:10 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

Trond Myklebust wrote:

>On Mon, 2006-07-10 at 18:37 -0400, Bill Davidsen wrote:
>  
>
>Linus might accept it, but I won't. It is totally unnecessary.
>  
>

By "totally unnecessary" you mean "I don't see why it's useful."

The reason for using noatime is to avoid generating disk activity while 
the data is being accessed. It's not usually used to hide the fact that 
the data has been used and is therefore useful to someone. In a perfect 
world, where money is no object, all data is on very fast storage which 
never fails. In my world I would like to identify which data, source or 
documentation, has been referenced over some period of time. This is 
useful for moving some data to slower (yes I mean less expensive) storage.

It's also useful to identify stuff which no one has used in a very long 
time and which is a candidate for not being on line at all.

By keeping lazy track of access time it's possible to still have that 
data, with minimal disk access cost. And to some people that can be 
really useful, such as those of us who have to justify expenditures.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-21  3:10                     ` Bill Davidsen
@ 2006-07-21 12:06                       ` Trond Myklebust
  2006-07-21 14:36                         ` Theodore Tso
  0 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-21 12:06 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

On Thu, 2006-07-20 at 23:10 -0400, Bill Davidsen wrote:
> Trond Myklebust wrote:
> 
> >On Mon, 2006-07-10 at 18:37 -0400, Bill Davidsen wrote:
> >  
> >
> >Linus might accept it, but I won't. It is totally unnecessary.
> >  
> >
> 
> By "totally unnecessary" you mean "I don't see why it's useful."
> 
> The reason for using noatime is to avoid generating disk activity while 
> the data is being accessed. It's not usually used to hide the fact that 
> the data has been used and is therefore useful to someone. In a perfect 
> world, where money is no object, all data is on very fast storage which 
> never fails. In my world I would like to identify which data, source or 
> documentation, has been referenced over some period of time. This is 
> useful for moving some data to slower (yes I mean less expensive) storage.
> 
> It's also useful to identify stuff which no one has used in a very long 
> time and which is a candidate for not being on line at all.
> 
> By keeping lazy track of access time it's possible to still have that 
> data, with minimal disk access cost. And to some people that can be 
> really useful, such as those of us who have to justify expenditures.

What you propose violates both POSIX and SuSv3. close() does not update
the atime on a file. I can't see anyone accepting that there is a need
for this.
If you want to force close to update atime automatically on your system,
then you should already be able to hack up libc to do it. There are no
discernable advantages to doing it in the kernel.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-21 12:06                       ` Trond Myklebust
@ 2006-07-21 14:36                         ` Theodore Tso
  2006-07-21 19:02                           ` Trond Myklebust
  0 siblings, 1 reply; 119+ messages in thread
From: Theodore Tso @ 2006-07-21 14:36 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Bill Davidsen, J. Bruce Fields, Thomas Glanzmann, LKML

On Fri, Jul 21, 2006 at 08:06:10AM -0400, Trond Myklebust wrote:
> > By keeping lazy track of access time it's possible to still have that 
> > data, with minimal disk access cost. And to some people that can be 
> > really useful, such as those of us who have to justify expenditures.
> 
> What you propose violates both POSIX and SuSv3. close() does not update
> the atime on a file. I can't see anyone accepting that there is a need
> for this.

Nope, it doesn't violate POSIX/SuSv3.  The specifications only control
what happens if the system is cleanly shutdown.  What happens on an
unclean shutdown is explicitly undefined.  Hence, lazy atime update
where there is a "dirty" and "mostly clean" (i.e., atime-dirty) bit,
and where "mostly clean" inodes are only flushed out to disk when they
become fully dirty and then written out to disk, or when the
filesystem is unmounted, or when the filesystem feels like it (i.e.,
when we need to clear out in-core inodes in response to memory
pressure), would in fact be completely POSIX/SuSv3 compliant.

						- Ted

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-21 14:36                         ` Theodore Tso
@ 2006-07-21 19:02                           ` Trond Myklebust
  2006-07-22 12:25                             ` Theodore Tso
  0 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-21 19:02 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Bill Davidsen, J. Bruce Fields, Thomas Glanzmann, LKML

On Fri, 2006-07-21 at 10:36 -0400, Theodore Tso wrote:
> On Fri, Jul 21, 2006 at 08:06:10AM -0400, Trond Myklebust wrote:
> > > By keeping lazy track of access time it's possible to still have that 
> > > data, with minimal disk access cost. And to some people that can be 
> > > really useful, such as those of us who have to justify expenditures.
> > 
> > What you propose violates both POSIX and SuSv3. close() does not update
> > the atime on a file. I can't see anyone accepting that there is a need
> > for this.
> 
> Nope, it doesn't violate POSIX/SuSv3.  The specifications only control
> what happens if the system is cleanly shutdown.  What happens on an
> unclean shutdown is explicitly undefined.  Hence, lazy atime update
> where there is a "dirty" and "mostly clean" (i.e., atime-dirty) bit,
> and where "mostly clean" inodes are only flushed out to disk when they
> become fully dirty and then written out to disk, or when the
> filesystem is unmounted, or when the filesystem feels like it (i.e.,
> when we need to clear out in-core inodes in response to memory
> pressure), would in fact be completely POSIX/SuSv3 compliant.

I agree that POSIX does not place any requirements on caching, but what
you propose is impossible to implement on NFS: you may be able to get
the atime 'right' (assuming that you are using something like ntp to
ensure that client and server are in sync) but the NFS SETATTR
primitives do not permit you to set the ctime, so that will be set to
the time on the server it processed your SETATTR call (i.e. the close
time). That violates POSIX semantics.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-21 19:02                           ` Trond Myklebust
@ 2006-07-22 12:25                             ` Theodore Tso
  0 siblings, 0 replies; 119+ messages in thread
From: Theodore Tso @ 2006-07-22 12:25 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Bill Davidsen, J. Bruce Fields, Thomas Glanzmann, LKML

On Fri, Jul 21, 2006 at 03:02:35PM -0400, Trond Myklebust wrote:
> > Nope, it doesn't violate POSIX/SuSv3.  The specifications only control
> > what happens if the system is cleanly shutdown.  What happens on an
> > unclean shutdown is explicitly undefined.  Hence, lazy atime update
> > where there is a "dirty" and "mostly clean" (i.e., atime-dirty) bit,
> > and where "mostly clean" inodes are only flushed out to disk when they
> > become fully dirty and then written out to disk, or when the
> > filesystem is unmounted, or when the filesystem feels like it (i.e.,
> > when we need to clear out in-core inodes in response to memory
> > pressure), would in fact be completely POSIX/SuSv3 compliant.
> 
> I agree that POSIX does not place any requirements on caching, but what
> you propose is impossible to implement on NFS: you may be able to get
> the atime 'right' (assuming that you are using something like ntp to
> ensure that client and server are in sync) but the NFS SETATTR
> primitives do not permit you to set the ctime, so that will be set to
> the time on the server it processed your SETATTR call (i.e. the close
> time). That violates POSIX semantics.

Sure, this is something that could only be done on local disk
filesystems.  But those are the ones most likely to be running on
battery power on notebook computers.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 12:59     ` J. Bruce Fields
  2006-07-05 13:17       ` Pádraig Brady
  2006-07-05 19:33       ` Trond Myklebust
@ 2006-07-05 21:12       ` Bill Davidsen
  2006-07-05 21:27         ` linux-os (Dick Johnson)
  2006-07-05 21:41         ` J. Bruce Fields
  2 siblings, 2 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-05 21:12 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Theodore Tso, Thomas Glanzmann, LKML

J. Bruce Fields wrote:

>On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote:
>  
>
>>Theodore Tso wrote:
>>    
>>
>>>Some of the ideas which have been tossed about include:
>>>
>>>	* nanosecond timestamps, and support for time beyond the 2038
>>>      
>>>
>>The 2nd one is probably more urgent than the first. I can see a general 
>>benefit from timestamp in ms, beyond that seems to be a specialty 
>>requirement best provided at the application level rather than the bits 
>>of a trillion inodes which need no such thing.
>>    
>>
>
>What's urgently needed for NFS (and I suspect for most other
>applications demanding higher timestamps) isn't really nanosecond
>precision so much as something that's guaranteed to increase whenever
>the file changes.
>
>Of course, just adding space in the inodes for nanoseconds isn't
>sufficient.  XFS, for example, has nanosecond timestamps, but it's still
>easy to modify a file twice without seeing the ctime or mtime change.
>So either we need a timesource guaranteed to tick faster than the kernel
>can process IO, or we have to be willing to, say, add 1 to the
>nanoseconds field whenever the time doesn't change between operations.
>
>Or we could add an entirely separate attribute that's guaranteed to
>increase whenever the ctime is updated, and that doesn't necessarily
>have any connection with time--call it a version number or something.
>  
>
There are versions in both VMS and the ISO filesystem. I have a sneaking 
suspicion that those of us who ever use them are few and far between. 
The other issue is that unless the field is time, programs like make 
can't really use it, at least without becoming Linux specific.

I'm not sure exactly how a "version" value would be used other than 
detecting the fact that the file had been changed in some way. Feel free 
to show me, I seem to come up empty on using this value.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 21:12       ` Bill Davidsen
@ 2006-07-05 21:27         ` linux-os (Dick Johnson)
  2006-07-05 21:41         ` J. Bruce Fields
  1 sibling, 0 replies; 119+ messages in thread
From: linux-os (Dick Johnson) @ 2006-07-05 21:27 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

On Wed, 5 Jul 2006, Bill Davidsen wrote:

> J. Bruce Fields wrote:
>
>> On Wed, Jul 05, 2006 at 08:24:29AM -0400, Bill Davidsen wrote:
>>
>>
>>> Theodore Tso wrote:
>>>
>>>
>>>> Some of the ideas which have been tossed about include:
>>>>
>>>> 	* nanosecond timestamps, and support for time beyond the 2038
>>>>
>>>>
>>> The 2nd one is probably more urgent than the first. I can see a general
>>> benefit from timestamp in ms, beyond that seems to be a specialty
>>> requirement best provided at the application level rather than the bits
>>> of a trillion inodes which need no such thing.
>>>
>>>
>>
>> What's urgently needed for NFS (and I suspect for most other
>> applications demanding higher timestamps) isn't really nanosecond
>> precision so much as something that's guaranteed to increase whenever
>> the file changes.
>>
>> Of course, just adding space in the inodes for nanoseconds isn't
>> sufficient.  XFS, for example, has nanosecond timestamps, but it's still
>> easy to modify a file twice without seeing the ctime or mtime change.
>> So either we need a timesource guaranteed to tick faster than the kernel
>> can process IO, or we have to be willing to, say, add 1 to the
>> nanoseconds field whenever the time doesn't change between operations.
>>
>> Or we could add an entirely separate attribute that's guaranteed to
>> increase whenever the ctime is updated, and that doesn't necessarily
>> have any connection with time--call it a version number or something.
>>
>>
> There are versions in both VMS and the ISO filesystem. I have a sneaking
> suspicion that those of us who ever use them are few and far between.
> The other issue is that unless the field is time, programs like make
> can't really use it, at least without becoming Linux specific.
>
> I'm not sure exactly how a "version" value would be used other than
> detecting the fact that the file had been changed in some way. Feel free
> to show me, I seem to come up empty on using this value.
>
> --
> bill davidsen <davidsen@tmr.com>
>  CTO TMR Associates, Inc
>  Doing interesting things with small computers since 1979

Currently a version is just a convention for not deleting on create.
Remember VAX/VMS  MYFILE.TXT;1, create another one, you have
MYFILE.TXT;2. They are not related in any way. Any internal
value won't be much use if Unix semantics are retained because
multiple directory entries pointing to the same file are just
links. And identical names, pointing to different files in
the same directory are prevented as well.

Maybe the 'version' is the number of times the file has been
modified since creation. This might be useful.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.88 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 21:12       ` Bill Davidsen
  2006-07-05 21:27         ` linux-os (Dick Johnson)
@ 2006-07-05 21:41         ` J. Bruce Fields
  2006-07-06  2:32           ` Bill Davidsen
  1 sibling, 1 reply; 119+ messages in thread
From: J. Bruce Fields @ 2006-07-05 21:41 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Theodore Tso, Thomas Glanzmann, LKML

On Wed, Jul 05, 2006 at 05:12:54PM -0400, Bill Davidsen wrote:
> J. Bruce Fields wrote:
> >Or we could add an entirely separate attribute that's guaranteed to
> >increase whenever the ctime is updated, and that doesn't necessarily
> >have any connection with time--call it a version number or something.
> >
> There are versions in both VMS and the ISO filesystem. I have a sneaking 
> suspicion that those of us who ever use them are few and far between. 
> The other issue is that unless the field is time, programs like make 
> can't really use it, at least without becoming Linux specific.

Sure.

> I'm not sure exactly how a "version" value would be used other than 
> detecting the fact that the file had been changed in some way.

I agree.  But "detecting the fact that the file has been changed" is a
really important use!  I think the challenge would be to come up with
applications that really depend on timestamps and that use them for
anything *other* than detecting when a file has changed.

(OK, so make is a special case--it cares not only about whether a file
has changed, but also about whether it has changed more recently than
some other file.  But I'd think a simple version would useful to any
network filesystem, or more generally to anything that caches a view of
the filesystem either on another machine or in userspace.)

> Feel free to show me, I seem to come up empty on using this value.

Betraying my own interests--the NFSv4 protocol (unlike v2 or v3) uses a
specialized "change" attribute to maintain cache consistency instead of
depending on mtime/ctime.  So nfsd would be one immediate in-kernel
user.  Currently we're using ctime, which causes obvious problems.

But an improved ctime--one that actually increased whenever the file
changed--would also do the job.

--b.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-05 21:41         ` J. Bruce Fields
@ 2006-07-06  2:32           ` Bill Davidsen
  2006-07-06  2:42             ` Nigel Cunningham
  2006-07-06 12:43             ` Trond Myklebust
  0 siblings, 2 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-06  2:32 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Theodore Tso, Thomas Glanzmann, LKML

J. Bruce Fields wrote:

>On Wed, Jul 05, 2006 at 05:12:54PM -0400, Bill Davidsen wrote:
>  
>
>>J. Bruce Fields wrote:
>>    
>>
>>>Or we could add an entirely separate attribute that's guaranteed to
>>>increase whenever the ctime is updated, and that doesn't necessarily
>>>have any connection with time--call it a version number or something.
>>>
>>>      
>>>
>>There are versions in both VMS and the ISO filesystem. I have a sneaking 
>>suspicion that those of us who ever use them are few and far between. 
>>The other issue is that unless the field is time, programs like make 
>>can't really use it, at least without becoming Linux specific.
>>    
>>
>
>Sure.
>
>  
>
>>I'm not sure exactly how a "version" value would be used other than 
>>detecting the fact that the file had been changed in some way.
>>    
>>
>
>I agree.  But "detecting the fact that the file has been changed" is a
>really important use!  I think the challenge would be to come up with
>applications that really depend on timestamps and that use them for
>anything *other* than detecting when a file has changed.
>  
>
But with timestamps I need remember only one number, the time of my last 
backup. Skipping over the question of "who's idea of time" inherent in 
network filesystems. I compare all ctimes with the time of the last 
backup and do incremental on the newer ones. If we use versioning I have 
to remember the version for each file! In practice I really question if 
the benefit justified keeping all that metadata between backups. And if 
I delete a file and create another by the same name, what is it's version?

I'll say it again, I think ms resolution is readily achieved today, even 
over network files, I think greater resolution or versions are going to 
be more trouble than they are worth.

>(OK, so make is a special case--it cares not only about whether a file
>has changed, but also about whether it has changed more recently than
>some other file.  But I'd think a simple version would useful to any
>network filesystem, or more generally to anything that caches a view of
>the filesystem either on another machine or in userspace.)
>
>  
>
>>Feel free to show me, I seem to come up empty on using this value.
>>    
>>
>
>Betraying my own interests--the NFSv4 protocol (unlike v2 or v3) uses a
>specialized "change" attribute to maintain cache consistency instead of
>depending on mtime/ctime.  So nfsd would be one immediate in-kernel
>user.  Currently we're using ctime, which causes obvious problems.
>
>But an improved ctime--one that actually increased whenever the file
>changed--would also do the job.
>

No comment, I would have to see a state table to be sure I saw the races 
or that there were none. With a single writer and a sinple dirty bit 
there is no issue, it behaves like an elevator, more or less. With 
multiple writers I bet changes are written in the order submitted rather 
than the order done, but multiple writers without locks are a train 
wreck waiting to happen anyway.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06  2:32           ` Bill Davidsen
@ 2006-07-06  2:42             ` Nigel Cunningham
  2006-07-06 12:43             ` Trond Myklebust
  1 sibling, 0 replies; 119+ messages in thread
From: Nigel Cunningham @ 2006-07-06  2:42 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

Hi.

There are so many points in this conversation where I could jump in and make 
the comment I want to provide (below). Sorry if I haven't picked the best 
one.

On Thursday 06 July 2006 12:32, Bill Davidsen wrote:
> No comment, I would have to see a state table to be sure I saw the races
> or that there were none. With a single writer and a sinple dirty bit
> there is no issue, it behaves like an elevator, more or less. With
> multiple writers I bet changes are written in the order submitted rather
> than the order done, but multiple writers without locks are a train
> wreck waiting to happen anyway.

One application I can see for this careful checking is checkpointing. IIRC, 
Linus recently said he'd like to see suspending to disk treated as a special 
case of checkpointing, and I can see good sense in that. But the support is 
just not there at the moment. An important part of implementing that would be 
having a filesystem where we could know exactly what the state of the 
filesystem was at the last checkpoint, and roll back to it if necessary.

Of course this would need to be tied to tracking changes in memory and to 
writing the memory state to storage, but they're separate problems.

Ext3 has a history of being the best filesystem to use in developing and 
testing suspend to disk. It would be great if ext4 was the basis for 
implementing serious checkpointing support.

Regards,

Nigel
-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06  2:32           ` Bill Davidsen
  2006-07-06  2:42             ` Nigel Cunningham
@ 2006-07-06 12:43             ` Trond Myklebust
  2006-07-07  2:15               ` Bill Davidsen
  1 sibling, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-06 12:43 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

On Wed, 2006-07-05 at 22:32 -0400, Bill Davidsen wrote:

> But with timestamps I need remember only one number, the time of my last 
> backup. Skipping over the question of "who's idea of time" inherent in 
> network filesystems. I compare all ctimes with the time of the last 
> backup and do incremental on the newer ones. If we use versioning I have 
> to remember the version for each file! In practice I really question if 
> the benefit justified keeping all that metadata between backups. And if 
> I delete a file and create another by the same name, what is it's version?

You are completely missing the point. Our background is that all NFS
clients are required to use the mtime and ctime timestamps in order to
figure out if their cached data is valid. They need to do this extremely
frequently (in fact, every time you open() the file).

Nobody gives a rats arse about backups: those are infrequent and
can/should use more sophisticated techniques such as checksumming.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-06 12:43             ` Trond Myklebust
@ 2006-07-07  2:15               ` Bill Davidsen
  2006-07-07  2:30                 ` Trond Myklebust
                                   ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-07  2:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

Trond Myklebust wrote:

>Nobody gives a rats arse about backups: those are infrequent and
>can/should use more sophisticated techniques such as checksumming.
>
Actually, those of us who do run production servers care vastly about 
backups. And beside being utterly unscalable (checksum 20 TB of files 
four times a day to find what changed???), you would have to remember 
the checksums for all those files.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07  2:15               ` Bill Davidsen
@ 2006-07-07  2:30                 ` Trond Myklebust
  2006-07-07  2:42                 ` Ric Wheeler
  2006-07-07 19:52                 ` Theodore Tso
  2 siblings, 0 replies; 119+ messages in thread
From: Trond Myklebust @ 2006-07-07  2:30 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: J. Bruce Fields, Theodore Tso, Thomas Glanzmann, LKML

On Thu, 2006-07-06 at 22:15 -0400, Bill Davidsen wrote:
> Trond Myklebust wrote:
> 
> >Nobody gives a rats arse about backups: those are infrequent and
> >can/should use more sophisticated techniques such as checksumming.
> >
> Actually, those of us who do run production servers care vastly about 
> backups. And beside being utterly unscalable (checksum 20 TB of files 
> four times a day to find what changed???), you would have to remember 
> the checksums for all those files.

It is trivial to check if your last backup of the file was started
within 1 second or so of the last change made to the file, in which case
your backup program needs to perform a more thorough check. That sort of
thing is possible when you are talking about a daily (or even hourly)
backup.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07  2:15               ` Bill Davidsen
  2006-07-07  2:30                 ` Trond Myklebust
@ 2006-07-07  2:42                 ` Ric Wheeler
  2006-07-07  2:46                   ` Trond Myklebust
  2006-07-07 19:52                 ` Theodore Tso
  2 siblings, 1 reply; 119+ messages in thread
From: Ric Wheeler @ 2006-07-07  2:42 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Trond Myklebust, J. Bruce Fields, Theodore Tso, Thomas Glanzmann,
	LKML



Bill Davidsen wrote:

> Trond Myklebust wrote:
>
>> Nobody gives a rats arse about backups: those are infrequent and
>> can/should use more sophisticated techniques such as checksumming.
>>
> Actually, those of us who do run production servers care vastly about 
> backups. And beside being utterly unscalable (checksum 20 TB of files 
> four times a day to find what changed???), you would have to remember 
> the checksums for all those files.
>
The point of using checksums (or digital signatures on files) is to be 
able to detect when the on disk file has been corrupted - not to look 
for updates.  With normal disks, even writes that are flagged as correct 
will occasionally actually end up corrupt on disk.  The rate that you 
need to validate the checksums is not at a 4 time a day rate.

Buying a nice, high array can make this much less of a concern, but 
those of us who get stuck using commodity disks should always have a way 
of detecting corruption and a backup (either on tape or on another box).

ric




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07  2:42                 ` Ric Wheeler
@ 2006-07-07  2:46                   ` Trond Myklebust
  2006-07-07  3:16                     ` Bill Davidsen
  0 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2006-07-07  2:46 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Bill Davidsen, J. Bruce Fields, Theodore Tso, Thomas Glanzmann,
	LKML

On Thu, 2006-07-06 at 22:42 -0400, Ric Wheeler wrote:

> The point of using checksums (or digital signatures on files) is to be 
> able to detect when the on disk file has been corrupted - not to look 
> for updates.  With normal disks, even writes that are flagged as correct 
> will occasionally actually end up corrupt on disk.  The rate that you 
> need to validate the checksums is not at a 4 time a day rate.
> 
> Buying a nice, high array can make this much less of a concern, but 
> those of us who get stuck using commodity disks should always have a way 
> of detecting corruption and a backup (either on tape or on another box).

I repeat: you do _not_ need high res ctime/mtime updates in order to
figure out whether or not you need to do a daily backup on your file.
You do need it in order to figure out if the page you just read in from
your NFS server 2 microseconds ago is still valid.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07  2:46                   ` Trond Myklebust
@ 2006-07-07  3:16                     ` Bill Davidsen
  2006-07-07  8:09                       ` Bernd Petrovitsch
  2006-07-07 14:56                       ` Trond Myklebust
  0 siblings, 2 replies; 119+ messages in thread
From: Bill Davidsen @ 2006-07-07  3:16 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Ric Wheeler, J. Bruce Fields, Theodore Tso, Thomas Glanzmann,
	LKML

Trond Myklebust wrote:

>On Thu, 2006-07-06 at 22:42 -0400, Ric Wheeler wrote:
>
>  
>
>>The point of using checksums (or digital signatures on files) is to be 
>>able to detect when the on disk file has been corrupted - not to look 
>>for updates.  With normal disks, even writes that are flagged as correct 
>>will occasionally actually end up corrupt on disk.  The rate that you 
>>need to validate the checksums is not at a 4 time a day rate.
>>
>>Buying a nice, high array can make this much less of a concern, but 
>>those of us who get stuck using commodity disks should always have a way 
>>of detecting corruption and a backup (either on tape or on another box).
>>    
>>
>
>I repeat: you do _not_ need high res ctime/mtime updates in order to
>figure out whether or not you need to do a daily backup on your file.
>You do need it in order to figure out if the page you just read in from
>your NFS server 2 microseconds ago is still valid.
>
In most cases you don't care and would be using locking if you did. The 
old value was valid when you read it, the new value is valid, and if 
data is changing in 2us and the change matters, you can't process the 
data before it changes again (or at least may change).

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07  3:16                     ` Bill Davidsen
@ 2006-07-07  8:09                       ` Bernd Petrovitsch
  2006-07-07 14:56                       ` Trond Myklebust
  1 sibling, 0 replies; 119+ messages in thread
From: Bernd Petrovitsch @ 2006-07-07  8:09 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Trond Myklebust, Ric Wheeler, J. Bruce Fields, Theodore Tso,
	Thomas Glanzmann, LKML

On Thu, 2006-07-06 at 23:16 -0400, Bill Davidsen wrote:
> Trond Myklebust wrote:
[...]
> >I repeat: you do _not_ need high res ctime/mtime updates in order to
> >figure out whether or not you need to do a daily backup on your file.
> >You do need it in order to figure out if the page you just read in from
> >your NFS server 2 microseconds ago is still valid.
> >
> In most cases you don't care and would be using locking if you did. The 
> old value was valid when you read it, the new value is valid, and if 
> data is changing in 2us and the change matters, you can't process the 
> data before it changes again (or at least may change).

Do you never use `make` on NFS-mounted filesystems (for e.g. kernel
compilation)?

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07  3:16                     ` Bill Davidsen
  2006-07-07  8:09                       ` Bernd Petrovitsch
@ 2006-07-07 14:56                       ` Trond Myklebust
  1 sibling, 0 replies; 119+ messages in thread
From: Trond Myklebust @ 2006-07-07 14:56 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Ric Wheeler, J. Bruce Fields, Theodore Tso, Thomas Glanzmann,
	LKML

On Thu, 2006-07-06 at 23:16 -0400, Bill Davidsen wrote:

> In most cases you don't care and would be using locking if you did. The 
> old value was valid when you read it, the new value is valid, and if 
> data is changing in 2us and the change matters, you can't process the 
> data before it changes again (or at least may change).

Wrong! The NFS cache consistency model (close-to-open cache consistency)
requires you to be able to revalidate the cache on open() whether or not
you are using posix locking. In fact, most alternatives to posix locking
(for instance dotlocking, which is frequently used by email
applications) rely heavily on this.

See for instance http://nfs.sourceforge.net/#faq_a8

  Trond


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-07  2:15               ` Bill Davidsen
  2006-07-07  2:30                 ` Trond Myklebust
  2006-07-07  2:42                 ` Ric Wheeler
@ 2006-07-07 19:52                 ` Theodore Tso
  2 siblings, 0 replies; 119+ messages in thread
From: Theodore Tso @ 2006-07-07 19:52 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Trond Myklebust, J. Bruce Fields, Thomas Glanzmann, LKML

On Thu, Jul 06, 2006 at 10:15:42PM -0400, Bill Davidsen wrote:
> Trond Myklebust wrote:
> 
> >Nobody gives a rats arse about backups: those are infrequent and
> >can/should use more sophisticated techniques such as checksumming.
> >
> Actually, those of us who do run production servers care vastly about 
> backups. And beside being utterly unscalable (checksum 20 TB of files 
> four times a day to find what changed???), you would have to remember 
> the checksums for all those files.

Not four times a day, but probably once a month or two it would be a
*very* good idea to do periodic sweeps of files to make sure the hard
drive hasn't corrupted the files out from under you.  If you have 20+
TB of data, the probability of silent data corruption starts going up.
That would be justification for storing the checksum in the inode or
in the EA of the file, with the kernel automatically clearing it if
the file was *deliberately* changed.  The goal is to detect the disk
silently changing the data for you, free of charge....

						- Ted

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04  1:02 ` Theodore Tso
                     ` (2 preceding siblings ...)
  2006-07-05 12:24   ` Bill Davidsen
@ 2006-07-05 14:04   ` Avi Kivity
  3 siblings, 0 replies; 119+ messages in thread
From: Avi Kivity @ 2006-07-05 14:04 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Thomas Glanzmann, LKML

Theodore Tso wrote:
>
> could argue that such a stupid student doesnt *deserve* to get a 
> Ph.D.  :-)
>
> >         * and snapshots on filesystem basis
>
> This requires a filesystem that is designed from the get-go to support
> snapshots.  So yes, it's lilely not going to happen for ext4.
> Although, if you have a really clever idea, feel free to post patches
> or a detailed technical proposal for how to achieve such a goal.  :-)
>

To take a snapshot of a file, copy its inode to a free inode (call it a 
frozen inode, or finode). The inode is at the head of a linked list of 
finodes, each older than its predecessor.

Finodes have the same content as the inode they were clones from except 
the extent map. A new finode's extent map contains a single extent the 
size of the entire file with a flag that means "look in the nearest 
future finode (or inode)".

When writing to a file, first look at the nearest finode's mapping for 
that range. If it has a normal extent, go ahead and write. If it has a 
future extent for that range, first transfer that extent to the finode 
(replacing the future extent), then write the data to newly allocated 
extents. Of course this process can break up extents. One can choose 
whether to transfer the block pointers or just the data; a tradeoff of 
additional data copying vs. fragmentation avoidance.

When reading from a finode, if you're reading a normal extent, proceed 
normally. If you encounter a future extent, keep searching for the range 
in newer finodes until you encounter a normal extent or the base inode.

To snapshot the entire filesystem, have a snapshot generation count in 
the superblock and in each inode. Incrementing the superblock generation 
count snapshots the filesystem. Whenever you write to a file, if its 
generation number lags the filesystem generation number, you take a file 
snapshot as outlined above.

Directories are handled in the same way as files, although special care 
is necessary for inode reference counts.

Deleting a snapshots means merging the preceding and next finodes' 
extent maps and freeing blocks. We'd need a linked list of all finodes 
belonging to a snapshot generation.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-01 16:33 ext4 features Thomas Glanzmann
  2006-07-01 17:07 ` Tomasz Torcz
  2006-07-04  1:02 ` Theodore Tso
@ 2006-07-04 14:36 ` Andi Kleen
  2006-07-04 14:43   ` Thomas Glanzmann
  2 siblings, 1 reply; 119+ messages in thread
From: Andi Kleen @ 2006-07-04 14:36 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: linux-kernel

Thomas Glanzmann <sithglan@stud.uni-erlangen.de> writes:
> 
> What I personally would like to see most in ext4 are
> 
>         * checksums for data

Sounds good. When can we expect the initial patch submission? 

-Andi

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: ext4 features
  2006-07-04 14:36 ` Andi Kleen
@ 2006-07-04 14:43   ` Thomas Glanzmann
  0 siblings, 0 replies; 119+ messages in thread
From: Thomas Glanzmann @ 2006-07-04 14:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Hello,

initial question was: Is there anything besides 64 bit migration going
into ext4?

> > checksum support for ext4

> Sounds good. When can we expect the initial patch submission? 

this was actually a question (for which I didn't get an answer by now,
even if 34 people replied). I didn't want to start a stupid debate on
principials. However I am more interested in snapshots anyway. And if I
would provide patches, I would provide snapshot patches. - Which I
don't, because I am god damn busy at the moment.

        Thomas

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2006-07-22 12:26 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-01 16:33 ext4 features Thomas Glanzmann
2006-07-01 17:07 ` Tomasz Torcz
2006-07-01 17:47   ` Thomas Glanzmann
2006-07-01 18:09     ` Claudio Martins
2006-07-01 18:59       ` Thomas Glanzmann
2006-07-01 18:17     ` Tomasz Torcz
2006-07-03  9:44       ` Gabor Gombas
2006-07-03 20:22       ` Helge Hafting
2006-07-03 20:55         ` Tomasz Torcz
2006-07-03 21:01           ` Arjan van de Ven
2006-07-03 21:46             ` Jeff V. Merkey
2006-07-03 21:25               ` Diego Calleja
2006-07-03 22:17                 ` Alan Cox
2006-07-04 14:45                   ` Jan Engelhardt
2006-07-04 16:35                     ` Jeffrey V. Merkey
2006-07-04 18:52                       ` Jeff Garzik
2006-07-04 19:40                         ` Jeffrey V. Merkey
2006-07-05 13:35                       ` Lew Palm
2006-07-03 23:01                 ` Jeff V. Merkey
2006-07-04  9:14                 ` Benny Amorsen
2006-07-05  4:21                   ` Bill Davidsen
2006-07-05  5:13                     ` H. Peter Anvin
2006-07-05  5:45                       ` Jeffrey V. Merkey
2006-07-07 14:12                         ` Pavel Machek
2006-07-05 10:38                       ` Krzysztof Halasa
2006-07-07 14:10                     ` Pavel Machek
2006-07-07 17:45                       ` Krzysztof Halasa
2006-07-07 21:30                         ` Pavel Machek
2006-07-08 10:52                           ` Krzysztof Halasa
2006-07-08 10:55                             ` Pavel Machek
2006-07-08 11:19                               ` Krzysztof Halasa
2006-07-08 11:23                                 ` Pavel Machek
2006-07-08 18:45                                 ` Avi Kivity
2006-07-08 20:24                                   ` Krzysztof Halasa
2006-07-04  9:22                 ` Petr Tesarik
2006-07-04 11:35                   ` Peter Zijlstra
2006-07-04 11:55                     ` ext4 features (salvage) Petr Tesarik
     [not found]                       ` <80294dc60607040508l1022d164ybe0ba10858e54f0c@mail.gmail.com>
2006-07-04 12:31                         ` Petr Tesarik
2006-07-04 12:42                           ` Helge Hafting
2006-07-04 16:20                       ` Matthew Frost
2006-07-04 15:25                     ` ext4 features Pavel Machek
2006-07-05  4:10                     ` Bill Davidsen
2006-07-03 21:46               ` Valdis.Kletnieks
     [not found]                 ` <Pine.LNX.4.61.0607032354170.31747@yvahk01.tjqt.qr>
2006-07-04 14:37                   ` Kernel recycler [was: ext4 features] Jan Engelhardt
2006-07-04 11:14               ` ext4 features Krzysztof Halasa
2006-07-04 22:35               ` Frank van Maarseveen
2006-07-04 23:47                 ` Claudio Martins
2006-07-03 22:12             ` Alan Cox
2006-07-03 21:59               ` Arjan van de Ven
2006-07-03 23:31               ` ext4 features (checksums) Neil Brown
2006-07-04  1:03                 ` Jeff Garzik
2006-07-04  6:09                 ` Avi Kivity
2006-07-04  7:02                   ` Neil Brown
2006-07-04  8:26                     ` Avi Kivity
2006-07-05 11:56                       ` Bill Davidsen
2006-07-05 12:06                   ` Bill Davidsen
2006-07-05 12:19                     ` Avi Kivity
2006-07-08 17:54                       ` Bill Davidsen
2006-07-04  8:17                 ` Alan Cox
2006-07-04 11:08                   ` Thomas Glanzmann
2006-07-04 11:19                 ` Krzysztof Halasa
2006-07-04 12:49                   ` Helge Hafting
2006-07-05 12:01                     ` Bill Davidsen
2006-07-05 12:10                       ` Avi Kivity
2006-07-08 18:02                         ` Bill Davidsen
2006-07-06  0:36           ` Blatant layering violations (was Re: ext4 features) Valerie Henson
2006-07-06 12:15             ` Xavier Bestel
2006-07-06 17:06               ` Valdis.Kletnieks
2006-07-06 20:02             ` Tom Vier
2006-07-03 21:34         ` ext4 features Bill Davidsen
2006-07-03 21:50           ` Valdis.Kletnieks
2006-07-03 22:04             ` Bruce Ferrell
2006-07-04 14:48               ` Valdis.Kletnieks
2006-07-03 23:00             ` Bill Davidsen
2006-07-04 15:01               ` Valdis.Kletnieks
2006-07-05  2:40                 ` Bill Davidsen
2006-07-05  2:47                   ` Valdis.Kletnieks
2006-07-04 12:52             ` Helge Hafting
2006-07-06 15:12       ` Ric Wheeler
2006-07-06 17:05         ` Krzysztof Halasa
2006-07-06 17:27           ` Ric Wheeler
2006-07-06 20:52             ` Valdis.Kletnieks
2006-07-07 17:41               ` Krzysztof Halasa
2006-07-07 17:34             ` Krzysztof Halasa
2006-07-04  1:02 ` Theodore Tso
2006-07-04 19:16   ` Thomas Glanzmann
2006-07-04 19:30   ` Valdis.Kletnieks
2006-07-05 12:24   ` Bill Davidsen
2006-07-05 12:59     ` J. Bruce Fields
2006-07-05 13:17       ` Pádraig Brady
2006-07-05 19:33       ` Trond Myklebust
2006-07-05 21:22         ` Bill Davidsen
2006-07-05 21:42           ` Trond Myklebust
2006-07-08 21:04             ` Bill Davidsen
2006-07-10 20:08               ` Trond Myklebust
2006-07-10 22:37                 ` Bill Davidsen
2006-07-11  2:36                   ` Trond Myklebust
2006-07-21  3:10                     ` Bill Davidsen
2006-07-21 12:06                       ` Trond Myklebust
2006-07-21 14:36                         ` Theodore Tso
2006-07-21 19:02                           ` Trond Myklebust
2006-07-22 12:25                             ` Theodore Tso
2006-07-05 21:12       ` Bill Davidsen
2006-07-05 21:27         ` linux-os (Dick Johnson)
2006-07-05 21:41         ` J. Bruce Fields
2006-07-06  2:32           ` Bill Davidsen
2006-07-06  2:42             ` Nigel Cunningham
2006-07-06 12:43             ` Trond Myklebust
2006-07-07  2:15               ` Bill Davidsen
2006-07-07  2:30                 ` Trond Myklebust
2006-07-07  2:42                 ` Ric Wheeler
2006-07-07  2:46                   ` Trond Myklebust
2006-07-07  3:16                     ` Bill Davidsen
2006-07-07  8:09                       ` Bernd Petrovitsch
2006-07-07 14:56                       ` Trond Myklebust
2006-07-07 19:52                 ` Theodore Tso
2006-07-05 14:04   ` Avi Kivity
2006-07-04 14:36 ` Andi Kleen
2006-07-04 14:43   ` Thomas Glanzmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox