Starting a grad project that may change kernel VFS. Early research

All of lore.kernel.org
 help / color / mirror / Atom feed

* Starting a grad project that may change kernel VFS. Early research
@ 2009-08-24 23:54 Jeff Shanab
  2009-08-25  0:59 ` Bryan Donlan
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jeff Shanab @ 2009-08-24 23:54 UTC (permalink / raw)
  To: linux-kernel

Title: "Pay it forward patch set"
Goal: Desire to change the dentry and inode functionality so commands
like du -s appear to have greatly improved performance.
How: TBD? 2 phase ubdate walking up the tree to root.

   Prior to actually starting my Grad Project in Computer science, I am
taking 1 semester to do research for it at the recommendation of my
advisory.  I need to of course make sure it doesn't already exist.  It
may be that all the changes end up in a file system and the kernel will
be left alone, just one of the things I want help determining.

1) First question, where to put this functionality?
    I originally thought to put my functionality in the VFS so that all
mounted file systems could share it, but after reading fs.h, and
inode.c, it looks like the VFS is purely an abstract interface and
functionality at that level may not be wanted? Also I guess certain file
systems may not have needed on disk structures to save the info (ie
VFAT,NFS, etc)

2) Second Question. The two part idea.
    I was thinking that a good way to handle this is that it starts with
a file change in a directory. The directory entry contains a sum already
for itself and all the subdirs and an adjustment is made immediately to
that, it should be in the cache. Then we queue up the change to be sent
to the parent(s?). These queued up events should be a low priority at a
more human time like 1 second. If a large number of changes come to a
directory, multiple adjustments hit the queue with the same (directory
name, inode #?) and early ones are thrown out. So levels above would see
at most a 1 per second low priority update.

    So when you issue a 'du -sh' or use anything that uses stat like
filelight, it can get the size of all the subdirs without actually
recursing through them, they have been built up over time.

    I have a second set of changes I am considering and I think would
fit more completely in a file system, but I bring them up here in case
it influences the above.
title: "User Metadata" aka "pet peeve reduction"
    I would like to maintain a few classifications of metadata, most
optional and configurable.

        1) OriginalFileName: Default on. The original filename is hung
onto. A warning is issued if it is attempted to be saved again in same
directory. This is primarily for  all those darn auto generated youtube
and pdf filenames.

       2) UserClasification: User Optional: User defined classifications
can be applied. Most examples I can think of can be usually handled by
directories, or file types, but users are surprising. Maybe
clasifications like personnel, buisness, job, school, can span directory
structures.

       3) KeyWords: Auto Gen or user defined: Allow google type searches
of files. Obviously faster to keep in a central location, we leave that
up to the application. we need it attached to file so an index can be
rebuilt  or it moves with the file.
   
       4) Description: User Optional: A user friendly Description can be
applied to any file. File Managers in a GUI can display original
filename and this Description on mouse hoover.

       5) Extension Specific Metadata: Configurable. An index page into
metadata specific to a file type. For example, security video may be
broken into many segments and may have motion events and alarms and
analytic information. A index to the frame containing this for the file
type may be useful.

Hopefully the cost of these would be relatively small, and most users
would only chose a few of them per file, so not all in use for every
file, but all available.

Sorry for the length of this. If you have read this far, thankyou!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Starting a grad project that may change kernel VFS. Early  research
  2009-08-24 23:54 Starting a grad project that may change kernel VFS. Early research Jeff Shanab
@ 2009-08-25  0:59 ` Bryan Donlan
  2009-08-25  1:26 ` Theodore Tso
  2009-08-25 12:13 ` Pavel Machek
  2 siblings, 0 replies; 8+ messages in thread
From: Bryan Donlan @ 2009-08-25  0:59 UTC (permalink / raw)
  To: Jeff Shanab; +Cc: linux-kernel

On Mon, Aug 24, 2009 at 7:54 PM, Jeff Shanab<jshanab@earthlink.net> wrote:
> Title: "Pay it forward patch set"
> Goal: Desire to change the dentry and inode functionality so commands
> like du -s appear to have greatly improved performance.
> How: TBD? 2 phase ubdate walking up the tree to root.
>
>   Prior to actually starting my Grad Project in Computer science, I am
> taking 1 semester to do research for it at the recommendation of my
> advisory.  I need to of course make sure it doesn't already exist.  It
> may be that all the changes end up in a file system and the kernel will
> be left alone, just one of the things I want help determining.
>
> 1) First question, where to put this functionality?
>    I originally thought to put my functionality in the VFS so that all
> mounted file systems could share it, but after reading fs.h, and
> inode.c, it looks like the VFS is purely an abstract interface and
> functionality at that level may not be wanted? Also I guess certain file
> systems may not have needed on disk structures to save the info (ie
> VFAT,NFS, etc)

VFS has a lot of generic functionality that filesystems can opt into -
but see below about your specific proposals...

> 2) Second Question. The two part idea.
>    I was thinking that a good way to handle this is that it starts with
> a file change in a directory. The directory entry contains a sum already
> for itself and all the subdirs and an adjustment is made immediately to
> that, it should be in the cache. Then we queue up the change to be sent
> to the parent(s?). These queued up events should be a low priority at a
> more human time like 1 second. If a large number of changes come to a
> directory, multiple adjustments hit the queue with the same (directory
> name, inode #?) and early ones are thrown out. So levels above would see
> at most a 1 per second low priority update.

As I understand it, you want to tag each directory with the total size
of its contents. There are a few problems with this:
1) A metadata change is required for a filesystem to use this. It
would be prohibitively expensive to cache all directories in memory to
remember their sizes, and we can't just traverse a directory and all
of its contents to find its disk space usage just because someone
touched it. So the size has to be remembered on disk.
2) Hard links break this scheme rather badly. Consider if /foo/x is
hardlinked to /bar/x. Then something modifies /bar/x. The kernel
cannot find all other hardlinks to /bar/x, so /foo's disk usage
estimate is not updated. Moreover /'s disk space usage would have
twice the actual size used by /{foo,bar}/x.

You can't just call it a rough estimate to get around 2), as the error
can build up without bounds, until you have directories apparently
taking 10x the size of your actual hard disk. That said, for
filesystems without hardlinks this is doable, but most Linux
filesystems support hardlinks. Heck, even NTFS supports hardlinks. So
it's unlikely to be useful in Linux...

>    I have a second set of changes I am considering and I think would
> fit more completely in a file system, but I bring them up here in case
> it influences the above.
> title: "User Metadata" aka "pet peeve reduction"
>    I would like to maintain a few classifications of metadata, most
> optional and configurable.
[snip details]

This is already supported through user xattrs. It just needs more
application support (good luck getting flash to use them for temp
files though ;)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Starting a grad project that may change kernel VFS. Early research
  2009-08-24 23:54 Starting a grad project that may change kernel VFS. Early research Jeff Shanab
  2009-08-25  0:59 ` Bryan Donlan
@ 2009-08-25  1:26 ` Theodore Tso
  2009-08-25 12:13 ` Pavel Machek
  2 siblings, 0 replies; 8+ messages in thread
From: Theodore Tso @ 2009-08-25  1:26 UTC (permalink / raw)
  To: Jeff Shanab; +Cc: linux-kernel

On Mon, Aug 24, 2009 at 04:54:52PM -0700, Jeff Shanab wrote:
>     I was thinking that a good way to handle this is that it starts with
> a file change in a directory. The directory entry contains a sum already
> for itself and all the subdirs and an adjustment is made immediately to
> that, it should be in the cache. Then we queue up the change to be sent
> to the parent(s?). These queued up events should be a low priority at a
> more human time like 1 second. If a large number of changes come to a
> directory, multiple adjustments hit the queue with the same (directory
> name, inode #?) and early ones are thrown out. So levels above would see
> at most a 1 per second low priority update.

Is this something that you want to be stored in the file system, or
just cached in memory?  If it is going to be stored on disk, which
seems to be implied by your description, and it is only going to be
updated once a second, what happens if there is a system crash?  Over
time, the values will go out of date.  Fsck could fix this, sure, but
that means you have to do the equivant of running "du -s" on the root
directory of the filesystem after an unclean shutdown.

You could write the size changes in a journal, but that blows up the
size of information that would need to be stored in a journal.  It
also slows down the very common operaton of writing to a file, all for
the sake of speeding up the relatively uncommon "du -s" operation.
It's not at all clear it's worthwhile tradeoff.

In addition, how will you handle hard links?  An inode can have
multiple hard links in different directories, and there is no way to
find all of the directories which might contain a hard link to a
particular inode, short of doing a brute force search.  Hence if you
have a file living in src/linux/v2.6.29/README, and it is a hard link
to ~/hacker/linux/README, and a program appends data to the file
~/hacker/linux/README, this would also change the result of running du
-s src/linux/v2.6.29; however, there's no way for your extension to
know that.

> title: "User Metadata" aka "pet peeve reduction"
>     I would like to maintain a few classifications of metadata, most
> optional and configurable.

Most Linux filesystems already have extended attributes that can be
used to store your proposed metadata.  Changing user application
programs to store the keywords, etc., is an exercise in
application-level programming; the kernel-side support is already
there.

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Starting a grad project that may change kernel VFS. Early research
@ 2009-08-25  2:05 Jeff Shanab
  2009-08-25  3:18 ` Bryan Donlan
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Shanab @ 2009-08-25  2:05 UTC (permalink / raw)
  To: linux-kernel

>
> On Mon, Aug 24, 2009 at 04:54:52PM -0700, Jeff Shanab wrote:
>   
>> >     I was thinking that a good way to handle this is that it starts with
>> > a file change in a directory. The directory entry contains a sum already
>> > for itself and all the subdirs and an adjustment is made immediately to
>> > that, it should be in the cache. Then we queue up the change to be sent
>> > to the parent(s?). These queued up events should be a low priority at a
>> > more human time like 1 second. If a large number of changes come to a
>> > directory, multiple adjustments hit the queue with the same (directory
>> > name, inode #?) and early ones are thrown out. So levels above would see
>> > at most a 1 per second low priority update.
>>     
>
> Is this something that you want to be stored in the file system, or
> just cached in memory?  If it is going to be stored on disk, which
> seems to be implied by your description, and it is only going to be
> updated once a second, what happens if there is a system crash?  Over
> time, the values will go out of date.  Fsck could fix this, sure, but
> that means you have to do the equivant of running "du -s" on the root
> directory of the filesystem after an unclean shutdown.

Could this could be done low priority in the background long after fsck and the boot process is done?
There will probably be a cutoff point where du -s after a command is better than the file by file, like when we recursively move a directory But I was gonna run tests and see how that went. Mv may be actually easier than cp, it is a tree grafting.

> You could write the size changes in a journal, but that blows up the
> size of information that would need to be stored in a journal.  It
> also slows down the very common operaton of writing to a file, all for
> the sake of speeding up the relatively uncommon "du -s" operation.
> It's not at all clear it's worthwhile tradeoff.
>   
Yeah fsck is an interesting scenario.
Databases have had to deal with this and maybe there are hints like the
two phase commit and
the WAL just for the size updates.
Maybe we set a flag in the directory entry when we update it, cause we
are writing this update to disk anyway.
Then when update completes at the parent, the flag is cleared. Now this
makes two writes for each directory but the process is resumable during fsk
I need to look at the cashing and how we handle changes already.  Do we
write things immediately all the time? Then why must I "sync" before
unmount. hummmm
> In addition, how will you handle hard links?  An inode can have
> multiple hard links in different directories, and there is no way to
> find all of the directories which might contain a hard link to a
> particular inode, short of doing a brute force search.  Hence if you
> have a file living in src/linux/v2.6.29/README, and it is a hard link
> to ~/hacker/linux/README, and a program appends data to the file
> ~/hacker/linux/README, this would also change the result of running du
> -s src/linux/v2.6.29; however, there's no way for your extension to
> know that.
>
>   
>> > title: "User Metadata" aka "pet peeve reduction"
>> >     I would like to maintain a few classifications of metadata, most
>> > optional and configurable.
>>     
>
> Most Linux filesystems already have extended attributes that can be
> used to store your proposed metadata.  Changing user application
> programs to store the keywords, etc., is an exercise in
> application-level programming; the kernel-side support is already
> there.
>
> 						- Ted
>
>   
Cool, a project for next summer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Starting a grad project that may change kernel VFS. Early  research
  2009-08-25  2:05 Jeff Shanab
@ 2009-08-25  3:18 ` Bryan Donlan
  2009-08-25  4:23   ` Jeff Shanab
  0 siblings, 1 reply; 8+ messages in thread
From: Bryan Donlan @ 2009-08-25  3:18 UTC (permalink / raw)
  To: Jeff Shanab; +Cc: linux-kernel, Theodore Tso

On Mon, Aug 24, 2009 at 10:05 PM, Jeff Shanab<jshanab@earthlink.net> wrote:
>>
>> On Mon, Aug 24, 2009 at 04:54:52PM -0700, Jeff Shanab wrote:
>>
>>> >     I was thinking that a good way to handle this is that it starts with
>>> > a file change in a directory. The directory entry contains a sum already
>>> > for itself and all the subdirs and an adjustment is made immediately to
>>> > that, it should be in the cache. Then we queue up the change to be sent
>>> > to the parent(s?). These queued up events should be a low priority at a
>>> > more human time like 1 second. If a large number of changes come to a
>>> > directory, multiple adjustments hit the queue with the same (directory
>>> > name, inode #?) and early ones are thrown out. So levels above would see
>>> > at most a 1 per second low priority update.
>>>
>>
>> Is this something that you want to be stored in the file system, or
>> just cached in memory?  If it is going to be stored on disk, which
>> seems to be implied by your description, and it is only going to be
>> updated once a second, what happens if there is a system crash?  Over
>> time, the values will go out of date.  Fsck could fix this, sure, but
>> that means you have to do the equivant of running "du -s" on the root
>> directory of the filesystem after an unclean shutdown.
>
> Could this could be done low priority in the background long after fsck and the boot process is done?
> There will probably be a cutoff point where du -s after a command is better than the file by file, like when we recursively move a directory But I was gonna run tests and see how that went. Mv may be actually easier than cp, it is a tree grafting.

cp is easier than mv - in that it requires no explicit support from
your layer. 'cp' really just loops doing read() and write() - there
are some experimental copy-on-write ioctls for btrfs, I think, but
nothing standard there yet.

Also, directories aren't 'recursively moved' - if you're moving within
a mount, you just rename() the directory, and it's moved in what is on
most filesystems an O(1) operation. If you're moving between mounts,
the kernel gives you no help whatsoever - it's up to the 'mv' program
to copy the directory, then delete the old one.

>> You could write the size changes in a journal, but that blows up the
>> size of information that would need to be stored in a journal.  It
>> also slows down the very common operaton of writing to a file, all for
>> the sake of speeding up the relatively uncommon "du -s" operation.
>> It's not at all clear it's worthwhile tradeoff.
>>
> Yeah fsck is an interesting scenario.
> Databases have had to deal with this and maybe there are hints like the
> two phase commit and
> the WAL just for the size updates.
> Maybe we set a flag in the directory entry when we update it, cause we
> are writing this update to disk anyway.
> Then when update completes at the parent, the flag is cleared. Now this
> makes two writes for each directory but the process is resumable during fsk

No. Updating the size at the same time as the main inode write is far
cheaper than opening a second transaction just for the size update -
unless computing the new size is an expensive operation as well.

> I need to look at the cashing and how we handle changes already.  Do we
> write things immediately all the time? Then why must I "sync" before
> unmount. hummmm

You don't need to sync before umount. umount automatically syncs the
filesystem it's applied on after it's removed from the namespace, but
before the umount completes. Additionally, dirty buffers and pages are
written back automatically based on memory pressure and timeouts - see
/proc/sys/vm/dirty_* for the knobs for this.

>> In addition, how will you handle hard links?  An inode can have
>> multiple hard links in different directories, and there is no way to
>> find all of the directories which might contain a hard link to a
>> particular inode, short of doing a brute force search.  Hence if you
>> have a file living in src/linux/v2.6.29/README, and it is a hard link
>> to ~/hacker/linux/README, and a program appends data to the file
>> ~/hacker/linux/README, this would also change the result of running du
>> -s src/linux/v2.6.29; however, there's no way for your extension to
>> know that.

^^^ don't skip this part, it's absolutely critical, the biggest
problem with your proposal, and you can't just handwave it away.

One thing you may want to look into is the new fanotify API[1] - it
allows a userspace program to monitor and/or block certain filesystem
events of interest. You may be able to implement a prototype of your
space-usage-caching system in userspace this way without needing to
modify the kernel. Or implement it as a FUSE layered filesystem. In
the latter case you may be able to make a reverse index of sorts for
hardlink handling - but this carries with it quite a bit of overhead.

PS - it's normal to keep all CCs when replying to messages on lkml
(that is, use reply to all), as some people may not be subscribed, or
may prefer to get extra copies in their inbox. I personally don't mind
either way, but there are some who are very adamant about this point.

[1] - http://lwn.net/Articles/339399/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Starting a grad project that may change kernel VFS. Early  research
  2009-08-25  3:18 ` Bryan Donlan
@ 2009-08-25  4:23   ` Jeff Shanab
  2009-08-25 14:37     ` Bryan Donlan
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Shanab @ 2009-08-25  4:23 UTC (permalink / raw)
  To: Bryan Donlan; +Cc: linux-kernel, tytso

Bryan Donlan wrote:
> On Mon, Aug 24, 2009 at 10:05 PM, Jeff Shanab<jshanab@earthlink.net> wrote:
>   
>>> On Mon, Aug 24, 2009 at 04:54:52PM -0700, Jeff Shanab wrote:
>>>
>>>       
>>>>>     I was thinking that a good way to handle this is that it starts with
>>>>> a file change in a directory. The directory entry contains a sum already
>>>>> for itself and all the subdirs and an adjustment is made immediately to
>>>>> that, it should be in the cache. Then we queue up the change to be sent
>>>>> to the parent(s?). These queued up events should be a low priority at a
>>>>> more human time like 1 second. If a large number of changes come to a
>>>>> directory, multiple adjustments hit the queue with the same (directory
>>>>> name, inode #?) and early ones are thrown out. So levels above would see
>>>>> at most a 1 per second low priority update.
>>>>>           
>>> Is this something that you want to be stored in the file system, or
>>> just cached in memory?  If it is going to be stored on disk, which
>>> seems to be implied by your description, and it is only going to be
>>> updated once a second, what happens if there is a system crash?  Over
>>> time, the values will go out of date.  Fsck could fix this, sure, but
>>> that means you have to do the equivant of running "du -s" on the root
>>> directory of the filesystem after an unclean shutdown.
>>>       
>> Could this could be done low priority in the background long after fsck and the boot process is done?
>> There will probably be a cutoff point where du -s after a command is better than the file by file, like when we recursively move a directory But I was gonna run tests and see how that went. Mv may be actually easier than cp, it is a tree grafting.
>>     
>
> cp is easier than mv - in that it requires no explicit support from
> your layer. 'cp' really just loops doing read() and write() - there
> are some experimental copy-on-write ioctls for btrfs, I think, but
> nothing standard there yet.
>   
Easier was a bad choice of words. I really meant move is less expensive.
> Also, directories aren't 'recursively moved' - if you're moving within
> a mount, you just rename() the directory, and it's moved in what is on
> most filesystems an O(1) operation.
I should of been clear, that is what I meant by tree grafting :-)
>  If you're moving between mounts,
> the kernel gives you no help whatsoever - it's up to the 'mv' program
> to copy the directory, then delete the old one.
>   
Now that is interesting, I am sure I would of realized that eventually,
I have certainly seen it in action. Just hadent thought of that this
time. Thanks.

So does mv essentially become copy when between mounts?
>   
>>> You could write the size changes in a journal, but that blows up the
>>> size of information that would need to be stored in a journal.  It
>>> also slows down the very common operaton of writing to a file, all for
>>> the sake of speeding up the relatively uncommon "du -s" operation.
>>> It's not at all clear it's worthwhile tradeoff.
>>>
>>>       
>> Yeah fsck is an interesting scenario.
>> Databases have had to deal with this and maybe there are hints like the
>> two phase commit and
>> the WAL just for the size updates.
>> Maybe we set a flag in the directory entry when we update it, cause we
>> are writing this update to disk anyway.
>> Then when update completes at the parent, the flag is cleared. Now this
>> makes two writes for each directory but the process is resumable during fsk
>>     
>
> No. Updating the size at the same time as the main inode write is far
> cheaper than opening a second transaction just for the size update -
> unless computing the new size is an expensive operation as well.
>   
But the size of a subdirectory is not stored in the inode in this
scenario, it is stored in the directory entry.
Or is it? Their is an inode for the directory file, maybe just adjust
the inode and return the subdir size if type is direntry.
Maybe this is on a flag and the directory can look like this ...
...
-rw-r--r--   1 root root          14347 Jan 24  2009 thickbox.js~
-rw-r--r--   1 root root          18545 Jun 10 18:56 unofficalTranscript.txt
-rw-r--r--   1 root root      322183635 Aug 11 20:20 uw_mm_inflamm_ipodv.m4v
drwxr-xr-x   2 root root     440(56093) Nov 23  2007 varicaddemo
drwxr-xr-x   2 root root     144(10298) Oct 23  2007 varicaddemos

                   TOTAL 322217111 (322282918) 

Where the number in parenthesis is the subdir total.

The Total at the end of the dir command, just like du or anything using
the stat command is now practical. (ever used filelight?)



>   
>> I need to look at the cashing and how we handle changes already.  Do we
>> write things immediately all the time? Then why must I "sync" before
>> unmount. hummmm
>>     
>
> You don't need to sync before umount. umount automatically syncs the
> filesystem it's applied on after it's removed from the namespace, but
> before the umount completes. Additionally, dirty buffers and pages are
> written back automatically based on memory pressure and timeouts - see
> /proc/sys/vm/dirty_* for the knobs for this.
>   
I know it now does the sync for you, but the fact a sync must be done
indicates there are buffers not written, correct?
>>> In addition, how will you handle hard links?  An inode can have
>>> multiple hard links in different directories, and there is no way to
>>> find all of the directories which might contain a hard link to a
>>> particular inode, short of doing a brute force search.  Hence if you
>>> have a file living in src/linux/v2.6.29/README, and it is a hard link
>>> to ~/hacker/linux/README, and a program appends data to the file
>>> ~/hacker/linux/README, this would also change the result of running du
>>> -s src/linux/v2.6.29; however, there's no way for your extension to
>>> know that.
>>>       
>
> ^^^ don't skip this part, it's absolutely critical, the biggest
> problem with your proposal, and you can't just handwave it away.
>   
I will sleep on the hard link issue. There must be an answer as DU must
handle this.
I can see where if I can't distinquish between which is the hard link
and which is not becasue they are implemented the same.

First think is to run an experiment in the morning

    test/foo/bar/file
    test/bar/foo/file
    where file is the same file close to the disk block size.
    does 'du -s in foo' + 'du -s in bar'  = 'du -s' in test?

> One thing you may want to look into is the new fanotify API[1] - it
> allows a userspace program to monitor and/or block certain filesystem
> events of interest. You may be able to implement a prototype of your
> space-usage-caching system in userspace this way without needing to
> modify the kernel. Or implement it as a FUSE layered filesystem. In
> the latter case you may be able to make a reverse index of sorts for
> hardlink handling - but this carries with it quite a bit of overhead.
>   
FUSE is an option I was keeping open.
Since I can dedicate a mountpoint to a file system and mount and umount
it and load and unload a kernel module FUSE, seemed like extra work with
little benefit. 
That does sound like a lot of overhead.
> PS - it's normal to keep all CCs when replying to messages on lkml
> (that is, use reply to all), as some people may not be subscribed, or
> may prefer to get extra copies in their inbox. I personally don't mind
> either way, but there are some who are very adamant about this point.
>   
ok, The other lists I am on are insistent that I only send to the list
address.
> [1] - http://lwn.net/Articles/339399/
>
>   


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Starting a grad project that may change kernel VFS. Early research
  2009-08-24 23:54 Starting a grad project that may change kernel VFS. Early research Jeff Shanab
  2009-08-25  0:59 ` Bryan Donlan
  2009-08-25  1:26 ` Theodore Tso
@ 2009-08-25 12:13 ` Pavel Machek
  2 siblings, 0 replies; 8+ messages in thread
From: Pavel Machek @ 2009-08-25 12:13 UTC (permalink / raw)
  To: Jeff Shanab; +Cc: linux-kernel, jack


> 2) Second Question. The two part idea.
>     I was thinking that a good way to handle this is that it starts with
> a file change in a directory. The directory entry contains a sum already
> for itself and all the subdirs and an adjustment is made immediately to
> that, it should be in the cache. Then we queue up the change to be sent
> to the parent(s?). These queued up events should be a low priority at a
> more human time like 1 second. If a large number of changes come to a
> directory, multiple adjustments hit the queue with the same (directory
> name, inode #?) and early ones are thrown out. So levels above would see
> at most a 1 per second low priority update.
> 
>     So when you issue a 'du -sh' or use anything that uses stat like
> filelight, it can get the size of all the subdirs without actually
> recursing through them, they have been built up over time.

I'd suggest you look at jack's recursive mtime idea, and then
implement your features on top of that, in userland.


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Starting a grad project that may change kernel VFS. Early  research
  2009-08-25  4:23   ` Jeff Shanab
@ 2009-08-25 14:37     ` Bryan Donlan
  0 siblings, 0 replies; 8+ messages in thread
From: Bryan Donlan @ 2009-08-25 14:37 UTC (permalink / raw)
  To: Jeff Shanab; +Cc: linux-kernel, tytso

On Tue, Aug 25, 2009 at 12:23 AM, Jeff Shanab<jshanab@earthlink.net> wrote:
> So does mv essentially become copy when between mounts?

Yes, essentially.

>>
>>> I need to look at the cashing and how we handle changes already.  Do we
>>> write things immediately all the time? Then why must I "sync" before
>>> unmount. hummmm
>>>
>>
>> You don't need to sync before umount. umount automatically syncs the
>> filesystem it's applied on after it's removed from the namespace, but
>> before the umount completes. Additionally, dirty buffers and pages are
>> written back automatically based on memory pressure and timeouts - see
>> /proc/sys/vm/dirty_* for the knobs for this.
>>
> I know it now does the sync for you, but the fact a sync must be done
> indicates there are buffers not written, correct?

Generally speaking the umount will actually make some buffers dirty
when, eg, setting a 'filesystem is clean' flag. There may also be
dirty buffers left over from prior activity.

>>>> In addition, how will you handle hard links?  An inode can have
>>>> multiple hard links in different directories, and there is no way to
>>>> find all of the directories which might contain a hard link to a
>>>> particular inode, short of doing a brute force search.  Hence if you
>>>> have a file living in src/linux/v2.6.29/README, and it is a hard link
>>>> to ~/hacker/linux/README, and a program appends data to the file
>>>> ~/hacker/linux/README, this would also change the result of running du
>>>> -s src/linux/v2.6.29; however, there's no way for your extension to
>>>> know that.
>>>>
>>
>> ^^^ don't skip this part, it's absolutely critical, the biggest
>> problem with your proposal, and you can't just handwave it away.
>>
> I will sleep on the hard link issue. There must be an answer as DU must
> handle this.
> I can see where if I can't distinquish between which is the hard link
> and which is not becasue they are implemented the same.
>
> First think is to run an experiment in the morning
>
>    test/foo/bar/file
>    test/bar/foo/file
>    where file is the same file close to the disk block size.
>    does 'du -s in foo' + 'du -s in bar'  = 'du -s' in test?

No. du -s in test will count 'file' only once, unless -l is passed.

>
>> One thing you may want to look into is the new fanotify API[1] - it
>> allows a userspace program to monitor and/or block certain filesystem
>> events of interest. You may be able to implement a prototype of your
>> space-usage-caching system in userspace this way without needing to
>> modify the kernel. Or implement it as a FUSE layered filesystem. In
>> the latter case you may be able to make a reverse index of sorts for
>> hardlink handling - but this carries with it quite a bit of overhead.
>>
> FUSE is an option I was keeping open.
> Since I can dedicate a mountpoint to a file system and mount and umount
> it and load and unload a kernel module FUSE, seemed like extra work with
> little benefit.
> That does sound like a lot of overhead.

It is additional overhead, but writing code for userspace is a lot
easier as you do not need to deal with kernel locking and low-memory
deadlock issues, and can use any userspace libraries you want. You
also won't have to worry about crashing the system and having to
reboot if you make a mistake. It's a good way to prove the concept is
sound before proposing it in a more concrete form to filesystem
developers.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-25 14:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-24 23:54 Starting a grad project that may change kernel VFS. Early research Jeff Shanab
2009-08-25  0:59 ` Bryan Donlan
2009-08-25  1:26 ` Theodore Tso
2009-08-25 12:13 ` Pavel Machek
  -- strict thread matches above, loose matches on Subject: below --
2009-08-25  2:05 Jeff Shanab
2009-08-25  3:18 ` Bryan Donlan
2009-08-25  4:23   ` Jeff Shanab
2009-08-25 14:37     ` Bryan Donlan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.