* Re: Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research
@ 2009-08-25 1:46 Jeff Shanab
2009-08-25 17:28 ` Rik van Riel
0 siblings, 1 reply; 7+ messages in thread
From: Jeff Shanab @ 2009-08-25 1:46 UTC (permalink / raw)
To: linux-kernel
Thanks for the prompt replys!
> On Mon, Aug 24, 2009 at 7:54 PM, Jeff Shanab<jshanab@earthlink.net> wrote:
>
>> > Title: "Pay it forward patch set"
>> > Goal: Desire to change the dentry and inode functionality so commands
>> > like du -s appear to have greatly improved performance.
>> > How: TBD? 2 phase ubdate walking up the tree to root.
>> >
>> > Prior to actually starting my Grad Project in Computer science, I am
>> > taking 1 semester to do research for it at the recommendation of my
>> > advisory. I need to of course make sure it doesn't already exist. It
>> > may be that all the changes end up in a file system and the kernel will
>> > be left alone, just one of the things I want help determining.
>> >
>> > 1) First question, where to put this functionality?
>> > I originally thought to put my functionality in the VFS so that all
>> > mounted file systems could share it, but after reading fs.h, and
>> > inode.c, it looks like the VFS is purely an abstract interface and
>> > functionality at that level may not be wanted? Also I guess certain file
>> > systems may not have needed on disk structures to save the info (ie
>> > VFAT,NFS, etc)
>>
>
> VFS has a lot of generic functionality that filesystems can opt into -
> but see below about your specific proposals...
>
>
>> > 2) Second Question. The two part idea.
>> > I was thinking that a good way to handle this is that it starts with
>> > a file change in a directory. The directory entry contains a sum already
>> > for itself and all the subdirs and an adjustment is made immediately to
>> > that, it should be in the cache. Then we queue up the change to be sent
>> > to the parent(s?). These queued up events should be a low priority at a
>> > more human time like 1 second. If a large number of changes come to a
>> > directory, multiple adjustments hit the queue with the same (directory
>> > name, inode #?) and early ones are thrown out. So levels above would see
>> > at most a 1 per second low priority update.
>>
>
> As I understand it, you want to tag each directory with the total size
> of its contents. There are a few problems with this:
> 1) A metadata change is required for a filesystem to use this. It
> would be prohibitively expensive to cache all directories in memory to
> remember their sizes, and we can't just traverse a directory and all
> of its contents to find its disk space usage just because someone
> touched it. So the size has to be remembered on disk.
>
Agreed and planned to save on disk.
You never need to look further than the directory you are already in.
> 2) Hard links break this scheme rather badly. Consider if /foo/x is
> hardlinked to /bar/x. Then something modifies /bar/x. The kernel
> cannot find all other hardlinks to /bar/x, so /foo's disk usage
> estimate is not updated. Moreover /'s disk space usage would have
> twice the actual size used by /{foo,bar}/x.
>
Updates start at the file and only work upwards back to root. How does
the hardlink get traversed?
> You can't just call it a rough estimate to get around 2), as the error
> can build up without bounds, until you have directories apparently
> taking 10x the size of your actual hard disk. That said, for
> filesystems without hardlinks this is doable, but most Linux
> filesystems support hardlinks. Heck, even NTFS supports hardlinks. So
> it's unlikely to be useful in Linux...
>
I need to look closely at how hard links are done, I think they count as
zero if they can be distinguished.
My thought was that a directory foo has size zero then a file is added
and the directory size is adjusted by the filesize.
Then a subdir is added and no adjustment is made
Then a file is placed in the subdir and it's directory gets updated and
a message to add X to subdir entry is queed and sent to parent.
THe parent directory adjusts the single entry and it's subtotal and
queues it's message to it's parent.
When the parent is root the process stops.
I think the system will have just cached all the inodes for the current
directory you are in and for that matter all the directories you
traversed getting to where you are.
Since we only update up the tree to root, don't the hard links get ignored?
>
>> > I have a second set of changes I am considering and I think would
>> > fit more completely in a file system, but I bring them up here in case
>> > it influences the above.
>> > title: "User Metadata" aka "pet peeve reduction"
>> > I would like to maintain a few classifications of metadata, most
>> > optional and configurable.
>>
> [snip details]
>
> This is already supported through user xattrs. It just needs more
> application support (good luck getting flash to use them for temp
> files though ;)
>
>
Cool, seperate project then.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research
2009-08-25 1:46 Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research Jeff Shanab
@ 2009-08-25 17:28 ` Rik van Riel
2009-08-25 18:04 ` Jeff Shanab
0 siblings, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2009-08-25 17:28 UTC (permalink / raw)
To: Jeff Shanab; +Cc: linux-kernel
Jeff Shanab wrote:
> Updates start at the file and only work upwards back to root. How does
> the hardlink get traversed?
A hard link is just a second directory entry linking to
the same inode. There are no back links from inodes to
the directory entries that point to them and they are
essentially indistinguishable from files that are not
hardlinked.
Hard links have been done like this in pretty much every
Unix filesystem since the 1970's.
--
All rights reversed.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research
2009-08-25 17:28 ` Rik van Riel
@ 2009-08-25 18:04 ` Jeff Shanab
2009-08-26 1:30 ` Theodore Tso
2009-08-26 12:18 ` Stefan Richter
0 siblings, 2 replies; 7+ messages in thread
From: Jeff Shanab @ 2009-08-25 18:04 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel
Rik van Riel wrote:
> Jeff Shanab wrote:
>
>> Updates start at the file and only work upwards back to root. How does
>> the hardlink get traversed?
>
> A hard link is just a second directory entry linking to
> the same inode. There are no back links from inodes to
> the directory entries that point to them and they are
> essentially indistinguishable from files that are not
> hardlinked.
>
> Hard links have been done like this in pretty much every
> Unix filesystem since the 1970's.
>
Right :-( I need a workaround but I am not sure how the sementics should be.
In my example where we have a directory structure like so, where file is
the same.
test/foo/bar/file
test/bar/foo/file
My example would result in the size showing up in each directory at the
bar/file and foo/file level
Then the two numbers are passed up to test and the test directory size
would be wrong.
There is no real way to handle this becasue the two files look the same
to the filesystem.
I can even remove the initial file without effecting the other. Is there
a refcount I can take advantage of in the inode?
Even then, I can't decide what the semantic is.
For example I could add a pie chart that changed with navigation into my
gui file manager. This system would make that practical.
How would I display this information usefully to the user. I suppose if
they navigate into either subdir they see the file size, but somehow
when they navigate up to the directory that holds subdirs that
indirectly contain both of them, only 1 can be counted. The problem is
that the top level directory will not be able to see the inodes of the
individual files, it is lost in the aggregate.
Still thinking....
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research
2009-08-25 18:04 ` Jeff Shanab
@ 2009-08-26 1:30 ` Theodore Tso
[not found] ` <4A94BF9E.2010101@earthlink.net>
2009-08-26 12:18 ` Stefan Richter
1 sibling, 1 reply; 7+ messages in thread
From: Theodore Tso @ 2009-08-26 1:30 UTC (permalink / raw)
To: Jeff Shanab; +Cc: Rik van Riel, linux-kernel
Something which I urge you to think about is whether optimizing du -s
is really worth it; this is not at all obvious. If it's going to cost
performance; if it's going to require non-backwards compatible changes
to filesystems; if it means introducing changes to the semantics of
hard links --- please *seriously* consider whether or not it's worth
it.
In what workload or use case is "du -s" something that gets done
frequently? And is it really worth adding code complexity and slowing
down much more common operations, such as writing files?
If the goal is do a project that gets you a master's degree or a
Ph.D., ok, fine; there are plenty of academic degrees and honors which
are given for ideas that are completely impractical and/or useless in
the real world.
If your goal is to create a feature that will be accepted into
mainline, then you need to provide a lot more justification that this
is actually a good and worthwhile thing to do, and that the benefits
outweigh the costs (in code complexity, long-term maintenance,
performance regressions, etc.)
- Ted
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research
2009-08-25 18:04 ` Jeff Shanab
2009-08-26 1:30 ` Theodore Tso
@ 2009-08-26 12:18 ` Stefan Richter
1 sibling, 0 replies; 7+ messages in thread
From: Stefan Richter @ 2009-08-26 12:18 UTC (permalink / raw)
To: Jeff Shanab; +Cc: Rik van Riel, linux-kernel
Jeff Shanab wrote:
> Rik van Riel wrote:
>> Hard links have been done like this in pretty much every
>> Unix filesystem since the 1970's.
[...]
> There is no real way to handle this becasue the two files look the same
> to the filesystem.
See above; it has been state of the art since the 70's. With hardlinks
or symlinks, a graph which represents the FS happens to contain loops.
No rocket science involved.
[...]
> How would I display this information usefully to the user.
Certain file managers display the overall size and the size on disk.
The latter differs from the former due to
- on-disk format overhead,
- on-disk compression,
- hardlinks and symlinks.
If the user copies the directory to another filesystem, the size on disk
will generally change.
--
Stefan Richter
-=====-==--= =--- ==-=-
http://arcgraph.de/sr/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research
[not found] ` <20090826130513.GM32712@mit.edu>
@ 2009-08-26 14:42 ` Jeff Shanab
2009-08-26 15:09 ` Bryan Donlan
0 siblings, 1 reply; 7+ messages in thread
From: Jeff Shanab @ 2009-08-26 14:42 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-kernel
Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 09:52:46PM -0700, Jeff Shanab wrote:
>
>> True or false
>> When we write a file we already write into it's inode and the
>> directorie's inode.
>>
>
> False; when we write into a file we need to update the inode's m_time
> field, but we don't have to update the directory's inode. So updating
> the directory inode when you write into a file is already increasing
> the overhead in terms of the file system metadata that will have to be
> updated on an *extremely* common filesystem operation.
>
oops, my bad. when I said "write a file" I meant "create the file" ie,
for the first time.
> In the case of hard links, where there are multiple directory entries
> pointing to the same inode, we don't even have an efficient way of
> knowing where the other "parent" directories are, and none of the
> directory entries are "distinguished" in any way; you don't know which
> directory entry was the "first" entry created, and in fact it may have
> already been deleted.
>
>
Yeah, A Deal breaker at the moment. Maybe a "fix" to this is a
pre-requisite for this idea.
Or maybe a workaround in the accounting side of things.
>
>> When we update a file we write into the inode and we had to load the
>> dentry to get there.
>>
>
> While a file is open for writing, the inode and the directory inode
> used to gain access to the file are pinned via the dentry cache,
> correct.
>
> Note however, that we can't change the meaning of i_blocks for
> directories without introducing serious compatibility problems;
I have just gotten started on this and this is the first time i_blocks
has come up.
I need to look into the structures closer, but as I look at the VFS and
web site explinations, it appears to be the c equivilent of the abstract
interface. (not counting all the caching) So there are operation
pointers that are called by the kernel to do things like, get the size.
I thought this system insulates us from the differences in individual
file systems. It does look more and more that all the changes are going
into the/a file system and not in the VFS.
> so if
> you want to have a "size of everything under this directory", you will
> need to allocate a new inode field; this means an filesystem-specific
> on-disk change.
>
Ok, I was afriad of that. While it would have been nice to have it at
the VFS level so it works for all filesystems, it doesn't look like that
is practical or possible.
>
>> The main addition in this idea is the low prioirity task to walk the
>> tree to root forcing adjustments along the way.
>>
>
> But it can't be low priority, or else the information will be horribly
> out of date if the system crashes and the filesystem isn't cleanly
> unmounted. To solve this solution, you only have two choices. (1)
> Update all of the parent directories at the time of the file write,
> and log these updates in the journal just as with all other metadata
> updates, with the attendent increase in overhead and performance loss,
> OR (2) force a limited fsck after each unclean reboot that walks the
> directory tree to correctly update the subdirectory summary sizes.
> Since you will need to walk the entire inode table to fix any broken
> summary sizes, the time it will take to do this is roughly equivalent
> to fsck Pass 1 and pass2 for ext2/3/4 filesystems (which is about 90%
> of a full fsck). Furthermore, it either needs to be done via the
> filesystem is unmounted or mounted read-only, or you will need to add
> a huge amount of complexity to deal with the fact that filesystem
> might be getting modified while you are recalculating the summary
> statistics.
>
Admittedly I need to study up on how recovery works now, especially for
a journaled file system.
Perhaps there is a 3. A sort of Distributed write Ahead log stored in
the inode of the directory itself.
If the process is completed, it is removed adding a second write to the
dentry's inode. If not, the queued task is found during fsck and started
again. But this fails if the changes to an dentry's inode are just
sitting in the cache at time of crash. Is the journal a write through
cache system? Is there a reference you can recommend?, I have just been
googeling and reading some source code.
When recovery is considered, The tasks will probably have to be idempotent.
>
> As far as knowing whether or not a copy will succeed, that depends on
> whether the copy is going to be smart about preseving hard links, and
> whether or not the copy will be copying symlinked files as real files.
> So you can only give a very specialized answer assuming only one set
> of copy options.
>
> - Ted
>
>
I have a seemingly dumb question, but I need a good answer for it. What
good are hardlinks, Do we need them with the advent of symlinks? Are
they just a vestigial feature or is there a real use case where they are
preferred? I don't think I have any hard links on this machine, but of
course, how would I know? ;-) I know i have seen them used in some
chroot environments at work.
I can see they allow du, for example, to give an accurate size of a
directory as it would be if copied to another drive, symlinks don't do
that AFAIK. In fact the copy is missing data. Guess I answered My own
question?
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research
2009-08-26 14:42 ` Jeff Shanab
@ 2009-08-26 15:09 ` Bryan Donlan
0 siblings, 0 replies; 7+ messages in thread
From: Bryan Donlan @ 2009-08-26 15:09 UTC (permalink / raw)
To: Jeff Shanab; +Cc: Theodore Tso, linux-kernel
On Wed, Aug 26, 2009 at 10:42 AM, Jeff Shanab<jshanab@earthlink.net> wrote:
> Admittedly I need to study up on how recovery works now, especially for
> a journaled file system.
>
> Perhaps there is a 3. A sort of Distributed write Ahead log stored in
> the inode of the directory itself.
> If the process is completed, it is removed adding a second write to the
> dentry's inode. If not, the queued task is found during fsck and started
> again. But this fails if the changes to an dentry's inode are just
> sitting in the cache at time of crash. Is the journal a write through
> cache system? Is there a reference you can recommend?, I have just been
> googeling and reading some source code.
With journalling filesystems, fsck will not be run in a crash, so a
flag as you suggest is insufficient. Moreover, a fsck will traverse
the entire filesystem, so it can (and must) just directly check that
all size fields are correct.
In a typical journalling filesystem such as ext3, prior to making any
changes to the disk, it will write a copy of all changes it will make
to a seperate log. Once these are committed to disk, they will be
copied to their final locations in the real filesystem data and
metadata. Other filesystems such as btrfs use the 'dancing trees'
approach - where the new data is allocated into free space, and commit
is done by changing some root pointers to make this new data be used.
You may want to take a look at
http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
for an overview of how ext3 in particular works.
>>
>> As far as knowing whether or not a copy will succeed, that depends on
>> whether the copy is going to be smart about preseving hard links, and
>> whether or not the copy will be copying symlinked files as real files.
>> So you can only give a very specialized answer assuming only one set
>> of copy options.
>>
>> - Ted
>>
>>
> I have a seemingly dumb question, but I need a good answer for it. What
> good are hardlinks, Do we need them with the advent of symlinks? Are
> they just a vestigial feature or is there a real use case where they are
> preferred? I don't think I have any hard links on this machine, but of
> course, how would I know? ;-) I know i have seen them used in some
> chroot environments at work.
Run: find / -xdev -type f -and -not -links 1
A typical debian installation contains a number of hardlinks,
including some for identical timezones under different names, or for
wine's wrapper scripts, or for gunzip and uncompress, or for various
e2fstools, or many other examples.
> I can see they allow du, for example, to give an accurate size of a
> directory as it would be if copied to another drive, symlinks don't do
> that AFAIK. In fact the copy is missing data. Guess I answered My own
> question?
I have no idea what you mean by this...
Keep in mind du can dereference symlinks to get the size of the target
(-L option).
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-08-26 15:09 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-25 1:46 Starting a grad project that may change kernel VFS. Early research Re: Starting a grad project that may change kernel VFS. Early research Jeff Shanab
2009-08-25 17:28 ` Rik van Riel
2009-08-25 18:04 ` Jeff Shanab
2009-08-26 1:30 ` Theodore Tso
[not found] ` <4A94BF9E.2010101@earthlink.net>
[not found] ` <20090826130513.GM32712@mit.edu>
2009-08-26 14:42 ` Jeff Shanab
2009-08-26 15:09 ` Bryan Donlan
2009-08-26 12:18 ` Stefan Richter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox