linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Recursive directory accounting for size, ctime, etc.
@ 2008-07-15 18:28 Sage Weil
  2008-07-15 19:47 ` Andreas Dilger
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sage Weil @ 2008-07-15 18:28 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: ceph-devel

All-

Ceph is a new distributed file system for Linux designed for scalability 
(terabytes to exabytes, tens to thousands of storage nodes), reliability, 
and performance.  The latest release (v0.3), aside from xattr support and 
the usual slew of bugfixes, includes a unique (?) recursive accounting 
infrastructure that allows statistics about all metadata nested beneath a 
point in the directory hierarchy to be efficiently propagated up the tree.  
Currently this includes a file and directory count, total bytes (summation 
over file sizes), and most recent inode ctime.  For example, for a 
directory like /home, Ceph can efficiently report the total number of 
files, directories, and bytes contained by that entire subtree of the 
directory hierarchy.

The file size summation is the most interesting, as it effectively gives 
you directory-based quota space accounting with fine granularity.  In many 
deployments, the quota _accounting_ is more important than actual 
enforcement.  Anybody who has had to figure out what has filled/is filling 
up a large volume will appreciate how cumbersome and inefficient 'du' can 
be for that purpose--especially when you're in a hurry.

There are currently two ways to access the recursive stats via a standard 
shell.  The first simply sets the directory st_size value to the 
_recursive_ bytes ('rbytes') value (when the client is mounted with -o 
rbytes).  For example (watch the directory sizes),

$ tar jxf linux-2.6.24.3.tar.bz2
$ ls -l
total 8
drwxr-xr-x 1 root root         0 Jul 10 05:30 .
drwxr-xr-x 8 root root      4096 Jul  9 18:21 ..
drwxrwxr-x 1 root root 254025660 Feb 26 00:20 linux-2.6.24.3
$ du -s linux-2.6.24.3/
254237  linux-2.6.24.3/
$ ls -al linux-2.6.24.3/
total 281
drwxrwxr-x 1 root root 254025660 Feb 26 00:20 .
drwxr-xr-x 1 root root         0 Jul 10 05:30 ..
-rw-rw-r-- 1 root root       628 Feb 26 00:20 .gitignore
-rw-rw-r-- 1 root root      3657 Feb 26 00:20 .mailmap
-rw-rw-r-- 1 root root     18693 Feb 26 00:20 COPYING
-rw-rw-r-- 1 root root     92230 Feb 26 00:20 CREDITS
drwxrwxr-x 1 root root   8984828 Feb 26 00:20 Documentation
-rw-rw-r-- 1 root root      1596 Feb 26 00:20 Kbuild
-rw-rw-r-- 1 root root     93957 Feb 26 00:20 MAINTAINERS
-rw-rw-r-- 1 root root     53162 Feb 26 00:20 Makefile
-rw-rw-r-- 1 root root     16930 Feb 26 00:20 README
-rw-rw-r-- 1 root root      3119 Feb 26 00:20 REPORTING-BUGS
drwxrwxr-x 1 root root  44216036 Feb 26 00:20 arch
drwxrwxr-x 1 root root    349137 Feb 26 00:20 block
drwxrwxr-x 1 root root    959654 Feb 26 00:20 crypto
drwxrwxr-x 1 root root 118578205 Feb 26 00:20 drivers
drwxrwxr-x 1 root root  21526882 Feb 26 00:20 fs
drwxrwxr-x 1 root root  27456604 Feb 26 00:20 include
drwxrwxr-x 1 root root     99077 Feb 26 00:20 init
drwxrwxr-x 1 root root    170827 Feb 26 00:20 ipc
drwxrwxr-x 1 root root   2189735 Feb 26 00:20 kernel
drwxrwxr-x 1 root root    679502 Feb 26 00:20 lib
drwxrwxr-x 1 root root   1213804 Feb 26 00:20 mm
drwxrwxr-x 1 root root  12562134 Feb 26 00:20 net
drwxrwxr-x 1 root root      3940 Feb 26 00:20 samples
drwxrwxr-x 1 root root   1105977 Feb 26 00:20 scripts
drwxrwxr-x 1 root root    740395 Feb 26 00:20 security
drwxrwxr-x 1 root root  12888682 Feb 26 00:20 sound
drwxrwxr-x 1 root root     16269 Feb 26 00:20 usr

Note that st_blocks is _not_ recursively defined, so 'du' still behaves as 
expected.  If mounted with -o norbytes instead, the directory st_size is 
the number of entries in the directory.

The second interface takes advantage of the fact (?) that read() on a 
directory is more or less undefined.  (Okay, that's not really true, but 
it used to return encoded dirents or something similar, and more recently 
returns -EISDIR.  As far as I know, no sane application expects meaningful 
data from read() on a directory...)  So, assuming Ceph is mounted with -o 
dirstat,

$ cat linux-2.6.24.3/
entries:                     27
 files:                       9
 subdirs:                    18
rentries:                 24418
 rfiles:                  23062
 rsubdirs:                 1356
rbytes:               254025660
rctime:    1215668428.051898000

Fields prefixed with 'r' are recursively defined, while 
entries/files/subdirs is just for the one directory.  'rctime' is the most 
recent ctime within the hierarchy, which should be useful for backup 
software or anything else scanning the hierarchy for recent changes.

Naturally, there are a few caveats:

 - There is some built-in delay before statistics fully propagate up 
toward the root of the hierarchy.  Changes are propagated 
opportunistically when lock/lease state allows, with an upper bound of (by 
default) ~30 seconds for each level of directory nesting.

 - Ceph internally distinguishes between multiple links to the same file 
(there is a single 'primary' link, and then zero or more 'remote' links).  
Only the primary link contributes toward the 'rbytes' total.

 - The 'rbytes' summation is over i_size, not blocks used.  That means 
sparse files "appear" larger than the storage space they actually consume.

 - Directories don't yet contribute anything to the 'rbytes' total.  They
should probably include an estimate of the storage consumed by directory 
metadata.  For this reason, and because the size isn't rounded up to the 
block size, the 'rbytes' total will usually be slightly smaller than what 
you get from 'du'.

 - Currently no stats for the root directory itself.


I'm extremely interested in what people think of overloading the file 
system interface in this way.  Handy?  Crufty?  Dangerous?  Does anybody 
know of any applications that rely on or expect meaningful values for a 
directory's i_size?  Or read() a directory?


More information on the recursive accounting at

	http://ceph.newdream.net/wiki/Recursive_accounting

and Ceph itself at

	http://ceph.newdream.net/

Cheers-
sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 18:28 Recursive directory accounting for size, ctime, etc Sage Weil
@ 2008-07-15 19:47 ` Andreas Dilger
  2008-07-15 20:26   ` Sage Weil
  2008-07-15 19:53 ` J. Bruce Fields
  2008-08-05 18:26 ` Pavel Machek
  2 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2008-07-15 19:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Jul 15, 2008  11:28 -0700, Sage Weil wrote:
> unique (?) recursive accounting 
> infrastructure that allows statistics about all metadata nested beneath a 
> point in the directory hierarchy to be efficiently propagated up the tree.  
> Currently this includes a file and directory count, total bytes (summation 
> over file sizes), and most recent inode ctime.

Interesting...

> Note that st_blocks is _not_ recursively defined, so 'du' still behaves as 
> expected.  If mounted with -o norbytes instead, the directory st_size is 
> the number of entries in the directory.

Is it possible to extract an environment variable from the process
in the kernel to decide what behaviour to have (e.g. like LS_COLORS)?

> The second interface takes advantage of the fact (?) that read() on a 
> directory is more or less undefined.  (Okay, that's not really true, but 
> it used to return encoded dirents or something similar, and more recently 
> returns -EISDIR.  As far as I know, no sane application expects meaningful 
> data from read() on a directory...)  So, assuming Ceph is mounted with -o 
> dirstat,

Hmm, what about just creating a virtual xattr that can be had with
getfattr user.dirstats?

>  - The 'rbytes' summation is over i_size, not blocks used.  That means 
> sparse files "appear" larger than the storage space they actually consume.

I'd think that in many cases it is more important to accumulate the
blocks count and not the size, since a single core file would throw
off the whole "hunt for the worst space consumer" approach.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 18:28 Recursive directory accounting for size, ctime, etc Sage Weil
  2008-07-15 19:47 ` Andreas Dilger
@ 2008-07-15 19:53 ` J. Bruce Fields
  2008-07-15 20:41   ` Sage Weil
  2008-08-05 18:26 ` Pavel Machek
  2 siblings, 1 reply; 14+ messages in thread
From: J. Bruce Fields @ 2008-07-15 19:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Tue, Jul 15, 2008 at 11:28:22AM -0700, Sage Weil wrote:
> Fields prefixed with 'r' are recursively defined, while 
> entries/files/subdirs is just for the one directory.  'rctime' is the most 
> recent ctime within the hierarchy, which should be useful for backup 
> software or anything else scanning the hierarchy for recent changes.
> 
> Naturally, there are a few caveats:
> 
>  - There is some built-in delay before statistics fully propagate up 
> toward the root of the hierarchy.  Changes are propagated 
> opportunistically when lock/lease state allows, with an upper bound of (by 
> default) ~30 seconds for each level of directory nesting.

That makes it less useful, e.g., for somebody with cached data trying to
validate their cache, or for something like git trying to check a
directory tree for changes.

>  - Ceph internally distinguishes between multiple links to the same file 
> (there is a single 'primary' link, and then zero or more 'remote' links).  
> Only the primary link contributes toward the 'rbytes' total.

Is that only true for 'rbytes'?

--b.

> 
>  - The 'rbytes' summation is over i_size, not blocks used.  That means 
> sparse files "appear" larger than the storage space they actually consume.
> 
>  - Directories don't yet contribute anything to the 'rbytes' total.  They
> should probably include an estimate of the storage consumed by directory 
> metadata.  For this reason, and because the size isn't rounded up to the 
> block size, the 'rbytes' total will usually be slightly smaller than what 
> you get from 'du'.
> 
>  - Currently no stats for the root directory itself.
> 
> 
> I'm extremely interested in what people think of overloading the file 
> system interface in this way.  Handy?  Crufty?  Dangerous?  Does anybody 
> know of any applications that rely on or expect meaningful values for a 
> directory's i_size?  Or read() a directory?
> 
> 
> More information on the recursive accounting at
> 
> 	http://ceph.newdream.net/wiki/Recursive_accounting
> 
> and Ceph itself at
> 
> 	http://ceph.newdream.net/
> 
> Cheers-
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 19:47 ` Andreas Dilger
@ 2008-07-15 20:26   ` Sage Weil
  0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2008-07-15 20:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Tue, 15 Jul 2008, Andreas Dilger wrote:
> > Note that st_blocks is _not_ recursively defined, so 'du' still behaves as 
> > expected.  If mounted with -o norbytes instead, the directory st_size is 
> > the number of entries in the directory.
> 
> Is it possible to extract an environment variable from the process
> in the kernel to decide what behaviour to have (e.g. like LS_COLORS)?

That could work too.  Currently the flag is changing the client's i_size, 
but the conditional can go in place of generic_fillattr, where st_size is 
set.  I would worry about the overhead of looking at the environment for 
every getattr, though.

> > The second interface takes advantage of the fact (?) that read() on a 
> > directory is more or less undefined.  (Okay, that's not really true, but 
> > it used to return encoded dirents or something similar, and more recently 
> > returns -EISDIR.  As far as I know, no sane application expects meaningful 
> > data from read() on a directory...)  So, assuming Ceph is mounted with -o 
> > dirstat,
> 
> Hmm, what about just creating a virtual xattr that can be had with
> getfattr user.dirstats?

Yeah, or ceph.dirstats, which hopefully backup software would ignore?  
(Not quite sure how the xattr 'namespaces' are intended to be used.)  Not 
quite as convenient as 'cat dir' for the user, but cleaner.

> >  - The 'rbytes' summation is over i_size, not blocks used.  That means 
> > sparse files "appear" larger than the storage space they actually consume.
> 
> I'd think that in many cases it is more important to accumulate the
> blocks count and not the size, since a single core file would throw
> off the whole "hunt for the worst space consumer" approach.

Yes.  If and when the MDS actually stores blocks used, that could 
trivially be supported as well.  But currently sparseness is a function of 
the objects on the storage nodes, so things like hole-finding and fiemap 
will require probing objects.

thanks-
sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 19:53 ` J. Bruce Fields
@ 2008-07-15 20:41   ` Sage Weil
  2008-07-15 20:48     ` J. Bruce Fields
  2008-07-15 21:56     ` Jamie Lokier
  0 siblings, 2 replies; 14+ messages in thread
From: Sage Weil @ 2008-07-15 20:41 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> >  - There is some built-in delay before statistics fully propagate up 
> > toward the root of the hierarchy.  Changes are propagated 
> > opportunistically when lock/lease state allows, with an upper bound of (by 
> > default) ~30 seconds for each level of directory nesting.
> 
> That makes it less useful, e.g., for somebody with cached data trying to
> validate their cache, or for something like git trying to check a
> directory tree for changes.

Having fully up to date values would definitely be nice, but unfortunately 
doesn't play nice with the fact that different parts of the directory 
hierarchy may be managed by different metadata servers.  A primary goal in 
implementing this was to minimize any impact on performance.  The uses I 
had I mind were more in line with quota-based accounting than cache 
validation.

I think I can adjust the propagation heuristics/timeouts to make updates 
seem more or less immediate to a user in most cases, but that won't be 
sufficient for a tool like git that needs to reliably identify very recent 
updates.  For backup software wanting a consistent file system image, it 
should really be operating on a snapshot as well, in which case a delay 
between taking the snapshot and starting the scan for changes would allow 
those values to propagate.

> >  - Ceph internally distinguishes between multiple links to the same file 
> > (there is a single 'primary' link, and then zero or more 'remote' links).  
> > Only the primary link contributes toward the 'rbytes' total.
> 
> Is that only true for 'rbytes'?

The same goes for rctime.  As far as the recursive stats go, the other 
stats (file/directory counts) aren't affected.  The primary/remote 
hard link distinction is fundamental to the way metadata is internally 
managed and stored by the MDS, though, if that's what you mean (inode 
content is embedded with the primary link's directory metadata).

sage


> 
> --b.
> 
> > 
> >  - The 'rbytes' summation is over i_size, not blocks used.  That means 
> > sparse files "appear" larger than the storage space they actually consume.
> > 
> >  - Directories don't yet contribute anything to the 'rbytes' total.  They
> > should probably include an estimate of the storage consumed by directory 
> > metadata.  For this reason, and because the size isn't rounded up to the 
> > block size, the 'rbytes' total will usually be slightly smaller than what 
> > you get from 'du'.
> > 
> >  - Currently no stats for the root directory itself.
> > 
> > 
> > I'm extremely interested in what people think of overloading the file 
> > system interface in this way.  Handy?  Crufty?  Dangerous?  Does anybody 
> > know of any applications that rely on or expect meaningful values for a 
> > directory's i_size?  Or read() a directory?
> > 
> > 
> > More information on the recursive accounting at
> > 
> > 	http://ceph.newdream.net/wiki/Recursive_accounting
> > 
> > and Ceph itself at
> > 
> > 	http://ceph.newdream.net/
> > 
> > Cheers-
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 20:41   ` Sage Weil
@ 2008-07-15 20:48     ` J. Bruce Fields
  2008-07-15 21:16       ` Sage Weil
  2008-07-15 21:44       ` Jamie Lokier
  2008-07-15 21:56     ` Jamie Lokier
  1 sibling, 2 replies; 14+ messages in thread
From: J. Bruce Fields @ 2008-07-15 20:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Tue, Jul 15, 2008 at 01:41:25PM -0700, Sage Weil wrote:
> On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> > >  - There is some built-in delay before statistics fully propagate up 
> > > toward the root of the hierarchy.  Changes are propagated 
> > > opportunistically when lock/lease state allows, with an upper bound of (by 
> > > default) ~30 seconds for each level of directory nesting.
> > 
> > That makes it less useful, e.g., for somebody with cached data trying to
> > validate their cache, or for something like git trying to check a
> > directory tree for changes.
> 
> Having fully up to date values would definitely be nice, but unfortunately 
> doesn't play nice with the fact that different parts of the directory 
> hierarchy may be managed by different metadata servers.  A primary goal in 
> implementing this was to minimize any impact on performance.  The uses I 
> had I mind were more in line with quota-based accounting than cache 
> validation.

Fair enough.

> I think I can adjust the propagation heuristics/timeouts to make updates 
> seem more or less immediate to a user in most cases, but that won't be 
> sufficient for a tool like git that needs to reliably identify very recent 
> updates.  For backup software wanting a consistent file system image, it 
> should really be operating on a snapshot as well, in which case a delay 
> between taking the snapshot and starting the scan for changes would allow 
> those values to propagate.
> 
> > >  - Ceph internally distinguishes between multiple links to the same file 
> > > (there is a single 'primary' link, and then zero or more 'remote' links).  
> > > Only the primary link contributes toward the 'rbytes' total.
> > 
> > Is that only true for 'rbytes'?
> 
> The same goes for rctime.  As far as the recursive stats go, the other 
> stats (file/directory counts) aren't affected.  The primary/remote 
> hard link distinction is fundamental to the way metadata is internally 
> managed and stored by the MDS, though, if that's what you mean (inode 
> content is embedded with the primary link's directory metadata).

I just wonder how one would explain to users (or application writers)
why changes to a file are reflected in the parent's rctime in one case,
and not in another, especially if the primary link is otherwise
indistinguishable from the others.  The symptoms could be a bit
mysterious from their point of view.

--b.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 20:48     ` J. Bruce Fields
@ 2008-07-15 21:16       ` Sage Weil
  2008-07-15 22:45         ` J. Bruce Fields
  2008-07-15 21:44       ` Jamie Lokier
  1 sibling, 1 reply; 14+ messages in thread
From: Sage Weil @ 2008-07-15 21:16 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> I just wonder how one would explain to users (or application writers)
> why changes to a file are reflected in the parent's rctime in one case,
> and not in another, especially if the primary link is otherwise
> indistinguishable from the others.  The symptoms could be a bit
> mysterious from their point of view.

Yes.  I'm not sure it can really be avoided, though.  I'm trying to lift 
the usual restriction of having to predefine what the 
volume/subvolume/qtree boundary is and then disallowing links/renames 
between then.  When all of a file's links are contained within the 
directory you're looking at (i.e. something that might be a subvolume 
under that paradigm), things look sensible.  If links span two directories 
and you're looking at recursive stats for a dir containing only one of 
them, then you're necessarily going to have some weirdness (you don't want 
to double-count).

Making the primary/remote-ness visible to users somehow (via, say, a 
virtual xattr) might help a bit.  The bottom line, though, is that links 
from multiple points in the namespace and a hierarchical view of file 
_content_ aren't particularly compatible concepts...

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 20:48     ` J. Bruce Fields
  2008-07-15 21:16       ` Sage Weil
@ 2008-07-15 21:44       ` Jamie Lokier
  2008-07-15 21:51         ` Sage Weil
  1 sibling, 1 reply; 14+ messages in thread
From: Jamie Lokier @ 2008-07-15 21:44 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Sage Weil, linux-fsdevel, linux-kernel, ceph-devel

J. Bruce Fields wrote:
> I just wonder how one would explain to users (or application writers)
> why changes to a file are reflected in the parent's rctime in one case,
> and not in another, especially if the primary link is otherwise
> indistinguishable from the others.  The symptoms could be a bit
> mysterious from their point of view.

Btw, what happens when the primary link is deleted?  Does another link
become the primary link?

-- Jamie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 21:44       ` Jamie Lokier
@ 2008-07-15 21:51         ` Sage Weil
  0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2008-07-15 21:51 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: J. Bruce Fields, linux-fsdevel, linux-kernel, ceph-devel

On Tue, 15 Jul 2008, Jamie Lokier wrote:
> J. Bruce Fields wrote:
> > I just wonder how one would explain to users (or application writers)
> > why changes to a file are reflected in the parent's rctime in one case,
> > and not in another, especially if the primary link is otherwise
> > indistinguishable from the others.  The symptoms could be a bit
> > mysterious from their point of view.
> 
> Btw, what happens when the primary link is deleted?  Does another link
> become the primary link?

Yeah.  It's initially moved to a hidden directory (along with 
open-but-unlinked files), and then moved back into the hierarchy the next 
time a remote link is used.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 20:41   ` Sage Weil
  2008-07-15 20:48     ` J. Bruce Fields
@ 2008-07-15 21:56     ` Jamie Lokier
  1 sibling, 0 replies; 14+ messages in thread
From: Jamie Lokier @ 2008-07-15 21:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: J. Bruce Fields, linux-fsdevel, linux-kernel, ceph-devel

Sage Weil wrote:
> Having fully up to date values would definitely be nice, but unfortunately 
> doesn't play nice with the fact that different parts of the directory 
> hierarchy may be managed by different metadata servers.  A primary goal in 
> implementing this was to minimize any impact on performance.  The uses I 
> had I mind were more in line with quota-based accounting than cache 
> validation.
> 
> I think I can adjust the propagation heuristics/timeouts to make updates 
> seem more or less immediate to a user in most cases, but that won't be 
> sufficient for a tool like git that needs to reliably identify very recent 
> updates.  For backup software wanting a consistent file system image, it 
> should really be operating on a snapshot as well, in which case a delay 
> between taking the snapshot and starting the scan for changes would allow 
> those values to propagate.

I have a similar thing in a distributed database (with some
filesystem-like characteristics) I'm working on.

The way I handle propagating compound values which are derived from
multiple metadata servers, like that, is using leases.  (Similar to
fcntl F_GETLEASE, Windows oplocks, and CPU MESI protocol).

E.g. when a single server is about to modify a file, it grabs a lease
covering the metadata for this file _plus_ leases for the aggregated
values for all parent directories, prior to allowing the file
modification.  The first file modification will be delayed briefly to
do this, but then subsequent modifications, including to other files
covered by the same directories, are instant because those servers
already have leases.  They can renew them asynchronously as needed.

When a client wants the aggregate values for a directory (i.e. total
size of all files recursively under it), it acquires a lease on that
directory only.  To do that, it has to query all the metadata servers
which currently hold a lease covering that.

The net effect is you can use the results for cache validation as the
git example.  There's a network ping-pong if someone is alternately
modifying a file under the tree and reading the aggregate value from a
parent directory elsewhere, but at least the values are always
consistent.  Most times, there is no ping-pong because that's not a
common scenario.

(In my project, you can also specify that some queries are allowed to
be a little out of date, to avoid lease acquisition delays if getting
an inaccurate result fast is better.  That's useful for GUIs, but not
suitable for git-like cache validation.)

-- Jamie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 21:16       ` Sage Weil
@ 2008-07-15 22:45         ` J. Bruce Fields
  0 siblings, 0 replies; 14+ messages in thread
From: J. Bruce Fields @ 2008-07-15 22:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Tue, Jul 15, 2008 at 02:16:45PM -0700, Sage Weil wrote:
> On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> > I just wonder how one would explain to users (or application writers)
> > why changes to a file are reflected in the parent's rctime in one case,
> > and not in another, especially if the primary link is otherwise
> > indistinguishable from the others.  The symptoms could be a bit
> > mysterious from their point of view.
> 
> Yes.  I'm not sure it can really be avoided, though.  I'm trying to lift 
> the usual restriction of having to predefine what the 
> volume/subvolume/qtree boundary is and then disallowing links/renames 
> between then.  When all of a file's links are contained within the 
> directory you're looking at (i.e. something that might be a subvolume 
> under that paradigm), things look sensible.  If links span two directories 
> and you're looking at recursive stats for a dir containing only one of 
> them, then you're necessarily going to have some weirdness (you don't want 
> to double-count).

Yeah, there's no clear right answer--that's partly why I was curious
about rctime specifically.

--b.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-07-15 18:28 Recursive directory accounting for size, ctime, etc Sage Weil
  2008-07-15 19:47 ` Andreas Dilger
  2008-07-15 19:53 ` J. Bruce Fields
@ 2008-08-05 18:26 ` Pavel Machek
  2008-08-08 13:11   ` John Stoffel
  2 siblings, 1 reply; 14+ messages in thread
From: Pavel Machek @ 2008-08-05 18:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-fsdevel, linux-kernel, ceph-devel

On Tue 2008-07-15 11:28:22, Sage Weil wrote:
> All-
> 
> Ceph is a new distributed file system for Linux designed for scalability 
> (terabytes to exabytes, tens to thousands of storage nodes), reliability, 
> and performance.  The latest release (v0.3), aside from xattr support and 
> the usual slew of bugfixes, includes a unique (?) recursive accounting 
> infrastructure that allows statistics about all metadata nested beneath a 
> point in the directory hierarchy to be efficiently propagated up the tree.  
> Currently this includes a file and directory count, total bytes (summation 
> over file sizes), and most recent inode ctime.  For example, for a 
> directory like /home, Ceph can efficiently report the total number of 
> files, directories, and bytes contained by that entire subtree of the 
> directory hierarchy.
> 
> The file size summation is the most interesting, as it effectively gives 
> you directory-based quota space accounting with fine granularity.  In many 
> deployments, the quota _accounting_ is more important than actual 
> enforcement.  Anybody who has had to figure out what has filled/is filling 
> up a large volume will appreciate how cumbersome and inefficient 'du' can 
> be for that purpose--especially when you're in a hurry.
> 
> There are currently two ways to access the recursive stats via a standard 
> shell.  The first simply sets the directory st_size value to the 
> _recursive_ bytes ('rbytes') value (when the client is mounted with -o 
> rbytes).  For example (watch the directory sizes),
...

> Naturally, there are a few caveats:
> 
>  - There is some built-in delay before statistics fully propagate up 
> toward the root of the hierarchy.  Changes are propagated 
> opportunistically when lock/lease state allows, with an upper bound of (by 
> default) ~30 seconds for each level of directory nesting.

Having instant rctime would be very nice -- for stuff like locate and
speeding up kde startup.

> I'm extremely interested in what people think of overloading the file 
> system interface in this way.  Handy?  Crufty?  Dangerous?  Does anybody 

Too ugly to live.

What about new rstat() syscall?

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-08-05 18:26 ` Pavel Machek
@ 2008-08-08 13:11   ` John Stoffel
  2008-08-08 23:32     ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: John Stoffel @ 2008-08-08 13:11 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Sage Weil, linux-fsdevel, linux-kernel, ceph-devel

>>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:

Pavel> On Tue 2008-07-15 11:28:22, Sage Weil wrote:
>> All-
>> 
>> Ceph is a new distributed file system for Linux designed for scalability 
>> (terabytes to exabytes, tens to thousands of storage nodes), reliability, 
>> and performance.  The latest release (v0.3), aside from xattr support and 
>> the usual slew of bugfixes, includes a unique (?) recursive accounting 
>> infrastructure that allows statistics about all metadata nested beneath a 
>> point in the directory hierarchy to be efficiently propagated up the tree.  
>> Currently this includes a file and directory count, total bytes (summation 
>> over file sizes), and most recent inode ctime.  For example, for a 
>> directory like /home, Ceph can efficiently report the total number of 
>> files, directories, and bytes contained by that entire subtree of the 
>> directory hierarchy.
>> 
>> The file size summation is the most interesting, as it effectively gives 
>> you directory-based quota space accounting with fine granularity.  In many 
>> deployments, the quota _accounting_ is more important than actual 
>> enforcement.  Anybody who has had to figure out what has filled/is filling 
>> up a large volume will appreciate how cumbersome and inefficient 'du' can 
>> be for that purpose--especially when you're in a hurry.
>> 
>> There are currently two ways to access the recursive stats via a standard 
>> shell.  The first simply sets the directory st_size value to the 
>> _recursive_ bytes ('rbytes') value (when the client is mounted with -o 
>> rbytes).  For example (watch the directory sizes),
Pavel> ...

>> Naturally, there are a few caveats:
>> 
>> - There is some built-in delay before statistics fully propagate up 
>> toward the root of the hierarchy.  Changes are propagated 
>> opportunistically when lock/lease state allows, with an upper bound of (by 
>> default) ~30 seconds for each level of directory nesting.

Pavel> Having instant rctime would be very nice -- for stuff like locate and
Pavel> speeding up kde startup.

>> I'm extremely interested in what people think of overloading the file 
>> system interface in this way.  Handy?  Crufty?  Dangerous?  Does anybody 

Pavel> Too ugly to live.

Pavel> What about new rstat() syscall?

Or how about tying this into the quotactl() syscall and extending it a
bit?  Say quotactl2(cmd,device,id,addr,path) which is probably just as
ugly, but seems to make better sense.

Me, I'd love to have this type of reporting on my filesystems, esp
since it would help me in my day job.

How exports over NFS would look is an issue too.

John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Recursive directory accounting for size, ctime, etc.
  2008-08-08 13:11   ` John Stoffel
@ 2008-08-08 23:32     ` Sage Weil
  0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2008-08-08 23:32 UTC (permalink / raw)
  To: John Stoffel; +Cc: Pavel Machek, linux-fsdevel, linux-kernel, ceph-devel

On Fri, 8 Aug 2008, John Stoffel wrote:
> >>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:
> 
> Pavel> On Tue 2008-07-15 11:28:22, Sage Weil wrote:
> >> All-
> >> 
> >> Ceph is a new distributed file system for Linux designed for scalability 
> >> (terabytes to exabytes, tens to thousands of storage nodes), reliability, 
> >> and performance.  The latest release (v0.3), aside from xattr support and 
> >> the usual slew of bugfixes, includes a unique (?) recursive accounting 
> >> infrastructure that allows statistics about all metadata nested beneath a 
> >> point in the directory hierarchy to be efficiently propagated up the tree.  
> >> Currently this includes a file and directory count, total bytes (summation 
> >> over file sizes), and most recent inode ctime.  For example, for a 
> >> directory like /home, Ceph can efficiently report the total number of 
> >> files, directories, and bytes contained by that entire subtree of the 
> >> directory hierarchy.
> >> 
> >> The file size summation is the most interesting, as it effectively gives 
> >> you directory-based quota space accounting with fine granularity.  In many 
> >> deployments, the quota _accounting_ is more important than actual 
> >> enforcement.  Anybody who has had to figure out what has filled/is filling 
> >> up a large volume will appreciate how cumbersome and inefficient 'du' can 
> >> be for that purpose--especially when you're in a hurry.
> >> 
> >> There are currently two ways to access the recursive stats via a standard 
> >> shell.  The first simply sets the directory st_size value to the 
> >> _recursive_ bytes ('rbytes') value (when the client is mounted with -o 
> >> rbytes).  For example (watch the directory sizes),
> Pavel> ...
> 
> >> Naturally, there are a few caveats:
> >> 
> >> - There is some built-in delay before statistics fully propagate up 
> >> toward the root of the hierarchy.  Changes are propagated 
> >> opportunistically when lock/lease state allows, with an upper bound of (by 
> >> default) ~30 seconds for each level of directory nesting.
> 
> Pavel> Having instant rctime would be very nice -- for stuff like locate and
> Pavel> speeding up kde startup.
> 
> >> I'm extremely interested in what people think of overloading the file 
> >> system interface in this way.  Handy?  Crufty?  Dangerous?  Does anybody 
> 
> Pavel> Too ugly to live.

:)

> Pavel> What about new rstat() syscall?
> 
> Or how about tying this into the quotactl() syscall and extending it a
> bit?  Say quotactl2(cmd,device,id,addr,path) which is probably just as
> ugly, but seems to make better sense.

Introducing or modifying system calls makes for pretty interfaces, but is
a bit impractical (and overkill) to support something present in only one
filesystem.

So far I think Andreas' suggestion of using pseudo-xattrs is the cleanest 
and simplest: it's doesn't interfere with any existing interfaces 
(provided the virtual xattr name is well chosen), and is usable via 
standard command line tools like getfattr.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-08-08 23:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-15 18:28 Recursive directory accounting for size, ctime, etc Sage Weil
2008-07-15 19:47 ` Andreas Dilger
2008-07-15 20:26   ` Sage Weil
2008-07-15 19:53 ` J. Bruce Fields
2008-07-15 20:41   ` Sage Weil
2008-07-15 20:48     ` J. Bruce Fields
2008-07-15 21:16       ` Sage Weil
2008-07-15 22:45         ` J. Bruce Fields
2008-07-15 21:44       ` Jamie Lokier
2008-07-15 21:51         ` Sage Weil
2008-07-15 21:56     ` Jamie Lokier
2008-08-05 18:26 ` Pavel Machek
2008-08-08 13:11   ` John Stoffel
2008-08-08 23:32     ` Sage Weil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).