space compression (again)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* space compression (again)
@ 2005-04-15 17:19 C. Scott Ananian
  2005-04-15 18:34 ` Linus Torvalds
  2005-04-15 18:50 ` Derek Fawcus
  0 siblings, 2 replies; 12+ messages in thread
From: C. Scott Ananian @ 2005-04-15 17:19 UTC (permalink / raw)
  To: git

I've been reading the archives (a bad idea, I know).  Here's a concrete 
suggestion for GIT space-compression which is (I believe) consistent with 
the philosophy of GIT.

Why are blobs per-file?  [After all, Linus insists that files are an 
illusion.]  Why not just have 'chunks', and assemble *these* 
into blobs (read, 'files')?  A good chunk size would fit evenly into some 
number of disk blocks (no wasted space!).

We already have the rsync algorithm which can scan through a file and 
efficiently tell which existing chunks match (portions of) it, using a 
rolling checksum. (Here's a refresher:
    http://samba.anu.edu.au/rsync/tech_report/node2.html
).  Why not treat the 'chunk' as the fundamental unit, and compose files 
from chunks?

This should get better space utilization: a small change to file X 
will only require storage to save the changed chunk, plus meta data to 
describe the chunks composing the new file.  I propose keeping this only 
one-level deep: we can only specify chunks, not pieces of files.

Unlike xdelta schemes, there is no 'file' dependency.  Chunks for a blob 
can be and are shared among *all the other files and versions in the 
repository*.  Moving pieces from file 'a' to file 'b' "just works".

Best of all, I believe this can be done in a completely layered fashion. 
From git's perspective, it's still 'open this blob' or 'write this blob'. 
It just turns out that the filesystem representation of a blob is slightly 
more fragmented.  Even better, you ought to be able to convert your 
on-disk store from one representation to the other: the named blob doesn't 
change, just 'how to fetch the blob' changes.  So, for example, Linus' 
tree can be unchunked for speed, but the release tree (say) can pull 
pruned history from Linus into a chunked on-disk representation that can 
be efficiently wget'ted (only new chunks need be transferred).

My first concern is possible fragmentation: would we end up with a large 
number of very small chunks, and end up representing files as a list of 
lines (effectively)?  Maybe someone can think of an effective coalescing 
strategy, or maybe it is sufficient just to avoid creating chunks smaller 
than a certain size (ie, possibly writing redundant data to a new chunk, 
just to improve the possibility of reuse).

I'm also not sure what the best 'chunk' size is.  Smaller chunks save more 
space but cost more to access (# of disk seeks per file/blob).  Picking a 
chunk half the average file size should reduce space by ~50% while only 
requiring ~2 additional seeks per file-read. OTOH, rsync experience 
suggests 500-1000 byte chunk sizes.  Probably empirical testing is best.

Lastly, we want to avoid hitting the dcache to check the existence of 
chunks while encoding.  In a large repository, there will be a very large 
number of chunks.  We don't *have* to index all of them, but our 
compression gets better the more chunks we know about.  The rsync 
algorithm creates hash tables of chunks at different levels of granularity 
to avoid doing a full check at every byte of the input file.  How large 
should this cached-on-disk chunk hash table be to avoid saturating it as 
the repository grows (maybe the standard grow-as-you-go hash table is 
fine; you only need one bit per entry anyway)?

Thoughts?  Is the constant-factor overhead of indirection-per-blob going 
to kill git's overwhelming speed?
  --scott

JUBILIST explosion MKULTRA HTAUTOMAT Indonesia Shoal Bay RUCKUS ammunition 
GPFLOOR Hager SDI MKDELTA KUBARK Dictionary Soviet  BLUEBIRD Delta Force
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-15 17:19 space compression (again) C. Scott Ananian
@ 2005-04-15 18:34 ` Linus Torvalds
  2005-04-15 18:45   ` C. Scott Ananian
  2005-04-15 18:50 ` Derek Fawcus
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2005-04-15 18:34 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: git



On Fri, 15 Apr 2005, C. Scott Ananian wrote:
> 
> Why are blobs per-file?  [After all, Linus insists that files are an 
> illusion.]  Why not just have 'chunks', and assemble *these* 
> into blobs (read, 'files')?  A good chunk size would fit evenly into some 
> number of disk blocks (no wasted space!).

I actually considered that. I ended up not doing it, because it's not 
obvious how to "block" things up (and even more so because while I like 
the notion, it flies in the face of the other issues I had: performance 
and simplicity).

The problem with chunking is:
 - it complicates a lot of the routines. Things like "is this file 
   unchanged" suddenly become "is this file still the same set of chunks",
   which is just a _lot_ more code and a lot more likely to have bugs.
 - you have to find a blocking factor. I thought of just going it fixed 
   chunks, and that just doesn't help at all. 
 - we already have wasted space due to the low-level filesystem (as 
   opposed to "git") usually being block-based, which means that space 
   utilization for small objects tends to suck. So you really want to 
   prefer objects that are several kB (compressed), and a small block just
   wastes tons of space.
 - there _is_ a natural blocking factor already. That's what a file 
   boundary really is within the project, and finding any other is really 
   quite hard.

So I'm personally 100% sure that it's not worth it. But I'm not opposed to
the _concept_: it makes total sense in the "filesystem" view, and is 100%
equivalent to having an inode with pointers to blocks. I just don't think 
the concept plays out well in reality.

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-15 18:34 ` Linus Torvalds
@ 2005-04-15 18:45   ` C. Scott Ananian
  2005-04-15 19:00     ` Derek Fawcus
  2005-04-15 19:11     ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: C. Scott Ananian @ 2005-04-15 18:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On Fri, 15 Apr 2005, Linus Torvalds wrote:

> The problem with chunking is:
> - it complicates a lot of the routines. Things like "is this file
>   unchanged" suddenly become "is this file still the same set of chunks",
>   which is just a _lot_ more code and a lot more likely to have bugs.

The blob still has the same hash; therefore the file is still the same.
Nothing looks inside blobs; they just want either the hash or the full 
contents (if I understand the algorithms correctly).
I agree it's more code, but I think it can be nicely layered.

> - you have to find a blocking factor. I thought of just going it fixed
>   chunks, and that just doesn't help at all.

rsync uses a fixed chunk size, but this chunk can start at any offset (ie, 
not constrained to fixed boundaries).  This means that adding a single 
line to the file works like you'd expect, even though all the chunk 
boundaries change.  [I think this is what you're talking about.]

> - we already have wasted space due to the low-level filesystem (as
>   opposed to "git") usually being block-based, which means that space
>   utilization for small objects tends to suck. So you really want to
>   prefer objects that are several kB (compressed), and a small block just
>   wastes tons of space.

Not on (say) reiserfs, and not over the network.  I'm proposing (at the 
moment) easy conversion from chunked to unchunked disk representation,
so that you can leave things unchunked if (for example) you know you're 
running ext2 with a large block size.

> - there _is_ a natural blocking factor already. That's what a file
>   boundary really is within the project, and finding any other is really
>   quite hard.

Well, yes, it may be nontrivial.  But 'quite hard' depends on your 
perspective, I guess.  Given a cache of existing chunks, it's just a 
few table lookups. =)

> So I'm personally 100% sure that it's not worth it. But I'm not opposed to
> the _concept_: it makes total sense in the "filesystem" view, and is 100%
> equivalent to having an inode with pointers to blocks. I just don't think
> the concept plays out well in reality.

So I guess I'll have to implement this and find out, won't I? =)
  --scott

AMLASH overthrow SDI Suharto HBDRILL SMOTH SUMAC SYNCARP kibo Blair 
Diplomat Kojarena CIA cracking counter-intelligence CABOUNCE anthrax
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-15 18:45   ` C. Scott Ananian
@ 2005-04-15 19:00     ` Derek Fawcus
  2005-04-15 19:11     ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: Derek Fawcus @ 2005-04-15 19:00 UTC (permalink / raw)
  To: git

On Fri, Apr 15, 2005 at 02:45:55PM -0400, C. Scott Ananian wrote:
> > - we already have wasted space due to the low-level filesystem (as
> >   opposed to "git") usually being block-based, which means that space
> >   utilization for small objects tends to suck. So you really want to
> >   prefer objects that are several kB (compressed), and a small block just
> >   wastes tons of space.
> 
> Not on (say) reiserfs, and not over the network.  I'm proposing (at the 
> moment) easy conversion from chunked to unchunked disk representation,
> so that you can leave things unchunked if (for example) you know you're 
> running ext2 with a large block size.

Or if one does not care about space,  and simply want's speed,  add another
layer of indirection - a flattened container object which has hashses as
normal,  then as it's content simply has the 'chunk list object' and the
'chunk objects' concatenated.

It's then a per user / database as to if the flattened objects,  or the
heirarcal objects are storred locally.

DF

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-15 18:45   ` C. Scott Ananian
  2005-04-15 19:00     ` Derek Fawcus
@ 2005-04-15 19:11     ` Linus Torvalds
  2005-04-16 14:39       ` Martin Uecker
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2005-04-15 19:11 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: git



On Fri, 15 Apr 2005, C. Scott Ananian wrote:
> 
> So I guess I'll have to implement this and find out, won't I? =)

The best way to shup somebody up is always to just do it, and say "hey, I 
told you so". It's hard to argue with numbers.

			Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-15 19:11     ` Linus Torvalds
@ 2005-04-16 14:39       ` Martin Uecker
  2005-04-16 15:11         ` C. Scott Ananian
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Uecker @ 2005-04-16 14:39 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 672 bytes --]

On Fri, Apr 15, 2005 at 12:11:43PM -0700, Linus Torvalds wrote:


> On Fri, 15 Apr 2005, C. Scott Ananian wrote:
> > 
> > So I guess I'll have to implement this and find out, won't I? =)
> 
> The best way to shup somebody up is always to just do it, and say "hey, I 
> told you so". It's hard to argue with numbers.

The right thing (TM) is to switch from SHA1 of compressed
content for the complete monolithic file to a merkle hash tree
of the uncompressed content. This would make the hash
independent of the actual storage method (chunked or not). 


Martin

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-16 14:39       ` Martin Uecker
@ 2005-04-16 15:11         ` C. Scott Ananian
  2005-04-16 17:37           ` Martin Uecker
  0 siblings, 1 reply; 12+ messages in thread
From: C. Scott Ananian @ 2005-04-16 15:11 UTC (permalink / raw)
  To: Martin Uecker; +Cc: git

On Sat, 16 Apr 2005, Martin Uecker wrote:

> The right thing (TM) is to switch from SHA1 of compressed
> content for the complete monolithic file to a merkle hash tree
> of the uncompressed content. This would make the hash
> independent of the actual storage method (chunked or not).

It would certainly be nice to change to a hash of the uncompressed 
content, rather than a hash of the compressed content, but it's not 
strictly necessary, since files are fetched all at once: there's not 'read 
subrange' operation on blobs.

I assume 'merkle hash tree' is talking about:
   http://www.open-content.net/specs/draft-jchapweske-thex-02.html
..which is very interesting, but not quite what I was thinking.
The merkle hash approach seems to require fixed chunk boundaries.
The rsync approach does not use fixed chunk boundaries; this is necessary 
to ensure good storage reuse for the expected case (ie; inserting a single 
line at the start or in the middle of the file, which changes all the 
chunk boundaries).

Further, in the absence of subrange reads on blobs, it's not entirely 
clear what using a merkle hash would buy you.
  --scott

WASHTUB supercomputer security Mk 48 justice ODUNIT radar COBRA JANE 
SSBN 731 BATF KUJUMP SECANT operation class struggle SYNCARP KGB ODACID
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-16 15:11         ` C. Scott Ananian
@ 2005-04-16 17:37           ` Martin Uecker
  2005-04-19 12:39             ` Martin Uecker
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Uecker @ 2005-04-16 17:37 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1895 bytes --]

On Sat, Apr 16, 2005 at 11:11:00AM -0400, C. Scott Ananian wrote:
> On Sat, 16 Apr 2005, Martin Uecker wrote:
> 
> >The right thing (TM) is to switch from SHA1 of compressed
> >content for the complete monolithic file to a merkle hash tree
> >of the uncompressed content. This would make the hash
> >independent of the actual storage method (chunked or not).
> 
> It would certainly be nice to change to a hash of the uncompressed 
> content, rather than a hash of the compressed content, but it's not 
> strictly necessary, since files are fetched all at once: there's not 'read 
> subrange' operation on blobs.
> 
> I assume 'merkle hash tree' is talking about:
>   http://www.open-content.net/specs/draft-jchapweske-thex-02.html
> ..which is very interesting, but not quite what I was thinking.
> The merkle hash approach seems to require fixed chunk boundaries.

I don't know what is written there, but I don't
consider fixed chunk boundaries part of the definition.

> The rsync approach does not use fixed chunk boundaries; this is necessary 
> to ensure good storage reuse for the expected case (ie; inserting a single 
> line at the start or in the middle of the file, which changes all the 
> chunk boundaries).

Yes. The chunk boundaries should be determined deterministically
from local properties of the data. Use a rolling checksum over
some small window and split the file it it hits a special value (0).
This is what the rsyncable patch to zlib does.

> Further, in the absence of subrange reads on blobs, it's not entirely 
> clear what using a merkle hash would buy you.

The whole design of git is a hash tree. If you extend
this tree structure into files you end up with merkle
hash trees. Everything else is just more complicated.

Martin
 

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-16 17:37           ` Martin Uecker
@ 2005-04-19 12:39             ` Martin Uecker
  0 siblings, 0 replies; 12+ messages in thread
From: Martin Uecker @ 2005-04-19 12:39 UTC (permalink / raw)
  To: git; +Cc: Martin Uecker

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Sat, Apr 16, 2005 at 07:37:02PM +0200, Martin Uecker wrote:
> On Sat, Apr 16, 2005 at 11:11:00AM -0400, C. Scott Ananian wrote:
 
> > The rsync approach does not use fixed chunk boundaries; this is necessary 
> > to ensure good storage reuse for the expected case (ie; inserting a single 
> > line at the start or in the middle of the file, which changes all the 
> > chunk boundaries).
> 
> Yes. The chunk boundaries should be determined deterministically
> from local properties of the data. Use a rolling checksum over
> some small window and split the file it it hits a special value (0).
> This is what the rsyncable patch to zlib does.

This is certainly uninteresting for source code repositories
but for people who manage repositories of rsyncable binary
packages this would save a lot of space, bandwidth and
cpu time (compared to rsync because the scanning phase is
not necessary anymore). 

Martin

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-15 17:19 space compression (again) C. Scott Ananian
  2005-04-15 18:34 ` Linus Torvalds
@ 2005-04-15 18:50 ` Derek Fawcus
  1 sibling, 0 replies; 12+ messages in thread
From: Derek Fawcus @ 2005-04-15 18:50 UTC (permalink / raw)
  To: git

On Fri, Apr 15, 2005 at 01:19:30PM -0400, C. Scott Ananian wrote:
> Why are blobs per-file?  [After all, Linus insists that files are an 
> illusion.]  Why not just have 'chunks', and assemble *these* 
> into blobs (read, 'files')?  A good chunk size would fit evenly into some 
> number of disk blocks (no wasted space!).

[ I've only been earwigging,  not paying a lot of attention,  however ...]

Funny I was just think of this having read Linus' discourse on
"files don't matter", the obvious chunking factor would be say
a function.

The problem being tending towards having very small files - I know
I tend to prefer small functions.  Hmm - a underlying filesystem that
efficiently stores small files - why does that ring a bell :-)

However the simple answer is to have a preparser for a file / tree
checkin which split say a .c file into it's associated chunks,  anf
represented it in git as a signed/hashed object.  i.e. a automatically
created extra level of indirection (as I seem to recall was added
somewhere else?).

  So say fred.c:

  /*
   * File boiler
   */
  #include <guff>
  #include <more guff>

  /*
   * Fn a boiler
   */
  int fn_a(args) {
  }

  /*
   * Fn b boiler
   */
  long fn_b(args) {
  }

Would be split into 4 parts within git,  the 'file object' which simply
points to the content objects,  and 3 contents objects,  being the stuff
before 'Fn a boiler',  fn_a and it's boiler,  fn_b and it's boiler.

The interesting bit is needing a preprocessor which can roughly parse
the code - i.e. detect where to place the boiler blocks.

You would then do most of your tree operations upon the file objects,
but get the space savings from the content objects being shared.

I suspect that simply to prevent pathological conditions you'd have to
arrange that the contents objects have a minimal size,  irrespective
of the number of desired chunks (functions) they would naturally
contain.  i.e. for compresion efficiency,  you may choose something like
2K as the minimal pre compression content object size.

DF

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
@ 2005-04-15 19:33 Ray Heasman
  2005-04-16 12:29 ` David Lang
  0 siblings, 1 reply; 12+ messages in thread
From: Ray Heasman @ 2005-04-15 19:33 UTC (permalink / raw)
  To: git

For for this email not threading properly, I have been lurking on the
mail list archives and just had to reply to this message.

I was planning to ask exactly this question, and Scott beat me to to. I
even wanted to call them "chunks" too. :-)

It's probably worthwhile for anyone discussing this subject to read this
link: http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf . I know it's
been posted before, but it really is worth reading. :-)

On Fri, 15 Apr 2005, Linus Torvalds wrote:
> On Fri, 15 Apr 2005, C. Scott Ananian wrote:
> > 
> > Why are blobs per-file?  [After all, Linus insists that files are an 
> > illusion.]  Why not just have 'chunks', and assemble *these* 
> > into blobs (read, 'files')?  A good chunk size would fit evenly into some 
> > number of disk blocks (no wasted space!).
>
> I actually considered that. I ended up not doing it, because it's not 
> obvious how to "block" things up (and even more so because while I like 
> the notion, it flies in the face of the other issues I had: performance 
> and simplicity).

I don't think it's as bad as you think.

Let's conceptually have two types of files - Pobs (Proxy Objects, or
Pointer Objects), and chunks. Both are stored and referenced by their
content hash, as usual. Pobs just contain a list of hashes referencing
the chunks in a file. When a file is initially stored, we chunk it so
each chunk fits comfortably in a block, but otherwise we aren't too
critical about sizes. When a file is changed (say, a single line edit),
we update the chunk that contains that line, hash it and store it with
its new name, and update the Pob, which we rehash and restore. If a
chunk grows to be very large (say > 2 disk blocks), we can rechunk it
and update the Pob to include the new chunks.

> The problem with chunking is:
>  - it complicates a lot of the routines. Things like "is this file 
>    unchanged" suddenly become "is this file still the same set of chunks",
>    which is just a _lot_ more code and a lot more likely to have bugs.

You're half right; it will be more complex, but I don't think it's as
bad as you think. Pobs are stored by hash just like anything else. If
some chunks are different, the pob is different, which means it has a
different hash. It's exactly the same as dealing with changed file now.
Sure, when you have to fetch the data, you have to read the pob and get
a list of chunks to concatenate and return, but your example given
doesn't change.

>  - you have to find a blocking factor. I thought of just going it fixed 
>    chunks, and that just doesn't help at all. 

Just use the block size of the filesystem. Some filesystems do tail
packing, so space isn't an issue, though speed can be. We don't actually
care how big a chunk is, except to make it easy on the filesystem.
Individual chunks can be any size.

>  - we already have wasted space due to the low-level filesystem (as 
>    opposed to "git") usually being block-based, which means that space 
>    utilization for small objects tends to suck. So you really want to 
>    prefer objects that are several kB (compressed), and a small block just
>    wastes tons of space.

If a chunk is smaller than a disk block, this is true. However, if we
size it right this is no worse than any other file. Small files (less
than a block) can't be made any larger, so they waste space anyway.
Large files end up wasting space in one block unless they are a perfect
multiple of the block size.

When we increase the size of a chunk, it will waste space, but we would
have created an entire new file, so we win there too.

Admittedly, Pobs will be wasting space too.

On the other hand, I use ReiserFS, so I don't care. ;-)

>  - there _is_ a natural blocking factor already. That's what a file 
>    boundary really is within the project, and finding any other is really 
>    quite hard.

Nah. I think I've made a good case it isn't.

> So I'm personally 100% sure that it's not worth it. But I'm not opposed to
> the _concept_: it makes total sense in the "filesystem" view, and is 100%
> equivalent to having an inode with pointers to blocks. I just don't think 
> the concept plays out well in reality.

Well, the reason I think this would be worth it is that you really win
when you have multiple parallel copies of a source tree, and changes are
cheaper too. If you store all the chunks for all your git repositories
in one place, and otherwise treat your trees of Pobs as the real
repository, your copied trees only cost you space for the Pobs.
Obviously this also applies for file updates within past revisions of a
tree, but I don't know how much it would save. It fits beautifully into
the current abstraction, and saves space without having to resort to
rolling hashes or xdeltas.

The _real_ reason why I am excited about git is that I have a vision of
using this as the filesystem (in a FUSE wrapper or something) for my
home directory. MP3s and AVIs aside, it will make actual work much
easier for me. I have a dream; a dream where I save files using the same
name, safe in the knowledge that I can get to any version I want. I will
live in a world of autosaves, deletes without confirmation, and /etcs
immune from the vagaries of my package management systems, not to
mention users not asking me leading questions about backups. *sigh*
*sniff* Excuse me, I think I have to go now.

-Ray

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: space compression (again)
  2005-04-15 19:33 Ray Heasman
@ 2005-04-16 12:29 ` David Lang
  0 siblings, 0 replies; 12+ messages in thread
From: David Lang @ 2005-04-16 12:29 UTC (permalink / raw)
  To: Ray Heasman; +Cc: git

we alrady have the concept of objects that contain objects and therefor 
don'e need to be re-checked (directories), the chunks inside a file could 
be the same type of thing.

currently we say that if the hash on the directory is the same we don't 
need to re-check each of the files in that directory, this would be that 
if the hash on the file hasn't changed we don't need to re-check the 
chunks inside that file.

David Lang


  On Fri, 15 Apr 2005, Ray Heasman wrote:

> Date: Fri, 15 Apr 2005 12:33:03 -0700
> From: Ray Heasman <lists@mythral.org>
> To: git@vger.kernel.org
> Subject: Re: space compression (again)
> 
> For for this email not threading properly, I have been lurking on the
> mail list archives and just had to reply to this message.
>
> I was planning to ask exactly this question, and Scott beat me to to. I
> even wanted to call them "chunks" too. :-)
>
> It's probably worthwhile for anyone discussing this subject to read this
> link: http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf . I know it's
> been posted before, but it really is worth reading. :-)
>
> On Fri, 15 Apr 2005, Linus Torvalds wrote:
>> On Fri, 15 Apr 2005, C. Scott Ananian wrote:
>>>
>>> Why are blobs per-file?  [After all, Linus insists that files are an
>>> illusion.]  Why not just have 'chunks', and assemble *these*
>>> into blobs (read, 'files')?  A good chunk size would fit evenly into some
>>> number of disk blocks (no wasted space!).
>>
>> I actually considered that. I ended up not doing it, because it's not
>> obvious how to "block" things up (and even more so because while I like
>> the notion, it flies in the face of the other issues I had: performance
>> and simplicity).
>
> I don't think it's as bad as you think.
>
> Let's conceptually have two types of files - Pobs (Proxy Objects, or
> Pointer Objects), and chunks. Both are stored and referenced by their
> content hash, as usual. Pobs just contain a list of hashes referencing
> the chunks in a file. When a file is initially stored, we chunk it so
> each chunk fits comfortably in a block, but otherwise we aren't too
> critical about sizes. When a file is changed (say, a single line edit),
> we update the chunk that contains that line, hash it and store it with
> its new name, and update the Pob, which we rehash and restore. If a
> chunk grows to be very large (say > 2 disk blocks), we can rechunk it
> and update the Pob to include the new chunks.
>
>> The problem with chunking is:
>>  - it complicates a lot of the routines. Things like "is this file
>>    unchanged" suddenly become "is this file still the same set of chunks",
>>    which is just a _lot_ more code and a lot more likely to have bugs.
>
> You're half right; it will be more complex, but I don't think it's as
> bad as you think. Pobs are stored by hash just like anything else. If
> some chunks are different, the pob is different, which means it has a
> different hash. It's exactly the same as dealing with changed file now.
> Sure, when you have to fetch the data, you have to read the pob and get
> a list of chunks to concatenate and return, but your example given
> doesn't change.
>
>>  - you have to find a blocking factor. I thought of just going it fixed
>>    chunks, and that just doesn't help at all.
>
> Just use the block size of the filesystem. Some filesystems do tail
> packing, so space isn't an issue, though speed can be. We don't actually
> care how big a chunk is, except to make it easy on the filesystem.
> Individual chunks can be any size.
>
>>  - we already have wasted space due to the low-level filesystem (as
>>    opposed to "git") usually being block-based, which means that space
>>    utilization for small objects tends to suck. So you really want to
>>    prefer objects that are several kB (compressed), and a small block just
>>    wastes tons of space.
>
> If a chunk is smaller than a disk block, this is true. However, if we
> size it right this is no worse than any other file. Small files (less
> than a block) can't be made any larger, so they waste space anyway.
> Large files end up wasting space in one block unless they are a perfect
> multiple of the block size.
>
> When we increase the size of a chunk, it will waste space, but we would
> have created an entire new file, so we win there too.
>
> Admittedly, Pobs will be wasting space too.
>
> On the other hand, I use ReiserFS, so I don't care. ;-)
>
>>  - there _is_ a natural blocking factor already. That's what a file
>>    boundary really is within the project, and finding any other is really
>>    quite hard.
>
> Nah. I think I've made a good case it isn't.
>
>> So I'm personally 100% sure that it's not worth it. But I'm not opposed to
>> the _concept_: it makes total sense in the "filesystem" view, and is 100%
>> equivalent to having an inode with pointers to blocks. I just don't think
>> the concept plays out well in reality.
>
> Well, the reason I think this would be worth it is that you really win
> when you have multiple parallel copies of a source tree, and changes are
> cheaper too. If you store all the chunks for all your git repositories
> in one place, and otherwise treat your trees of Pobs as the real
> repository, your copied trees only cost you space for the Pobs.
> Obviously this also applies for file updates within past revisions of a
> tree, but I don't know how much it would save. It fits beautifully into
> the current abstraction, and saves space without having to resort to
> rolling hashes or xdeltas.
>
> The _real_ reason why I am excited about git is that I have a vision of
> using this as the filesystem (in a FUSE wrapper or something) for my
> home directory. MP3s and AVIs aside, it will make actual work much
> easier for me. I have a dream; a dream where I save files using the same
> name, safe in the knowledge that I can get to any version I want. I will
> live in a world of autosaves, deletes without confirmation, and /etcs
> immune from the vagaries of my package management systems, not to
> mention users not asking me leading questions about backups. *sigh*
> *sniff* Excuse me, I think I have to go now.
>
> -Ray
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-04-19 12:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-15 17:19 space compression (again) C. Scott Ananian
2005-04-15 18:34 ` Linus Torvalds
2005-04-15 18:45   ` C. Scott Ananian
2005-04-15 19:00     ` Derek Fawcus
2005-04-15 19:11     ` Linus Torvalds
2005-04-16 14:39       ` Martin Uecker
2005-04-16 15:11         ` C. Scott Ananian
2005-04-16 17:37           ` Martin Uecker
2005-04-19 12:39             ` Martin Uecker
2005-04-15 18:50 ` Derek Fawcus
  -- strict thread matches above, loose matches on Subject: below --
2005-04-15 19:33 Ray Heasman
2005-04-16 12:29 ` David Lang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).