Re: Index/hash order

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Index/hash order
       [not found]                         ` <Pine.LNX.4.58.0504131144160.4501@ppc970.osdl.org>
@ 2005-04-13 20:02                           ` Ingo Molnar
  2005-04-13 20:07                             ` H. Peter Anvin
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2005-04-13 20:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, git

* Linus Torvalds <torvalds@osdl.org> wrote:

> > with a plaintext repository we could do the 'hardlink trick' (which 
> > brings in other manageability problems and limitations but is at least a 
> > partially good idea), which would make the working tree and the 
> > repository share the same inode in most cases.

> However, the real issue is that you're really asking for trouble. 
> There are tons of tools that modify files without breaking the 
> hardlink. Even some editors do. So you just use the wrong tool on the 
> tree by mistake, and not only is your archive corrupt, you've 
> corrupted all other archives that might have shared the same object 
> directory.

that's what i loosely meant under 'manageability problems'.

I mentioned one solution earlier: to make the repository object an 
immutable file (the +i flag on the inode) - it really wants to be 
immutable after all. That would solve a whole range of 'accidental 
corruption' issues.

Another solution (suggested by Christer Weinigel) was to enforce 
immutability by making it owned by another user/group (git:git or 
whatever).

but having a binary compressed format is 'soft immutability', done 
cleverly.

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 20:02                           ` Index/hash order Ingo Molnar
@ 2005-04-13 20:07                             ` H. Peter Anvin
  2005-04-13 20:15                               ` Ingo Molnar
  2005-04-13 20:15                               ` Index/hash order Linus Torvalds
  0 siblings, 2 replies; 20+ messages in thread
From: H. Peter Anvin @ 2005-04-13 20:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, git

Ingo Molnar wrote:
> 
> that's what i loosely meant under 'manageability problems'.
> 
> I mentioned one solution earlier: to make the repository object an 
> immutable file (the +i flag on the inode) - it really wants to be 
> immutable after all. That would solve a whole range of 'accidental 
> corruption' issues.
> 

I think abusing the immutable bit quickly will decend into the same 
rathole which makes u-w often useless.  u-w will actually be preserved 
by more tools -- simply because they know about it -- than +i.

Either which way, it feels to me that this idea has already been ruled 
out, so it's probably pointless to keep debating just exactly what we're 
not actually going to do.

	-hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 20:07                             ` H. Peter Anvin
@ 2005-04-13 20:15                               ` Ingo Molnar
  2005-04-13 20:18                                 ` Ingo Molnar
  2005-04-13 21:04                                 ` Index/hash order Linus Torvalds
  2005-04-13 20:15                               ` Index/hash order Linus Torvalds
  1 sibling, 2 replies; 20+ messages in thread
From: Ingo Molnar @ 2005-04-13 20:15 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, git


* H. Peter Anvin <hpa@zytor.com> wrote:

> >that's what i loosely meant under 'manageability problems'.
> >
> >I mentioned one solution earlier: to make the repository object an 
> >immutable file (the +i flag on the inode) - it really wants to be 
> >immutable after all. That would solve a whole range of 'accidental 
> >corruption' issues.
> >
> 
> I think abusing the immutable bit quickly will decend into the same 
> rathole which makes u-w often useless.  u-w will actually be preserved 
> by more tools -- simply because they know about it -- than +i.

well, the 'owned by another user' solution is valid though, and doesnt 
have this particular problem. (We've got a secure multiuser OS, so can 
as well use it to protect the DB against corruption.)

> Either which way, it feels to me that this idea has already been ruled 
> out, so it's probably pointless to keep debating just exactly what 
> we're not actually going to do.

(even if it sounds stupid, i keep discussing decisions that are done for 
reasons i cannot fully agree with (yet), even if i happen to agree with 
the net decision. It's all technological arguments, so it's not like 
there's anything fuzzy about any of these issues.)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 20:15                               ` Ingo Molnar
@ 2005-04-13 20:18                                 ` Ingo Molnar
  2005-04-13 20:21                                   ` Ingo Molnar
  2005-04-13 21:04                                 ` Index/hash order Linus Torvalds
  1 sibling, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2005-04-13 20:18 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, git


* Ingo Molnar <mingo@elte.hu> wrote:

> > I think abusing the immutable bit quickly will decend into the same 
> > rathole which makes u-w often useless.  u-w will actually be preserved 
> > by more tools -- simply because they know about it -- than +i.
> 
> well, the 'owned by another user' solution is valid though, and doesnt 
> have this particular problem. (We've got a secure multiuser OS, so can 
> as well use it to protect the DB against corruption.)

but ... this variant doesnt have any 'wow' feeling to it either, and it 
clearly brings in a number of other limitations. I might as well shut up 
until i can suggest something obviously superior :)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 20:18                                 ` Ingo Molnar
@ 2005-04-13 20:21                                   ` Ingo Molnar
  2005-04-13 20:26                                     ` Updated base64 patches H. Peter Anvin
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2005-04-13 20:21 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, git


* Ingo Molnar <mingo@elte.hu> wrote:

> > > I think abusing the immutable bit quickly will decend into the same 
> > > rathole which makes u-w often useless.  u-w will actually be preserved 
> > > by more tools -- simply because they know about it -- than +i.
> > 
> > well, the 'owned by another user' solution is valid though, and doesnt 
> > have this particular problem. (We've got a secure multiuser OS, so can 
> > as well use it to protect the DB against corruption.)
> 
> but ... this variant doesnt have any 'wow' feeling to it either, and 
> it clearly brings in a number of other limitations. I might as well 
> shut up until i can suggest something obviously superior :)

i think the killer argument is compression. A 2 GB compressed repository 
will be a hard sell already, 4 GB is pretty much out of question. And 
once we accept that we have to have _some_ form of compression, it's 
Linus' scheme that wins.

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Updated base64 patches
  2005-04-13 20:21                                   ` Ingo Molnar
@ 2005-04-13 20:26                                     ` H. Peter Anvin
  0 siblings, 0 replies; 20+ messages in thread
From: H. Peter Anvin @ 2005-04-13 20:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, git

I have uploaded two new base64 patches, one which uses the flat 
repository and one which doesn't:

ftp://ftp.kernel.org/pub/linux/kernel/people/hpa/git-0.04-base64-3.diff
ftp://ftp.kernel.org/pub/linux/kernel/people/hpa/git-0.04-base64-flat-3.diff

... both are still against the git-0.04 tarball.

The only differences is changing "char" to "signed char" in places where 
it actually matters (since plain char is unsigned on some platforms), 
and, for the non-flat version, allowing the cache subdirectories to be 
lazily created (if ENOENT is returned, try mkdir before giving up.)

	-=hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 20:15                               ` Ingo Molnar
  2005-04-13 20:18                                 ` Ingo Molnar
@ 2005-04-13 21:04                                 ` Linus Torvalds
  2005-04-20  7:40                                   ` enforcing DB immutability Ingo Molnar
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2005-04-13 21:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: H. Peter Anvin, git



On Wed, 13 Apr 2005, Ingo Molnar wrote:
> 
> well, the 'owned by another user' solution is valid though, and doesnt 
> have this particular problem. (We've got a secure multiuser OS, so can 
> as well use it to protect the DB against corruption.)

So now you need root to set up new repositories? No thanks.

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* enforcing DB immutability
  2005-04-13 21:04                                 ` Index/hash order Linus Torvalds
@ 2005-04-20  7:40                                   ` Ingo Molnar
  2005-04-20  7:49                                     ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2005-04-20  7:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, git

* Linus Torvalds <torvalds@osdl.org> wrote:

> On Wed, 13 Apr 2005, Ingo Molnar wrote:
> > 
> > well, the 'owned by another user' solution is valid though, and doesnt 
> > have this particular problem. (We've got a secure multiuser OS, so can 
> > as well use it to protect the DB against corruption.)
> 
> So now you need root to set up new repositories? No thanks.

yeah, it's a bit awkward to protect uncompressed repositories - but it 
will need some sort of kernel enforcement. (if userspace finds out the 
DB contains uncompressed blobs, it _will_ try to use them.)

(perhaps having an in-kernel GIT-alike versioned filesystem will help - 
but that brings up the same 'I have to be root' issues. The FS will 
enforce the true immutability of objects.)

perhaps having a new 'immutable hardlink' feature in the Linux VFS would 
help? I.e. a hardlink that can only be readonly followed, and can be 
removed, but cannot be chmod-ed to a writeable hardlink. That i think 
would be a large enough barrier for editors/build-tools not to play the 
tricks they already do that makes 'readonly' files virtually 
meaningless.

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: enforcing DB immutability
  2005-04-20  7:40                                   ` enforcing DB immutability Ingo Molnar
@ 2005-04-20  7:49                                     ` Ingo Molnar
  2005-04-20  7:53                                       ` Ingo Molnar
                                                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Ingo Molnar @ 2005-04-20  7:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, git

* Ingo Molnar <mingo@elte.hu> wrote:

> perhaps having a new 'immutable hardlink' feature in the Linux VFS 
> would help? I.e. a hardlink that can only be readonly followed, and 
> can be removed, but cannot be chmod-ed to a writeable hardlink. That i 
> think would be a large enough barrier for editors/build-tools not to 
> play the tricks they already do that makes 'readonly' files virtually 
> meaningless.

immutable hardlinks have the following advantage: a hardlink by design 
hides the information where the link comes from. So even if an editor 
wanted to play stupid games and override the immutability - it doesnt 
know where the DB object is. (sure, it could find it if it wants to, but 
that needs real messing around - editors wont do _that_)

i think this might work.

(the current chattr +i flag isnt quite what we need though because it 
works on the inode, and it's also a root-only feature so it puts us back 
to square one. What would be needed is an immutability flag on 
hardlinks, settable by unprivileged users.)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: enforcing DB immutability
  2005-04-20  7:49                                     ` Ingo Molnar
@ 2005-04-20  7:53                                       ` Ingo Molnar
  2005-04-20  8:58                                         ` Chris Wedgwood
  2005-04-20 14:57                                       ` Nick Craig-Wood
  2005-04-27  8:15                                       ` Wout
  2 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2005-04-20  7:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, git


* Ingo Molnar <mingo@elte.hu> wrote:

> > perhaps having a new 'immutable hardlink' feature in the Linux VFS 
> > would help? I.e. a hardlink that can only be readonly followed, and 
> > can be removed, but cannot be chmod-ed to a writeable hardlink. That i 
> > think would be a large enough barrier for editors/build-tools not to 
> > play the tricks they already do that makes 'readonly' files virtually 
> > meaningless.
> 
> immutable hardlinks have the following advantage: a hardlink by design 
> hides the information where the link comes from. So even if an editor 
> wanted to play stupid games and override the immutability - it doesnt 
> know where the DB object is. (sure, it could find it if it wants to, 
> but that needs real messing around - editors wont do _that_)

so the only sensible thing the editor/tool can do when it wants to 
change the file is precisely what we want: it will copy the hardlinked 
files's contents to a new file, and will replace the old file with the 
new file - a copy on write. No accidental corruption of the DB's 
contents.

(another in-kernel VFS solution would be to enforce that the files's 
name always matches the sha1 hash. So if someone edits a DB object it 
will automatically change its name. But this is complex, probably cannot 
be done atomically, and brings up other problems as well.)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: enforcing DB immutability
  2005-04-20  7:53                                       ` Ingo Molnar
@ 2005-04-20  8:58                                         ` Chris Wedgwood
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wedgwood @ 2005-04-20  8:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, H. Peter Anvin, git

On Wed, Apr 20, 2005 at 09:53:20AM +0200, Ingo Molnar wrote:

> so the only sensible thing the editor/tool can do when it wants to
> change the file is precisely what we want: it will copy the
> hardlinked files's contents to a new file, and will replace the old
> file with the new file - a copy on write. No accidental corruption
> of the DB's contents.

editors that have SCM smarts and know about the files different states
can do this

i really like the way this works under BK btw --- files are RO until i
do the magic thing which will do a 'bk edit' and i can then do
checkins or similar as needed (this assumes you can do per-file
deltas)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: enforcing DB immutability
  2005-04-20  7:49                                     ` Ingo Molnar
  2005-04-20  7:53                                       ` Ingo Molnar
@ 2005-04-20 14:57                                       ` Nick Craig-Wood
  2005-04-27  8:15                                       ` Wout
  2 siblings, 0 replies; 20+ messages in thread
From: Nick Craig-Wood @ 2005-04-20 14:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, H. Peter Anvin, git

On Wed, Apr 20, 2005 at 09:49:48AM +0200, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > perhaps having a new 'immutable hardlink' feature in the Linux VFS 
> > would help? I.e. a hardlink that can only be readonly followed, and 
> > can be removed, but cannot be chmod-ed to a writeable hardlink. That i 
> > think would be a large enough barrier for editors/build-tools not to 
> > play the tricks they already do that makes 'readonly' files virtually 
> > meaningless.
> 
> immutable hardlinks have the following advantage: a hardlink by design 
> hides the information where the link comes from. So even if an editor 
> wanted to play stupid games and override the immutability - it doesnt 
> know where the DB object is. (sure, it could find it if it wants to, but 
> that needs real messing around - editors wont do _that_)

This has already been implemented for the linux vserver project.  Take
a look in the patch here :-

  http://vserver.13thfloor.at/Experimental/patch-2.6.11.7-vs1.9.5.x5.diff.bz2

(Its not split out, but search for IMMUTABLE and you'll see what I mean)

It implements immutable linkage invert, which basically allows people
to delete hardlinks to immutable files, but not do anything else to
them.  It uses another bit out of the attributes to "invert" the
immutability of the linkage of immutable files.

Its used in the vserver project so that individual vservers (which are
basically just fancy chroots) can share libraries, binaries and hence
memory, can't muck each other up, but can upgrade their libs/binaries.

-- 
Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: enforcing DB immutability
  2005-04-20  7:49                                     ` Ingo Molnar
  2005-04-20  7:53                                       ` Ingo Molnar
  2005-04-20 14:57                                       ` Nick Craig-Wood
@ 2005-04-27  8:15                                       ` Wout
  2 siblings, 0 replies; 20+ messages in thread
From: Wout @ 2005-04-27  8:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: git

On Wed, Apr 20, 2005 at 09:49:48AM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > perhaps having a new 'immutable hardlink' feature in the Linux VFS 
> > would help? I.e. a hardlink that can only be readonly followed, and 
> > can be removed, but cannot be chmod-ed to a writeable hardlink. That i 
> > think would be a large enough barrier for editors/build-tools not to 
> > play the tricks they already do that makes 'readonly' files virtually 
> > meaningless.
> 
> immutable hardlinks have the following advantage: a hardlink by design 
> hides the information where the link comes from. So even if an editor 
> wanted to play stupid games and override the immutability - it doesnt 
> know where the DB object is. (sure, it could find it if it wants to, but 
> that needs real messing around - editors wont do _that_)
> 
> i think this might work.
> 
> (the current chattr +i flag isnt quite what we need though because it 
> works on the inode, and it's also a root-only feature so it puts us back 
> to square one. What would be needed is an immutability flag on 
> hardlinks, settable by unprivileged users.)
> 
> 	Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Slightly off-topic for this list. Apologies to those offended.

Would a filesystem that allows sharing of blocks between inodes
be useful here? Each block would need a reference count (refco).
Writing a block would be impossible once refco > 1. If someone
attempts to write to such a block, a new block is allocated for
that particular inode and the refco of the original is decreased.

Next to this there would have to be a clone_file() function:
    clone_file(src-file, dst-file, mode)

This function would create file dst-file with a new inode that
references the blocks belonging to src-file (increasing the
blocks' reference counts). The owner/group of dst-file are the
caller, not the owner of src-file.

Things to check for are:
    - read permissions for src-file
    - write permissions for dst-file
    - are src-file and dst-file in the same filesystem (if not,
      one could implement copy)
    - ...?


Suppose I have a file foo:

    foo -> inode1(blk1[1], blk2[1], blk3[1], blk4[1])

The [n] value on the blocks is the reference count.
I now call clone_file("foo", "bar", 0644):

    foo -> inode1(blk1[2], blk2[2], blk3[2], blk4[2])
    bar -> inode2(blk1[2], blk2[2], blk3[2], blk4[2])

Next I modify blk2 of bar (write):

    foo -> inode1(blk1[2], blk2[1], blk3[2], blk4[2])
    bar -> inode2(blk1[2], blk5[1], blk3[2], blk4[2])


I see the following uses:

- Checking out a tree of (uncompressed) files with git could be
  done using the clone_file() call on each file. This means no
  extra disk space is used unless files are edited later.

- Easy way to freeze files for backups. A database (mysql, ...)
  could bring its files into an acceptable state, call clone_file()
  on them and proceed with its work.

- It could be used to protect user files from external tampering.
  Someone mentioned the problems with malware killing his files.
  The impact of this could be reduced by having a script that did
  a clone_file() on everything as root periodically. If files are
  deleted, root would have a backup.


Notes:

- Small changes to files would probably cause all the blocks to
  be copied as programs (editors) usually write out the complete
  file.

- I don't know anything about implementing filesystems so all of
  the above could be complete nonsense.

- The idea isn't mine, I've come across this before under the name
  of 'snapshot filesystems' and I think it was patented. I've never
  heard of anyone doing this for individual files though.

Wout

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 20:07                             ` H. Peter Anvin
  2005-04-13 20:15                               ` Ingo Molnar
@ 2005-04-13 20:15                               ` Linus Torvalds
  1 sibling, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2005-04-13 20:15 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ingo Molnar, git

On Wed, 13 Apr 2005, H. Peter Anvin wrote:
> 
> Either which way, it feels to me that this idea has already been ruled 
> out, so it's probably pointless to keep debating just exactly what we're 
> not actually going to do.

Hey, isn't that how most discussions progress? ;)

I don't mind alternatives per se. I'm just lazy. I came up with one
solution to the issues I percieved, and I like that one. But dammit, if
somebody comes up with something _clearly_ superior, I'll just bow down in
your general direction, and promptly implement that.

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
       [not found]                       ` <20050413182909.GA25221@elte.hu>
       [not found]                         ` <Pine.LNX.4.58.0504131144160.4501@ppc970.osdl.org>
@ 2005-04-13 20:28                         ` Baruch Even
  1 sibling, 0 replies; 20+ messages in thread
From: Baruch Even @ 2005-04-13 20:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, H. Peter Anvin, git

Ingo Molnar wrote:
> with a plaintext repository we could do the 'hardlink trick' (which 
> brings in other manageability problems and limitations but is at least a 
> partially good idea), which would make the working tree and the 
> repository share the same inode in most cases.
> 
> While in the compressed case we'd have a separate compressed inode 
> (taking up RAM with all its contents) and the working directory inode 
> (taking up RAM) - summing up to more RAM than if we only had a single 
> inode per object.
> 
> furthermore, when generating/destroying large trees (which is a quite 
> common thing), a hardlinked solution is faster, as it doesnt create 
> 250MB+ of dirty RAM. In some cases (e.g. handling dozens of 'merge 
> trees') it's dramatically faster.

You could still have the hardlink way by way of a .git/cache that keeps 
uncompressed files, keep the files with their hash names but uncompressed.

It will be easy to find, fully hard-linkable, only keep the needed files 
  uncompressed and the three year old file compressed. The

You can even save some CPU time by checking if the file is in the cache 
before decompressing it, though it does cost you with an extra disk 
access to see if it's there or not. If you repeat the operation enough 
you'll have the uncompressed version in the cache most of the times anyway.

Clear the cache weekly or so to avoid stale files from an ancient version.

Baruch

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <Pine.LNX.4.58.0504131008500.4501@ppc970.osdl.org>]

* Re: Index/hash order
       [not found]                   ` <Pine.LNX.4.58.0504131008500.4501@ppc970.osdl.org>
@ 2005-04-13 21:40                     ` Florian Weimer
  2005-04-13 22:11                       ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Florian Weimer @ 2005-04-13 21:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Ingo Molnar, git

* Linus Torvalds:

>  - I want things to distribute well. This means that it has to be based 
>    on a "append data" model, where historical data never changes, and you 
>    only append on top of it (either by adding totally new files, or by 
>    just letting the files grow).

Yes, I think this is something which can easily dominate the choice of
data structure.

>    This works in a forward-delta environment (which is fundamentally based 
>    on the notion of "we know the old version, we're adding new stuff on
>    top of it"), but does _not_ work in the backwards model of "we keep the
>    old history as a delta against the new" model.

Forward deltas don't have to be terribly inefficient.  You can get
O(log n) access to revision n fairly easily, using the trick described
there:

  <http://svn.collab.net/repos/svn/trunk/notes/skip-deltas>

I've run a few tests, just to get a few numbers of the overhead
involved.  I used the last ~8,000 changesets from the BKCVS kernel
repository.  With cold cache, a checkout from cold cache takes about
250 seconds on my laptop.  I don't have git numbers, but a mere copy
of the kernel tree needs 40 seconds.

For the hot-cache case, the difference is 140 seconds vs. 2.5 seconds
(or 6 seconds with decompression).

Uh-oh.  I wouldn't have imaged the difference would be *that*
dramatic.  The file system layer is *fast*.

Subversion's delta implementation is not a speed daemon (it handles
arbitrarily large files, which increases complexity significantly and
slows things down, compared to simpler in-memory algorithms), but it
will be very hard to come even close to the 2.5 seconds.

On the storage front, we have 220 MB for the skip deltas vs. 106 MB
for pure deltas-to-previous vs. 1.1 GB for uncompressed files
(directories are always delta-compressed, so to speak[1]).  In the
first two cases, the first revision in the repository is deltaed
against /dev/null and itself and thus compressed, in case you think
the numbers are suspiciously low.

1. AFAICS, you can't really avoid that if you want to track file
   identity information without introducing arbitrary file IDs.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 21:40                     ` Florian Weimer
@ 2005-04-13 22:11                       ` Linus Torvalds
  2005-04-13 22:48                         ` Florian Weimer
  2005-04-14  7:04                         ` Ingo Molnar
  0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2005-04-13 22:11 UTC (permalink / raw)
  To: Florian Weimer; +Cc: H. Peter Anvin, Ingo Molnar, git

On Wed, 13 Apr 2005, Florian Weimer wrote:
> 
> I've run a few tests, just to get a few numbers of the overhead
> involved.  I used the last ~8,000 changesets from the BKCVS kernel
> repository.  With cold cache, a checkout from cold cache takes about
> 250 seconds on my laptop.  I don't have git numbers, but a mere copy
> of the kernel tree needs 40 seconds.

I will bet you that a git checkout is _faster_ than a kernel source tree
copy. The time will be dominated by the IO costs (in particular the read
costs), and the IO costs are lower thanks to compression. So I think that
the cold-cache case will beat your 40 seconds by a clear margin. It
generally compresses to half the size, so 20 seconds is not impossible
(although seek costs would tend to stay constant, so I'd expect it to be
somewhere in between the two).

> For the hot-cache case, the difference is 140 seconds vs. 2.5 seconds
> (or 6 seconds with decompression).
> 
> Uh-oh.  I wouldn't have imaged the difference would be *that*
> dramatic.  The file system layer is *fast*.

Did I mention that I designed git for speed?

Yes. The whole damn design is really about performance, distribution, and 
built-in integrity checking. 

> On the storage front, we have 220 MB for the skip deltas vs. 106 MB
> for pure deltas-to-previous vs. 1.1 GB for uncompressed files
> (directories are always delta-compressed, so to speak[1]).

That's actually pretty encouraging. Your 1.1GB number implies to me that a
compressed file setup should be about half that, which in turn says that
the cost of full-file is not at all outrageous. Sure, it's 2-3 times
larger than your skip deltas, but considering that the performance is
about fifty times faster (and I can do distributed stuff without any
locking synchronization and you can't), that's a tradeoff I'm more than
happy with.

Or maybe I misunderstood what you were comparing?

Of course, the numbers will all depend on how the history looks etc, so 
this is all pretty much just guidelines.

			Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 22:11                       ` Linus Torvalds
@ 2005-04-13 22:48                         ` Florian Weimer
  2005-04-14  7:04                         ` Ingo Molnar
  1 sibling, 0 replies; 20+ messages in thread
From: Florian Weimer @ 2005-04-13 22:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Ingo Molnar, git

* Linus Torvalds:

> I will bet you that a git checkout is _faster_ than a kernel source tree
> copy. The time will be dominated by the IO costs (in particular the read
> costs), and the IO costs are lower thanks to compression. So I think that
> the cold-cache case will beat your 40 seconds by a clear margin. It
> generally compresses to half the size, so 20 seconds is not impossible
> (although seek costs would tend to stay constant, so I'd expect it to be
> somewhere in between the two).

It's indeed slightly faster (34 seconds).  The hot-cache case is about
6 seconds.  Still okay.

However, I should redo these tests with a real git.  The numbers could
be quite different because seek overhead is a bit hard to predict.
Which version should I try?

> That's actually pretty encouraging. Your 1.1GB number implies to me that a
> compressed file setup should be about half that, which in turn says that
> the cost of full-file is not at all outrageous.

I usually try to avoid the typical O(f(n)) fallacy because constant
factors do matter in practice.  But the way you put it -- maybe delta
compression isn't worth the complexity after all.  At least I'm
beginning to have doubts.

Especially since the same Subversion repository, stored by the
Berkeley DB backend, requires a whopping 1.3 GB of disk space.

> Or maybe I misunderstood what you were comparing?

My estimates only cover file data, not metadata.  Based on the
Subversion dumps, it might be possible to get some rough estimates for
the cost of storing directory information.  What is the average size
of a directory blob?  Is it true that for each tree revision, you need
to store a new directory blob for each directory which indirectly
contains a modified file?

Does your 50% estimate include wasted space due to the file system
block size?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Index/hash order
  2005-04-13 22:11                       ` Linus Torvalds
  2005-04-13 22:48                         ` Florian Weimer
@ 2005-04-14  7:04                         ` Ingo Molnar
  2005-04-14 10:50                           ` cache-cold repository performance Ingo Molnar
  1 sibling, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2005-04-14  7:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Florian Weimer, H. Peter Anvin, git


* Linus Torvalds <torvalds@osdl.org> wrote:

> > I've run a few tests, just to get a few numbers of the overhead
> > involved.  I used the last ~8,000 changesets from the BKCVS kernel
> > repository.  With cold cache, a checkout from cold cache takes about
> > 250 seconds on my laptop.  I don't have git numbers, but a mere copy
> > of the kernel tree needs 40 seconds.
> 
> I will bet you that a git checkout is _faster_ than a kernel source 
> tree copy. The time will be dominated by the IO costs (in particular 
> the read costs), and the IO costs are lower thanks to compression. So 
> I think that the cold-cache case will beat your 40 seconds by a clear 
> margin. It generally compresses to half the size, so 20 seconds is not 
> impossible (although seek costs would tend to stay constant, so I'd 
> expect it to be somewhere in between the two).

i'd be surprised if it was twice as fast - cache-cold linear checkouts 
are _seek_ limited, and it doesnt matter whether after a 1-2 msec 
track-to-track disk seek the DMA engine spends another 30 microseconds 
DMA-ing 60K uncompressed data instead of 30K compressed... (there are 
other factors, but this is the main thing.)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* cache-cold repository performance
  2005-04-14  7:04                         ` Ingo Molnar
@ 2005-04-14 10:50                           ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2005-04-14 10:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Florian Weimer, H. Peter Anvin, git

* Ingo Molnar <mingo@elte.hu> wrote:

> i'd be surprised if it was twice as fast - cache-cold linear checkouts 
> are _seek_ limited, and it doesnt matter whether after a 1-2 msec 
> track-to-track disk seek the DMA engine spends another 30 microseconds 
> DMA-ing 60K uncompressed data instead of 30K compressed... (there are 
> other factors, but this is the main thing.)

i've benchmarked cache-cold compressed vs. uncompressed performance, to 
shed some more light on the performance differences between flat and 
compressed repositories.

i did alot of testing, and i primarily concentrated on being able to 
_trust_ the benchmark results, not to generate some quick numbers. The 
major problem was that the timing of the reads associated with 'checking 
out a large tree' is very unstable, even on a completely isolated 
testsystem with very common (and predictable) IO hardware.

the content i tested was a vanilla 2.6.10 kernel tree, with 19042 files 
in it, taking 246 MB uncompressed, and 110 MB compressed (via gzip -9).  
Average file size is 13.2 KB uncompressed, 5.9 KB compressed.

Firstly, the timings are very sensitive to the way the tree was created.  
To have a 'fair' on-disk layout the trees have to be created in an 
identical fashion: e.g. it is not valid to copy the uncompressed tree 
and run gzip over it - that will create a 'sparse' on-disk layout 
penalizing the compressed layout and making it 30% slower than the 
uncompressed layout! I first created the two trees, then i "cp -a"-ed 
them over into a new directory one after each other, so that they get on 
similar on-disk positions as well. I also created 2 more pairs of such 
trees to make sure disk layout is fair.

all timings were taken fresh after reboot, on a UP 1 GB RAM Athlon64 
3200+, using a large, top of the line IDE disk. The kernel was 
2.6.12-rc2, the filesystem was ext3 with enough free space to not be 
fragmented, both noatime and nodiratime was specified so that no write 
activities whatever occur during the 'checkout'.

the operation timed was a simple:

        time find . -type f | xargs cat > /dev/null

done in the root of the given tree. This generates the very same 
readonly IO pattern for each test. I've run the tests 10 times (i.e.  
have done 10 fresh reboots), but after every reboot i permutated the 
order of trees tested - to make sure there is no interaction between 
trees. (there was no interaction)

here are the raw numbers, elapsed real time in seconds:

 flat-1:  29.7 29.5 29.4 29.4 29.5 29.5 29.7 29.6 29.4 29.6 29.5 29.4:  29.5
 gzip-1:  41.2 40.9 40.7 40.7 40.5 41.7 41.0 40.3 40.6 40.8 40.8 40.9:  40.8

 flat-2:  28.0 28.2 27.7 27.9 27.8 27.9 27.7 27.9 27.9 28.1 27.9 28.0:  27.9
 gzip-2:  27.2 27.4 27.4 27.2 27.2 27.2 27.2 27.2 27.1 27.3 27.2 27.4:  27.2
 flat-3:  27.0 27.8 27.6 27.7 27.8 27.8 27.8 27.7 27.8 27.6 27.8 27.8:  27.6
 gzip-3:  25.8 26.8 26.6 26.5 26.5 26.5 26.6 26.4 26.5 26.7 26.6 26.7:  26.5

The final column is the average. (Standard deviation is below 0.1 sec, 
less than 0.3%.)

flat-1 is the original tree, created via tar. gzip-1 is a cp -a copy of 
it, per-file compressed afterwards. flat-2 is a cp -a copy of flat-1, 
gzip-2 is a cp -a copy of gzip-1. flat-3/gzip-3 are cp -a copies of 
flat-2/gzip-2.

note that gzip-1 is ~40% slower due to the 'sparse layout', so its 
results approximate a repository with 'bad' file layout. I'd not expect 
GIT repositories to have such a layout normally, so we can disregard it.

flat-2/3 and gzip-2/3 can be directly compared. Firstly, the results 
show that the on-disk layout cannot be constructed reliably - there's a 
1% systematic difference between flat-2 and flat-3, and a 3% systematic 
difference between gzip-2 and gzip-3 - both systematic errors are larger 
than the 0.5% standard deviation, so they are not measurement errors but 
real layout properties of these trees.

the most interesting result is that gzip-2 is 2.5% faster than flat-2, 
and gzip-3 is 4% faster than flat-3. These differences are close to the 
layout-related systematic error, but slightly above it, so i'd conclude 
that a compressed repository is 3% faster on this hardware.

(since these results were in line with my expectations i double-checked 
everything again and did another 10 reboot tests - same results.)

conclusion [*]: there's a negligible cache-cold performance hit from 
using an uncompressed repository, because cache-cold performance is 
dominated by number of seeks, which is almost identical in the two 
cases.

	Ingo

[*] lots of conditionals apply: these werent flat/compressed GIT 
repositories (although they were quite similar to it), nor was the GIT 
workload measured (although the one measured should be quite close to 
it).

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-04-27  8:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <425C3F12.9070606@zytor.com>
     [not found] ` <Pine.LNX.4.58.0504121452330.4501@ppc970.osdl.org>
     [not found]   ` <20050412224027.GB20821@elte.hu>
     [not found]     ` <Pine.LNX.4.58.0504121554140.4501@ppc970.osdl.org>
     [not found]       ` <20050412230027.GA21759@elte.hu>
     [not found]         ` <20050412230729.GA22179@elte.hu>
     [not found]           ` <20050413111355.GB13865@elte.hu>
     [not found]             ` <425D4E1D.4040108@zytor.com>
     [not found]               ` <20050413165310.GA22428@elte.hu>
     [not found]                 ` <425D4FB1.9040207@zytor.com>
     [not found]                   ` <20050413171052.GA22711@elte.hu>
     [not found]                     ` <Pine.LNX.4.58.0504131027210.4501@ppc970.osdl.org>
     [not found]                       ` <20050413182909.GA25221@elte.hu>
     [not found]                         ` <Pine.LNX.4.58.0504131144160.4501@ppc970.osdl.org>
2005-04-13 20:02                           ` Index/hash order Ingo Molnar
2005-04-13 20:07                             ` H. Peter Anvin
2005-04-13 20:15                               ` Ingo Molnar
2005-04-13 20:18                                 ` Ingo Molnar
2005-04-13 20:21                                   ` Ingo Molnar
2005-04-13 20:26                                     ` Updated base64 patches H. Peter Anvin
2005-04-13 21:04                                 ` Index/hash order Linus Torvalds
2005-04-20  7:40                                   ` enforcing DB immutability Ingo Molnar
2005-04-20  7:49                                     ` Ingo Molnar
2005-04-20  7:53                                       ` Ingo Molnar
2005-04-20  8:58                                         ` Chris Wedgwood
2005-04-20 14:57                                       ` Nick Craig-Wood
2005-04-27  8:15                                       ` Wout
2005-04-13 20:15                               ` Index/hash order Linus Torvalds
2005-04-13 20:28                         ` Baruch Even
     [not found]                   ` <Pine.LNX.4.58.0504131008500.4501@ppc970.osdl.org>
2005-04-13 21:40                     ` Florian Weimer
2005-04-13 22:11                       ` Linus Torvalds
2005-04-13 22:48                         ` Florian Weimer
2005-04-14  7:04                         ` Ingo Molnar
2005-04-14 10:50                           ` cache-cold repository performance Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).