From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Petr Baudis <pasky@ucw.cz>, "Randy.Dunlap" <rddunlap@osdl.org>,
Ross Vandegrift <ross@jose.lug.udel.edu>,
Kernel Mailing List <linux-kernel@vger.kernel.org>,
git@vger.kernel.org
Subject: [rfc] git: combo-blobs
Date: Mon, 11 Apr 2005 13:35:23 +0200 [thread overview]
Message-ID: <20050411113523.GA19256@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.58.0504091617000.1267@ppc970.osdl.org>
i think all of the 'repository size' and 'bandwidth' concerns could be
solved via a new (and pretty much simple and transparent) object type:
the 'combo-blob'.
Summary:
--------
This is a space/bandwidth-efficient blob that 'includes' arbitrary
portions of (one, two, or more) simple blobs by reference [1], with byte
granularity, plus an optional followup portion that includes the full
constructed state, uncompressed. [2] It can also conserve more RAM
compared to the current repository format.
Representation:
---------------
A combo-blob would have the 'simplest possible' and thus most obvious
representation: a list (the 'include-table') of "include X bytes at
offset Y from parent Z" operations:
<parent-blob-ID> <offset> <size>
[optional full constructed state]
e.g.:
6d11b2dd7f169c29664ac0553090865b7b020973 0 64444
6d374c972c04a0b1894cc6898dffa8ab0b273fcb 0 100
6d11b2dd7f169c29664ac0553090865b7b020973 64545 163656
'punches' 100 bytes out of blob 6d1* at offset 64444, and replaces it
with blob 6d3*'s 100 bytes. [offset/size would be stored in a binary
form to have constant record sizes.]
in OS terms it's similar to an iovec representation. [3]
The hash of a combo-blob is calculated off the include-table alone: i.e.
it's _not_ equivalent to the hash of the included contents. I.e. you
cannot 'collapse' a combo-blob after the fact, it's an immutable part of
the history of the repository, similar to other stored objects. You can
freely cache/uncache (blow-up/collapse) it on the other hand.
[ NOTE: further below you can find a 'Notes' section as well, which
might address some of the issues/ideas you might have at this point. ]
Cons:
-----
there are a number of disadvantages:
- performance hit. Linus is perfectly right, in terms of performance,
nothing beats having full objects.
Hence i kept the option to include the full constructed blob [4]
(uncompressed) as well in the combo-blob. When all combo-blobs are
'blown up' then they can be better in terms of performance than the
current repository format. [they still carry the small slice & dice
information as well]
the performance hit can be reduced in a finegrained way by introducing
occasional full objects in the history. E.g. after every 8 steps one
would include a full blob, to limit the number of blobs necessary to
construct a previously unconstructed combo-blob. This would still cut
the overhead of the current format substantially.
clearly, the most important cache is the current directory cache,
which this abstraction does not hurt.
- complexity. It's all pretty straightforward, but checking the
consistency of a combo object is not as simple as checking the
consistency of a simple object, as it would have to recursively check
all parent IDs as well. I think it's worth the price though.
- repository has optional components: the 'blown up' (cached) portion of
a combo-blob can be freely destructed. This means that two
repositories can now not only differ in their directory-cache, but
also in their objects/ hierarchy. I dont think this is a big issue,
BYMMV.
Pros:
-----
- the main advantage is space/bandwith: it's pretty much as efficient as
it gets: it can be used to represent compressed binary deltas. A fully
trimmed (uncached) repository is very efficient.
- the optional 'fully constructed' portion is not compressed, so once a
repository is 'cached', it is faster to process (in areas outside the
current directory cache) than the current repository format. (In fact,
when a previously unused portion of a repository is accessed _first_,
it is IO-bound by nature - so we can very well spend the extra CPU
cycles on uncompressing things.)
- a 'combo' blob will be more memory-efficient as well. So with given
amount of RAM one could access more history, with a small CPU cost -
as long as the level of 'history recursion' is kept in check (e.g. via
the previously mentioned 'at most 8-deep combinations').
Straightforward iovecs could be passed to Linux system-calls, when
constructing a 'view' of a file, without having to cache every step of
the file's history.
- a combo-blob directly represents the way humans code: combining
pre-existing pieces of information and adding relatively low amount of
new stuff. Having a natural representation for the type of activity
that a tool supports cannot hurt.
( - combo-blobs enable a per-chunk (or per-line) edit history. It's not
an important feature though. )
Notes:
------
[1] the combo-blob is not a 'delta' thing. It combines pre-existing
parents. One of the parents may of course be a 'delta' that acts upon
the other parent - but the combo-blob does not know and does not care.
(A combo-blob might as well represent an act of someone consolidating
multiple small files into a big file, or splitting up a big file into
smaller files. Or a combo-blob might represent the trimming of a
preexisting file.)
[2]: a combo-blob is conceptually still a simple object with blob data
in it, nothing more. It can be referenced in other object types
equivalently to other blobs. It just happens to be a combination of
existing blobs, and hence the 'git filesystem' has to work harder (but
still quite efficiently) to get to the contents.
[3]: a combo-blob might reference any parent blob, including combo
blobs. This means that e.g. multiple small deltas can be represented
via:
<blob-#1>
|
|-----<blob-#2>
|
<combo-blob-#1>
|
|-----<blob-#3>
|
<combo-blob-#2>
where combo-blob-#2 is thus a combination of blob-#1,blob-#2,blob-#3.
[4] alternatively, it might also make sense to extend the simple
combo-blob concept with the concept of a 'cache-blob': a cache-blob
'blows up' combo blobs in that it fully constructs the blob contents,
but it is otherwise identical to the blob it caches. Simple (non-combo)
blob types are a cache of themselves.
Ingo
next prev parent reply other threads:[~2005-04-11 11:36 UTC|newest]
Thread overview: 194+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-09 19:45 more git updates Linus Torvalds
2005-04-09 19:56 ` Linus Torvalds
2005-04-09 20:07 ` Petr Baudis
2005-04-09 21:00 ` Linus Torvalds
2005-04-09 21:00 ` tony.luck
2005-04-10 16:01 ` Linus Torvalds
2005-04-12 17:34 ` Helge Hafting
2005-04-10 18:19 ` Paul Jackson
2005-04-10 23:04 ` Bernd Eckenfels
2005-04-11 9:27 ` Anton Altaparmakov
2005-04-09 21:08 ` Linus Torvalds
2005-04-09 23:31 ` Linus Torvalds
2005-04-10 2:41 ` Petr Baudis
2005-04-10 16:27 ` [ANNOUNCE] git-pasky-0.1 Petr Baudis
2005-04-10 16:55 ` Linus Torvalds
2005-04-10 19:49 ` Sean
2005-04-10 17:33 ` Ingo Molnar
2005-04-10 17:42 ` Willy Tarreau
2005-04-10 17:45 ` Ingo Molnar
2005-04-10 18:45 ` Petr Baudis
2005-04-10 19:13 ` Willy Tarreau
2005-04-10 21:27 ` Petr Baudis
2005-04-10 20:38 ` Linus Torvalds
2005-04-10 21:39 ` Linus Torvalds
2005-04-10 23:49 ` Petr Baudis
2005-04-10 22:27 ` Petr Baudis
2005-04-10 23:10 ` Linus Torvalds
2005-04-10 23:26 ` Petr Baudis
2005-04-10 23:46 ` Linus Torvalds
2005-04-10 23:56 ` Petr Baudis
2005-04-11 0:20 ` GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1) Linus Torvalds
2005-04-11 0:27 ` Petr Baudis
2005-04-11 7:45 ` Ingo Molnar
2005-04-11 8:40 ` Florian Weimer
2005-04-11 10:52 ` Petr Baudis
2005-04-11 16:05 ` Florian Weimer
2005-04-10 23:23 ` [ANNOUNCE] git-pasky-0.1 Paul Jackson
2005-04-11 0:15 ` Randy.Dunlap
2005-04-11 0:30 ` Re: " Petr Baudis
2005-04-11 1:11 ` Linus Torvalds
2005-04-10 20:41 ` Paul Jackson
2005-04-11 1:58 ` [ANNOUNCE] git-pasky-0.2 Petr Baudis
2005-04-11 2:46 ` Daniel Barkalow
2005-04-11 10:17 ` Petr Baudis
2005-04-11 8:50 ` Ingo Molnar
2005-04-11 10:16 ` Petr Baudis
2005-04-11 13:57 ` [ANNOUNCE] git-pasky-0.3 Petr Baudis
2005-04-12 12:47 ` Martin Schlemmer
2005-04-12 13:02 ` Petr Baudis
2005-04-12 13:13 ` Martin Schlemmer
2005-04-12 13:23 ` Petr Baudis
[not found] ` <1113375277.23299.25.camel@nosferatu.lan>
[not found] ` <20050413075441.GD16489@pasky.ji.cz>
[not found] ` <1113381672.23299.47.camel@nosferatu.lan>
[not found] ` <20050413092656.GO16489@pasky.ji.cz>
[not found] ` <1113394537.23299.51.camel@nosferatu.lan>
2005-04-13 22:19 ` Re: Re: Remove need to untrack before tracking new branch Petr Baudis
2005-04-14 6:55 ` Martin Schlemmer
2005-04-14 8:28 ` Martin Schlemmer
2005-04-14 8:38 ` Martin Schlemmer
2005-04-14 9:11 ` Petr Baudis
2005-04-14 9:40 ` Martin Schlemmer
2005-04-14 9:55 ` Martin Schlemmer
2005-04-14 22:35 ` Alex Riesen
2005-04-15 5:45 ` Martin Schlemmer
2005-04-15 6:42 ` Paul Jackson
2005-04-15 23:49 ` Re: Re: " Alex Riesen
2005-04-14 22:42 ` Petr Baudis
2005-04-14 23:01 ` Martin Schlemmer
2005-04-14 23:00 ` Petr Baudis
2005-04-14 23:09 ` Martin Schlemmer
2005-04-14 23:25 ` Martin Schlemmer
2005-04-12 13:07 ` [ANNOUNCE] git-pasky-0.3 David Woodhouse
2005-04-13 8:47 ` Russell King
2005-04-13 8:59 ` Petr Baudis
2005-04-13 9:06 ` H. Peter Anvin
2005-04-13 9:09 ` David Woodhouse
2005-04-13 9:25 ` David Woodhouse
2005-04-13 9:42 ` Petr Baudis
2005-04-13 10:24 ` David Woodhouse
2005-04-13 17:01 ` Daniel Barkalow
2005-04-13 18:07 ` Petr Baudis
2005-04-13 18:22 ` git mailing list (Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.3) Linus Torvalds
2005-04-13 18:38 ` Re: Re: Re: [ANNOUNCE] git-pasky-0.3 Daniel Barkalow
2005-04-13 12:43 ` Xavier Bestel
2005-04-13 16:48 ` H. Peter Anvin
2005-04-13 18:15 ` Xavier Bestel
2005-04-13 23:05 ` bd
2005-04-13 14:38 ` Linus Torvalds
2005-04-13 14:47 ` David Woodhouse
2005-04-13 14:59 ` Linus Torvalds
2005-04-13 9:35 ` Russell King
2005-04-13 9:38 ` Russell King
2005-04-13 9:49 ` Petr Baudis
2005-04-13 11:02 ` Ingo Molnar
2005-04-13 14:50 ` Linus Torvalds
2005-04-13 9:46 ` Petr Baudis
2005-04-13 10:28 ` Russell King
2005-04-13 19:03 ` Russell King
2005-04-13 19:13 ` Petr Baudis
2005-04-13 19:21 ` Russell King
2005-04-13 19:23 ` H. Peter Anvin
2005-04-10 6:53 ` more git updates Christopher Li
2005-04-10 11:48 ` Ralph Corderoy
2005-04-10 19:23 ` Paul Jackson
2005-04-10 18:42 ` Christopher Li
2005-04-10 22:30 ` Petr Baudis
2005-04-11 13:58 ` H. Peter Anvin
2005-04-20 20:29 ` Kai Henningsen
2005-04-24 0:42 ` Paul Jackson
2005-04-24 1:29 ` Bernd Eckenfels
2005-04-24 4:13 ` Paul Jackson
2005-04-24 4:38 ` Bernd Eckenfels
2005-04-24 4:53 ` Paul Jackson
2005-04-25 11:57 ` Theodore Ts'o
2005-04-25 16:40 ` David Wagner
2005-04-25 20:35 ` Bernd Eckenfels
2005-04-24 16:52 ` Horst von Brand
2005-04-24 8:00 ` Kai Henningsen
[not found] ` <6f6293f10504210220744af114@mail.gmail.com>
2005-04-24 8:01 ` Kai Henningsen
2005-04-11 11:35 ` Ingo Molnar [this message]
2005-04-11 14:45 ` [rfc] git: combo-blobs Paul Jackson
2005-04-11 15:12 ` Ingo Molnar
2005-04-11 15:32 ` Linus Torvalds
2005-04-11 15:39 ` Ingo Molnar
2005-04-11 15:57 ` Ingo Molnar
2005-04-11 16:01 ` Linus Torvalds
2005-04-11 16:33 ` Ingo Molnar
2005-04-12 5:42 ` Barry K. Nathan
2005-04-11 18:13 ` Chris Wedgwood
2005-04-11 18:30 ` Linus Torvalds
2005-04-11 20:18 ` Linus Torvalds
2005-04-11 18:40 ` Petr Baudis
2005-04-11 17:50 ` Paul Jackson
2005-04-11 15:28 ` Ingo Molnar
2005-04-11 15:31 ` Ingo Molnar
2005-04-12 4:05 ` more git updates David Eger
2005-04-12 8:16 ` Petr Baudis
2005-04-12 20:44 ` David Eger
2005-04-12 21:21 ` Linus Torvalds
2005-04-12 22:29 ` Krzysztof Halasa
2005-04-12 22:49 ` Linus Torvalds
2005-04-13 4:32 ` Matthias Urlichs
2005-04-12 22:36 ` David Eger
2005-04-12 23:48 ` Panagiotis Issaris
2005-04-12 23:40 ` Andrea Arcangeli
2005-04-12 23:45 ` Linus Torvalds
2005-04-13 0:14 ` Andrea Arcangeli
2005-04-13 1:10 ` Linus Torvalds
2005-04-13 10:59 ` Andrea Arcangeli
2005-04-13 20:44 ` Matt Mackall
2005-04-13 23:42 ` Krzysztof Halasa
2005-04-14 0:13 ` Matt Mackall
2005-04-13 9:30 ` Russell King
2005-04-13 10:20 ` Andrea Arcangeli
2005-04-13 14:43 ` Linus Torvalds
2005-04-10 2:07 ` Paul Jackson
2005-04-10 2:20 ` Paul Jackson
2005-04-10 2:09 ` Paul Jackson
2005-04-10 7:51 ` Junio C Hamano
2005-04-10 5:53 ` Christopher Li
2005-04-10 9:28 ` Junio C Hamano
2005-04-10 7:06 ` Christopher Li
2005-04-10 11:38 ` tony.luck
2005-04-10 9:48 ` Petr Baudis
2005-04-10 9:40 ` Wichert Akkerman
2005-04-10 9:41 ` Petr Baudis
2005-04-10 7:09 ` Christopher Li
2005-04-10 11:21 ` Proposal for shell-patch-format [was: Re: more git updates..] Rutger Nijlunsing
2005-04-10 15:44 ` more git updates Linus Torvalds
2005-04-10 17:00 ` Rutger Nijlunsing
2005-04-10 18:50 ` Paul Jackson
2005-04-10 20:57 ` Linus Torvalds
2005-04-10 19:03 ` Christopher Li
2005-04-10 22:38 ` Linus Torvalds
2005-04-10 19:53 ` Christopher Li
2005-04-10 23:21 ` Linus Torvalds
2005-04-10 21:28 ` Christopher Li
2005-04-12 5:14 ` David Lang
2005-04-12 6:00 ` Paul Jackson
2005-04-12 7:05 ` Barry K. Nathan
2005-04-11 6:57 ` bert hubert
2005-04-11 7:20 ` Christer Weinigel
2005-04-10 23:14 ` Paul Jackson
2005-04-10 23:38 ` Linus Torvalds
2005-04-11 0:19 ` Paul Jackson
2005-04-11 15:49 ` Randy.Dunlap
2005-04-11 18:30 ` Petr Baudis
2005-04-11 0:10 ` Petr Baudis
2005-04-09 22:00 ` Paul Jackson
2005-04-09 23:21 ` Ralph Corderoy
2005-04-10 0:39 ` Paul Jackson
2005-04-10 1:14 ` Bernd Eckenfels
2005-04-10 1:33 ` Paul Jackson
2005-04-10 10:22 ` Ralph Corderoy
2005-04-10 17:30 ` Paul Jackson
2005-04-10 17:31 ` Rik van Riel
2005-04-10 17:35 ` Ingo Molnar
2005-04-11 16:46 ` ross
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20050411113523.GA19256@elte.hu \
--to=mingo@elte.hu \
--cc=git@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pasky@ucw.cz \
--cc=rddunlap@osdl.org \
--cc=ross@jose.lug.udel.edu \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.