From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Petr Baudis <pasky@ucw.cz>, "Randy.Dunlap" <rddunlap@osdl.org>,
Ross Vandegrift <ross@jose.lug.udel.edu>,
Kernel Mailing List <linux-kernel@vger.kernel.org>,
git@vger.kernel.org
Subject: [rfc] git: combo-blobs
Date: Mon, 11 Apr 2005 13:35:23 +0200 [thread overview]
Message-ID: <20050411113523.GA19256@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.58.0504091617000.1267@ppc970.osdl.org>
i think all of the 'repository size' and 'bandwidth' concerns could be
solved via a new (and pretty much simple and transparent) object type:
the 'combo-blob'.
Summary:
--------
This is a space/bandwidth-efficient blob that 'includes' arbitrary
portions of (one, two, or more) simple blobs by reference [1], with byte
granularity, plus an optional followup portion that includes the full
constructed state, uncompressed. [2] It can also conserve more RAM
compared to the current repository format.
Representation:
---------------
A combo-blob would have the 'simplest possible' and thus most obvious
representation: a list (the 'include-table') of "include X bytes at
offset Y from parent Z" operations:
<parent-blob-ID> <offset> <size>
[optional full constructed state]
e.g.:
6d11b2dd7f169c29664ac0553090865b7b020973 0 64444
6d374c972c04a0b1894cc6898dffa8ab0b273fcb 0 100
6d11b2dd7f169c29664ac0553090865b7b020973 64545 163656
'punches' 100 bytes out of blob 6d1* at offset 64444, and replaces it
with blob 6d3*'s 100 bytes. [offset/size would be stored in a binary
form to have constant record sizes.]
in OS terms it's similar to an iovec representation. [3]
The hash of a combo-blob is calculated off the include-table alone: i.e.
it's _not_ equivalent to the hash of the included contents. I.e. you
cannot 'collapse' a combo-blob after the fact, it's an immutable part of
the history of the repository, similar to other stored objects. You can
freely cache/uncache (blow-up/collapse) it on the other hand.
[ NOTE: further below you can find a 'Notes' section as well, which
might address some of the issues/ideas you might have at this point. ]
Cons:
-----
there are a number of disadvantages:
- performance hit. Linus is perfectly right, in terms of performance,
nothing beats having full objects.
Hence i kept the option to include the full constructed blob [4]
(uncompressed) as well in the combo-blob. When all combo-blobs are
'blown up' then they can be better in terms of performance than the
current repository format. [they still carry the small slice & dice
information as well]
the performance hit can be reduced in a finegrained way by introducing
occasional full objects in the history. E.g. after every 8 steps one
would include a full blob, to limit the number of blobs necessary to
construct a previously unconstructed combo-blob. This would still cut
the overhead of the current format substantially.
clearly, the most important cache is the current directory cache,
which this abstraction does not hurt.
- complexity. It's all pretty straightforward, but checking the
consistency of a combo object is not as simple as checking the
consistency of a simple object, as it would have to recursively check
all parent IDs as well. I think it's worth the price though.
- repository has optional components: the 'blown up' (cached) portion of
a combo-blob can be freely destructed. This means that two
repositories can now not only differ in their directory-cache, but
also in their objects/ hierarchy. I dont think this is a big issue,
BYMMV.
Pros:
-----
- the main advantage is space/bandwith: it's pretty much as efficient as
it gets: it can be used to represent compressed binary deltas. A fully
trimmed (uncached) repository is very efficient.
- the optional 'fully constructed' portion is not compressed, so once a
repository is 'cached', it is faster to process (in areas outside the
current directory cache) than the current repository format. (In fact,
when a previously unused portion of a repository is accessed _first_,
it is IO-bound by nature - so we can very well spend the extra CPU
cycles on uncompressing things.)
- a 'combo' blob will be more memory-efficient as well. So with given
amount of RAM one could access more history, with a small CPU cost -
as long as the level of 'history recursion' is kept in check (e.g. via
the previously mentioned 'at most 8-deep combinations').
Straightforward iovecs could be passed to Linux system-calls, when
constructing a 'view' of a file, without having to cache every step of
the file's history.
- a combo-blob directly represents the way humans code: combining
pre-existing pieces of information and adding relatively low amount of
new stuff. Having a natural representation for the type of activity
that a tool supports cannot hurt.
( - combo-blobs enable a per-chunk (or per-line) edit history. It's not
an important feature though. )
Notes:
------
[1] the combo-blob is not a 'delta' thing. It combines pre-existing
parents. One of the parents may of course be a 'delta' that acts upon
the other parent - but the combo-blob does not know and does not care.
(A combo-blob might as well represent an act of someone consolidating
multiple small files into a big file, or splitting up a big file into
smaller files. Or a combo-blob might represent the trimming of a
preexisting file.)
[2]: a combo-blob is conceptually still a simple object with blob data
in it, nothing more. It can be referenced in other object types
equivalently to other blobs. It just happens to be a combination of
existing blobs, and hence the 'git filesystem' has to work harder (but
still quite efficiently) to get to the contents.
[3]: a combo-blob might reference any parent blob, including combo
blobs. This means that e.g. multiple small deltas can be represented
via:
<blob-#1>
|
|-----<blob-#2>
|
<combo-blob-#1>
|
|-----<blob-#3>
|
<combo-blob-#2>
where combo-blob-#2 is thus a combination of blob-#1,blob-#2,blob-#3.
[4] alternatively, it might also make sense to extend the simple
combo-blob concept with the concept of a 'cache-blob': a cache-blob
'blows up' combo blobs in that it fully constructs the blob contents,
but it is otherwise identical to the blob it caches. Simple (non-combo)
blob types are a cache of themselves.
Ingo
next prev parent reply other threads:[~2005-04-11 11:36 UTC|newest]
Thread overview: 178+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-09 19:45 more git updates Linus Torvalds
2005-04-09 19:56 ` Linus Torvalds
2005-04-09 20:07 ` Petr Baudis
2005-04-09 21:00 ` Linus Torvalds
2005-04-09 21:00 ` tony.luck
2005-04-10 16:01 ` Linus Torvalds
2005-04-12 17:34 ` Helge Hafting
2005-04-10 18:19 ` Paul Jackson
2005-04-10 23:04 ` Bernd Eckenfels
2005-04-11 9:27 ` Anton Altaparmakov
2005-04-09 21:08 ` Linus Torvalds
2005-04-09 23:31 ` Linus Torvalds
2005-04-10 2:41 ` Petr Baudis
2005-04-10 16:27 ` [ANNOUNCE] git-pasky-0.1 Petr Baudis
2005-04-10 16:55 ` Linus Torvalds
2005-04-10 19:49 ` Sean
2005-04-10 17:33 ` Ingo Molnar
2005-04-10 17:42 ` Willy Tarreau
2005-04-10 17:45 ` Ingo Molnar
2005-04-10 18:45 ` Petr Baudis
2005-04-10 19:13 ` Willy Tarreau
2005-04-10 21:27 ` Petr Baudis
2005-04-10 20:38 ` Linus Torvalds
2005-04-10 21:39 ` Linus Torvalds
2005-04-10 23:49 ` Petr Baudis
2005-04-10 22:27 ` Petr Baudis
2005-04-10 23:10 ` Linus Torvalds
2005-04-10 23:26 ` Petr Baudis
2005-04-10 23:46 ` Linus Torvalds
2005-04-10 23:56 ` Petr Baudis
2005-04-11 0:20 ` GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1) Linus Torvalds
2005-04-11 0:27 ` Petr Baudis
2005-04-11 7:45 ` Ingo Molnar
2005-04-11 8:40 ` Florian Weimer
2005-04-11 10:52 ` Petr Baudis
2005-04-11 16:05 ` Florian Weimer
2005-04-10 23:23 ` [ANNOUNCE] git-pasky-0.1 Paul Jackson
2005-04-11 0:15 ` Randy.Dunlap
2005-04-11 0:30 ` Re: " Petr Baudis
2005-04-11 1:11 ` Linus Torvalds
2005-04-10 20:41 ` Paul Jackson
2005-04-11 1:58 ` [ANNOUNCE] git-pasky-0.2 Petr Baudis
2005-04-11 2:46 ` Daniel Barkalow
2005-04-11 10:17 ` Petr Baudis
2005-04-11 8:50 ` Ingo Molnar
2005-04-11 10:16 ` Petr Baudis
2005-04-11 13:57 ` [ANNOUNCE] git-pasky-0.3 Petr Baudis
2005-04-12 12:47 ` Martin Schlemmer
2005-04-12 13:02 ` Petr Baudis
2005-04-12 13:13 ` Martin Schlemmer
2005-04-12 13:23 ` Petr Baudis
2005-04-12 13:07 ` David Woodhouse
2005-04-13 8:47 ` Russell King
2005-04-13 8:59 ` Petr Baudis
2005-04-13 9:06 ` H. Peter Anvin
2005-04-13 9:09 ` David Woodhouse
2005-04-13 9:25 ` David Woodhouse
2005-04-13 9:42 ` Petr Baudis
2005-04-13 10:24 ` David Woodhouse
2005-04-13 17:01 ` Daniel Barkalow
2005-04-13 18:07 ` Petr Baudis
2005-04-13 18:22 ` git mailing list (Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.3) Linus Torvalds
2005-04-13 18:38 ` Re: Re: Re: [ANNOUNCE] git-pasky-0.3 Daniel Barkalow
2005-04-13 12:43 ` Xavier Bestel
2005-04-13 16:48 ` H. Peter Anvin
2005-04-13 18:15 ` Xavier Bestel
2005-04-13 23:05 ` bd
2005-04-13 14:38 ` Linus Torvalds
2005-04-13 14:47 ` David Woodhouse
2005-04-13 14:59 ` Linus Torvalds
2005-04-13 9:35 ` Russell King
2005-04-13 9:38 ` Russell King
2005-04-13 9:49 ` Petr Baudis
2005-04-13 11:02 ` Ingo Molnar
2005-04-13 14:50 ` Linus Torvalds
2005-04-13 9:46 ` Petr Baudis
2005-04-13 10:28 ` Russell King
2005-04-13 19:03 ` Russell King
2005-04-13 19:13 ` Petr Baudis
2005-04-13 19:21 ` Russell King
2005-04-13 19:23 ` H. Peter Anvin
2005-04-10 6:53 ` more git updates Christopher Li
2005-04-10 11:48 ` Ralph Corderoy
2005-04-10 19:23 ` Paul Jackson
2005-04-10 18:42 ` Christopher Li
2005-04-10 22:30 ` Petr Baudis
2005-04-11 13:58 ` H. Peter Anvin
2005-04-20 20:29 ` Kai Henningsen
2005-04-24 0:42 ` Paul Jackson
2005-04-24 1:29 ` Bernd Eckenfels
2005-04-24 4:13 ` Paul Jackson
2005-04-24 4:38 ` Bernd Eckenfels
2005-04-24 4:53 ` Paul Jackson
2005-04-25 11:57 ` Theodore Ts'o
2005-04-25 16:40 ` David Wagner
2005-04-25 20:35 ` Bernd Eckenfels
2005-04-24 16:52 ` Horst von Brand
2005-04-24 8:00 ` Kai Henningsen
[not found] ` <6f6293f10504210220744af114@mail.gmail.com>
2005-04-24 8:01 ` Kai Henningsen
2005-04-11 11:35 ` Ingo Molnar [this message]
2005-04-11 14:45 ` [rfc] git: combo-blobs Paul Jackson
2005-04-11 15:12 ` Ingo Molnar
2005-04-11 15:32 ` Linus Torvalds
2005-04-11 15:39 ` Ingo Molnar
2005-04-11 15:57 ` Ingo Molnar
2005-04-11 16:01 ` Linus Torvalds
2005-04-11 16:33 ` Ingo Molnar
2005-04-12 5:42 ` Barry K. Nathan
2005-04-11 18:13 ` Chris Wedgwood
2005-04-11 18:30 ` Linus Torvalds
2005-04-11 20:18 ` Linus Torvalds
2005-04-11 18:40 ` Petr Baudis
2005-04-11 17:50 ` Paul Jackson
2005-04-11 15:28 ` Ingo Molnar
2005-04-11 15:31 ` Ingo Molnar
2005-04-12 4:05 ` more git updates David Eger
2005-04-12 8:16 ` Petr Baudis
2005-04-12 20:44 ` David Eger
2005-04-12 21:21 ` Linus Torvalds
2005-04-12 22:29 ` Krzysztof Halasa
2005-04-12 22:49 ` Linus Torvalds
2005-04-13 4:32 ` Matthias Urlichs
2005-04-12 22:36 ` David Eger
2005-04-12 23:48 ` Panagiotis Issaris
2005-04-12 23:40 ` Andrea Arcangeli
2005-04-12 23:45 ` Linus Torvalds
2005-04-13 0:14 ` Andrea Arcangeli
2005-04-13 1:10 ` Linus Torvalds
2005-04-13 10:59 ` Andrea Arcangeli
2005-04-13 20:44 ` Matt Mackall
2005-04-13 23:42 ` Krzysztof Halasa
2005-04-14 0:13 ` Matt Mackall
2005-04-13 9:30 ` Russell King
2005-04-13 10:20 ` Andrea Arcangeli
2005-04-13 14:43 ` Linus Torvalds
2005-04-10 2:07 ` Paul Jackson
2005-04-10 2:20 ` Paul Jackson
2005-04-10 2:09 ` Paul Jackson
2005-04-10 7:51 ` Junio C Hamano
2005-04-10 5:53 ` Christopher Li
2005-04-10 9:28 ` Junio C Hamano
2005-04-10 7:06 ` Christopher Li
2005-04-10 11:38 ` tony.luck
2005-04-10 9:48 ` Petr Baudis
2005-04-10 9:40 ` Wichert Akkerman
2005-04-10 9:41 ` Petr Baudis
2005-04-10 7:09 ` Christopher Li
2005-04-10 11:21 ` Proposal for shell-patch-format [was: Re: more git updates..] Rutger Nijlunsing
2005-04-10 15:44 ` more git updates Linus Torvalds
2005-04-10 17:00 ` Rutger Nijlunsing
2005-04-10 18:50 ` Paul Jackson
2005-04-10 20:57 ` Linus Torvalds
2005-04-10 19:03 ` Christopher Li
2005-04-10 22:38 ` Linus Torvalds
2005-04-10 19:53 ` Christopher Li
2005-04-10 23:21 ` Linus Torvalds
2005-04-10 21:28 ` Christopher Li
2005-04-12 5:14 ` David Lang
2005-04-12 6:00 ` Paul Jackson
2005-04-12 7:05 ` Barry K. Nathan
2005-04-11 6:57 ` bert hubert
2005-04-11 7:20 ` Christer Weinigel
2005-04-10 23:14 ` Paul Jackson
2005-04-10 23:38 ` Linus Torvalds
2005-04-11 0:19 ` Paul Jackson
2005-04-11 15:49 ` Randy.Dunlap
2005-04-11 18:30 ` Petr Baudis
2005-04-11 0:10 ` Petr Baudis
2005-04-09 22:00 ` Paul Jackson
2005-04-09 23:21 ` Ralph Corderoy
2005-04-10 0:39 ` Paul Jackson
2005-04-10 1:14 ` Bernd Eckenfels
2005-04-10 1:33 ` Paul Jackson
2005-04-10 10:22 ` Ralph Corderoy
2005-04-10 17:30 ` Paul Jackson
2005-04-10 17:31 ` Rik van Riel
2005-04-10 17:35 ` Ingo Molnar
2005-04-11 16:46 ` ross
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20050411113523.GA19256@elte.hu \
--to=mingo@elte.hu \
--cc=git@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pasky@ucw.cz \
--cc=rddunlap@osdl.org \
--cc=ross@jose.lug.udel.edu \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox