All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Petr Baudis <pasky@ucw.cz>, "Randy.Dunlap" <rddunlap@osdl.org>,
	Ross Vandegrift <ross@jose.lug.udel.edu>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>,
	git@vger.kernel.org
Subject: [rfc] git: combo-blobs
Date: Mon, 11 Apr 2005 13:35:23 +0200	[thread overview]
Message-ID: <20050411113523.GA19256@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.58.0504091617000.1267@ppc970.osdl.org>


i think all of the 'repository size' and 'bandwidth' concerns could be 
solved via a new (and pretty much simple and transparent) object type: 
the 'combo-blob'.

Summary:
--------

This is a space/bandwidth-efficient blob that 'includes' arbitrary 
portions of (one, two, or more) simple blobs by reference [1], with byte 
granularity, plus an optional followup portion that includes the full 
constructed state, uncompressed. [2] It can also conserve more RAM 
compared to the current repository format.

Representation:
---------------

A combo-blob would have the 'simplest possible' and thus most obvious 
representation: a list (the 'include-table') of "include X bytes at 
offset Y from parent Z" operations:

  <parent-blob-ID> <offset> <size>
  [optional full constructed state]

e.g.:

  6d11b2dd7f169c29664ac0553090865b7b020973 0 64444
  6d374c972c04a0b1894cc6898dffa8ab0b273fcb 0 100
  6d11b2dd7f169c29664ac0553090865b7b020973 64545 163656

'punches' 100 bytes out of blob 6d1* at offset 64444, and replaces it 
with blob 6d3*'s 100 bytes. [offset/size would be stored in a binary 
form to have constant record sizes.]

in OS terms it's similar to an iovec representation. [3]

The hash of a combo-blob is calculated off the include-table alone: i.e.  
it's _not_ equivalent to the hash of the included contents. I.e. you 
cannot 'collapse' a combo-blob after the fact, it's an immutable part of
the history of the repository, similar to other stored objects. You can
freely cache/uncache (blow-up/collapse) it on the other hand.

[ NOTE: further below you can find a 'Notes' section as well, which
  might address some of the issues/ideas you might have at this point. ]

Cons:
-----

there are a number of disadvantages:

- performance hit. Linus is perfectly right, in terms of performance, 
  nothing beats having full objects.

  Hence i kept the option to include the full constructed blob [4]
  (uncompressed) as well in the combo-blob. When all combo-blobs are
  'blown up' then they can be better in terms of performance than the 
  current repository format. [they still carry the small slice & dice
  information as well]

  the performance hit can be reduced in a finegrained way by introducing 
  occasional full objects in the history. E.g. after every 8 steps one
  would include a full blob, to limit the number of blobs necessary to
  construct a previously unconstructed combo-blob. This would still cut
  the overhead of the current format substantially.

  clearly, the most important cache is the current directory cache, 
  which this abstraction does not hurt.

- complexity. It's all pretty straightforward, but checking the
  consistency of a combo object is not as simple as checking the 
  consistency of a simple object, as it would have to recursively check 
  all parent IDs as well. I think it's worth the price though.

- repository has optional components: the 'blown up' (cached) portion of 
  a combo-blob can be freely destructed. This means that two 
  repositories can now not only differ in their directory-cache, but 
  also in their objects/ hierarchy. I dont think this is a big issue, 
  BYMMV.

Pros:
-----

- the main advantage is space/bandwith: it's pretty much as efficient as 
  it gets: it can be used to represent compressed binary deltas. A fully 
  trimmed (uncached) repository is very efficient.

- the optional 'fully constructed' portion is not compressed, so once a
  repository is 'cached', it is faster to process (in areas outside the
  current directory cache) than the current repository format. (In fact,
  when a previously unused portion of a repository is accessed _first_, 
  it is IO-bound by nature - so we can very well spend the extra CPU
  cycles on uncompressing things.)

- a 'combo' blob will be more memory-efficient as well. So with given 
  amount of RAM one could access more history, with a small CPU cost - 
  as long as the level of 'history recursion' is kept in check (e.g. via 
  the previously mentioned 'at most 8-deep combinations').  
  Straightforward iovecs could be passed to Linux system-calls, when 
  constructing a 'view' of a file, without having to cache every step of 
  the file's history.

- a combo-blob directly represents the way humans code: combining 
  pre-existing pieces of information and adding relatively low amount of 
  new stuff. Having a natural representation for the type of activity 
  that a tool supports cannot hurt.

( - combo-blobs enable a per-chunk (or per-line) edit history. It's not
    an important feature though.  )

Notes:
------

[1] the combo-blob is not a 'delta' thing. It combines pre-existing 
parents. One of the parents may of course be a 'delta' that acts upon 
the other parent - but the combo-blob does not know and does not care.  
(A combo-blob might as well represent an act of someone consolidating 
multiple small files into a big file, or splitting up a big file into 
smaller files. Or a combo-blob might represent the trimming of a 
preexisting file.)

[2]: a combo-blob is conceptually still a simple object with blob data 
in it, nothing more. It can be referenced in other object types 
equivalently to other blobs. It just happens to be a combination of 
existing blobs, and hence the 'git filesystem' has to work harder (but 
still quite efficiently) to get to the contents.

[3]: a combo-blob might reference any parent blob, including combo 
blobs. This means that e.g. multiple small deltas can be represented 
via:

   <blob-#1>
      |
      |-----<blob-#2>
      |
   <combo-blob-#1>
      |
      |-----<blob-#3>
      |
   <combo-blob-#2>

where combo-blob-#2 is thus a combination of blob-#1,blob-#2,blob-#3.

[4] alternatively, it might also make sense to extend the simple 
combo-blob concept with the concept of a 'cache-blob': a cache-blob 
'blows up' combo blobs in that it fully constructs the blob contents, 
but it is otherwise identical to the blob it caches. Simple (non-combo) 
blob types are a cache of themselves.

	Ingo

  parent reply	other threads:[~2005-04-11 11:36 UTC|newest]

Thread overview: 194+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-09 19:45 more git updates Linus Torvalds
2005-04-09 19:56 ` Linus Torvalds
2005-04-09 20:07 ` Petr Baudis
2005-04-09 21:00   ` Linus Torvalds
2005-04-09 21:00     ` tony.luck
2005-04-10 16:01       ` Linus Torvalds
2005-04-12 17:34         ` Helge Hafting
2005-04-10 18:19       ` Paul Jackson
2005-04-10 23:04         ` Bernd Eckenfels
2005-04-11  9:27           ` Anton Altaparmakov
2005-04-09 21:08     ` Linus Torvalds
2005-04-09 23:31       ` Linus Torvalds
2005-04-10  2:41         ` Petr Baudis
2005-04-10 16:27           ` [ANNOUNCE] git-pasky-0.1 Petr Baudis
2005-04-10 16:55             ` Linus Torvalds
2005-04-10 19:49               ` Sean
2005-04-10 17:33             ` Ingo Molnar
2005-04-10 17:42               ` Willy Tarreau
2005-04-10 17:45                 ` Ingo Molnar
2005-04-10 18:45                   ` Petr Baudis
2005-04-10 19:13                     ` Willy Tarreau
2005-04-10 21:27                       ` Petr Baudis
2005-04-10 20:38                     ` Linus Torvalds
2005-04-10 21:39                       ` Linus Torvalds
2005-04-10 23:49                         ` Petr Baudis
2005-04-10 22:27                       ` Petr Baudis
2005-04-10 23:10                         ` Linus Torvalds
2005-04-10 23:26                           ` Petr Baudis
2005-04-10 23:46                             ` Linus Torvalds
2005-04-10 23:56                               ` Petr Baudis
2005-04-11  0:20                                 ` GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1) Linus Torvalds
2005-04-11  0:27                                   ` Petr Baudis
2005-04-11  7:45                                   ` Ingo Molnar
2005-04-11  8:40                                     ` Florian Weimer
2005-04-11 10:52                                       ` Petr Baudis
2005-04-11 16:05                                         ` Florian Weimer
2005-04-10 23:23                         ` [ANNOUNCE] git-pasky-0.1 Paul Jackson
2005-04-11  0:15                           ` Randy.Dunlap
2005-04-11  0:30                       ` Re: " Petr Baudis
2005-04-11  1:11                         ` Linus Torvalds
2005-04-10 20:41                     ` Paul Jackson
2005-04-11  1:58             ` [ANNOUNCE] git-pasky-0.2 Petr Baudis
2005-04-11  2:46               ` Daniel Barkalow
2005-04-11 10:17                 ` Petr Baudis
2005-04-11  8:50               ` Ingo Molnar
2005-04-11 10:16                 ` Petr Baudis
2005-04-11 13:57               ` [ANNOUNCE] git-pasky-0.3 Petr Baudis
2005-04-12 12:47                 ` Martin Schlemmer
2005-04-12 13:02                   ` Petr Baudis
2005-04-12 13:13                     ` Martin Schlemmer
2005-04-12 13:23                       ` Petr Baudis
     [not found]                         ` <1113375277.23299.25.camel@nosferatu.lan>
     [not found]                           ` <20050413075441.GD16489@pasky.ji.cz>
     [not found]                             ` <1113381672.23299.47.camel@nosferatu.lan>
     [not found]                               ` <20050413092656.GO16489@pasky.ji.cz>
     [not found]                                 ` <1113394537.23299.51.camel@nosferatu.lan>
2005-04-13 22:19                                   ` Re: Re: Remove need to untrack before tracking new branch Petr Baudis
2005-04-14  6:55                                     ` Martin Schlemmer
2005-04-14  8:28                                       ` Martin Schlemmer
2005-04-14  8:38                                         ` Martin Schlemmer
2005-04-14  9:11                                           ` Petr Baudis
2005-04-14  9:40                                             ` Martin Schlemmer
2005-04-14  9:55                                               ` Martin Schlemmer
2005-04-14 22:35                                                 ` Alex Riesen
2005-04-15  5:45                                                   ` Martin Schlemmer
2005-04-15  6:42                                                     ` Paul Jackson
2005-04-15 23:49                                                     ` Re: Re: " Alex Riesen
2005-04-14 22:42                                               ` Petr Baudis
2005-04-14 23:01                                                 ` Martin Schlemmer
2005-04-14 23:00                                                   ` Petr Baudis
2005-04-14 23:09                                                     ` Martin Schlemmer
2005-04-14 23:25                                                       ` Martin Schlemmer
2005-04-12 13:07                 ` [ANNOUNCE] git-pasky-0.3 David Woodhouse
2005-04-13  8:47                   ` Russell King
2005-04-13  8:59                     ` Petr Baudis
2005-04-13  9:06                       ` H. Peter Anvin
2005-04-13  9:09                         ` David Woodhouse
2005-04-13  9:25                       ` David Woodhouse
2005-04-13  9:42                         ` Petr Baudis
2005-04-13 10:24                           ` David Woodhouse
2005-04-13 17:01                           ` Daniel Barkalow
2005-04-13 18:07                             ` Petr Baudis
2005-04-13 18:22                               ` git mailing list (Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.3) Linus Torvalds
2005-04-13 18:38                               ` Re: Re: Re: [ANNOUNCE] git-pasky-0.3 Daniel Barkalow
2005-04-13 12:43                         ` Xavier Bestel
2005-04-13 16:48                           ` H. Peter Anvin
2005-04-13 18:15                             ` Xavier Bestel
2005-04-13 23:05                           ` bd
2005-04-13 14:38                         ` Linus Torvalds
2005-04-13 14:47                           ` David Woodhouse
2005-04-13 14:59                             ` Linus Torvalds
2005-04-13  9:35                 ` Russell King
2005-04-13  9:38                   ` Russell King
2005-04-13  9:49                     ` Petr Baudis
2005-04-13 11:02                       ` Ingo Molnar
2005-04-13 14:50                         ` Linus Torvalds
2005-04-13  9:46                   ` Petr Baudis
2005-04-13 10:28                     ` Russell King
2005-04-13 19:03                   ` Russell King
2005-04-13 19:13                     ` Petr Baudis
2005-04-13 19:21                       ` Russell King
2005-04-13 19:23                         ` H. Peter Anvin
2005-04-10  6:53         ` more git updates Christopher Li
2005-04-10 11:48           ` Ralph Corderoy
2005-04-10 19:23           ` Paul Jackson
2005-04-10 18:42             ` Christopher Li
2005-04-10 22:30               ` Petr Baudis
2005-04-11 13:58           ` H. Peter Anvin
2005-04-20 20:29             ` Kai Henningsen
2005-04-24  0:42               ` Paul Jackson
2005-04-24  1:29                 ` Bernd Eckenfels
2005-04-24  4:13                   ` Paul Jackson
2005-04-24  4:38                     ` Bernd Eckenfels
2005-04-24  4:53                       ` Paul Jackson
2005-04-25 11:57                       ` Theodore Ts'o
2005-04-25 16:40                         ` David Wagner
2005-04-25 20:35                         ` Bernd Eckenfels
2005-04-24 16:52                   ` Horst von Brand
2005-04-24  8:00                 ` Kai Henningsen
     [not found]               ` <6f6293f10504210220744af114@mail.gmail.com>
2005-04-24  8:01                 ` Kai Henningsen
2005-04-11 11:35         ` Ingo Molnar [this message]
2005-04-11 14:45           ` [rfc] git: combo-blobs Paul Jackson
2005-04-11 15:12             ` Ingo Molnar
2005-04-11 15:32               ` Linus Torvalds
2005-04-11 15:39                 ` Ingo Molnar
2005-04-11 15:57                   ` Ingo Molnar
2005-04-11 16:01                   ` Linus Torvalds
2005-04-11 16:33                     ` Ingo Molnar
2005-04-12  5:42                       ` Barry K. Nathan
2005-04-11 18:13                     ` Chris Wedgwood
2005-04-11 18:30                       ` Linus Torvalds
2005-04-11 20:18                         ` Linus Torvalds
2005-04-11 18:40                       ` Petr Baudis
2005-04-11 17:50               ` Paul Jackson
2005-04-11 15:28             ` Ingo Molnar
2005-04-11 15:31               ` Ingo Molnar
2005-04-12  4:05         ` more git updates David Eger
2005-04-12  8:16           ` Petr Baudis
2005-04-12 20:44             ` David Eger
2005-04-12 21:21               ` Linus Torvalds
2005-04-12 22:29                 ` Krzysztof Halasa
2005-04-12 22:49                   ` Linus Torvalds
2005-04-13  4:32                     ` Matthias Urlichs
2005-04-12 22:36                 ` David Eger
2005-04-12 23:48                   ` Panagiotis Issaris
2005-04-12 23:40                 ` Andrea Arcangeli
2005-04-12 23:45                   ` Linus Torvalds
2005-04-13  0:14                     ` Andrea Arcangeli
2005-04-13  1:10                       ` Linus Torvalds
2005-04-13 10:59                         ` Andrea Arcangeli
2005-04-13 20:44                         ` Matt Mackall
2005-04-13 23:42                           ` Krzysztof Halasa
2005-04-14  0:13                             ` Matt Mackall
2005-04-13  9:30                     ` Russell King
2005-04-13 10:20                       ` Andrea Arcangeli
2005-04-13 14:43                       ` Linus Torvalds
2005-04-10  2:07     ` Paul Jackson
2005-04-10  2:20       ` Paul Jackson
2005-04-10  2:09     ` Paul Jackson
2005-04-10  7:51     ` Junio C Hamano
2005-04-10  5:53       ` Christopher Li
2005-04-10  9:28         ` Junio C Hamano
2005-04-10  7:06           ` Christopher Li
2005-04-10 11:38             ` tony.luck
2005-04-10  9:48           ` Petr Baudis
2005-04-10  9:40         ` Wichert Akkerman
2005-04-10  9:41         ` Petr Baudis
2005-04-10  7:09           ` Christopher Li
2005-04-10 11:21       ` Proposal for shell-patch-format [was: Re: more git updates..] Rutger Nijlunsing
2005-04-10 15:44       ` more git updates Linus Torvalds
2005-04-10 17:00         ` Rutger Nijlunsing
2005-04-10 18:50         ` Paul Jackson
2005-04-10 20:57           ` Linus Torvalds
2005-04-10 19:03             ` Christopher Li
2005-04-10 22:38               ` Linus Torvalds
2005-04-10 19:53                 ` Christopher Li
2005-04-10 23:21                   ` Linus Torvalds
2005-04-10 21:28                     ` Christopher Li
2005-04-12  5:14                       ` David Lang
2005-04-12  6:00                         ` Paul Jackson
2005-04-12  7:05                         ` Barry K. Nathan
2005-04-11  6:57                 ` bert hubert
2005-04-11  7:20                   ` Christer Weinigel
2005-04-10 23:14             ` Paul Jackson
2005-04-10 23:38               ` Linus Torvalds
2005-04-11  0:19                 ` Paul Jackson
2005-04-11 15:49                 ` Randy.Dunlap
2005-04-11 18:30                   ` Petr Baudis
2005-04-11  0:10               ` Petr Baudis
2005-04-09 22:00 ` Paul Jackson
2005-04-09 23:21 ` Ralph Corderoy
2005-04-10  0:39   ` Paul Jackson
2005-04-10  1:14     ` Bernd Eckenfels
2005-04-10  1:33       ` Paul Jackson
2005-04-10 10:22     ` Ralph Corderoy
2005-04-10 17:30       ` Paul Jackson
2005-04-10 17:31 ` Rik van Riel
2005-04-10 17:35   ` Ingo Molnar
2005-04-11 16:46 ` ross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050411113523.GA19256@elte.hu \
    --to=mingo@elte.hu \
    --cc=git@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pasky@ucw.cz \
    --cc=rddunlap@osdl.org \
    --cc=ross@jose.lug.udel.edu \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.