public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Petr Baudis <pasky@ucw.cz>, "Randy.Dunlap" <rddunlap@osdl.org>,
	Ross Vandegrift <ross@jose.lug.udel.edu>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>,
	git@vger.kernel.org
Subject: [rfc] git: combo-blobs
Date: Mon, 11 Apr 2005 13:35:23 +0200	[thread overview]
Message-ID: <20050411113523.GA19256@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.58.0504091617000.1267@ppc970.osdl.org>


i think all of the 'repository size' and 'bandwidth' concerns could be 
solved via a new (and pretty much simple and transparent) object type: 
the 'combo-blob'.

Summary:
--------

This is a space/bandwidth-efficient blob that 'includes' arbitrary 
portions of (one, two, or more) simple blobs by reference [1], with byte 
granularity, plus an optional followup portion that includes the full 
constructed state, uncompressed. [2] It can also conserve more RAM 
compared to the current repository format.

Representation:
---------------

A combo-blob would have the 'simplest possible' and thus most obvious 
representation: a list (the 'include-table') of "include X bytes at 
offset Y from parent Z" operations:

  <parent-blob-ID> <offset> <size>
  [optional full constructed state]

e.g.:

  6d11b2dd7f169c29664ac0553090865b7b020973 0 64444
  6d374c972c04a0b1894cc6898dffa8ab0b273fcb 0 100
  6d11b2dd7f169c29664ac0553090865b7b020973 64545 163656

'punches' 100 bytes out of blob 6d1* at offset 64444, and replaces it 
with blob 6d3*'s 100 bytes. [offset/size would be stored in a binary 
form to have constant record sizes.]

in OS terms it's similar to an iovec representation. [3]

The hash of a combo-blob is calculated off the include-table alone: i.e.  
it's _not_ equivalent to the hash of the included contents. I.e. you 
cannot 'collapse' a combo-blob after the fact, it's an immutable part of
the history of the repository, similar to other stored objects. You can
freely cache/uncache (blow-up/collapse) it on the other hand.

[ NOTE: further below you can find a 'Notes' section as well, which
  might address some of the issues/ideas you might have at this point. ]

Cons:
-----

there are a number of disadvantages:

- performance hit. Linus is perfectly right, in terms of performance, 
  nothing beats having full objects.

  Hence i kept the option to include the full constructed blob [4]
  (uncompressed) as well in the combo-blob. When all combo-blobs are
  'blown up' then they can be better in terms of performance than the 
  current repository format. [they still carry the small slice & dice
  information as well]

  the performance hit can be reduced in a finegrained way by introducing 
  occasional full objects in the history. E.g. after every 8 steps one
  would include a full blob, to limit the number of blobs necessary to
  construct a previously unconstructed combo-blob. This would still cut
  the overhead of the current format substantially.

  clearly, the most important cache is the current directory cache, 
  which this abstraction does not hurt.

- complexity. It's all pretty straightforward, but checking the
  consistency of a combo object is not as simple as checking the 
  consistency of a simple object, as it would have to recursively check 
  all parent IDs as well. I think it's worth the price though.

- repository has optional components: the 'blown up' (cached) portion of 
  a combo-blob can be freely destructed. This means that two 
  repositories can now not only differ in their directory-cache, but 
  also in their objects/ hierarchy. I dont think this is a big issue, 
  BYMMV.

Pros:
-----

- the main advantage is space/bandwith: it's pretty much as efficient as 
  it gets: it can be used to represent compressed binary deltas. A fully 
  trimmed (uncached) repository is very efficient.

- the optional 'fully constructed' portion is not compressed, so once a
  repository is 'cached', it is faster to process (in areas outside the
  current directory cache) than the current repository format. (In fact,
  when a previously unused portion of a repository is accessed _first_, 
  it is IO-bound by nature - so we can very well spend the extra CPU
  cycles on uncompressing things.)

- a 'combo' blob will be more memory-efficient as well. So with given 
  amount of RAM one could access more history, with a small CPU cost - 
  as long as the level of 'history recursion' is kept in check (e.g. via 
  the previously mentioned 'at most 8-deep combinations').  
  Straightforward iovecs could be passed to Linux system-calls, when 
  constructing a 'view' of a file, without having to cache every step of 
  the file's history.

- a combo-blob directly represents the way humans code: combining 
  pre-existing pieces of information and adding relatively low amount of 
  new stuff. Having a natural representation for the type of activity 
  that a tool supports cannot hurt.

( - combo-blobs enable a per-chunk (or per-line) edit history. It's not
    an important feature though.  )

Notes:
------

[1] the combo-blob is not a 'delta' thing. It combines pre-existing 
parents. One of the parents may of course be a 'delta' that acts upon 
the other parent - but the combo-blob does not know and does not care.  
(A combo-blob might as well represent an act of someone consolidating 
multiple small files into a big file, or splitting up a big file into 
smaller files. Or a combo-blob might represent the trimming of a 
preexisting file.)

[2]: a combo-blob is conceptually still a simple object with blob data 
in it, nothing more. It can be referenced in other object types 
equivalently to other blobs. It just happens to be a combination of 
existing blobs, and hence the 'git filesystem' has to work harder (but 
still quite efficiently) to get to the contents.

[3]: a combo-blob might reference any parent blob, including combo 
blobs. This means that e.g. multiple small deltas can be represented 
via:

   <blob-#1>
      |
      |-----<blob-#2>
      |
   <combo-blob-#1>
      |
      |-----<blob-#3>
      |
   <combo-blob-#2>

where combo-blob-#2 is thus a combination of blob-#1,blob-#2,blob-#3.

[4] alternatively, it might also make sense to extend the simple 
combo-blob concept with the concept of a 'cache-blob': a cache-blob 
'blows up' combo blobs in that it fully constructs the blob contents, 
but it is otherwise identical to the blob it caches. Simple (non-combo) 
blob types are a cache of themselves.

	Ingo

  parent reply	other threads:[~2005-04-11 11:36 UTC|newest]

Thread overview: 178+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-09 19:45 more git updates Linus Torvalds
2005-04-09 19:56 ` Linus Torvalds
2005-04-09 20:07 ` Petr Baudis
2005-04-09 21:00   ` Linus Torvalds
2005-04-09 21:00     ` tony.luck
2005-04-10 16:01       ` Linus Torvalds
2005-04-12 17:34         ` Helge Hafting
2005-04-10 18:19       ` Paul Jackson
2005-04-10 23:04         ` Bernd Eckenfels
2005-04-11  9:27           ` Anton Altaparmakov
2005-04-09 21:08     ` Linus Torvalds
2005-04-09 23:31       ` Linus Torvalds
2005-04-10  2:41         ` Petr Baudis
2005-04-10 16:27           ` [ANNOUNCE] git-pasky-0.1 Petr Baudis
2005-04-10 16:55             ` Linus Torvalds
2005-04-10 19:49               ` Sean
2005-04-10 17:33             ` Ingo Molnar
2005-04-10 17:42               ` Willy Tarreau
2005-04-10 17:45                 ` Ingo Molnar
2005-04-10 18:45                   ` Petr Baudis
2005-04-10 19:13                     ` Willy Tarreau
2005-04-10 21:27                       ` Petr Baudis
2005-04-10 20:38                     ` Linus Torvalds
2005-04-10 21:39                       ` Linus Torvalds
2005-04-10 23:49                         ` Petr Baudis
2005-04-10 22:27                       ` Petr Baudis
2005-04-10 23:10                         ` Linus Torvalds
2005-04-10 23:26                           ` Petr Baudis
2005-04-10 23:46                             ` Linus Torvalds
2005-04-10 23:56                               ` Petr Baudis
2005-04-11  0:20                                 ` GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1) Linus Torvalds
2005-04-11  0:27                                   ` Petr Baudis
2005-04-11  7:45                                   ` Ingo Molnar
2005-04-11  8:40                                     ` Florian Weimer
2005-04-11 10:52                                       ` Petr Baudis
2005-04-11 16:05                                         ` Florian Weimer
2005-04-10 23:23                         ` [ANNOUNCE] git-pasky-0.1 Paul Jackson
2005-04-11  0:15                           ` Randy.Dunlap
2005-04-11  0:30                       ` Re: " Petr Baudis
2005-04-11  1:11                         ` Linus Torvalds
2005-04-10 20:41                     ` Paul Jackson
2005-04-11  1:58             ` [ANNOUNCE] git-pasky-0.2 Petr Baudis
2005-04-11  2:46               ` Daniel Barkalow
2005-04-11 10:17                 ` Petr Baudis
2005-04-11  8:50               ` Ingo Molnar
2005-04-11 10:16                 ` Petr Baudis
2005-04-11 13:57               ` [ANNOUNCE] git-pasky-0.3 Petr Baudis
2005-04-12 12:47                 ` Martin Schlemmer
2005-04-12 13:02                   ` Petr Baudis
2005-04-12 13:13                     ` Martin Schlemmer
2005-04-12 13:23                       ` Petr Baudis
2005-04-12 13:07                 ` David Woodhouse
2005-04-13  8:47                   ` Russell King
2005-04-13  8:59                     ` Petr Baudis
2005-04-13  9:06                       ` H. Peter Anvin
2005-04-13  9:09                         ` David Woodhouse
2005-04-13  9:25                       ` David Woodhouse
2005-04-13  9:42                         ` Petr Baudis
2005-04-13 10:24                           ` David Woodhouse
2005-04-13 17:01                           ` Daniel Barkalow
2005-04-13 18:07                             ` Petr Baudis
2005-04-13 18:22                               ` git mailing list (Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.3) Linus Torvalds
2005-04-13 18:38                               ` Re: Re: Re: [ANNOUNCE] git-pasky-0.3 Daniel Barkalow
2005-04-13 12:43                         ` Xavier Bestel
2005-04-13 16:48                           ` H. Peter Anvin
2005-04-13 18:15                             ` Xavier Bestel
2005-04-13 23:05                           ` bd
2005-04-13 14:38                         ` Linus Torvalds
2005-04-13 14:47                           ` David Woodhouse
2005-04-13 14:59                             ` Linus Torvalds
2005-04-13  9:35                 ` Russell King
2005-04-13  9:38                   ` Russell King
2005-04-13  9:49                     ` Petr Baudis
2005-04-13 11:02                       ` Ingo Molnar
2005-04-13 14:50                         ` Linus Torvalds
2005-04-13  9:46                   ` Petr Baudis
2005-04-13 10:28                     ` Russell King
2005-04-13 19:03                   ` Russell King
2005-04-13 19:13                     ` Petr Baudis
2005-04-13 19:21                       ` Russell King
2005-04-13 19:23                         ` H. Peter Anvin
2005-04-10  6:53         ` more git updates Christopher Li
2005-04-10 11:48           ` Ralph Corderoy
2005-04-10 19:23           ` Paul Jackson
2005-04-10 18:42             ` Christopher Li
2005-04-10 22:30               ` Petr Baudis
2005-04-11 13:58           ` H. Peter Anvin
2005-04-20 20:29             ` Kai Henningsen
2005-04-24  0:42               ` Paul Jackson
2005-04-24  1:29                 ` Bernd Eckenfels
2005-04-24  4:13                   ` Paul Jackson
2005-04-24  4:38                     ` Bernd Eckenfels
2005-04-24  4:53                       ` Paul Jackson
2005-04-25 11:57                       ` Theodore Ts'o
2005-04-25 16:40                         ` David Wagner
2005-04-25 20:35                         ` Bernd Eckenfels
2005-04-24 16:52                   ` Horst von Brand
2005-04-24  8:00                 ` Kai Henningsen
     [not found]               ` <6f6293f10504210220744af114@mail.gmail.com>
2005-04-24  8:01                 ` Kai Henningsen
2005-04-11 11:35         ` Ingo Molnar [this message]
2005-04-11 14:45           ` [rfc] git: combo-blobs Paul Jackson
2005-04-11 15:12             ` Ingo Molnar
2005-04-11 15:32               ` Linus Torvalds
2005-04-11 15:39                 ` Ingo Molnar
2005-04-11 15:57                   ` Ingo Molnar
2005-04-11 16:01                   ` Linus Torvalds
2005-04-11 16:33                     ` Ingo Molnar
2005-04-12  5:42                       ` Barry K. Nathan
2005-04-11 18:13                     ` Chris Wedgwood
2005-04-11 18:30                       ` Linus Torvalds
2005-04-11 20:18                         ` Linus Torvalds
2005-04-11 18:40                       ` Petr Baudis
2005-04-11 17:50               ` Paul Jackson
2005-04-11 15:28             ` Ingo Molnar
2005-04-11 15:31               ` Ingo Molnar
2005-04-12  4:05         ` more git updates David Eger
2005-04-12  8:16           ` Petr Baudis
2005-04-12 20:44             ` David Eger
2005-04-12 21:21               ` Linus Torvalds
2005-04-12 22:29                 ` Krzysztof Halasa
2005-04-12 22:49                   ` Linus Torvalds
2005-04-13  4:32                     ` Matthias Urlichs
2005-04-12 22:36                 ` David Eger
2005-04-12 23:48                   ` Panagiotis Issaris
2005-04-12 23:40                 ` Andrea Arcangeli
2005-04-12 23:45                   ` Linus Torvalds
2005-04-13  0:14                     ` Andrea Arcangeli
2005-04-13  1:10                       ` Linus Torvalds
2005-04-13 10:59                         ` Andrea Arcangeli
2005-04-13 20:44                         ` Matt Mackall
2005-04-13 23:42                           ` Krzysztof Halasa
2005-04-14  0:13                             ` Matt Mackall
2005-04-13  9:30                     ` Russell King
2005-04-13 10:20                       ` Andrea Arcangeli
2005-04-13 14:43                       ` Linus Torvalds
2005-04-10  2:07     ` Paul Jackson
2005-04-10  2:20       ` Paul Jackson
2005-04-10  2:09     ` Paul Jackson
2005-04-10  7:51     ` Junio C Hamano
2005-04-10  5:53       ` Christopher Li
2005-04-10  9:28         ` Junio C Hamano
2005-04-10  7:06           ` Christopher Li
2005-04-10 11:38             ` tony.luck
2005-04-10  9:48           ` Petr Baudis
2005-04-10  9:40         ` Wichert Akkerman
2005-04-10  9:41         ` Petr Baudis
2005-04-10  7:09           ` Christopher Li
2005-04-10 11:21       ` Proposal for shell-patch-format [was: Re: more git updates..] Rutger Nijlunsing
2005-04-10 15:44       ` more git updates Linus Torvalds
2005-04-10 17:00         ` Rutger Nijlunsing
2005-04-10 18:50         ` Paul Jackson
2005-04-10 20:57           ` Linus Torvalds
2005-04-10 19:03             ` Christopher Li
2005-04-10 22:38               ` Linus Torvalds
2005-04-10 19:53                 ` Christopher Li
2005-04-10 23:21                   ` Linus Torvalds
2005-04-10 21:28                     ` Christopher Li
2005-04-12  5:14                       ` David Lang
2005-04-12  6:00                         ` Paul Jackson
2005-04-12  7:05                         ` Barry K. Nathan
2005-04-11  6:57                 ` bert hubert
2005-04-11  7:20                   ` Christer Weinigel
2005-04-10 23:14             ` Paul Jackson
2005-04-10 23:38               ` Linus Torvalds
2005-04-11  0:19                 ` Paul Jackson
2005-04-11 15:49                 ` Randy.Dunlap
2005-04-11 18:30                   ` Petr Baudis
2005-04-11  0:10               ` Petr Baudis
2005-04-09 22:00 ` Paul Jackson
2005-04-09 23:21 ` Ralph Corderoy
2005-04-10  0:39   ` Paul Jackson
2005-04-10  1:14     ` Bernd Eckenfels
2005-04-10  1:33       ` Paul Jackson
2005-04-10 10:22     ` Ralph Corderoy
2005-04-10 17:30       ` Paul Jackson
2005-04-10 17:31 ` Rik van Riel
2005-04-10 17:35   ` Ingo Molnar
2005-04-11 16:46 ` ross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050411113523.GA19256@elte.hu \
    --to=mingo@elte.hu \
    --cc=git@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pasky@ucw.cz \
    --cc=rddunlap@osdl.org \
    --cc=ross@jose.lug.udel.edu \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox