From: Jakub Narebski <jnareb@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org,
Linus Torvalds <torvalds@linux-foundation.org>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: How would Git chapter look like in "The Architecture of Open Source Applications"?
Date: Mon, 30 May 2011 12:30:10 +0200 [thread overview]
Message-ID: <201105301230.11772.jnareb@gmail.com> (raw)
In-Reply-To: <20110530034044.GC27691@sigill.intra.peff.net>
On Mon, 30 May 2011, Jeff King <peff@peff.org> wrote:
> On Sat, May 28, 2011 at 02:17:38PM +0200, Jakub Narebski wrote:
>
> > Among covered programs is Mercurial (chapter by Dirkjan Ochtman)...
> > but unfortunately no Git (they probably thought that one DVCS is enough).
> >
> > How would such chapter on Git look like? Authors of this book
> > encourage (among others) to write new chapters.
>
> I just skimmed the Mercurial chapter, but they do cover a fair bit of
> general DVCS architecture. For git, I would guess a good approach would
> be to describe the data structures (i.e., content-addressable object
> database, DAG of commits, refs storing branches and tags), as everything
> else falls out from there. Most of the basic commands can be explained
> as "do some simple operation to the history graph or object db" and the
> more complex commands are compositions of the simple ones. So the
> architecture is really about having a data structure that represents the
> problem, exposing it to the user, and then building some niceties around
> the basic data structure operations.
The repository model that Git uses is quite well described in "Pro Git",
in "Discussion" section of git(1) manpage, in "Git concepts" section of
Git User's Manual and in gitcore-tutorial(7).
What I am more interested in is design *goals*, i.e. what's behind
choosing this and not other architecture.
The chapter on Mercurial, in '12.2. Data Structures > 12.2.1. Challenges'
subsection says about limiting technology factors (quoting [Mac06]):
* speed: CPU
* capacity: disk and memory
* bandwidth: memory, LAN, disk, and WAN
* disk seek rate
This was for Mercurial; from what I remember from KernelTrap articles,
which covered beginnings of Git development quite well, and from other
sources, the main limiting factor considered was __speed__.
Not disk space. At first Git had only 'loose' format -- do you remember
"disk space is cheap" comment by Linus? Admittedly Git used zlib
compression from very beginning (which works well for text). IIRC at
first when _model_ that Git uses for repository was being drafted
LAN/WAN bandwidth wasn't consideration; AFAIK first transport that Git
used was nowadays deprecated rsync:// (UNIX philosophy of prototyping
and developing using existing ready tools, see [TAOUP], [Ben86]).
I think it was assumed that operating system would be good enough that
we don't have to worry about seek rates: Git is optimized for "hot cache"
case. Note however that adoption of 'packed' format as on-disk format
was driven by speed (disk seek rate) as well as disk capacity i.e.
reducing repository size. Well, at least from what I remember.
The Mercurial's '12.2.1. Challenges' subsection continues from:
The paper [i.e. [Mac06]] goes on to review common scenarios or
criteria for evaluating the performance of such a system at
the file level:
* Storage compression: what kind of compression is best suited
to save the file history on disk? Effectively, what algorithm
makes the most out of the I/O performance while preventing
CPU time from becoming a bottleneck?
* Retrieving arbitrary file revisions: a number of version control
systems will store a given revision in such a way that a large
number of older revisions must be read to reconstruct the newer
one (using deltas). We want to control this to make sure that
retrieving old revisions is still fast.
* Adding file revisions: we regularly add new revisions. We don't
want to rewrite old revisions every time we add a new one, because
that would become too slow when there are many revisions.
* Showing file history: we want to be able to review a history of
all changesets that touched a certain file. This also allows us
to do annotations (which used to be called `blame` in CVS but was
renamed to `annotate` in some later systems to remove the negative
connotation): reviewing the originating changeset for each line
currently in a file.
>From what *I* understand Linus approached the problem of DVCS design
from different direction: he is maintainer rather than ordinary developer,
and (from what he said) filesystem designer at heart, and not version
control developer. Thus the common scenarios or criteria were different:
* Merging and applying patches
* Showing _subsystem_ history
* ???
That is what I am interested in.
Some of Git history, and I think of motivations behind design, can be
found in "Git Chronicle" slides by Junio from GitTogether.
> Of course that's just my perspective. Linus might have written something
> totally different. :)
Well, only Linus can be definitive source of initial *design goals*
(behind core design of Git)...
References:
~~~~~~~~~~~
[Mac06]: Matt Mackall: "Towards a Better SCM: Revlog and Mercurial".
2006 Ottawa Linux Symposium, 2006.
http://selenic.com/mercurial/wiki/index.cgi/Presentations?action=AttachFile&do=get&target=ols-mercurial-paper.pdf
(see also http://mercurial.selenic.com/wiki/Presentations)
[TAOUP]: Eric Raymond: "The Art of Unix Programming", 2003
http://www.faqs.org/docs/artu/
http://www.catb.org/~esr/writings/taoup/
[Ben86]: Jon Bentley: "Programming Pearls", chapter about implementing
and prototyping UNIX 'spell' program (from Polish translation).
--
Jakub Narebski
Poland
prev parent reply other threads:[~2011-05-30 10:30 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-28 12:17 How would Git chapter look like in "The Architecture of Open Source Applications"? Jakub Narebski
2011-05-30 3:40 ` Jeff King
2011-05-30 10:30 ` Jakub Narebski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201105301230.11772.jnareb@gmail.com \
--to=jnareb@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).