From: Jeff King <peff@peff.net>
To: Albretch Mueller <lbrtchx@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: diff'ing files ...
Date: Mon, 6 Jun 2011 18:43:56 -0400 [thread overview]
Message-ID: <20110606224356.GC13697@sigill.intra.peff.net> (raw)
In-Reply-To: <BANLkTi=1vaoLVmhyahDttmUmqw7RTp=8-A@mail.gmail.com>
On Mon, Jun 06, 2011 at 06:49:04PM +0000, Albretch Mueller wrote:
> I could not find specific information on what diff'ing algorithm does
> git use. Any white papers about git internals, mostly about its
> diff'ing and synching algorithm? (and I am not scared at all of
> "Directed Acyclic Graph" or any kind of Math terminology; in fact, I
> love it ;-))
It depends on what level you want to talk about. Each commit has a tree
state represented by a single "tree object". Each tree object is a list
of paths mapping to more objects, each of which may be a binary stream
of data ("blob") or another tree object (i.e., a subtree).
Diffing commits means diffing their trees. Diffing trees means looking
for blobs at the same path in each commit and diffing them[1]. How we
diff the content in the blobs depends on:
1. any external diff configuration (e.g., GIT_EXTERNAL_DIFF in the
environment, or a per-file diff driver; see "git help attributes").
In that case, we just show whatever output the external diff
produces.
2. We use xdiff by default, which does a line-oriented diff using the
Myers algorithm (just like most "diff" implementations). See
xdiff/xdiffi.c in the git source, or Myers "An O(ND) Difference
Algorith and its Variations".
3. If you give the --patience option, we use the patience algorithm
instead:
http://bramcohen.livejournal.com/73318.html
That's the inventor's original description, but you can find other
discussion by googling for "patience diff".
4. If you use --word-diff (or --color-words), we find differences by
line, and then break lines down by configurable delimiters into
words, and then do a Myers diff on the resulting list of words.
5. If files are not text, we usually just say "Binary files differ"
(like most diff implementations). We can also generate binary
diffs, though I don't know offhand the details of the algorithm.
I think that more or less covers 2-way diffing. There's also a 3-way
comparison when merging, both at the tree level and at the blob level.
Syncing actually has very little to do with our regular diffing. We
generate compact binary deltas (just like in (5) above) between objects,
regardless of their path, and send the other side a set of needed
objects. This is the same format used for compact "packfile" storage.
If the last bit of that paragraph didn't make any sense, then read up on
git's object model in something like "Pro Git":
http://progit.org/book/ch9-0.html
-Peff
[1] Actually, tree comparisons are a little more complex than I made
them out to be. For instance, a path might exist on only one side (which
we usually show as a comparison to an empty blob), or might exist as a
tree on one side and a blob on the other. We also look for renames of
paths at the blob level by looking for similar blobs.
next prev parent reply other threads:[~2011-06-06 22:44 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-06 18:49 diff'ing files Albretch Mueller
2011-06-06 22:43 ` Jeff King [this message]
2011-06-07 22:12 ` Albretch Mueller
2011-06-07 22:19 ` Jeff King
2011-06-10 2:19 ` Nicolas Pitre
2011-06-10 13:12 ` Jakub Narebski
2011-06-10 20:37 ` Nicolas Pitre
2011-06-12 1:02 ` Albretch Mueller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110606224356.GC13697@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=lbrtchx@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).