git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Albretch Mueller <lbrtchx@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: diff'ing files ...
Date: Mon, 6 Jun 2011 18:43:56 -0400	[thread overview]
Message-ID: <20110606224356.GC13697@sigill.intra.peff.net> (raw)
In-Reply-To: <BANLkTi=1vaoLVmhyahDttmUmqw7RTp=8-A@mail.gmail.com>

On Mon, Jun 06, 2011 at 06:49:04PM +0000, Albretch Mueller wrote:

>  I could not find specific information on what diff'ing algorithm does
> git use. Any white papers about git internals, mostly about its
> diff'ing and synching algorithm? (and I am not scared at all of
> "Directed Acyclic Graph" or any kind of Math terminology; in fact, I
> love it ;-))

It depends on what level you want to talk about. Each commit has a tree
state represented by a single "tree object". Each tree object is a list
of paths mapping to more objects, each of which may be a binary stream
of data ("blob") or another tree object (i.e., a subtree).

Diffing commits means diffing their trees. Diffing trees means looking
for blobs at the same path in each commit and diffing them[1]. How we
diff the content in the blobs depends on:

  1. any external diff configuration (e.g., GIT_EXTERNAL_DIFF in the
     environment, or a per-file diff driver; see "git help attributes").
     In that case, we just show whatever output the external diff
     produces.

  2. We use xdiff by default, which does a line-oriented diff using the
     Myers algorithm (just like most "diff" implementations). See
     xdiff/xdiffi.c in the git source, or Myers "An O(ND) Difference
     Algorith and its Variations".

  3. If you give the --patience option, we use the patience algorithm
     instead:

       http://bramcohen.livejournal.com/73318.html

     That's the inventor's original description, but you can find other
     discussion by googling for "patience diff".

  4. If you use --word-diff (or --color-words), we find differences by
     line, and then break lines down by configurable delimiters into
     words, and then do a Myers diff on the resulting list of words.

  5. If files are not text, we usually just say "Binary files differ"
     (like most diff implementations). We can also generate binary
     diffs, though I don't know offhand the details of the algorithm.

I think that more or less covers 2-way diffing. There's also a 3-way
comparison when merging, both at the tree level and at the blob level.
Syncing actually has very little to do with our regular diffing. We
generate compact binary deltas (just like in (5) above) between objects,
regardless of their path, and send the other side a set of needed
objects. This is the same format used for compact "packfile" storage.
If the last bit of that paragraph didn't make any sense, then read up on
git's object model in something like "Pro Git":

  http://progit.org/book/ch9-0.html

-Peff

[1] Actually, tree comparisons are a little more complex than I made
them out to be. For instance, a path might exist on only one side (which
we usually show as a comparison to an empty blob), or might exist as a
tree on one side and a blob on the other. We also look for renames of
paths at the blob level by looking for similar blobs.

  reply	other threads:[~2011-06-06 22:44 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-06 18:49 diff'ing files Albretch Mueller
2011-06-06 22:43 ` Jeff King [this message]
2011-06-07 22:12   ` Albretch Mueller
2011-06-07 22:19     ` Jeff King
2011-06-10  2:19       ` Nicolas Pitre
2011-06-10 13:12         ` Jakub Narebski
2011-06-10 20:37           ` Nicolas Pitre
2011-06-12  1:02             ` Albretch Mueller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110606224356.GC13697@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=lbrtchx@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).