diff'ing files ...

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* diff'ing files ...
@ 2011-06-06 18:49 Albretch Mueller
  2011-06-06 22:43 ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Albretch Mueller @ 2011-06-06 18:49 UTC (permalink / raw)
  To: git

 I research on collaborative editing/revision control utilities
(related mostly to NLP) and at the corpora list
~
 http://listserv.linguistlist.org/cgi-bin/wa?A1=ind1106&L=CORPORA#18
~
 ([Corpora-List] Managing texts and their edition history ...) they
told me they use git. However at:
~
 http://git-scm.com/documentation
~
 I could not find specific information on what diff'ing algorithm does
git use. Any white papers about git internals, mostly about its
diff'ing and synching algorithm? (and I am not scared at all of
"Directed Acyclic Graph" or any kind of Math terminology; in fact, I
love it ;-))
~
 Also, of the wave of git related or general books coming out, which
one actually explains the general concepts behind distributed version
Control? Again, my ultimate interest is not computer
programming/compiler-fed languages, but there is much that can be
learned from them
~
 Thank you
 lbrtchx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: diff'ing files ...
  2011-06-06 18:49 diff'ing files Albretch Mueller
@ 2011-06-06 22:43 ` Jeff King
  2011-06-07 22:12   ` Albretch Mueller
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2011-06-06 22:43 UTC (permalink / raw)
  To: Albretch Mueller; +Cc: git

On Mon, Jun 06, 2011 at 06:49:04PM +0000, Albretch Mueller wrote:

>  I could not find specific information on what diff'ing algorithm does
> git use. Any white papers about git internals, mostly about its
> diff'ing and synching algorithm? (and I am not scared at all of
> "Directed Acyclic Graph" or any kind of Math terminology; in fact, I
> love it ;-))

It depends on what level you want to talk about. Each commit has a tree
state represented by a single "tree object". Each tree object is a list
of paths mapping to more objects, each of which may be a binary stream
of data ("blob") or another tree object (i.e., a subtree).

Diffing commits means diffing their trees. Diffing trees means looking
for blobs at the same path in each commit and diffing them[1]. How we
diff the content in the blobs depends on:

  1. any external diff configuration (e.g., GIT_EXTERNAL_DIFF in the
     environment, or a per-file diff driver; see "git help attributes").
     In that case, we just show whatever output the external diff
     produces.

  2. We use xdiff by default, which does a line-oriented diff using the
     Myers algorithm (just like most "diff" implementations). See
     xdiff/xdiffi.c in the git source, or Myers "An O(ND) Difference
     Algorith and its Variations".

  3. If you give the --patience option, we use the patience algorithm
     instead:

       http://bramcohen.livejournal.com/73318.html

     That's the inventor's original description, but you can find other
     discussion by googling for "patience diff".

  4. If you use --word-diff (or --color-words), we find differences by
     line, and then break lines down by configurable delimiters into
     words, and then do a Myers diff on the resulting list of words.

  5. If files are not text, we usually just say "Binary files differ"
     (like most diff implementations). We can also generate binary
     diffs, though I don't know offhand the details of the algorithm.

I think that more or less covers 2-way diffing. There's also a 3-way
comparison when merging, both at the tree level and at the blob level.
Syncing actually has very little to do with our regular diffing. We
generate compact binary deltas (just like in (5) above) between objects,
regardless of their path, and send the other side a set of needed
objects. This is the same format used for compact "packfile" storage.
If the last bit of that paragraph didn't make any sense, then read up on
git's object model in something like "Pro Git":

  http://progit.org/book/ch9-0.html

-Peff

[1] Actually, tree comparisons are a little more complex than I made
them out to be. For instance, a path might exist on only one side (which
we usually show as a comparison to an empty blob), or might exist as a
tree on one side and a blob on the other. We also look for renames of
paths at the blob level by looking for similar blobs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: diff'ing files ...
  2011-06-06 22:43 ` Jeff King
@ 2011-06-07 22:12   ` Albretch Mueller
  2011-06-07 22:19     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Albretch Mueller @ 2011-06-07 22:12 UTC (permalink / raw)
  To: Jeff King; +Cc: git

> ... binary diffs, though I don't know offhand the details of the algorithm.
~
 this is the part that I need ;-)
~
 Reading the source code without knowing the basic underlying
ideas/algorithm (just an outline if possible) won't help much
~
 Thank you
 lbrtchx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: diff'ing files ...
  2011-06-07 22:12   ` Albretch Mueller
@ 2011-06-07 22:19     ` Jeff King
  2011-06-10  2:19       ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2011-06-07 22:19 UTC (permalink / raw)
  To: Albretch Mueller; +Cc: Nicolas Pitre, git

On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote:

> > ... binary diffs, though I don't know offhand the details of the algorithm.
> ~
>  this is the part that I need ;-)
> ~
>  Reading the source code without knowing the basic underlying
> ideas/algorithm (just an outline if possible) won't help much

You could read the comments in the source:

  $ head -n 7 diff-delta.c
  /*
   * diff-delta.c: generate a delta between two buffers
   *
   * This code was greatly inspired by parts of LibXDiff from Davide Libenzi
   * http://www.xmailserver.org/xdiff-lib.html
   *
   * Rewritten for GIT by Nicolas Pitre <nico@fluxnic.net>, (C) 2005-2007

According to the xdiff page linked:

  For binary files, LibXDiff implements both (with some modification)
  the algorithm described in File System Support for Delta Compression
  by Joshua P.  MacDonald, and the algorithm described in Fingerprinting
  By Random Polynomials by Michael O. Rabin.

Nicolas (cc'd) might be able to say what, if any, substantive changes
were made from those works.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: diff'ing files ...
  2011-06-07 22:19     ` Jeff King
@ 2011-06-10  2:19       ` Nicolas Pitre
  2011-06-10 13:12         ` Jakub Narebski
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2011-06-10  2:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Albretch Mueller, git

On Tue, 7 Jun 2011, Jeff King wrote:

> On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote:
> 
> > > ... binary diffs, though I don't know offhand the details of the algorithm.
> > ~
> >  this is the part that I need ;-)
> > ~
> >  Reading the source code without knowing the basic underlying
> > ideas/algorithm (just an outline if possible) won't help much
> 
> You could read the comments in the source:
> 
>   $ head -n 7 diff-delta.c
>   /*
>    * diff-delta.c: generate a delta between two buffers
>    *
>    * This code was greatly inspired by parts of LibXDiff from Davide Libenzi
>    * http://www.xmailserver.org/xdiff-lib.html
>    *
>    * Rewritten for GIT by Nicolas Pitre <nico@fluxnic.net>, (C) 2005-2007
> 
> According to the xdiff page linked:
> 
>   For binary files, LibXDiff implements both (with some modification)
>   the algorithm described in File System Support for Delta Compression
>   by Joshua P.  MacDonald, and the algorithm described in Fingerprinting
>   By Random Polynomials by Michael O. Rabin.
> 
> Nicolas (cc'd) might be able to say what, if any, substantive changes
> were made from those works.

The libxdiff code was pretty generic so to be highly portable and usable 
for many application types.  What I did is to get rid of everything that 
git strictly didn't need in order to make the code as simple as 
possible, and most importantly as fast as possible.  And then the code 
was optimized even further, sacrificing on clarity a bit, to make it 
even faster.  Since this code is the very inner loop of every delta 
search for best delta matches, every bit of optimization counts.

And then further modifications were made to avoid pathological corner 
cases which were taking too much time for little gain in the Git 
context.

I also changed the output encoding to make it tighter.

So, strictly speaking, the current code in Git doesn't bear any 
resemblance with the libxdiff code at all.  However the basic algorithm 
behind both implementations is the same.  Studying the libxdiff version 
is probably easier in order to gain an understanding of how this works.

Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: diff'ing files ...
  2011-06-10  2:19       ` Nicolas Pitre
@ 2011-06-10 13:12         ` Jakub Narebski
  2011-06-10 20:37           ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Narebski @ 2011-06-10 13:12 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, Albretch Mueller, git

Nicolas Pitre <nico@fluxnic.net> writes:
> On Tue, 7 Jun 2011, Jeff King wrote:
> > On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote:
> > 
> > > > ... binary diffs, though I don't know offhand the details of the algorithm.
> > > ~
> > >  this is the part that I need ;-)

[...]

> > According to the xdiff page linked:
> > 
> >   For binary files, LibXDiff implements both (with some modification)
> >   the algorithm described in File System Support for Delta Compression
> >   by Joshua P.  MacDonald, and the algorithm described in Fingerprinting
> >   By Random Polynomials by Michael O. Rabin.
> > 
> > Nicolas (cc'd) might be able to say what, if any, substantive changes
> > were made from those works.
> 
> The libxdiff code was pretty generic so to be highly portable and usable 
> for many application types.  What I did is to get rid of everything that 
> git strictly didn't need in order to make the code as simple as 
> possible, and most importantly as fast as possible. [...]
> 
> And then further modifications were made to avoid pathological corner 
> cases which were taking too much time for little gain in the Git 
> context.
> 
> I also changed the output encoding to make it tighter.

Nicolas, do you know how binary diff used by git compares with respect
to performance and compression with other binary diff algorithms:

  * original LibXDiff
  * bsdiff
  * xdelta (vcdif algorithm)
  * vbindiff

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: diff'ing files ...
  2011-06-10 13:12         ` Jakub Narebski
@ 2011-06-10 20:37           ` Nicolas Pitre
  2011-06-12  1:02             ` Albretch Mueller
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2011-06-10 20:37 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Jeff King, Albretch Mueller, git

On Fri, 10 Jun 2011, Jakub Narebski wrote:

> Nicolas Pitre <nico@fluxnic.net> writes:
> > The libxdiff code was pretty generic so to be highly portable and usable 
> > for many application types.  What I did is to get rid of everything that 
> > git strictly didn't need in order to make the code as simple as 
> > possible, and most importantly as fast as possible. [...]
> > 
> > And then further modifications were made to avoid pathological corner 
> > cases which were taking too much time for little gain in the Git 
> > context.
> > 
> > I also changed the output encoding to make it tighter.
> 
> Nicolas, do you know how binary diff used by git compares with respect
> to performance and compression with other binary diff algorithms:
> 
>   * original LibXDiff
>   * bsdiff
>   * xdelta (vcdif algorithm)
>   * vbindiff

No idea.  But you can test that pretty easily if you wish.  I would be 
interested in the results of course. Just do:

	make test-delta

and then, to compress something:

	./test-delta -d <input_file> <reference_file> <output_file>

Of course this will produce <output_file> which is only the bare binary 
diff annotation data against the reference file.  In Git that output is 
also deflated with zlib, before it is stored in a pack.  The other 
binary diff tools are usually doing a similar post-deflation pass as 
well.

It should be noted that the algorithm that Git uses won't produce the 
absolute smallest output.  When I tried that, computation time went up 
of course, but surprizingly the final deflated result was slightly 
_larger_ as well, probably due to the fact that zlib was less efficient 
with the increased randomness in the tighter delta output to deflate.

Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: diff'ing files ...
  2011-06-10 20:37           ` Nicolas Pitre
@ 2011-06-12  1:02             ` Albretch Mueller
  0 siblings, 0 replies; 8+ messages in thread
From: Albretch Mueller @ 2011-06-12  1:02 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Jeff King, git

> further modifications were made to avoid pathological corner cases ...
~
 Nicolas, did you keep test samples of those pathological corner cases
you would share?
~
 lbrtchx

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-06-12  1:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-06 18:49 diff'ing files Albretch Mueller
2011-06-06 22:43 ` Jeff King
2011-06-07 22:12   ` Albretch Mueller
2011-06-07 22:19     ` Jeff King
2011-06-10  2:19       ` Nicolas Pitre
2011-06-10 13:12         ` Jakub Narebski
2011-06-10 20:37           ` Nicolas Pitre
2011-06-12  1:02             ` Albretch Mueller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).