* diff'ing files ... @ 2011-06-06 18:49 Albretch Mueller 2011-06-06 22:43 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Albretch Mueller @ 2011-06-06 18:49 UTC (permalink / raw) To: git I research on collaborative editing/revision control utilities (related mostly to NLP) and at the corpora list ~ http://listserv.linguistlist.org/cgi-bin/wa?A1=ind1106&L=CORPORA#18 ~ ([Corpora-List] Managing texts and their edition history ...) they told me they use git. However at: ~ http://git-scm.com/documentation ~ I could not find specific information on what diff'ing algorithm does git use. Any white papers about git internals, mostly about its diff'ing and synching algorithm? (and I am not scared at all of "Directed Acyclic Graph" or any kind of Math terminology; in fact, I love it ;-)) ~ Also, of the wave of git related or general books coming out, which one actually explains the general concepts behind distributed version Control? Again, my ultimate interest is not computer programming/compiler-fed languages, but there is much that can be learned from them ~ Thank you lbrtchx ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: diff'ing files ... 2011-06-06 18:49 diff'ing files Albretch Mueller @ 2011-06-06 22:43 ` Jeff King 2011-06-07 22:12 ` Albretch Mueller 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2011-06-06 22:43 UTC (permalink / raw) To: Albretch Mueller; +Cc: git On Mon, Jun 06, 2011 at 06:49:04PM +0000, Albretch Mueller wrote: > I could not find specific information on what diff'ing algorithm does > git use. Any white papers about git internals, mostly about its > diff'ing and synching algorithm? (and I am not scared at all of > "Directed Acyclic Graph" or any kind of Math terminology; in fact, I > love it ;-)) It depends on what level you want to talk about. Each commit has a tree state represented by a single "tree object". Each tree object is a list of paths mapping to more objects, each of which may be a binary stream of data ("blob") or another tree object (i.e., a subtree). Diffing commits means diffing their trees. Diffing trees means looking for blobs at the same path in each commit and diffing them[1]. How we diff the content in the blobs depends on: 1. any external diff configuration (e.g., GIT_EXTERNAL_DIFF in the environment, or a per-file diff driver; see "git help attributes"). In that case, we just show whatever output the external diff produces. 2. We use xdiff by default, which does a line-oriented diff using the Myers algorithm (just like most "diff" implementations). See xdiff/xdiffi.c in the git source, or Myers "An O(ND) Difference Algorith and its Variations". 3. If you give the --patience option, we use the patience algorithm instead: http://bramcohen.livejournal.com/73318.html That's the inventor's original description, but you can find other discussion by googling for "patience diff". 4. If you use --word-diff (or --color-words), we find differences by line, and then break lines down by configurable delimiters into words, and then do a Myers diff on the resulting list of words. 5. If files are not text, we usually just say "Binary files differ" (like most diff implementations). We can also generate binary diffs, though I don't know offhand the details of the algorithm. I think that more or less covers 2-way diffing. There's also a 3-way comparison when merging, both at the tree level and at the blob level. Syncing actually has very little to do with our regular diffing. We generate compact binary deltas (just like in (5) above) between objects, regardless of their path, and send the other side a set of needed objects. This is the same format used for compact "packfile" storage. If the last bit of that paragraph didn't make any sense, then read up on git's object model in something like "Pro Git": http://progit.org/book/ch9-0.html -Peff [1] Actually, tree comparisons are a little more complex than I made them out to be. For instance, a path might exist on only one side (which we usually show as a comparison to an empty blob), or might exist as a tree on one side and a blob on the other. We also look for renames of paths at the blob level by looking for similar blobs. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: diff'ing files ... 2011-06-06 22:43 ` Jeff King @ 2011-06-07 22:12 ` Albretch Mueller 2011-06-07 22:19 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Albretch Mueller @ 2011-06-07 22:12 UTC (permalink / raw) To: Jeff King; +Cc: git > ... binary diffs, though I don't know offhand the details of the algorithm. ~ this is the part that I need ;-) ~ Reading the source code without knowing the basic underlying ideas/algorithm (just an outline if possible) won't help much ~ Thank you lbrtchx ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: diff'ing files ... 2011-06-07 22:12 ` Albretch Mueller @ 2011-06-07 22:19 ` Jeff King 2011-06-10 2:19 ` Nicolas Pitre 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2011-06-07 22:19 UTC (permalink / raw) To: Albretch Mueller; +Cc: Nicolas Pitre, git On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote: > > ... binary diffs, though I don't know offhand the details of the algorithm. > ~ > this is the part that I need ;-) > ~ > Reading the source code without knowing the basic underlying > ideas/algorithm (just an outline if possible) won't help much You could read the comments in the source: $ head -n 7 diff-delta.c /* * diff-delta.c: generate a delta between two buffers * * This code was greatly inspired by parts of LibXDiff from Davide Libenzi * http://www.xmailserver.org/xdiff-lib.html * * Rewritten for GIT by Nicolas Pitre <nico@fluxnic.net>, (C) 2005-2007 According to the xdiff page linked: For binary files, LibXDiff implements both (with some modification) the algorithm described in File System Support for Delta Compression by Joshua P. MacDonald, and the algorithm described in Fingerprinting By Random Polynomials by Michael O. Rabin. Nicolas (cc'd) might be able to say what, if any, substantive changes were made from those works. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: diff'ing files ... 2011-06-07 22:19 ` Jeff King @ 2011-06-10 2:19 ` Nicolas Pitre 2011-06-10 13:12 ` Jakub Narebski 0 siblings, 1 reply; 8+ messages in thread From: Nicolas Pitre @ 2011-06-10 2:19 UTC (permalink / raw) To: Jeff King; +Cc: Albretch Mueller, git On Tue, 7 Jun 2011, Jeff King wrote: > On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote: > > > > ... binary diffs, though I don't know offhand the details of the algorithm. > > ~ > > this is the part that I need ;-) > > ~ > > Reading the source code without knowing the basic underlying > > ideas/algorithm (just an outline if possible) won't help much > > You could read the comments in the source: > > $ head -n 7 diff-delta.c > /* > * diff-delta.c: generate a delta between two buffers > * > * This code was greatly inspired by parts of LibXDiff from Davide Libenzi > * http://www.xmailserver.org/xdiff-lib.html > * > * Rewritten for GIT by Nicolas Pitre <nico@fluxnic.net>, (C) 2005-2007 > > According to the xdiff page linked: > > For binary files, LibXDiff implements both (with some modification) > the algorithm described in File System Support for Delta Compression > by Joshua P. MacDonald, and the algorithm described in Fingerprinting > By Random Polynomials by Michael O. Rabin. > > Nicolas (cc'd) might be able to say what, if any, substantive changes > were made from those works. The libxdiff code was pretty generic so to be highly portable and usable for many application types. What I did is to get rid of everything that git strictly didn't need in order to make the code as simple as possible, and most importantly as fast as possible. And then the code was optimized even further, sacrificing on clarity a bit, to make it even faster. Since this code is the very inner loop of every delta search for best delta matches, every bit of optimization counts. And then further modifications were made to avoid pathological corner cases which were taking too much time for little gain in the Git context. I also changed the output encoding to make it tighter. So, strictly speaking, the current code in Git doesn't bear any resemblance with the libxdiff code at all. However the basic algorithm behind both implementations is the same. Studying the libxdiff version is probably easier in order to gain an understanding of how this works. Nicolas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: diff'ing files ... 2011-06-10 2:19 ` Nicolas Pitre @ 2011-06-10 13:12 ` Jakub Narebski 2011-06-10 20:37 ` Nicolas Pitre 0 siblings, 1 reply; 8+ messages in thread From: Jakub Narebski @ 2011-06-10 13:12 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jeff King, Albretch Mueller, git Nicolas Pitre <nico@fluxnic.net> writes: > On Tue, 7 Jun 2011, Jeff King wrote: > > On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote: > > > > > > ... binary diffs, though I don't know offhand the details of the algorithm. > > > ~ > > > this is the part that I need ;-) [...] > > According to the xdiff page linked: > > > > For binary files, LibXDiff implements both (with some modification) > > the algorithm described in File System Support for Delta Compression > > by Joshua P. MacDonald, and the algorithm described in Fingerprinting > > By Random Polynomials by Michael O. Rabin. > > > > Nicolas (cc'd) might be able to say what, if any, substantive changes > > were made from those works. > > The libxdiff code was pretty generic so to be highly portable and usable > for many application types. What I did is to get rid of everything that > git strictly didn't need in order to make the code as simple as > possible, and most importantly as fast as possible. [...] > > And then further modifications were made to avoid pathological corner > cases which were taking too much time for little gain in the Git > context. > > I also changed the output encoding to make it tighter. Nicolas, do you know how binary diff used by git compares with respect to performance and compression with other binary diff algorithms: * original LibXDiff * bsdiff * xdelta (vcdif algorithm) * vbindiff -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: diff'ing files ... 2011-06-10 13:12 ` Jakub Narebski @ 2011-06-10 20:37 ` Nicolas Pitre 2011-06-12 1:02 ` Albretch Mueller 0 siblings, 1 reply; 8+ messages in thread From: Nicolas Pitre @ 2011-06-10 20:37 UTC (permalink / raw) To: Jakub Narebski; +Cc: Jeff King, Albretch Mueller, git On Fri, 10 Jun 2011, Jakub Narebski wrote: > Nicolas Pitre <nico@fluxnic.net> writes: > > The libxdiff code was pretty generic so to be highly portable and usable > > for many application types. What I did is to get rid of everything that > > git strictly didn't need in order to make the code as simple as > > possible, and most importantly as fast as possible. [...] > > > > And then further modifications were made to avoid pathological corner > > cases which were taking too much time for little gain in the Git > > context. > > > > I also changed the output encoding to make it tighter. > > Nicolas, do you know how binary diff used by git compares with respect > to performance and compression with other binary diff algorithms: > > * original LibXDiff > * bsdiff > * xdelta (vcdif algorithm) > * vbindiff No idea. But you can test that pretty easily if you wish. I would be interested in the results of course. Just do: make test-delta and then, to compress something: ./test-delta -d <input_file> <reference_file> <output_file> Of course this will produce <output_file> which is only the bare binary diff annotation data against the reference file. In Git that output is also deflated with zlib, before it is stored in a pack. The other binary diff tools are usually doing a similar post-deflation pass as well. It should be noted that the algorithm that Git uses won't produce the absolute smallest output. When I tried that, computation time went up of course, but surprizingly the final deflated result was slightly _larger_ as well, probably due to the fact that zlib was less efficient with the increased randomness in the tighter delta output to deflate. Nicolas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: diff'ing files ... 2011-06-10 20:37 ` Nicolas Pitre @ 2011-06-12 1:02 ` Albretch Mueller 0 siblings, 0 replies; 8+ messages in thread From: Albretch Mueller @ 2011-06-12 1:02 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jakub Narebski, Jeff King, git > further modifications were made to avoid pathological corner cases ... ~ Nicolas, did you keep test samples of those pathological corner cases you would share? ~ lbrtchx ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-06-12 1:02 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-06-06 18:49 diff'ing files Albretch Mueller 2011-06-06 22:43 ` Jeff King 2011-06-07 22:12 ` Albretch Mueller 2011-06-07 22:19 ` Jeff King 2011-06-10 2:19 ` Nicolas Pitre 2011-06-10 13:12 ` Jakub Narebski 2011-06-10 20:37 ` Nicolas Pitre 2011-06-12 1:02 ` Albretch Mueller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).