A better approach to diffing and merging

All of lore.kernel.org
 help / color / mirror / Atom feed

* A better approach to diffing and merging
@ 2008-11-29 18:12 Ian Clarke
  2008-11-29 23:40 ` Boyd Stephen Smith Jr.
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Ian Clarke @ 2008-11-29 18:12 UTC (permalink / raw)
  To: git

Apologies if this is off-topic, but I recently had an idea for a
better way to do diffs and merging which I thought may be of interest
to those on this list.

I described it in a blog entry here: http://budurl.com/jyt6

For your convenience text is pasted below (although missing a few hyperlinks):

A plan for better source code diffs and merging
======================================

I've been using Subversion for years, but a few months ago I was
really starting to feel the limitations of being able to create and
merge branches easily. I'd heard that Git made this very easy indeed,
and so I decided to try it.

Anyway, this isn't yet another "I discovered Git and now I've achieved
self-actualization" blog post, so to cut a long story short, I now use
git for everything (together with the excellent GitHub).

Even though merging is a lot better with Git than Subversion, I've
still found myself getting into situations where it requires a lot of
work to merge a branch back into another branch, and it got me
thinking about better ways to do merging.

While I'm no merging expert, it seems that most merging algorithms do
it on a line-by-line basis, treating source code as nothing but a list
of lines of text.  It got me thinking, what if the merging algorithm
understood the structure of the source code it is trying to merge?

So the idea is this:

Provide the merge algorithm with the grammar of the programming
language, perhaps in the form of a Bison grammar file, or some other
standardized way to represent a grammar.

The merge algorithm then uses this to parse the files to be diffed
and/or merged into trees, and then the diff and merge are treated as
operations on these trees.  These operations may include creating,
deleting, or moving nodes or branches, renaming nodes, etc.  There has
been quite a bit (pdf) of academic research on this topic, although I
haven't yet found off-the-shelf code that will do what we need.
Still, it shouldn't be terribly hard to implement.

The beauty of this approach is that the merge algorithm should be far
less likely to be confused by formatting changes, and much more likely
to correctly identify what has changed.

I can't think of any reason that such a tool wouldn't work in the
exact same way as existing diff/merge tools from the programmer's
perspective. The tool would automatically select the correct grammar
based on the file extension, or fall-back to line-based diffs if the
extension is unrecognized (or the file isn't valid for the selected
grammar). Thus, it should be trivial to use this new tool with
existing version control systems.

I'd love to have the time to implement this, although regretfully it
is at the bottom of a very large pile of "some day" projects.  I think
this is an interesting enough idea, and one that would be immediately
useful, that if I put it out there someone somewhere might be able to
make it a reality.

Any takers? I've set up a Google Group for further discussion, please
join if interested.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A better approach to diffing and merging
  2008-11-29 18:12 A better approach to diffing and merging Ian Clarke
@ 2008-11-29 23:40 ` Boyd Stephen Smith Jr.
  2008-11-30  2:54   ` Miklos Vajna
  2008-11-30  1:56 ` Brian Dessent
  2008-12-01 11:41 ` Jakub Narebski
  2 siblings, 1 reply; 7+ messages in thread
From: Boyd Stephen Smith Jr. @ 2008-11-29 23:40 UTC (permalink / raw)
  To: Ian Clarke; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2374 bytes --]

On Saturday 2008 November 29 12:12, Ian Clarke wrote:
> While I'm no merging expert, it seems that most merging algorithms do
> it on a line-by-line basis, treating source code as nothing but a list
> of lines of text.  It got me thinking, what if the merging algorithm
> understood the structure of the source code it is trying to merge?

Unfortunately, this is hard to do in general.  Not impossible, but very hard.  
Heck, some languages don't really have a formal grammar, or have one that is 
undecidable without doing deeper analysis.  Perl 6 is supposed to have some 
support for language constructs that change the grammar.

Also, this generally takes a lot of time.  Automatic merges are only useful if 
they take less (or only a little more) time than doing the merge manually.  
If your mergetool has to think about something for 30 minutes that you could 
have resolved in 5, it's not normally a "win".

Also, it slightly changes the format of a "patch" file.  Currently, patch 
files are a line-by-line diff.  If you instead made changes based on mapping 
parse trees to parse trees, you'd (probably) want to 
store/transfer/communicate your patches using a different format, to preserve 
the proper amount of "context" and make the patch easy to apply.  (I.e., do 
the hard work once.)

> Any takers? I've set up a Google Group for further discussion, please
> join if interested.

You might look deeper into Darcs development.  This level of 
pluggable "understanding" of the file(s) being modified fits in well with a 
Grand Unified Theory of Patching.  Also "understanding" patches better allows 
Darcs to reorder patches (and calculate "reverse patches") better -- reducing 
the time to do existing automatic merging (or reject the merge as 
non-automatable) and make merges automatic that are currently not handled 
automatically.

I'm not going to come out and discourage you or other from adding the 
functionality to git, but I think there are more useful and practical ways to 
improve git.  (Line-by-line merging is generally "good enough", the worst 
enemy of "good" software.)
-- 
Boyd Stephen Smith Jr.                     ,= ,-_-. =. 
bss03@volumehost.net                      ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy           `-'(. .)`-' 
http://iguanasuicide.org/                      \_/     

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A better approach to diffing and merging
  2008-11-29 18:12 A better approach to diffing and merging Ian Clarke
  2008-11-29 23:40 ` Boyd Stephen Smith Jr.
@ 2008-11-30  1:56 ` Brian Dessent
  2008-12-01  9:54   ` Karl Hasselström
  2008-12-01 11:41 ` Jakub Narebski
  2 siblings, 1 reply; 7+ messages in thread
From: Brian Dessent @ 2008-11-30  1:56 UTC (permalink / raw)
  To: Ian Clarke; +Cc: git

Ian Clarke wrote:

> Provide the merge algorithm with the grammar of the programming
> language, perhaps in the form of a Bison grammar file, or some other
> standardized way to represent a grammar.
> 
> The merge algorithm then uses this to parse the files to be diffed
> and/or merged into trees, and then the diff and merge are treated as
> operations on these trees.  These operations may include creating,
> deleting, or moving nodes or branches, renaming nodes, etc.  There has
> been quite a bit (pdf) of academic research on this topic, although I
> haven't yet found off-the-shelf code that will do what we need.
> Still, it shouldn't be terribly hard to implement.

There's a huge flaw in that approach for C/C++: in order to parse C/C++
you have to first preprocess it -- consider the twisty mazes that
#ifdef/#else/#endif can create.  But in order to preprocess source code
you need a whole heap of extra information that is not in the repository
(or if it is, cannot be automatically extracted.)

For example, you'd have to know all the -D/-U/-I flags that the makefile
or the user might pass to the compiler.  You'd have to replicate the
compiler's complicated header search path algorithm, which can depend on
the directives in the code as well as command line arguments,
environment variables, and values specific to the toolchain.  (Don't
forget that you can have code in a repository that's meant to be
cross-compiled and which uses a toolchain that has its own headers and
not the ones in /usr/include.)  You'd have to know all the built-in
predefined symbols of that toolchain, e.g. what's the value of
__GNUC_MINOR__ or __GNUC_PATCHLEVEL__, is __mips__ or __i386__ defined,
and on and on.  And of course the natural conclusion of this
progression: a change can be perfectly grammatically correct for one
particular platform/toolchain/setting of CFLAGS, and completely broken
for another.  There's no way for a VCS to know any of this, it takes
human comprehension.

If you look at a tool like doxygen that attempts to parse C/C++, it
don't actually do full preprocessing, only a very limited subset: it
only expands macros that the user names as relevant in the config file,
and it only preprocesses included headers that match a pathspec the user
provides.  Consequently it cannot fully parse the code to see if it's
grammatically correct, only to the limited extent that it can infer the
location where things appear to be defined.  And it is easily confused,
e.g. it will "see" code in both halves of an #ifdef section if it wasn't
told anything about the value of the macro in the config file, which can
cause it to incorrectly think that a function or variable was defined
there when in reality that section was discarded.

The idea may have value for langauges that are easy to parse and do not
have all this preprocessor cruft, but I just don't see how it would be
able to provide anything useful for non-trivial changes to real world
C/C++, which require human eyes to decipher.

Brian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A better approach to diffing and merging
  2008-11-29 23:40 ` Boyd Stephen Smith Jr.
@ 2008-11-30  2:54   ` Miklos Vajna
  0 siblings, 0 replies; 7+ messages in thread
From: Miklos Vajna @ 2008-11-30  2:54 UTC (permalink / raw)
  To: Boyd Stephen Smith Jr.; +Cc: Ian Clarke, git

[-- Attachment #1: Type: text/plain, Size: 1187 bytes --]

On Sat, Nov 29, 2008 at 05:40:02PM -0600, "Boyd Stephen Smith Jr." <bss03@volumehost.net> wrote:
> You might look deeper into Darcs development.  This level of 
> pluggable "understanding" of the file(s) being modified fits in well with a 
> Grand Unified Theory of Patching.  Also "understanding" patches better allows 
> Darcs to reorder patches (and calculate "reverse patches") better -- reducing 
> the time to do existing automatic merging (or reject the merge as 
> non-automatable) and make merges automatic that are currently not handled 
> automatically.
> 
> I'm not going to come out and discourage you or other from adding the 
> functionality to git, but I think there are more useful and practical ways to 
> improve git.  (Line-by-line merging is generally "good enough", the worst 
> enemy of "good" software.)

I think this was already discussed:

http://thread.gmane.org/gmane.comp.version-control.git/60457/focus=60512

If you mean just looking at the code moves/copies between the trees (but
no other history), then a merge strategy which makes use of git blame's
code move/copy detection would be indeed nice, though nobody created it
so far.

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A better approach to diffing and merging
  2008-11-30  1:56 ` Brian Dessent
@ 2008-12-01  9:54   ` Karl Hasselström
  0 siblings, 0 replies; 7+ messages in thread
From: Karl Hasselström @ 2008-12-01  9:54 UTC (permalink / raw)
  To: Brian Dessent; +Cc: Ian Clarke, git

On 2008-11-29 17:56:44 -0800, Brian Dessent wrote:

> Ian Clarke wrote:
>
> > Provide the merge algorithm with the grammar of the programming
> > language, perhaps in the form of a Bison grammar file, or some
> > other standardized way to represent a grammar.
>
> There's a huge flaw in that approach for C/C++: in order to parse
> C/C++ you have to first preprocess it -- consider the twisty mazes
> that #ifdef/#else/#endif can create. But in order to preprocess
> source code you need a whole heap of extra information that is not
> in the repository (or if it is, cannot be automatically extracted.)

But it's probably not necessary to parse the input files exactly. All
you have to do is parse it well enough that the diff of the parse
trees is interesting.

And in practice, you'd probably also generate the "normal" diff, and
then fall back to that one if the parse tree diff was worse.

> The idea may have value for langauges that are easy to parse and do
> not have all this preprocessor cruft, but I just don't see how it
> would be able to provide anything useful for non-trivial changes to
> real world C/C++, which require human eyes to decipher.

I think it could work. But there would be quite a bit of heuristics
involved to get the "approximate" parsing right, so I'm pretty sure
there's no way to find out without actually trying to build the thing.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A better approach to diffing and merging
  2008-11-29 18:12 A better approach to diffing and merging Ian Clarke
  2008-11-29 23:40 ` Boyd Stephen Smith Jr.
  2008-11-30  1:56 ` Brian Dessent
@ 2008-12-01 11:41 ` Jakub Narebski
  2008-12-02  8:37   ` Karl Hasselström
  2 siblings, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2008-12-01 11:41 UTC (permalink / raw)
  To: Ian Clarke; +Cc: git

"Ian Clarke" <ian.clarke@gmail.com> writes:

> Apologies if this is off-topic, but I recently had an idea for a
> better way to do diffs and merging which I thought may be of interest
> to those on this list.

[...]
> While I'm no merging expert, it seems that most merging algorithms do
> it on a line-by-line basis, treating source code as nothing but a list
> of lines of text.  It got me thinking, what if the merging algorithm
> understood the structure of the source code it is trying to merge?
> 
> So the idea is this:
> 
> Provide the merge algorithm with the grammar of the programming
> language, perhaps in the form of a Bison grammar file, or some other
> standardized way to represent a grammar.
> 
> The merge algorithm then uses this to parse the files to be diffed
> and/or merged into trees, and then the diff and merge are treated as
> operations on these trees.  These operations may include creating,
> deleting, or moving nodes or branches, renaming nodes, etc.  There has
> been quite a bit (pdf) of academic research on this topic, although I
> haven't yet found off-the-shelf code that will do what we need.

First, as Brian Dessent said it would be hard to generate parse tree
in the presence of compile-time configuration (using preprocessor
in C/C++, but in principle this applies to programs in any language;
not only you have to know conditionals, but also compile options).
And for dynamic languages you would have to take care about
self-modifying programs.

Second, from what I understand we have _good_, established algorithms
for merging sequences (which includes sequence of lines, or sequence
of words), and for merging special kinds of trees that are
representations of directory structure.  I haven't read link to
mentioned research, but I think that it is still unproven research,
and not something well established and well tested.

Third, it would require embedding knowledge about various programming
languages (including C, shell, Perl, TeX) and document formats
(including XML, HTML, AsciiDoc) in version control system...

> Still, it shouldn't be terribly hard to implement.

So, try to provide us with some proof-of-concept patches, then.
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A better approach to diffing and merging
  2008-12-01 11:41 ` Jakub Narebski
@ 2008-12-02  8:37   ` Karl Hasselström
  0 siblings, 0 replies; 7+ messages in thread
From: Karl Hasselström @ 2008-12-02  8:37 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Ian Clarke, git

On 2008-12-01 03:41:38 -0800, Jakub Narebski wrote:

> Third, it would require embedding knowledge about various programming
> languages (including C, shell, Perl, TeX) and document formats
> (including XML, HTML, AsciiDoc) in version control system...

Really? I was under the impression that you could specify external
diff and merge programs in .gitattributes, which would be precisely
the right way to hook this stuff into git.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-12-02  8:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-29 18:12 A better approach to diffing and merging Ian Clarke
2008-11-29 23:40 ` Boyd Stephen Smith Jr.
2008-11-30  2:54   ` Miklos Vajna
2008-11-30  1:56 ` Brian Dessent
2008-12-01  9:54   ` Karl Hasselström
2008-12-01 11:41 ` Jakub Narebski
2008-12-02  8:37   ` Karl Hasselström

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.