All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Dessent <brian@dessent.net>
To: Ian Clarke <ian.clarke@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: A better approach to diffing and merging
Date: Sat, 29 Nov 2008 17:56:44 -0800	[thread overview]
Message-ID: <4931F2DC.CE9B1E35@dessent.net> (raw)
In-Reply-To: 823242bd0811291012g15c4d442qa5d7afc9cc762b20@mail.gmail.com

Ian Clarke wrote:

> Provide the merge algorithm with the grammar of the programming
> language, perhaps in the form of a Bison grammar file, or some other
> standardized way to represent a grammar.
> 
> The merge algorithm then uses this to parse the files to be diffed
> and/or merged into trees, and then the diff and merge are treated as
> operations on these trees.  These operations may include creating,
> deleting, or moving nodes or branches, renaming nodes, etc.  There has
> been quite a bit (pdf) of academic research on this topic, although I
> haven't yet found off-the-shelf code that will do what we need.
> Still, it shouldn't be terribly hard to implement.

There's a huge flaw in that approach for C/C++: in order to parse C/C++
you have to first preprocess it -- consider the twisty mazes that
#ifdef/#else/#endif can create.  But in order to preprocess source code
you need a whole heap of extra information that is not in the repository
(or if it is, cannot be automatically extracted.)

For example, you'd have to know all the -D/-U/-I flags that the makefile
or the user might pass to the compiler.  You'd have to replicate the
compiler's complicated header search path algorithm, which can depend on
the directives in the code as well as command line arguments,
environment variables, and values specific to the toolchain.  (Don't
forget that you can have code in a repository that's meant to be
cross-compiled and which uses a toolchain that has its own headers and
not the ones in /usr/include.)  You'd have to know all the built-in
predefined symbols of that toolchain, e.g. what's the value of
__GNUC_MINOR__ or __GNUC_PATCHLEVEL__, is __mips__ or __i386__ defined,
and on and on.  And of course the natural conclusion of this
progression: a change can be perfectly grammatically correct for one
particular platform/toolchain/setting of CFLAGS, and completely broken
for another.  There's no way for a VCS to know any of this, it takes
human comprehension.

If you look at a tool like doxygen that attempts to parse C/C++, it
don't actually do full preprocessing, only a very limited subset: it
only expands macros that the user names as relevant in the config file,
and it only preprocesses included headers that match a pathspec the user
provides.  Consequently it cannot fully parse the code to see if it's
grammatically correct, only to the limited extent that it can infer the
location where things appear to be defined.  And it is easily confused,
e.g. it will "see" code in both halves of an #ifdef section if it wasn't
told anything about the value of the macro in the config file, which can
cause it to incorrectly think that a function or variable was defined
there when in reality that section was discarded.

The idea may have value for langauges that are easy to parse and do not
have all this preprocessor cruft, but I just don't see how it would be
able to provide anything useful for non-trivial changes to real world
C/C++, which require human eyes to decipher.

Brian

  parent reply	other threads:[~2008-11-30  2:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-29 18:12 A better approach to diffing and merging Ian Clarke
2008-11-29 23:40 ` Boyd Stephen Smith Jr.
2008-11-30  2:54   ` Miklos Vajna
2008-11-30  1:56 ` Brian Dessent [this message]
2008-12-01  9:54   ` Karl Hasselström
2008-12-01 11:41 ` Jakub Narebski
2008-12-02  8:37   ` Karl Hasselström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4931F2DC.CE9B1E35@dessent.net \
    --to=brian@dessent.net \
    --cc=git@vger.kernel.org \
    --cc=ian.clarke@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.