From mboxrd@z Thu Jan 1 00:00:00 1970 From: Brian Dessent Subject: Re: A better approach to diffing and merging Date: Sat, 29 Nov 2008 17:56:44 -0800 Organization: My own little world... Message-ID: <4931F2DC.CE9B1E35@dessent.net> References: <823242bd0811291012g15c4d442qa5d7afc9cc762b20@mail.gmail.com> Reply-To: git@vger.kernel.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: git@vger.kernel.org To: Ian Clarke X-From: git-owner@vger.kernel.org Sun Nov 30 03:38:53 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1L6cCw-0004vc-Ue for gcvg-git-2@gmane.org; Sun, 30 Nov 2008 03:38:51 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753908AbYK3Cd2 (ORCPT ); Sat, 29 Nov 2008 21:33:28 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753877AbYK3Cd2 (ORCPT ); Sat, 29 Nov 2008 21:33:28 -0500 Received: from dessent.net ([69.60.119.225]:38169 "EHLO dessent.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752848AbYK3Cd2 (ORCPT ); Sat, 29 Nov 2008 21:33:28 -0500 X-Greylist: delayed 2201 seconds by postgrey-1.27 at vger.kernel.org; Sat, 29 Nov 2008 21:33:27 EST Received: from localhost.localdomain ([127.0.0.1] helo=dessent.net) by dessent.net with esmtp (Exim 4.50) id 1L6bYD-0002Eu-Iy; Sun, 30 Nov 2008 01:56:45 +0000 X-Mailer: Mozilla 4.79 [en] (Windows NT 5.0; U) X-Accept-Language: en,en-US Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Ian Clarke wrote: > Provide the merge algorithm with the grammar of the programming > language, perhaps in the form of a Bison grammar file, or some other > standardized way to represent a grammar. > > The merge algorithm then uses this to parse the files to be diffed > and/or merged into trees, and then the diff and merge are treated as > operations on these trees. These operations may include creating, > deleting, or moving nodes or branches, renaming nodes, etc. There has > been quite a bit (pdf) of academic research on this topic, although I > haven't yet found off-the-shelf code that will do what we need. > Still, it shouldn't be terribly hard to implement. There's a huge flaw in that approach for C/C++: in order to parse C/C++ you have to first preprocess it -- consider the twisty mazes that #ifdef/#else/#endif can create. But in order to preprocess source code you need a whole heap of extra information that is not in the repository (or if it is, cannot be automatically extracted.) For example, you'd have to know all the -D/-U/-I flags that the makefile or the user might pass to the compiler. You'd have to replicate the compiler's complicated header search path algorithm, which can depend on the directives in the code as well as command line arguments, environment variables, and values specific to the toolchain. (Don't forget that you can have code in a repository that's meant to be cross-compiled and which uses a toolchain that has its own headers and not the ones in /usr/include.) You'd have to know all the built-in predefined symbols of that toolchain, e.g. what's the value of __GNUC_MINOR__ or __GNUC_PATCHLEVEL__, is __mips__ or __i386__ defined, and on and on. And of course the natural conclusion of this progression: a change can be perfectly grammatically correct for one particular platform/toolchain/setting of CFLAGS, and completely broken for another. There's no way for a VCS to know any of this, it takes human comprehension. If you look at a tool like doxygen that attempts to parse C/C++, it don't actually do full preprocessing, only a very limited subset: it only expands macros that the user names as relevant in the config file, and it only preprocesses included headers that match a pathspec the user provides. Consequently it cannot fully parse the code to see if it's grammatically correct, only to the limited extent that it can infer the location where things appear to be defined. And it is easily confused, e.g. it will "see" code in both halves of an #ifdef section if it wasn't told anything about the value of the macro in the config file, which can cause it to incorrectly think that a function or variable was defined there when in reality that section was discarded. The idea may have value for langauges that are easy to parse and do not have all this preprocessor cruft, but I just don't see how it would be able to provide anything useful for non-trivial changes to real world C/C++, which require human eyes to decipher. Brian