* Comments on "Understanding Version Control" by Eric S. Raymond
@ 2009-02-02 18:48 Jakub Narebski
2009-02-02 20:24 ` Theodore Tso
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Jakub Narebski @ 2009-02-02 18:48 UTC (permalink / raw)
To: git; +Cc: Eric S. Raymond
Some time ago I have found "Understanding Version-Control Systems"
(http://www.catb.org/esr/writings/version-control/version-control.html)
paper (in draft version) by Eric S. Raymond, via Joel Spolsky
recommendation in 'Podcast #37' entry in Stackoverflow Blog
(http://blog.stackoverflow.com/2009/01/podcast-37/).
This is a very nice _survey paper_, currently in development, so we
now have time for comments, corrections and suggestions. It is
currently on hiatus (it seems that last changes are from the Jan 2008,
slightly more than a year ago), but it is not abandoned. ESR plans to
get back to it after Wesnoth and gpsd do their upcoming releases.
In my opinion the most important issue is concentrating on "container
identity" instead of on the underlying issue of renames in version
control, which includes intelligent, rename-aware merge; talk about
issues and not about possible solution. I will concentrate on this
issue for now, and leave for example issue of workflows, and of VCS
history for possible later posts (it is long enough as is).
Below you can find my comments; quoted fragments of "Understanding
Version Control' essay are prefixed with 'UVC> '. 'TODO' refers to
http://www.catb.org/esr/writings/version-control/TODO.html
Please do participate in this discussion, especially if you have
something to say with respect to rename detection versus rename
tracking issue. Thanks in advance.
----------------------------------------------------------------------
http://www.catb.org/esr/writings/version-control/version-control.html
(Independent comment)
[...]
UVC> This leads to an awkward case called a cross-merge which tends to
UVC> confuse history-aware merging tools.
First, please remember that history-unaware merging tools, such as CVS
merge operation (unless you manually specified points in history) have
trouble with much simpler case of repeated merging into the same
branch.
Second, I think that currently DVCS can deal with criss-cross merges
correctly; for example git with its recursive merge strategy, or so
called MarkMerge from revctrl.org [TODO: probably needs survey how
other modern merging DVCS, such as Mercurial, Bazaar or Monotone, deal
with criss-cross merges].
......................................................................
UVC> == Container identity ==
UVC>
UVC> In a VCS with container identity, files and directories have a
UVC> stable internal identity, initialized when the file is added to
UVC> the repository. This identity follows them through renames and
UVC> moves. This means that filenames and directories are versioned,
UVC> so that it is possible (for example) for the VCS to notice while
UVC> doing a merge between branches that two files with different
UVC> names are descendants of the same file and do a smarter merge of
UVC> their contents.
This whole subsection confuses in my opinion the goal (rename-aware
merge) with means (container identity). It should concentrate on
intelligent merging after renames or moves, not on the possible ways
to do this. It should list what is expected of modern VCS to do,
_then_ enumerate possible ways to do this.
But the most important thing is that "container identity" is a wrong
idea to have: wholesame rename (or copy) of a file is just special
case of more generic code movement and copying.
UVC>
UVC> Container identity can be implemented by giving each file in the
UVC> repository a 'true name' analogous to a Unix inode,
UVC> Alternatively, it can be implemented implicitly by keeping all
UVC> records of renames in history and chasing through them each time
UVC> the VCS needs to check what file X was called in revision Y.
If I understand correctly 'file-ids' is the way original Arch (and its
descendant) did this, and how Bazaar deal with file and directory
renames, while 'tracking renames' is what Mercurial doea (a bit
incompletly).
But there is _another_ way to have rename-aware merge, and it is how
Git does it. Git uses heuristic similarity based _rename detection_
to find how to do a merge in presence of renames. This works quite
well _in practice_ (but it makes it harder to _test_ this solution).
Those three ways of dealing with renames have the following advantages
and disadvantages:
1. Need to use VCS to do file and directory renames and copies.
Both 'file-ids' and 'rename tracking' have this limitation.
This means that you cannot use your favorite filemanager, or IDE
if it doesn't have support for your VCS, or patches to move and
copy files. This might be migitated a bit by doing rename
detection at commit time (Mercurial has this ability wia
hg-addremove, but it is not automatic), but that means freezing
current algorithm results; with rename detection you can always
take advantage of improved rename detection.
2. Ability to improve and correct errors in rename information.
Both 'file-ids' and 'rename tracking' make it hard to correct
errors in rename info, for example if you incorrectly marked file
as copy or rename, or forgot to mark it as copy or rename. If
'file-ids' info is stored separately from history, it is less of an
issue, but I don't see how you can correct errorneous rename
information in 'rename tracking' solution short of rewriting
history.
In the case of 'rename detection' solution Git uses it is
non-issue, as no rename information is stored.
3. Dealing with independently (in separate branches) added files.
(this issue can be found in "Test suite" section of TODO).
Most 'file-ids' solution have the problem if you add the same file
independently on different branches (e.g. adding file via patch).
First, one of histories must vanish; second, you have to repeat
resolution if file-id conflict for every repeated merge.
From TODO 'rename tracking' solution used in Mercurial doesn't have
this problem, and of course Git's 'rename detection' doesn't have
it either.
Note that there are different versions of this issue: same name
and same contents, different name same contents (rename without
ancestor), and same name different contents. The last case is of
issue with automatic file-ids if they are dependent on file name;
first case is of issue with automatic file-ids if they depend on
branch/comitter/time, i.e. if file-ids differ on different
branches. Note: same contents here migh mean _almost_ the same
contents.
4. Creating new files in renamed directories.
Usually called support for directory renames; the problem is if one
side (on one branch) renames some directory, and other side (other
branch) creates new files in the old-name directory.
For 'file-ids' solution this means that there have to be 'file-ids'
('inodes') also for directories; Bazaar has this feature. For
'rename tracking' solution this means that rename information for
directories has to be stored somewhere; from what I understand
Mercurial with its filename-hashed storage and per-file stored
rename information doesn't have this feature. For both 'file-id'
and 'rename tracking' ('rename info') solutions this I think
usually mean that you have to do directory renames using VCS
tools.
Whle on first glance it might seem that 'rename detection' solution
(like the one used in Git) cannot deal with this problem it is not
true. VCS employing similarity based rename detection can detect
wholesame directory rename based on pattern of file renames, and
can put new file in new-name of directory. Moreover it should be
able to deal quite sanely with more complicated cases like
splitting or merging of directories. _However_ Git currently does
not support this case (but see the note below), although there were
some preliminary patches adding 'wholesame directory rename
detection', so it is not purely theoretical.
Note: usually you cannot simply move file to new directory without
any other changes, so automatic creating new file (from the point
of view of branch we merge into) in new-name directory instead of
old-name directory might be not a good idea without stopping for
manual merge conflict resolution, as it would result in semantic
conflict.
5. Misdetection of file renames, and remembering manual corrections.
(this problem concerns only 'rename detection' solution).
Of course 'rename detection' algorithm is not perfect. It can find
rename when there isn't any if many files consist mainly from
identical boilerplate (e.g. copyright, license, etc.). It can fail
to detect rename if files differ too much (usually then files
cannot be merged automatically anyway then), or if files are too
small (like usually in simple test cases used). In the presence of
multiple file renames and copies it can assign move or copy source
to the wrong file.
Closely related to this issue is a problem of remembering manual
corrections to rename detection algorithm, and manual hints (like
inn 'rename tracking' solution). Both were discussed on git
mailing list, the former under the name of having git-rerere2,
remembering (and reusing) of recorded resolutions of tree-level
conflicted merge, but there were (if I remember correctly) no
conclusion and no patches.
6. Performance bottlenecks in managing renames during merge
I don't know if it is a problem for 'file-id' solution (Bazaar), or
for 'rename tracking' solution (Mercurial), but with 'rename
detection' (Git) if there were a lot of reorganization of directory
hierarchy between merge points then merge can take a lot of time.
UVC>
UVC> Absence of container identity has the symptom that file
UVC> rename/move operations have to be modeled as a file add followed
UVC> by a delete, with the deleted file's history magically copied
UVC> during the add.
I don't know if this paragraph was meant to be about another issue
with renames in VCS, namely dealing with renames and copise during
history browsing (which includes both log of changes, and tracking
line-wise file history aka. annotate/blame/praise).
1. Rename-aware changelog (commit logs).
1.1. Showing renames in "<scm> log", i.e. in whole project history.
The only complication with rename detection here is choosing a
level of rename and copy detection, as it might be CPU
intensive. In Git those are 'detect renames', 'detect copies as
well as renames' and 'find copies harder'. That can, of course,
be configured.
1.2. Following renames in "<scm> log <filename>", i.e. in single file
history.
Currently Git does not support it very well. There is '--follow'
option for git-log, but it is more of a hack to have something
similar to what for example Subversion provides, than a full
solution. It works for simple histories, and might fail for more
complicated ones; it is not however fundamental limitation. On
can always use "git log -- <old name> <new name>"...
On the other hand single file history should be for developers
second-class citizen in VCS supporting changesets; full history
is _more_ than sum of single-file histories.
2. Rename-aware line-wise file history (which means annotating file
with history of each line, something like "cvs annotate").
Note that line-wise file history cannot deal with _deletions_;
it is a known limitation of this tool.
2.1. Following wholesame file renames and copying.
The issue is not stopping at file rename when tracking where
given line came from... and of course representing this
information in blame/annotate output. git-blame supports this,
and I think other VCS (Bazaar, Mercurial) also does.
[TODO: check this]
2.2. Following code movement and copying.
You can request that git-blame detect moving lines in the file
(e.g. moving around code), and detect lines copied from other
files (code moved across files). I think it is (together with
ability to ignore changes in whitespace in blame/annotate)
currently (?) feature unique to Git; it also shows that idea of
"container identities" is limited and narrow-minded, and that one
should think of file renames as of special case of code
movement.
[TODO: perhaps sreenshot of "git gui blame" or equivalent?]
[TODO: some stats about file renames and other code movement]
UVC>
UVC> Usually VCSes that lack container identity also create parent
UVC> directories on the fly whenever a file is added or checked out, and
UVC> you cannot actually have an empty directory in a repository.
In my opinion this is an fundamentally unrelated issue. The question
if directories are created and deleted on demand is behavioral issue;
note that _not_ having directories deleted on the fly result in higher
probability of file/directory conflict.
Git is a bit peculiar in this case: empty directories _can_ be
represented in repository, currently cannot be represented in (flat)
staging area aka. index although it shouldn't be too hard to add, and
are removed on the fly which again shouldn't be too hard to change.
Note that you can track empty directories in Git (even if VCS adds
them and removes them on the fly) by trick of having for example empty
'.gitignore' file in it.
UVC> == Snapshots vs. changesets ==
UVC>
UVC> There are two ways to model the history of a line of
UVC> development. One is as a series of snapshots of an evolving tree
UVC> of files. The other is as a series of changesets transforming
UVC> that tree from a root state (usually empty) to a tip state.
Git is snapshot-based (although packed format uses [binary] delta
compression). Mercurial is changeset-based. Bazaar uses different
representation altogether, I think[1] (used to use weave); wiki
says that it is snapshot-oriented, but it has file-ids.
[1] http://bazaar-vcs.org/BazaarFormats is a bit lacking in details
[...]
UVC> Changeset-based systems have some further distinctions based on
UVC> what kinds of data a changeset carries. At minimum, a changeset
UVC> is a group of deltas to individual files, but there are
UVC> variations in what kind of file-tree operations are represented
UVC> in changesets.
I think most famous _example_ here is Darcs, with _idea_ of other
operations than simple text delta, like "rename this variable to that
name" example in documentation, but I don't know if it was actually
went beyound hand-waving.
Note that patch commutation algebra of Darcs (most pure-changeset VCS
there, in my opinion) might look like a neat idea, but please remember
than Darcs had (and perhaps has still) exponential bad performance of
merging in some cases.
UVC>
UVC> Changesets which include an explicit representation of
UVC> file/directory moves and renames make it easy to implement
UVC> container identity. (Container identity could also be implemented
UVC> as a separate sequence of transaction records running parallel to
UVC> a snapshot-sequence representation, but I know of no VCS that
UVC> actually does this.)
[TODO: check how Bazaar actually does this, as it is example of VCS
with 'file-ids', also for directories; check where Mercurial stores
'rename tracking' information].
[...]
UVC> Snapshot and changesets are not perfectly dual
UVC> representations. It took a long time for VCS designers to notice
UVC> this; the broken symmetry was at the core of a well-known
UVC> argument between the designers of Arch and Subversion in 2003,
UVC> and did not begin to become widely understood until after Martin
UVC> Pool's 2004 essay "Integrals and Derivatives"[2]. Pool, a
UVC> co-author of bzr, correctly noted that attempts to stick with the
UVC> more intuitive sequence-of-snapshots representation have several
UVC> troubling consequences, including making container identity and
UVC> past merges between branches more difficult to track.
UVC>
UVC> [2] http://sourcefrog.net/weblog/software/vc/derivatives.html
Below there is excerpt from OLD "Integrals and Derivatives" essay
(I hope I have choosen relevant part of this essay).
ID> Working in terms of changesets, or at least having the option to
ID> do so allows more powerful operation.
ID>
ID> For example, consider repeated merges among a related set of
ID> trees. Arch and Darcs [which work primarily in the changeset
ID> domain] handle this well, because they can easily remember which
ID> changesets have already come across. Subversion and CVS tend to
ID> handle it poorly, because merely tracking which version from the
ID> other tree has merged doesn't really capture the right
ID> information.
This is, as we now know, _wrong_. Subversion and CVS handle this
poorly not because they are snapshot based (CVS most certainly is not:
it is file-level delta based), but because they do not store merge
information (which revisions were merged to form a merge commit,
i.e. all parents of a merge commit): they lack merge tracking.
Therefore it is not possible to find common ancestors (merge bases),
because there is no enough information. _Not_ because being
snapshot-based.
Additionally it is now I think universally acknowledged that three-way
merge (with a bit extra to deal with possibly multiple merge-bases,
e.g. in presence of criss-cross merge) is superior merge algorithm; it
is certainly superior to "reapply patches skipping already present"
algorithm that quoted example seems to imply that Arch and Darcs use
(which is available in modern DVCS as well, either as some patch
management / patch queue extensions, or as git-rebase, Mercurial's
transplant extension, or Bazaar graft extension).
[...]
UVC> = What, if anything, have we learned from history? =
UVC>
UVC> There's a folk saying that "It's not what you don't know that
UVC> hurts you, it's what you think you know that ain't so." In
UVC> examining the pattern of development of VCSes, it seems to me
UVC> that the this sub-field of computer science has been less
UVC> hampered than most by difficulties in finding appropriate
UVC> techniques, but more hampered than most by wrong assumptions that
UVC> hung on far longer than they should have.
UVC>
UVC> First wrong assumption: Conflict resolution by merging is
UVC> intractably difficult, so we'll have to settle for locking. It
UVC> took at least fifteen and arguably twenty years for VCS designers
UVC> to get shut of that one. But it's historical now.
I'm not sure if it has place here, i.e. if it was wrong assumption or
just lack of thought, but I would emphasize that commit-before-merge
is much, much better than merge-before-commit (or update-before-commit,
as it is _implicit_ merge that might need to be performed) workflow.
One of more important tasks of VCS is to not lose your changes; the
merge-before-commit does not fill this tack completely, in my opinion,
especially nowadays with networked large-group collarboration.
?merge strategies
UVC>
UVC> Second wrong assumption: Change history representation as a
UVC> snapshot sequence is perfectly dual to the representation as
UVC> change/add/delete/rename sequences.. This folk theorem is well
UVC> expressed in the 2004 essay "On Arch and Subversion"[3]. It is
UVC> appealing, widely held, and dead wrong.
UVC>
UVC> File renames break the apparent symmetry. The failure of
UVC> snapshot-based models to correctly address this has caused
UVC> endless design failures, subtle bugs, and user misery.
It is not true. Example of snapshot-based Git, which with its rename
detection deals very well in practice with file renames contradict
this theory. Bazaar which is supposedly snapshot-based, yet support
"container identities" ('file-ids') contradict this further.
The symmetry might be broken _only_ if there are other operators in
changesets than simple delta. But one has to remember that if
representation is not equivalent to snapshot-based in the sense that
you can do a (3-way) merge based on endpoints (branches to be merged
and common ancestors aka. merge bases) _only_, and one has to take
_whole_ history into account when merging, then there is high
probability of cases where merge performance suffers badly. Please do
not say that performance doesn't matter, because it does; if operation
takes minutes (or more) rather than seconds, then this would affect
workflow used, as people try to avoid unpleasant operations...
UVC>
UVC> Practically speaking, failure to address this broken symmetry
UVC> goes a long way towards explaining why CVS became such a
UVC> disaster. But the damage didn't end there, which is why I'm
UVC> courting controversy by pointing out that it underlies a debate
UVC> about third-generation designs that is still live today. Should
UVC> VCSes be purely content-addressable filesystems (the Mercurial
UVC> and git approach) or should they have container identity (as in
UVC> Arch, monotone, and bzr)?
UVC>
UVC> That debate is not over, but at least VCS designers are grappling
UVC> with it now.
CVS was disaster no because of (supposedly) broken symmetry between
snapshots (snapshot-based operations) and changesets; it was disaster
because its per-file versioning roots were very visible in: non-atomic
commits, no support for changesets, heavyweight (bad performance)
branching and tagging, complete lack of support for renaming files.
I agree that the issue of content-addressed filesystem versus
container identities (file-ids) is importnat one, but I would say that
neither one won over the other...
UVC>
UVC> I have a guess about the third wrong assumption. I think it goes
UVC> something like this: The correct choice of abstractions,
UVC> operations, and containers in a VCS is the one that makes the
UVC> cleverest sorts of data-shuffling possible. I suspect that, as
UVC> our algorithms get better, we're going to find that the best
UVC> choices are not the most theoretically clever ones but rather the
UVC> ones that are easiest for human beings to intuitively model.
Here I think Git model wins hands down: DAG (Direct Acyclic Graph) of
commits (Monotone had it first), clear relation between objects:
commit object contain metadata (like authorship and commit message),
link to zero (root / initial commit), one (usual case), or two or more
(merges) parent commits, and link to snapshot of a state of project at
given version... Having branches and tags (and remote-tracking
branches) to exist outside DAG of commits, in a separate namespaces;
not that disaster of conventions for imitation of branches and tags in
Subversion. Having current branch to be just pointer to one of
branches.
Having delta compression "under the hood" in the packfiles allow to
combine efficiency of deltas for repository size and network transfer,
while retaining very clear model of snapshot-based VCS.
UVC>
UVC> This is why, even though I find constructions like today's
UVC> elaborate merge theory[4] fascinating to think about. I'm not
UVC> sure it is actually going anywhere useful. Naive merge algorithms
UVC> with poor behavior at edge cases may actually be preferable to
UVC> more sophisticated ones that handle edge cases well but that
UVC> humans have trouble anticipating effectively.
Nowadays even Bram Cohen of Codeville merge agrees[citation needed]
that 3-way merge, which works in most cases but if fails is easy to
understood, is better than advanced merge strategies which can merge
automatically a few more cases, but if fail it is difficult to resolve
conflicts; note that fail include wrongly resolved merge (merge
strategy gives no conflicts, but merge resolution is wrong).
UVC>
UVC> An even more interesting question is this: what are the fourth,
UVC> and nth wrong assumptions --- the ones we haven't noticed we're
UVC> making yet?
UVC>
UVC> [3] http://www.reverberate.org/computers/ArchAndSVN.html
UVC> [4] http://revctrl.org/CategoryMergeAlgorithm
Well, Linus probably would say that concentrating on special case of
file and directory renames instead of more generic code movement is a
wrong idea.
But what do YOU think?
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-02 18:48 Comments on "Understanding Version Control" by Eric S. Raymond Jakub Narebski
@ 2009-02-02 20:24 ` Theodore Tso
2009-02-02 20:35 ` Eric S. Raymond
2009-02-04 2:04 ` Jakub Narebski
2009-02-04 22:14 ` Tests for " Jakub Narebski
2009-02-10 1:20 ` Comments on " Jakub Narebski
2 siblings, 2 replies; 21+ messages in thread
From: Theodore Tso @ 2009-02-02 20:24 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git, Eric S. Raymond
On Mon, Feb 02, 2009 at 07:48:53PM +0100, Jakub Narebski wrote:
> In my opinion the most important issue is concentrating on "container
> identity" instead of on the underlying issue of renames in version
> control, which includes intelligent, rename-aware merge; talk about
> issues and not about possible solution. I will concentrate on this
> issue for now, and leave for example issue of workflows, and of VCS
> history for possible later posts (it is long enough as is).
This was discussed in no small amount of detail on the mailing list
uvc-reviewers, which used to be hosted here:
http://thyrsus.com/mailman/listinfo/uvc-reviewers
Unfortunately, it looks like Eric has taken down his mailman instance
on thyrsus.com. I have personal archives of the list, and the list
used to be have public archives, so I don't feel any hesitation
sharing it with interested parties.
> Below you can find my comments; quoted fragments of "Understanding
> Version Control' essay are prefixed with 'UVC> '. 'TODO' refers to
> http://www.catb.org/esr/writings/version-control/TODO.html
>
> Please do participate in this discussion, especially if you have
> something to say with respect to rename detection versus rename
> tracking issue. Thanks in advance.
Heh. A lot of this has been said already. I think one of the reasons
why Eric kept things short in his paper, and did *not* say a lot about
whether or not container identity tracking was fundamentally needed or
not was because we didn't come to any real consensus on the
uvc-reviewers mailing list. I believe it is extremely difficult to do
so given that it's very hard to avoid the slippery slope of advocating
for one SCM system versus another.
I'll include some of my writings on the subject from the uvc-reviewers
mailing list so folks can see where some of this discussion went last
time... (All of this dates from January, 2008, when Eric was last
aggressively updating the paper in question.) BTW, when I referred to
SCM's being a horrible hack and "guessing" and "fit only to be used by
amateurs" if they didn't record function-level identity tracking,
there were those who were seriously arguing that any SCM (i.e., like
git) that didn't track container identity was fundamentally a "hack".
Yes, there are people who seriously take that view, some of which were
very bitter that their DSCM didn't win the market/popularity wars, and
so their pet projects overtaken by SCM's such as git, describing
$THEIR_PET_SCM_WITH_PROVABLY_CORRECT_SEMANTICS as Betamax, and git as
VHS. The argument that without rename-tracking, if git was used to
development an software for Air Traffic Control application, airplanes
could be dropping out of the sky was also made by these advocates, no
kidding. (So was the argument that using a DSCM that didn't do
container identity tracking might be considered Programming
Malpractice.)
So be careful about wanting to reopen this discussion; if the some of
the wrong people join in, you may be very sorry! :-)
- Ted
> Here then are some types of identity
> tracking one might imagine:
>
> * File identity tracking: tracks the identity of a file through
> renames and moves.
> * Simple file content tracking: tracks the identity of content
> using adds and deletes within a single file. (Note, there is a
> question that could be asked here about the resolution of the
> tracking. Most current systems that track do so on a line-by-line
> basis, but one could imagine tracking bytes. I wont say any more
> about this in this email.)
> * Movement of content within a file: tracks the identity of
> content within a file when lines are moved.
> * Movement of content between files: tracks the identity of
> content when lines are moved between files.
One obvious one which isn't in this list is "Directory Identity
Tracking", that is when you move a directory, new files which are
created in one branch at the original directory location will be moved
when you merge with another branch where the directory has been moved.
In private conversations with Tom Lord, he tells me that he had also
played with the concept of "Function/variable (more generally,
programming structure identifier) identity tracking". That is,
suppose you had an editor like Eclipse which has as a primitive,
"Rename Java identifier (class/method function/variable)", and this
information was passed into the SCM so it could be tracked. Then in
one branch, a Java identifier could be renamed, and then in another
branch, the use of that same Java identifier could be added in 20
different places --- and since the SCM knows, at a deep semantic
level, that a rename had taken place in Branch A, when it is merged
with Branch B, it could DTRT and change the newly added uses of the
renamed identifier to its new name.
And like with Directory Identity Tracking, it's not hard to come up
with scenarios where without this level of tracking, something
horribly wrong could take place as a result of the SCM not tracking
function identities and using them when doing merges. At the very
least, the program would fail to compile, and if the example involved
an Air Traffic Control system and multiple function renames taking
place, you could even come up with a contrived horror scenario where
planes would be falling out of the sky --- that is, if you ignore
regression testing, and simple coding practices that would prevent
something like this from happening.
Of course, the flip argument by people who are trying to promote their
brand-spanking new SCM that did function identity tracking (FIT) is
critical since SCM's are all about ACCOUNTING, and without FIT,
systems that try to merges are just GUESSING, and *obviously* a system
which did FIT is far superior to a SCM that didn't; in fact, a SCM
that didn't do FIT is just a Horrible Hack Done By Amateurs.
Furthermore, using an SCM without this feature would (according to
promotors of this hypothetical new SCM), be Programming Malpractice.
And if this sounds silly, I'm just repeating the exact same arguments
that proponents of systems like arch, Bk, et.al, which store the user
intention information of file and directory renames, have recently
advanced against git since it doesn't store this sort of information.
(It may reconstruct rename information in a lazy fashion when it is
needed, but it doesn't store it.) But if file and directory renames
is a type of user intention which MUST be stored in order for an SCM
not to be a hack, why not function, variable, and class renames? That
too would be another type of user intention.
- Ted
---------------------------
> The second could be called "location". Which file should this patch
> be applied to? Which lines within a file should this hunk be applied
> to? I argue in [2] that Darcs does strictly better at the task of
> location than do SVN or GNU diff3. (I think that SCCS, BK and CDV do
> as well, but I don't understand them well enough to be sure.) I
> argue that Darcs does strictly better in the sense that its answers
> to the location question are often better and never worse, and that
> it does so *not* by having a more sophisticated heuristic or by
> getting lucky more often, but by a simple, provably-correct algorithm
> which uses valuable information that other algorithms overlook.
How are you defining "provably correct"? In order to show
correctness, you need to define what correctness means.
One approach is that you force the user to tell you --- and if you are
in the middle of applying a series of 500 patches, you throw up the
Annoying GUI Dialog Box which stops the application of patches dead in
its tracks, and force the user to confirm whether this is a rename, or
a delete followed by an add of remarkably similar content. Or, if
patch removes all the files from one directory, and created them all
in another directory, that what was the user intention was a directory
rename, and the SCM records it as such. Here, you are *assuming* that
what the user tells you is correct, and that's part of the lemma you
use for proving correctness. If the user, who is seriously annoyed at
the popup boxes, says, "Yeah, yeah, yeah", and dismisses the dialog
box without changing the defaults (which were selected via a
hueristics and which were wrong), well, it's not the SCM's fault since
the user told it what it wanted, and the user was wrong. GIGO.
In your case, you're saying that Darcs is using "valuable information"
that other algorithms overlook. OK, so the Darcs people were more
clever about designing a hueristic which tries to approximate user
intention, and having designed the hueristic which uses said "valuable
information" you can prove whether or not Darcs' algorithm correctly
implemented said hueristic.
But at the end of the day, it's still a hueristic, and the use of
words "provably correct" is just a semantic trick. Even svn's lack of
directory rename support could be considered "provably correct", if
the definition of "correct" is an algorithm which determinically
creates new files in their original location, even if all other files
in that directory were deleted and new files with the same name and
same content were created somewhere else. It's still an algorithm,
and you can prove whether or not it meets its design specs, but to the
extent that it is less likely to approximate user intenions, people
would say that svn might be less useful in such cases than some other
SCM.
> Let's be careful not to lump these three things together, say "All
> merging involves guessing.", and thus overlook the interesting fact
> that some merge algorithms involve strictly more guessing than do
> others.
Part of the problem is that words like "guessing" and claims of some
algorithm being "provably correct" are basically marketing words.
They are generally used to denigrate one SCM, and promote someone's
favorite SCM as being **better**.
Fundamentally, the goal of merging is to Do The Right Thing --- from a
semantic point of view, which means that the user's intentions is
what's important at the end of the day. The question then is whether
you record the user's intentions, or try to determine it in from a
hueristics point of view.
The people who claim that recording the user's intentions is superior
will claim that you can never know for sure what the user meant, so
you have to ask him or her to provide that information. In some cases
that's relatively easy; you require the user to use commands like "bk
mv" and "bk cp" and "bk rm" which not only performs the function, but
also records the user's intention. Unfortunately, if you are applying
a patch, and the patch file hasn't been enhanced to carry this kind of
information, you have to use hueristics and then somehow get the user
to confirm them --- hence the use of the Annoying Popup Dialog Box.
In other cases, you can't determine it easily, such as the "rename a
Java method function" case, unless you have a specialized editor which
has this as a primitive, or, alternatively, even more Annoying Dialog
Boxes that pop up as you try to commit a change. Once you have
recorded this information, using it in the merge is relatively easy
--- or if not easy, at least relatively easy to specify and then show
whether or not the information was used correctly.
What then tends to happen is that people whose SCM does one kind of
user intention recording (such as file renaming) will use this as a
huge club to say their system is better than another system, and that
a system which doesn't record this information is "Guessing". They
will also say that their system is "Provably Correct". So both of
these words are really Red Meat Marketing words, which get used when
people try to say that Their SCM Is Superior.
On the flip side, the people who don't do any recording of this
information will point out that trying to record the information is
hard, especially when changesets go through a lossy medium, such as
patch files which are e-mailed around, which doesn't record this kind
of user intention. This may or may not applicable for a particular
project, but for some projects, it is extremely important, especially
for those where e-mail is the primary communication channel and how
patches get reviewed and passed around.
The other point which people on "we don't record user intentions; we
just record content" tend to use is that you can always add more
hueristics later, but in practice, if you didn't record the user
intention when you made the commit, it's almost impossible to add it
later. So for example, suppose git, which doesn't have function
rename support today, has a new merge engine added later which works
specifically for Java files that correctly intuits and deals with
method function renames. (Maybe some crazy Java programming
methodology does this all the time, so people get motivated to write
such a thing.) A SCM which works on recorded user intentions will
need to add that support, in a way which doesn't break backwards
compatibility of their distributed repository, *OR* accept the fact
that for function renaming, it will have to use hueristics that are
run at merge time.
So this seems to be fundamentally a tradeoff. The two *objective*
things that can be said is how many user intentions are recorded at
commit time (file rename? directory rename? function/variable
rename?), and what sort of information is used at merge time via
hueristics to determine user intention. And if you want to call those
hueristics an "algorithm" because it sounds more mathy and provable,
sure, whatever. The fact is, the algorithm is still an approximation
on trying to determine user intent, with the goal of making the merge
do the right thing from a Semantic Point of View.
* * *
>From a project point of view, how often you actually *do* merges is an
interesting one. If merges requiring complicated user intention
tracking is necessary don't happen very often, maybe focusing on
this issue to the extent that SCM geeks tend to do isn't very
productive. In my opinion, the overall usability of the system is
*far* more important. In the OSS world, every project is competing
for programmers, and if the tool is easier to use, the more likely it
will be that you can get people to contribute to your project. Even
if they aren't bright enough to work on the core algorithms for your
project, they could at least improve the documentation. That was what
drove my decision to switch to Mercurial back in 2005. Last year, I
moved my projects to git because of various features that worked well
with my development workflow and which generally improved programmer
productivity. Whether or not file or directory renames were being
tracked in the SCM had *no* bearing on my decision to switch, because
merges happen rarely, and I have an extensive regression test suite
which I run after almost every commit, and *definitely* after every
merge. So if a merge doesn't do the right thing, that's OK; I'll fix
it up, and use "git commit --amend" correct the merge commit.
And yet, people seem to focus on recording of user intention because
it reflects some holy grail of Perfection and Correctness. And maybe,
because it is easily measurable, whereas usability and improving
programmer productivty are inherently more subjective measures.
What's very sad are the people who are feel profoundly hurt that they
spent a huge amount of their life working on SCM Correctness, only to
find that people chose other SCM's based on other metrics and other
issues other than the one that they felt was most important.
Unfortunately there's not a lot that can be done about that.
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-02 20:24 ` Theodore Tso
@ 2009-02-02 20:35 ` Eric S. Raymond
2009-02-03 20:57 ` Jakub Narebski
2009-02-04 2:04 ` Jakub Narebski
1 sibling, 1 reply; 21+ messages in thread
From: Eric S. Raymond @ 2009-02-02 20:35 UTC (permalink / raw)
To: Theodore Tso; +Cc: Jakub Narebski, git
Theodore Tso <tytso@MIT.EDU>:
> Unfortunately, it looks like Eric has taken down his mailman instance
> on thyrsus.com. I have personal archives of the list, and the list
> used to be have public archives, so I don't feel any hesitation
> sharing it with interested parties.
Oops! I think it got lost during a system upgrade. I should be able
to restore it and the archives, now you've reminded me.
The paper isn't dead, by the way, I've just been buried in other stuff
the last year or so. It's been on my mind lately to resume work on it.
--
<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-02 20:35 ` Eric S. Raymond
@ 2009-02-03 20:57 ` Jakub Narebski
0 siblings, 0 replies; 21+ messages in thread
From: Jakub Narebski @ 2009-02-03 20:57 UTC (permalink / raw)
To: Eric S. Raymond; +Cc: Theodore Tso, git
On Mon, 2 Jan 2009, Eric S. Raymond <esr@thyrsus.com> wrote:
> Theodore Tso <tytso@MIT.EDU>:
> > Unfortunately, it looks like Eric has taken down his mailman instance
> > on thyrsus.com. I have personal archives of the list, and the list
> > used to be have public archives, so I don't feel any hesitation
> > sharing it with interested parties.
>
> Oops! I think it got lost during a system upgrade. I should be able
> to restore it and the archives, now you've reminded me.
By the way, Mercurial repository of "Understanding Version Control"
sources doesn't work either: http://thyrsus.com/hg/uvc returns
404 Not Found error.
What is strange is that http://thyrsus.com works...
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-02 20:24 ` Theodore Tso
2009-02-02 20:35 ` Eric S. Raymond
@ 2009-02-04 2:04 ` Jakub Narebski
2009-02-04 23:54 ` Theodore Tso
1 sibling, 1 reply; 21+ messages in thread
From: Jakub Narebski @ 2009-02-04 2:04 UTC (permalink / raw)
To: Theodore Tso; +Cc: git, Eric S. Raymond
On Mon, Feb 02, 2009, Theodore Tso <tytso@mit.edu> wrote:
> On Mon, Feb 02, 2009 at 07:48:53PM +0100, Jakub Narebski wrote:
> > In my opinion the most important issue is concentrating on "container
> > identity" instead of on the underlying issue of renames in version
> > control, which includes intelligent, rename-aware merge; talk about
> > issues and not about possible solution. I will concentrate on this
> > issue for now, and leave for example issue of workflows, and of VCS
> > history for possible later posts (it is long enough as is).
>
> This was discussed in no small amount of detail on the mailing list
> uvc-reviewers [...]
I guess that this mailing list is subscribe-only, isn't it? So doing
CC to uvc-reviewers wouldn't, unfortunately, cut?
> > Below you can find my comments; quoted fragments of "Understanding
> > Version Control' essay are prefixed with 'UVC> '. 'TODO' refers to
> > http://www.catb.org/esr/writings/version-control/TODO.html
> >
> > Please do participate in this discussion, especially if you have
> > something to say with respect to rename detection versus rename
> > tracking issue. Thanks in advance.
>
> Heh. A lot of this has been said already. I think one of the reasons
> why Eric kept things short in his paper, and did *not* say a lot about
> whether or not container identity tracking was fundamentally needed or
> not was because we didn't come to any real consensus on the
> uvc-reviewers mailing list. I believe it is extremely difficult to do
> so given that it's very hard to avoid the slippery slope of advocating
> for one SCM system versus another.
Well, I tried to be objective, but I know I am biased towards Git.
I tried to list advantages and disadvantages of all three methods of
dealing with renames in VCS: 'container identity' aka. 'file-ids'
(and 'directory-ids') which I think is solution used by Bazaar;
'rename tracking' or 'recording rename information' which I think
is the solution used by Mercurial, and 'rename detection' used by
Git.
BTW. if I remember correctly (either from comments in UVC, or from
TODO) Eric wanted to have test suite which he could run to examine
how well does given VCS support renames in the form of intelligent
rename-aware merges, and in the form of following file through
renames when examining history.
When writing this email I have wanted to add as appendix a proposal
for such test, which would also clarify what are the expectations
wrt. rename support... but I plainly forgot to add it. You can
find some very bare bones version in my post "Git at Better SCM
Initiative comparison of VCS (long)"[1][2] on git mailing list,
around where I was talking about intelligent_renames section.
(The post was meant to correct invalid information about Git in
'Better SCM Initiative' comparison[3]).
[1] http://thread.gmane.org/gmane.comp.version-control.git/95809
[2] My earlier posts with similar title were about _adding_ Git
to the comparison.
[3] http://better-scm.berlios.de/comparison/
>
> I'll include some of my writings on the subject from the uvc-reviewers
> mailing list so folks can see where some of this discussion went last
> time... (All of this dates from January, 2008, when Eric was last
> aggressively updating the paper in question.)
Thank you very much for those excerpts / fragments, even though
I'd rather have your fresh comments either on current state of
"Understanding Version-Control Systems", or on my post.
> BTW, when I referred to
> SCM's being a horrible hack and "guessing" and "fit only to be used by
> amateurs" if they didn't record function-level identity tracking,
> there were those who were seriously arguing that any SCM (i.e., like
> git) that didn't track container identity was fundamentally a "hack".
> Yes, there are people who seriously take that view, some of which were
> very bitter that their DSCM didn't win the market/popularity wars, and
> so their pet projects overtaken by SCM's such as git, describing
> $THEIR_PET_SCM_WITH_PROVABLY_CORRECT_SEMANTICS as Betamax, and git as
> VHS. The argument that without rename-tracking, if git was used to
> development an software for Air Traffic Control application, airplanes
> could be dropping out of the sky was also made by these advocates, no
> kidding. (So was the argument that using a DSCM that didn't do
> container identity tracking might be considered Programming
> Malpractice.)
That is, I think, the difference between being 'perfect in theory' and
'good enough in practice', and reminds me of times of discussion about
"perfect" contents merge algorithm (Codeville merge and precise
Codeville merge, folks at revctrl.org and mark-merge, etc.). Nowadays
even Bram Cohen (of the Codeville) that 3-way merge algorithm is good
enough ("Version Control Recommended Practices", '3. Use 3-way merge',
at http://bramcohen.livejournal.com/52148.html).
Failing gracefully matters (i.e. if there is a conflict it can be
easily understood), low probability of falsely clean merges matters,
and performance (to not be exponential in some real cases) matters
too. More than theoretical perfection.
And there are many, many examples of using heuristics over perfect
algorithmic solution (if algorithmic solution exists at all).
>
> So be careful about wanting to reopen this discussion; if the some of
> the wrong people join in, you may be very sorry! :-)
Thanks for the warning! :-) Well, I hope they are not present on git
mailing list...
>
> - Ted
>
>
> > Here then are some types of identity
> > tracking one might imagine:
> >
> > * File identity tracking: tracks the identity of a file through
> > renames and moves.
Git tracks _contents_ of a file. Not file identity. And this works
quite well in practice (although it is not free from disadvantages).
> > * Simple file content tracking: tracks the identity of content
> > using adds and deletes within a single file. (Note, there is a
> > question that could be asked here about the resolution of the
> > tracking. Most current systems that track do so on a line-by-line
> > basis, but one could imagine tracking bytes. I wont say any more
> > about this in this email.)
> > * Movement of content within a file: tracks the identity of
> > content within a file when lines are moved.
> > * Movement of content between files: tracks the identity of
> > content when lines are moved between files.
"git blame -C -C <file>" supports all three. I think it is the only
VCS which supports detection of code movement and copying in single
file and across files. That's by the way, of course...
>
> One obvious one which isn't in this list is "Directory Identity
> Tracking", that is when you move a directory, new files which are
> created in one branch at the original directory location will be moved
> when you merge with another branch where the directory has been moved.
Or "Detecting [Wholesame] Directory Renames"... which can be done
using 'rename detection' paradigm, and we have patches to prove it![4]
but unfortunately code didn't made it (yet!) into git. And it can,
I think, deal with splitting files into two directories, something
which I guess in 'container identity' (directory-id) based solution
is simply impossible
This issue would be in my planned test suite of VCS rename support.
[4] e.g. http://thread.gmane.org/gmane.comp.version-control.git/99529
> In private conversations with Tom Lord, he tells me that he had also
> played with the concept of "Function/variable (more generally,
> programming structure identifier) identity tracking". [...]
True, it is _theoretical_ advantage of changeset based model that
one can use richer set of operators than simple delta (this is IIRC
touched in Darcs manual). But first, nobody implemented such thing,
even as proof-of-concept prototype. Second, I GUESS that _in practice_
this would require something like Darcs' theory of patches and Darcs'
patch commutation, and to merge strategy which would have to take
into account _whole_ history (and not only endpoints: branches to
be merged and common ancestor(s) like in 3-way merge)... which could
and would lead to exponential time; extremely bad performance. But
that is just my guess.
> ---------------------------
>
> > The second could be called "location". Which file should this patch
> > be applied to? Which lines within a file should this hunk be applied
> > to? I argue in [2] that Darcs does strictly better at the task of
> > location than do SVN or GNU diff3. [...]
>
Does Darcs still have bug with exponential time of some merges?
> How are you defining "provably correct"? In order to show
> correctness, you need to define what correctness means.
Additionally if it would be possible to decide and mathematically
define what is correct way of merge resolution, it still wouldn't
help in the case of semantic conflict. And not tool would be
ever able to solve it.
>
> One approach is that you force the user to tell you [...]
And this is of course _wrong_ solution from the usability point of
view. Well, at least I think it is.
> > Let's be careful not to lump these three things together, say "All
> > merging involves guessing.", and thus overlook the interesting fact
> > that some merge algorithms involve strictly more guessing than do
> > others.
>
> Part of the problem is that words like "guessing" and claims of some
> algorithm being "provably correct" are basically marketing words.
> They are generally used to denigrate one SCM, and promote someone's
> favorite SCM as being **better**.
>
> Fundamentally, the goal of merging is to Do The Right Thing --- from a
> semantic point of view, which means that the user's intentions is
> what's important at the end of the day. The question then is whether
> you record the user's intentions, or try to determine it in from a
> heuristics point of view.
As I wrote: sometimes heuristic algorithms are better; take for
example the NP-C 'travelling salesman' problem...
And what matters is if given merge strategy (algorithm) is good
in practice, and not if it is good in theory... (I'm repeating
myself.)
>
> The people who claim that recording the user's intentions is superior
> will claim that you can never know for sure what the user meant, so
> you have to ask him or her to provide that information. In some cases
> that's relatively easy; you require the user to use commands like "bk
> mv" and "bk cp" and "bk rm" which not only performs the function, but
> also records the user's intention. Unfortunately, if you are applying
> a patch, and the patch file hasn't been enhanced to carry this kind of
> information, you have to use heuristics and then somehow get the user
> to confirm them --- hence the use of the Annoying Popup Dialog Box.
Ordinary patches, filemanages, your favorite editor (if it doesn't
have support for 'rename recording' VCS you use)...
BTW. using heuristic to find renames and record information about
renames during commit wouldn't help in the case if the file was
independently added but for example under different filename: for
this is I think necessary to have rename detection at merge time.
[...]
> And yet, people seem to focus on recording of user intention because
> it reflects some holy grail of Perfection and Correctness. And maybe,
> because it is easily measurable, whereas usability and improving
> programmer productivty are inherently more subjective measures.
> What's very sad are the people who are feel profoundly hurt that they
> spent a huge amount of their life working on SCM Correctness, only to
> find that people chose other SCM's based on other metrics and other
> issues other than the one that they felt was most important.
> Unfortunately there's not a lot that can be done about that.
>
> - Ted
Erm... this certainly shows that discussion on uvc-reviewers mailing
list drifted _badly_ away from what "Understanding Version Control"
is about :-(
P.S. I hoped that Linus who is strong proponent of 'contents is the
king' and superiority of rename detection would write something
but perhaps it was too long...
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
* Tests for "Understanding Version Control" by Eric S. Raymond
2009-02-02 18:48 Comments on "Understanding Version Control" by Eric S. Raymond Jakub Narebski
2009-02-02 20:24 ` Theodore Tso
@ 2009-02-04 22:14 ` Jakub Narebski
2009-02-10 1:20 ` Comments on " Jakub Narebski
2 siblings, 0 replies; 21+ messages in thread
From: Jakub Narebski @ 2009-02-04 22:14 UTC (permalink / raw)
To: git; +Cc: Eric S. Raymond, Theodore Tso
Sometimes code speaks louder than words. I think it is the case in
understanding how intelligent merge should deal with the presence
of renames; well commented examples (use cases) are (might be?) better
than talking about it.
Instead of talking about 'container identities' it would be, in my
opinion, to examine what challenges version control system has to
overcome if it wants to tell that it supports renames. Example tests,
example use cases might be a good choice. With those examples we
can check what are the limitations, strong and weak sides of different
ways of dealing with renames: container identities (or 'file-ids'),
tracking rename information, and heuristic rename detection.
Below there are proposed tests which are meant to check how good merge
algorithm (strategy) used by given SCM supports renames. In all cases
we have two branch: branch 'a' (ours) and branch 'b' (theirs), and we
usually would be merging branch 'b' into (ours) branch 'a'.
Note that for Git we want example files be large enough that
similarity based heuristic rename detection works. At the bottom
there are proposed files, taken from t/t6022-merge-rename.sh test
from git.git repository (which, along with t/t6032-merge-large-rename.sh,
might be a good start for more comprehensive test suite for SCMs).
* merging renames: if one side renamed file you should get rename
on merge; renaming a file and then merging that rename.
<switch to branch b>
[on branch b]$ scm mv foo bar
[on branch b]$ scm commit ... # to commit file rename
<switch to branch a>
[on branch a]$ scm commit ... # to not have fast-forward case
[on branch a]$ scm merge b # merge branch b (with rename)
expected result:
you have file 'bar', and do not have file 'foo'
* applying change to correct file: if our side renamed a file (or
rename directory it is in, which does rename full pathname of a
file indirectly), and possibly change it, and the other side
changed file, we would want merge to bring changes to file after
rename.
<switch to branch a>
[on branch a]$ scm mv foo bar
[on branch a]$ edit bar && scm commit # optionally
<switch to branch b>
[on branch b]$ edit foo
[on branch b]$ scm commit -m 'FOO' # commit changes
<switch to branch a>
[on branch a]$ scm merge b
expected result:
you have changes made on branch 'b' to file 'foo'
(commit 'FOO') in file 'bar'
Note that like in example in previous item all operations take place
_after_ branching point (after creation of branch b off branch a).
This is I guess what most people think when talking about
rename-aware (intelligent) merging.
* renamed directories bring another complication (described for example
on Mark Shuttleworth blog in articles about DVCS, promoting Bazaar-NG),
namely how to deal with merging changes where other side creates
_new files_ in renamed directory.
<switch to branch a>
[on branch a]$ scm mv subdir-foo/ subdir-bar/
<switch to branch b>
[on branch b]$ scm add subdir-foo/baz # add new file in old dir
<switch to branch b>
[on branch a]$ scm merge b
expected result:
New file subdir-bar/baz
Either automatic merge, or a conflict (no commit)
There is a bit of controversy about this feature, as for example in
some programming languages (e.g. Java) or in some project build tool
info it is not posible to simply move a file (or create new file in
different directory) without changing file contents. Some say that
is better to fail than to do wrongly clean merge.
* independent adding of a file: this is the case where both sides add
the same (or nearly the same) file independently, so the file in
question doesn't have common ancestor in per-file history. It
might happen because of applying patch independently, for example.
I _suspect_ that 'file-id' based solutions would have problems...
Below there is table of cases that might happen:
| 1 | 2 | 3 | 4 | 5 |
---------|-----|-----|-----|-----|-----|
filename | = | = | != | = | != |
contents | = | != | = | ~= | ~= |
where
'=' means that both sides use the same filename,
or exactly the same contents;
'!=' means files have different contents,
or files _started_ with different names
'~=' means that sides have slightly different contents,
but similarity score is high enough for rename detection.
Let us examine most complicated case 5 in above table; one can
simply omit some commends to get cases 1, 3 and 4. Note that
COPYING file is GNU GPL text.
<switch to branch a>
[on branch a]$ cp ../COPYING COPYING
[on branch a]$ scm add COPYING
[on branch a]$ scm commit -m 'Added COPYING'
<switch to branch b>
[on branch b]$ cp ../COPYING LICENSE # optional rename
[on branch b]$ sed -e 's/HOWEVER/However/' <LICENSE >LICENSE.1 &&
mv -f LICENSE.1 LICENSE # optional change
[on branch b]$ scm add LICENSE
[on branch b]$ scm commit -m 'Added LICENSE'
[on branch b]$ scm mv LICENSE COPYING
[on branch b]$ scm commit -m 'Renamed LICENSE to COPYING'
<switch to branch a>
[on branch a]$ scm merge b
expected result:
Either clean merging changes from branch 'b', or cleanly
marked conflict, e.g. CONFLICT (add/add). What we do not
want is one side vanishing.
But that series of commands was only a preparation. We now
want to repeat merging from branch 'b' into branch 'a', or
do reverse merging 'a' into 'b'.
<switch to branch b>
[on branch b]$ sed -e 's/GPL/G.P.L/g' <COPYING >COPYING1 &&
mv -f COPYING COPYING
[on branch b]$ scm commit COPYING -m 'Changed COPYING'
<switch to branch a>
[on branch a]$ scm commit ... # to not have fast-forward case
[on branch a]$ scm merge b # merge branch b
expected result:
Clean merge of changes done on branch 'b' into COPYING
==================================================
##################################################
==================================================
Files for testing rename detection in Git
........................................
t/t6022-merge-rename.sh
t/t6022-merge-rename-nocruft.sh
cat >A <<\EOF &&
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
b bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
c cccccccccccccccccccccccccccccccccccccccccccccccc
d dddddddddddddddddddddddddddddddddddddddddddddddd
e eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
f ffffffffffffffffffffffffffffffffffffffffffffffff
g gggggggggggggggggggggggggggggggggggggggggggggggg
h hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
i iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
j jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
k kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
l llllllllllllllllllllllllllllllllllllllllllllllll
m mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
n nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
o oooooooooooooooooooooooooooooooooooooooooooooooo
EOF
cat >M <<\EOF &&
A AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
B BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
D DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
E EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
F FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
H HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
I IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
J JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
K KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
L LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
M MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
N NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
O OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
EOF
sed -e "/^g /s/.*/g : master changes a line/" <A >A+
sed -e "/^g /s/.*/g : white changes a line/" <A >B
sed -e "/^G /s/.*/G : colored branch changes a line/" <M >N
sed -e "/^g /s/.*/g : red changes a line/" <A >B
sed -e "/^G /s/.*/G : colored branch changes a line/" <M >N
[...]
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-04 2:04 ` Jakub Narebski
@ 2009-02-04 23:54 ` Theodore Tso
2009-02-05 0:04 ` Junio C Hamano
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Theodore Tso @ 2009-02-04 23:54 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git, Eric S. Raymond
On Wed, Feb 04, 2009 at 03:04:02AM +0100, Jakub Narebski wrote:
>
> I guess that this mailing list is subscribe-only, isn't it? So doing
> CC to uvc-reviewers wouldn't, unfortunately, cut?
According to the Wayback Archive's record of the uvc-reviewers mailman
listinfo was open for anyone to join, and the archives were public,
which is why I don't mind sharing the archives with anyone who asks.
> > I'll include some of my writings on the subject from the uvc-reviewers
> > mailing list so folks can see where some of this discussion went last
> > time... (All of this dates from January, 2008, when Eric was last
> > aggressively updating the paper in question.)
>
> Thank you very much for those excerpts / fragments, even though
> I'd rather have your fresh comments either on current state of
> "Understanding Version-Control Systems", or on my post.
My comments haven't changed; as you probably noted, I agree with you,
and my arguments largely parallel yours. I was using a Reductio ad
absurdum argument to show that the same argument that claims that Git
is a primitive, hackish, SCM because it doesn't record user intention
vis-a-vis file renames could also be extended to say that use of all
current DSCM's amount to "Programming Malpractice" because they don't
allow the recording of higher level "user intentions" such as the
renaming of variables, functions, types, and class names.
My comments date from the very end of January 2008, when Eric stopped
updating his paper, and before he could start doing an extensive
description and evaluation of bzr, Mercurial and Git, so it's not
surprising that they are still relevant today. I suspect that when he
picks up this draft again, and starts writing these sections covering
modern distributed SCM's, the sections for Mercurial, Git, Bzr,
et. al, will cause a huge amount of controversy, because even though
he is claiming to be unbaised, there is very clear in the draft to
date that he would very much like to draw a grand sweeping picture of
progress and evolution starting from "first generation systems" (RCS,
SCCS, et. al), to "second generation systems" (CVS, SVN, et. al), to
"third generation systems" (Arch, Monotone, git, Mercurial, etc.)
There are hints in the draft that he views "container identity" has
the next "evolutionary idea" which "more primitive" systems do not
have, and "more evolved" systems do have. This can be seen from this
excerpt from his draft:
First wrong assumption: Conflict resolution by merging is
intractably difficult, so we'll have to settle for locking. It
took at least fifteen and arguably twenty years for VCS
designers to get shut of that one. But it's historical now.
Second wrong assumption: Change history representation as a
snapshot sequence is perfectly dual to the representation as
change/add/delete/rename sequences.. This folk theorem is well
expressed in the 2004 essay On Arch and Subversion. It is
appealing, widely held, and dead wrong.
File renames break the apparent symmetry. The failure of
snapshot-based models to correctly address this has caused
endless design failures, subtle bugs, and user misery.
So you can see that Eric seems to believe quite strongly that the
failure to track file renames is as fundamental an error as what he
terms the "First Wrong Assumption". He later admits that the idea is
controversial, and that people are still "grapling" with it, but I
think he's tipped his hand about what he believes the ultimate correct
answer is with respect to this issue.
I believe, as I think you do, that the hysteria that states that you
*must* record user intention leads inexorably to the requirement to
force users to indicate "intention" by popping up Annoying Dialog
Boxes whenever they suck in a patch that was sent via e-mail so that
the SCM can record information about whether a file rename had
happened in a particular commit. I believe this requirement to do
record user intentions and to pop up these Annoying Dialog Boxes is a
blind alley ala the vast amount of time wasted arguing over algorithms
such as Codeville precise merges. I also believe that forcing users
to record "user intention" makes about as much sense as forcing users
to declare they are about to edit a file by explicitly taking locks on
files ala RCS.
I suspect Eric will disagree with me, but regardless of how he
completes his paper, it will almost certainly end up taking sides one
way or another on this controversy, at which point one side or the
other of this particular disagreement will argue that Eric is really
writing an advocacy paper pushing Bzr, Mercurial, or Git (depending on
how he comes out on this issue).
Your suggestion that the proof is going to be in the code makes a lot
of sense. The examples I would suggest that we create, and then
demonstrate (or make enhancements to git) so that it can handle these
real world examples are:
1) In branch A, the directory src/plugin/innodb-experimental is
renamed to src/plugin/innodb, and in branch B, a commit (i)
modifies a file src/plugin/innodb-experimental/table.c, and (ii)
creates a file src/plugin/innodb-experimental/mod-schema.c. This
commit in branch B is then pulled into branch A, where the
directory rename has taken place. The user may not know that a
directory rename had taken place under the covers, so they don't
give any magic options when they run the "git cherry-pick" or "git
merge" command. Does the right thing happen such that the right
file in src/plugin/innodb is modified, and the new file is created
in src/plugin/innodb, even though in the original commit, the
changes were made to files in src/plugin/innodb-experimental?
2) And does the right thing happen if the situation is as described
above, but in, branch C, which is descended from branch B, a new
directory, src/plugin/innodb-experimental is created, such that
src/plugin/innodb and src/plugin/innodb-experimental both exist.
Now the same commit from branch A is pulled into branch C. Will
the correct thing happen in that the correct files in
src/plugin/innodb are modified and created, even though there is a
new directory containing a completely unrelated plugin that happens
to have the name, "innodb-experimental"?
BTW, it has been asserted that there exists at least one major open
source project where this sort of thing happens quite often, and
the fact that git did not do the right thing in these conditions
was a factor their choosing another DSCM.
> Or "Detecting [Wholesame] Directory Renames"... which can be done
> using 'rename detection' paradigm, and we have patches to prove it![4]
> but unfortunately code didn't made it (yet!) into git. And it can,
> I think, deal with splitting files into two directories, something
> which I guess in 'container identity' (directory-id) based solution
> is simply impossible
It may be that Yann Dirson's patches will handle case (1) above.
Handling case (2) is much harder, especially without slowing
everything down massively, since it would effectively mean needing to
looking for directory renames along every single commit on the branch.
(This would obviously have to be cached in some cache file.)
It can be done, I'm sure, but it would require a lot of code to get
right. Whether or not it's worth it is a question which is open to
debate, but I believe the bzr folks have asserted that bzr can handle
both cases (1) and (2) above, and there are some folks who apparently
care.
Whether or not a particular open source project will really and truly
run into this problem is a different question, and one can argue that
renaming plugins, and then creating new plugins with the same name as
older plugins that have since been renamed will lead to programmer
confusion, and so that's a good enough reason to avoid doing such
crazy things. Unfortunately, you know how some programmers
are.... telling someone they shouldn't do something is often an
invitation to do exactly what you tell them is a bad idea, and then
they complain when your filesystem or your DSCM doesn't handle that
case particularly gracefully.
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-04 23:54 ` Theodore Tso
@ 2009-02-05 0:04 ` Junio C Hamano
2009-02-05 2:43 ` Theodore Tso
2009-02-05 0:08 ` Jakub Narebski
` (2 subsequent siblings)
3 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2009-02-05 0:04 UTC (permalink / raw)
To: Theodore Tso; +Cc: Jakub Narebski, git, Eric S. Raymond
Theodore Tso <tytso@mit.edu> writes:
> 1) In branch A, the directory src/plugin/innodb-experimental is
> renamed to src/plugin/innodb, and in branch B, a commit (i)
> modifies a file src/plugin/innodb-experimental/table.c, and (ii)
> creates a file src/plugin/innodb-experimental/mod-schema.c. This
> commit in branch B is then pulled into branch A, where the
> directory rename has taken place. The user may not know that a
> directory rename had taken place under the covers, so they don't
> give any magic options when they run the "git cherry-pick" or "git
> merge" command. Does the right thing happen such that the right
> file in src/plugin/innodb is modified, and the new file is created
> in src/plugin/innodb, even though in the original commit, the
> changes were made to files in src/plugin/innodb-experimental?
Careful.
Although it is reasonable to expect that existing file's modification will
move to innodb/ directory, it is not as clear-cut as some people seem to
assume that the new file should always be created in the new directory
innodb/. You seem to imply you understand the issues by having the second
example, though.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-04 23:54 ` Theodore Tso
2009-02-05 0:04 ` Junio C Hamano
@ 2009-02-05 0:08 ` Jakub Narebski
2009-02-05 0:49 ` Theodore Tso
2009-02-05 6:01 ` Miles Bader
2009-02-05 11:23 ` Jakub Narebski
3 siblings, 1 reply; 21+ messages in thread
From: Jakub Narebski @ 2009-02-05 0:08 UTC (permalink / raw)
To: Theodore Tso; +Cc: git, Eric S. Raymond
On Thu, 5 Feb 2009, Theodore Tso wrote:
> On Wed, Feb 04, 2009 at 03:04:02AM +0100, Jakub Narebski wrote:
> >
> > I guess that this mailing list is subscribe-only, isn't it? So doing
> > CC to uvc-reviewers wouldn't, unfortunately, cut?
>
> According to the Wayback Archive's record of the uvc-reviewers mailman
> listinfo was open for anyone to join, and the archives were public,
> which is why I don't mind sharing the archives with anyone who asks.
What I meant here was whether to send email to uvc-reviewers mailing
list I have to be subscribed or not? Because if you have to be
subscribed, even if it is open subscription and not by invitation only,
it would make it impossible to have discussion on git mailing list as
a main venue, and just CC: (send copies) to uvc-reviewers list.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 0:08 ` Jakub Narebski
@ 2009-02-05 0:49 ` Theodore Tso
0 siblings, 0 replies; 21+ messages in thread
From: Theodore Tso @ 2009-02-05 0:49 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git, Eric S. Raymond
On Thu, Feb 05, 2009 at 01:08:13AM +0100, Jakub Narebski wrote:
> On Thu, 5 Feb 2009, Theodore Tso wrote:
> > On Wed, Feb 04, 2009 at 03:04:02AM +0100, Jakub Narebski wrote:
> > >
> > > I guess that this mailing list is subscribe-only, isn't it? So doing
> > > CC to uvc-reviewers wouldn't, unfortunately, cut?
> >
> > According to the Wayback Archive's record of the uvc-reviewers mailman
> > listinfo was open for anyone to join, and the archives were public,
> > which is why I don't mind sharing the archives with anyone who asks.
>
> What I meant here was whether to send email to uvc-reviewers mailing
> list I have to be subscribed or not? Because if you have to be
> subscribed, even if it is open subscription and not by invitation only,
> it would make it impossible to have discussion on git mailing list as
> a main venue, and just CC: (send copies) to uvc-reviewers list.
Well, given that the mailman web interface is dead and gone, I'm
guessing the mailing list itself is also dead, so the point is rather
moot. I assume that if Eric thought the list is useful, he'll set it
up again when he starts active development on his paper again --- or
not.
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 0:04 ` Junio C Hamano
@ 2009-02-05 2:43 ` Theodore Tso
2009-02-05 6:24 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: Theodore Tso @ 2009-02-05 2:43 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Jakub Narebski, git, Eric S. Raymond
On Wed, Feb 04, 2009 at 04:04:43PM -0800, Junio C Hamano wrote:
> Theodore Tso <tytso@mit.edu> writes:
>
> > 1) In branch A, the directory src/plugin/innodb-experimental is
> > renamed to src/plugin/innodb, and in branch B, a commit (i)
> > modifies a file src/plugin/innodb-experimental/table.c, and (ii)
> > creates a file src/plugin/innodb-experimental/mod-schema.c. This
> > commit in branch B is then pulled into branch A, where the
> > directory rename has taken place. The user may not know that a
> > directory rename had taken place under the covers, so they don't
> > give any magic options when they run the "git cherry-pick" or "git
> > merge" command. Does the right thing happen such that the right
> > file in src/plugin/innodb is modified, and the new file is created
> > in src/plugin/innodb, even though in the original commit, the
> > changes were made to files in src/plugin/innodb-experimental?
>
> Careful.
>
> Although it is reasonable to expect that existing file's modification will
> move to innodb/ directory, it is not as clear-cut as some people seem to
> assume that the new file should always be created in the new directory
> innodb/.
Careful; that's actually an argument for recording the directory
rename. If the intention is to rename the directory containing some
plugin, where all of the associated files are for the plugin foobar,
and we are renaming the directory because plugin has had its name
changed to fooblatz, then a commit which introduces a new file, say
table.c probably does want to get created in the new directory ---
especially if one of the changes was to foobar/Makefile:
--- Makefile.in 2009-02-04 21:28:43.977052347 -0500
+++ Makefile.in.orig 2009-02-04 21:28:38.830212569 -0500
@@ -60,7 +60,7 @@
#
#MCHECK= -DMCHECK
-OBJS= table.o crc32.o dict.o unix.o pass1.o pass1b.o pass2.o \
+OBJS= crc32.o dict.o unix.o pass1.o pass1b.o pass2.o \
pass3.o pass4.o pass5.o journal.o badblocks.o util.o dirinfo.o \
dx_dirinfo.o ehandler.o problem.o message.o recovery.o region.o
In other cases, maybe the right thing *is* to drop the new file in the
original directory. So as the Hg and Bzr apologists might say, if the
SCM actually records whether the user intention was a *directory*
rename, versus a series of *file* rename/moves, then it becomes
obvious what the right thing to do.
But if the SCM is tracking *content* as git does, then we don't have
the benefit of having recorded the user's intention, so we have to use
hueristics. We can say that if *all* of the files in the directory
foobar had been moved to fooblatz as part of a commit, it's likely
that new commits that create new files in foobar should create them in
fooblatz --- especially if the new file name is mentioned in a file
named "Makefile" or "Makefile.in". Or we can give up and ask the user
what was intended at merge time. That's admittedly as annoying as
throwing up the Annoying Dialog Box at commit time when folding in a
patch (which is how systems that record user intention have to do
things). Ultimately, there are only three choices; either the user
tells the SCM, the SCM asks the user, or the SCM applies some
hueristic (i.e., the SCM "guesses"). This can happen at commit time
or it can happen at merge time, but and in some cases the SCM can use
a combination of techniques depending on whether a patch is being
imported or whether the user is explicitly telling the scm via "scm
mv", "scm mvdir", "scm cp", et. al.
But if we throw up our hands and say, it's impossible to guess
correctly, then that leaves us open to the argument from people from
the Hg or Bzr camp to say, "You see, tracking *content* is really for
the birds; you really have to let/force the user to *tell you* and
record the whether the directory is being moved, or a series of files
in the directory are being moved."
Instead, I would argue that just as we've shown that while collisions
can happen, the costs of locking files ala RCS outweigh the costs, it
also is better to use hueristics to determine whether a file should be
created in the original directory or the apparently renamed directory
location, and then let the user fix things up afterwards if the
algorithm gets it wrong. If the corner case only happens 1% of the
time, and our algorithm gets it right 99% of the time, the resulting
0.01% error rate is probably quite acceptable.
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-04 23:54 ` Theodore Tso
2009-02-05 0:04 ` Junio C Hamano
2009-02-05 0:08 ` Jakub Narebski
@ 2009-02-05 6:01 ` Miles Bader
2009-02-05 9:34 ` Eric S. Raymond
2009-02-05 11:23 ` Jakub Narebski
3 siblings, 1 reply; 21+ messages in thread
From: Miles Bader @ 2009-02-05 6:01 UTC (permalink / raw)
To: Theodore Tso; +Cc: Jakub Narebski, git, Eric S. Raymond
Theodore Tso <tytso@mit.edu> writes:
> I suspect Eric will disagree with me, but regardless of how he
> completes his paper, it will almost certainly end up taking sides one
> way or another on this controversy, at which point one side or the
> other of this particular disagreement will argue that Eric is really
> writing an advocacy paper pushing Bzr, Mercurial, or Git (depending on
> how he comes out on this issue).
That was pretty clear from his comments on the emacs-devel mailing list
(2008-05 roughly).
He spent a lot of time trying to sound impartial (and that he was "still
doing research"), but strongly gave the impression that he had already
made up his mind.
-Miles
--
Infancy, n. The period of our lives when, according to Wordsworth, 'Heaven
lies about us.' The world begins lying about us pretty soon afterward.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 2:43 ` Theodore Tso
@ 2009-02-05 6:24 ` Junio C Hamano
2009-02-05 13:28 ` Theodore Tso
0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2009-02-05 6:24 UTC (permalink / raw)
To: Theodore Tso; +Cc: Jakub Narebski, git, Eric S. Raymond
Theodore Tso <tytso@mit.edu> writes:
> Careful; that's actually an argument for recording the directory
> rename.
I do not think so. More precisely, I can see people could make that
argument, but I think that argument is weak.
Suppose the original project's implementor only knew about innodb
interface, so he had the "database interface" directory and innodb access
method file in the source tree, perhaps at <db/inno.c>.
I forked the project, and added gdbm support at <db/gdbm.c>.
You also forked the project without knowing what I was working on, and you
started working on refining the innodb support.
All the while, the development community started discussing how the source
tree should be organized to support multiple backends, and you learned
that the plan is to have one directory per larger backend, while keeping
single file ones in <db/*.c>. Specifically, you learned that innodb
related code will be stored in <innodb/*.c>, and there may be other
<somedb/*.c> and <someotherdb/*.c> groups added, but you are not
interested in anything but enhancing innodb support.
You rename "scm mv db innodb" and then add <innodb/enhanced.c>, or perhaps
you may have done it the other way, i.e. added <db/enhanced.c> and then
renamed "scm mv db innodb".
Suppose you would want to merge my changes, but the upstream's plan hasn't
happened yet. Neither of us merged from the upstream in the meantime.
Recording your "scm mv db innodb" as "the user's intention to rename
directory" does not help when you want to merge with me to handle the new
file <db/gdbm.c> I added. You not only need to record the "intent to
rename db to innodb", but need to know that the validity of that "intent
to rename" is contingent on the absense of anything unrelated to innodb in
db/ directory, in order to merge the two branches correctly. Otherwise
you will end up moving my <db/gdbm.c> to <innodb/gdbm.c>. The correct
outcome in this case would probably be to leave it as it is.
> In other cases, maybe the right thing *is* to drop the new file in the
> original directory. So as the Hg and Bzr apologists might say, if the
> SCM actually records whether the user intention was a *directory*
> rename, versus a series of *file* rename/moves, then it becomes
> obvious what the right thing to do.
See how that argument is flawed? The point of my example is that the line
between your example (1) and (2) in the previous message is blurry.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 6:01 ` Miles Bader
@ 2009-02-05 9:34 ` Eric S. Raymond
0 siblings, 0 replies; 21+ messages in thread
From: Eric S. Raymond @ 2009-02-05 9:34 UTC (permalink / raw)
To: Miles Bader; +Cc: Theodore Tso, Jakub Narebski, git
Miles Bader <miles@gnu.org>:
> Theodore Tso <tytso@mit.edu> writes:
> > I suspect Eric will disagree with me, but regardless of how he
> > completes his paper, it will almost certainly end up taking sides one
> > way or another on this controversy, at which point one side or the
> > other of this particular disagreement will argue that Eric is really
> > writing an advocacy paper pushing Bzr, Mercurial, or Git (depending on
> > how he comes out on this issue).
>
> That was pretty clear from his comments on the emacs-devel mailing list
> (2008-05 roughly).
>
> He spent a lot of time trying to sound impartial (and that he was "still
> doing research"), but strongly gave the impression that he had already
> made up his mind.
At the time, I leaned slightly towards Mercurial, but my reasons had
nothing to do with the cluster of issues Ted is pointing at; rather, I
liked hg for its interface simplicity.
I remain agnostic about the deep issues around renaming and user
intentions - in part because I'm by no means sure I completely
understand them yet.
--
<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-04 23:54 ` Theodore Tso
` (2 preceding siblings ...)
2009-02-05 6:01 ` Miles Bader
@ 2009-02-05 11:23 ` Jakub Narebski
2009-02-05 13:16 ` Theodore Tso
3 siblings, 1 reply; 21+ messages in thread
From: Jakub Narebski @ 2009-02-05 11:23 UTC (permalink / raw)
To: Theodore Tso; +Cc: git, Eric S. Raymond
On Tue, Feb 05, 2009, Theodore Tso wrote:
> On Wed, Feb 04, 2009 at 03:04:02AM +0100, Jakub Narebski wrote:
> My comments date from the very end of January 2008, when Eric stopped
> updating his paper, and before he could start doing an extensive
> description and evaluation of bzr, Mercurial and Git,
Which evaluation is very important, as Git, Mercurial (hg) and Bazaar
(bzr) in addition to Subversion (svn) dominate the field of open-source
version control systems, with Darcs and Monotone having its own niches.
> so it's not
> surprising that they are still relevant today. I suspect that when he
> picks up this draft again, and starts writing these sections covering
> modern distributed SCM's, the sections for Mercurial, Git, Bzr,
> et. al, will cause a huge amount of controversy, because even though
> he is claiming to be unbiased, there is very clear in the draft to
> date that he would very much like to draw a grand sweeping picture of
> progress and evolution starting from "first generation systems" (RCS,
> SCCS, et. al), to "second generation systems" (CVS, SVN, et. al), to
> "third generation systems" (Arch, Monotone, git, Mercurial, etc.)
There is progress and evolution...
* locking -> update-then-commit (or merge-then-commit) ->
-> commit-then-merge (and alternate/additional workflow of rebase
aka. commit-merge-recommit-push)
* local -> client / server -> distributed
Perhaps also
? per-file history -> whole tree commits
But there are still controversial issues, like discussed here issue
on _how_ to deal with renames.
>
> There are hints in the draft that he views "container identity" has
> the next "evolutionary idea" which "more primitive" systems do not
> have, and "more evolved" systems do have. This can be seen from this
> excerpt from his draft:
>
> First wrong assumption: Conflict resolution by merging is
> intractably difficult, so we'll have to settle for locking. It
> took at least fifteen and arguably twenty years for VCS
> designers to get shut of that one. But it's historical now.
There I think everybody would agree. Modern VCS rare, if even, have
support for locking model.
>
> Second wrong assumption: Change history representation as a
> snapshot sequence is perfectly dual to the representation as
> change/add/delete/rename sequences.. This folk theorem is well
> expressed in the 2004 essay On Arch and Subversion. It is
> appealing, widely held, and dead wrong.
>
> File renames break the apparent symmetry. The failure of
> snapshot-based models to correctly address this has caused
> endless design failures, subtle bugs, and user misery.
First, I have stressed already, the issue of 'container identities'
for dealing with renames is totally ORTHOGONAL to the issue whether
SCM is snapshot based or changeset based. Case in point: Bazaar (bzr).
Bazaar uses file-ids and directory-ids to deal with renames (here it
is spiritual child of Arch), but on Bazaar wiki (http://bazaar-vcs.org)
it is mentioned in the passing that it is _snapshot based_. I think
that it had those file-ids even when it used 'weave' in repository
format (not deltas / changesets).
Second, what I also wrote about already, the article cited as argument
for changeset based SCM (which you don't have in above excerpt) is not
to the point, and moreover is totally, utterly _wrong_. The troubles
with merging in CVS and Subversion are not caused by the fact that they
are snapshot based (CVS isn't, by the way), but by the fact that they
don't (or in the case of Subversion didn't) track merges.
>
> So you can see that Eric seems to believe quite strongly that the
> failure to track file renames is as fundamental an error as what he
> terms the "First Wrong Assumption". He later admits that the idea is
> controversial, and that people are still "grapling" with it, but I
> think he's tipped his hand about what he believes the ultimate correct
> answer is with respect to this issue.
What I'd like to see in the next version of "Understanding Version
Control Systems" is to concentrate more on the _issue_ of managing
renames, than on specific solution of this problem. And I very much
would like to see 'rename detection' mentioned...
But I think that the issue of renames is not the main point. The main
point is that in modern VCS _merging_ has to be easy[1], from which
naturally follows that VCS needs intelligent merge which can deal well
with file renames. Managing renames is needed for easy merging; all
else is glitter.
Or, from the other point of view the important thing that _branching_
is important. Both creating branches, and merging branches (and having
large amount of branches, and being able to delete branches, and having
local (unpublished) and global (published) branches, etc.).
BTW. there is excerpt from Junio C. Hamano blog post "FLOSS weekly #19
follow-up (3)" http://gitster.livejournal.com/9970.html
By the time the basic structure as we currently know has stabilized,
we had help from literally dozens of contributors to add many things
on top of the very original version:
[...]
* We did not envision that multiple branches in a single repository
would turn out to be such a useful way to work, and did not have
support for switching branches.
>
[...]
> I suspect Eric will disagree with me, but regardless of how he
> completes his paper, it will almost certainly end up taking sides one
> way or another on this controversy, at which point one side or the
> other of this particular disagreement will argue that Eric is really
> writing an advocacy paper pushing Bzr, Mercurial, or Git (depending on
> how he comes out on this issue).
I think, and I hope, that Eric would manage to keep proper scientific
decorum[2], balancing or at least mentioning all problems and all
possible solutions, even if he is biased, and even if this bias shows
(hopefully a little).
[2] The thing that distinguish true science from cargo-cult science
(pseudo-science), which shows only arguments "for".
>
>
> Your suggestion that the proof is going to be in the code makes a lot
> of sense.
I though more about the fact that having 'use cases' examples would
be more clean. And also would make possible to test against...
> The examples I would suggest that we create, and then
> demonstrate (or make enhancements to git) so that it can handle these
> real world examples are:
>
> 1) In branch A, the directory src/plugin/innodb-experimental is
> renamed to src/plugin/innodb, and in branch B, a commit (i)
> modifies a file src/plugin/innodb-experimental/table.c, and (ii)
> creates a file src/plugin/innodb-experimental/mod-schema.c. This
> commit in branch B is then pulled into branch A, where the
> directory rename has taken place. The user may not know that a
> directory rename had taken place under the covers, so they don't
> give any magic options when they run the "git cherry-pick" or "git
> merge" command. Does the right thing happen such that the right
> file in src/plugin/innodb is modified, and the new file is created
> in src/plugin/innodb, even though in the original commit, the
> changes were made to files in src/plugin/innodb-experimental?
This (or similar, at least) example you can find in 'Tests for
"Understanding Version Control" by Eric S. Raymond' subthread...
>
> 2) And does the right thing happen if the situation is as described
> above, but in, branch C, which is descended from branch B, a new
> directory, src/plugin/innodb-experimental is created, such that
> src/plugin/innodb and src/plugin/innodb-experimental both exist.
> Now the same commit from branch A is pulled into branch C. Will
> the correct thing happen in that the correct files in
> src/plugin/innodb are modified and created, even though there is a
> new directory containing a completely unrelated plugin that happens
> to have the name, "innodb-experimental"?
Errr... I think that you confused branch 'B' (with innodb-experimental)
with branch 'A' (with innodb only) here.
>
> BTW, it has been asserted that there exists at least one major open
> source project where this sort of thing happens quite often, and
> the fact that git did not do the right thing in these conditions
> was a factor their choosing another DSCM.
I think that they should change their filesystem hierarchy naming
conventions and/or use branches more. But that is not terribly
relevant...
>
> > Or "Detecting [Wholesame] Directory Renames"... which can be done
> > using 'rename detection' paradigm, and we have patches to prove it![4]
> > but unfortunately code didn't made it (yet!) into git. And it can,
> > I think, deal with splitting files into two directories, something
> > which I guess in 'container identity' (directory-id) based solution
> > is simply impossible
>
> It may be that Yann Dirson's patches will handle case (1) above.
> Handling case (2) is much harder, especially without slowing
> everything down massively, since it would effectively mean needing to
> looking for directory renames along every single commit on the branch.
> (This would obviously have to be cached in some cache file.)
Well, I think it would be a bit simpler: for each _new_ file in merge
you have to see where other files in the directory it was created are.
But I agree that it would be costly; perhaps it should be triggered by
separate option / config, like diff.renames = copies?
>
> It can be done, I'm sure, but it would require a lot of code to get
> right. Whether or not it's worth it is a question which is open to
> debate, but I believe the bzr folks have asserted that bzr can handle
> both cases (1) and (2) above, and there are some folks who apparently
> care.
On the other hand I think that fundamentally 'container identity'
solutions cannot deal with the case of splitting contents (or reverse,
joining contents), e.g. splitting file into smaller files, or splitting
directory into few directories (grouping files). And from what
I understand at least current implementations of 'file-id' solution
have problems with repeated merging in the case of independently added
file.
>
> Whether or not a particular open source project will really and truly
> run into this problem is a different question, and one can argue that
> renaming plugins, and then creating new plugins with the same name as
> older plugins that have since been renamed will lead to programmer
> confusion, and so that's a good enough reason to avoid doing such
> crazy things. Unfortunately, you know how some programmers
> are.... telling someone they shouldn't do something is often an
> invitation to do exactly what you tell them is a bad idea, and then
> they complain when your filesystem or your DSCM doesn't handle that
> case particularly gracefully.
>
> - Ted
Final words: there is no race. We aren't here to achieve world
domination. Sometimes one SCM, with its different choices, might
be better solution than the other. For example if you have large
media files then centralized SCM with partial checkout support might
be a best choice. Another example is how we pointed on #git (sic!)
people from IPsec, who wanted each commit to be signed or equivalent,
towards Monotone.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 11:23 ` Jakub Narebski
@ 2009-02-05 13:16 ` Theodore Tso
2009-02-05 17:36 ` Jakub Narebski
0 siblings, 1 reply; 21+ messages in thread
From: Theodore Tso @ 2009-02-05 13:16 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git, Eric S. Raymond
On Thu, Feb 05, 2009 at 12:23:37PM +0100, Jakub Narebski wrote:
> >
> > 2) And does the right thing happen if the situation is as described
> > above, but in, branch C, which is descended from branch B, a new
> > directory, src/plugin/innodb-experimental is created, such that
> > src/plugin/innodb and src/plugin/innodb-experimental both exist.
> > Now the same commit from branch A is pulled into branch C. Will
> > the correct thing happen in that the correct files in
> > src/plugin/innodb are modified and created, even though there is a
> > new directory containing a completely unrelated plugin that happens
> > to have the name, "innodb-experimental"?
>
> Errr... I think that you confused branch 'B' (with innodb-experimental)
> with branch 'A' (with innodb only) here.
>
No, I didn't. Let me try again.
At time T: Project grows a plugin in directory src/plugins/foo-new
At time T+1: Project releases a stable release, and branches off "maint"
At time T+2: Project renames the plugin to be src/plugins/foo, using
"scm mvdir src/plugins/foo-new src/plugins/foo" on the
devel branch:
At time T+3: A developer wants to implement a new experimental
'foo-new' plug in so she creates a completely new
src/plugins/foo-new directory. At this point the
devel branch has 'src/plugins/foo' and
'src/plugins/foo-new', where src/plugins/foo contains
the plugin which is in the directory
src/plugins/foo-new on the maint branch (since the
maint branch branched off
before the directory renames started happening.
At time T+4: A fix goes into the maint branch that modifies
src/plugins/foo-new/interface.c. The fix needs to be
pulled into the devel branch. Does the right thing
happen? (Suppose "interface.c" is a commonly used
filename in all plugins and exists in both the 'foo'
and 'foo-new' directories on the devel branch. Does
the SCM figure out what is the correct file to
modify?)
At Time T+5: A commit goes into the maint branch which creates a
new file, src/plugins/foo-new/table.c, and modifies
src/plugins/foo-new/Makefile to compile table.c.
Which directory does the SCM drop table.c into?
Thie point is if the project is organized around plugins, which are
considered bundles of code written in a modular way, and there is a
desire to rename directories which are the top-level modules, an SCM
that can easily deal with directory renames is important. In practice
this doesn't come up in the Linux kernel, and many other OSS projects,
and if the project's developent style doesn't do directory
reorganizations often, then this isn't an issue. If an OSS project
does do this type of reorganization more frequently, then the argument
"don't do that", would seem to be an unnecessary restriction.
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 6:24 ` Junio C Hamano
@ 2009-02-05 13:28 ` Theodore Tso
2009-02-05 23:06 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: Theodore Tso @ 2009-02-05 13:28 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Jakub Narebski, git, Eric S. Raymond
On Wed, Feb 04, 2009 at 10:24:57PM -0800, Junio C Hamano wrote:
> All the while, the development community started discussing how the source
> tree should be organized to support multiple backends, and you learned
> that the plan is to have one directory per larger backend, while keeping
> single file ones in <db/*.c>. Specifically, you learned that innodb
> related code will be stored in <innodb/*.c>, and there may be other
> <somedb/*.c> and <someotherdb/*.c> groups added, but you are not
> interested in anything but enhancing innodb support.
>
> You rename "scm mv db innodb" and then add <innodb/enhanced.c>, or perhaps
> you may have done it the other way, i.e. added <db/enhanced.c> and then
> renamed "scm mv db innodb".
The argument would be that for SCM that properly tracked user
intentions, you did the wrong thing. If the SCM properly understood
directory renames, there is a big differene between this:
scm mvdir db innodb
and this
scm mv db/* innodb
You see? The first moves the *directory* db to innodb. The second
moves all of the *files* that are in db to a new directory, innodb.
If, in your example, you had learned that the goal was to keep single
file ones in <db/*.c>, and larger backends in <innodb/*.c>, the
correct thing to tell the SCM is *not* to rename the directory db to
innodb, but rather, to move all of the files currently in <db/*.c>,
which implement innodb, into the innodb directory. If an SCM properly
handles directory renames, it would distinguish between these two
cases and record them different, since it implies a different
intention about what should happen to new files created in <db/*.c> in
other branches when it comes time to merge them.
Of course, this distinction does not exist in git, because we track
content only. And a number of other SCM's like Hg, which only track
file renames, wouldn't get this right either. In order to get this
right, you need to treat directory renames as separate and distinct
operations from file renames, because they have different merge
implications.
> See how that argument is flawed? The point of my example is that the line
> between your example (1) and (2) in the previous message is blurry.
It's blurry if you don't properly make the distinction between file
and directory renames, yes. A SCM that only handles file renames
can't record the difference between "move all the files in directory
foo to bar" from "rename directory foo to bar". Just as an SCM (like
git) that only handles content that tell the difference between "move
all of the lines of content from foo.c to bar.c" and "rename foo.c to
bar.c".
Our argument for git is that with sufficiently smart merge algorithms
it doesn't matter, since we can intuit the right thing at merge time.
However, your argument that it's not possible to determine whether the
new file should appear as db/gdbm.c or innodb/gdm.c is an argument
content-tracking alone isn't enough.
Personally, I think the scenario I used of renaming plugins is more
likely that the sort of source reorganization which you've posited,
but I agree they are both possible scenarios. The question for git
development is whether these sorts of issues ar ones that we should
try to handle or not? After all, one possibility is just to tell
people that if they are folks who like to go wild with source tree
reorganizations all the time, they should go to some other SCM like
bzr or Hg; that in git's view, the costs of being able to handle
random file and directory renames isn't worth the benefits for what is
normally a rare occurrence (and if it's happening all the time, the
project is probably doing something else wrong....)
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 13:16 ` Theodore Tso
@ 2009-02-05 17:36 ` Jakub Narebski
2009-02-05 21:45 ` Theodore Tso
0 siblings, 1 reply; 21+ messages in thread
From: Jakub Narebski @ 2009-02-05 17:36 UTC (permalink / raw)
To: Theodore Tso; +Cc: git, Eric S. Raymond
On Thu, Feb 05, 2009, Theodore Tso wrote:
> On Thu, Feb 05, 2009 at 12:23:37PM +0100, Jakub Narebski wrote:
> > >
> > > 2) And does the right thing happen if the situation is as described
> > > above, but in, branch C, which is descended from branch B, a new
> > > directory, src/plugin/innodb-experimental is created, such that
> > > src/plugin/innodb and src/plugin/innodb-experimental both exist.
> > > Now the same commit from branch A is pulled into branch C. Will
> > > the correct thing happen in that the correct files in
> > > src/plugin/innodb are modified and created, even though there is a
> > > new directory containing a completely unrelated plugin that happens
> > > to have the name, "innodb-experimental"?
> >
> > Errr... I think that you confused branch 'B' (with innodb-experimental)
> > with branch 'A' (with innodb only) here.
> >
>
> No, I didn't. Let me try again.
>
> At time T: Project grows a plugin in directory src/plugins/foo-new
>
> At time T+1: Project releases a stable release, and branches off "maint"
>
> At time T+2: Project renames the plugin to be src/plugins/foo, using
> "scm mvdir src/plugins/foo-new src/plugins/foo" on the
> devel branch:
And it is on branch 'A' that it happens. But it doesn't matter...
The example is of 'independent add' in the same filename, different
contents case that I put in "Tests for...", but for directory not
for a filename. Well, slightly more complicated than that...
What I wonder is how directory-id solution deals with situation
where (for example die to some reorganization) where once was single
directory (e.g. lib/) now there are two (include/ and src/); how it
would deal with the new file at old directory, hmmm...?
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 17:36 ` Jakub Narebski
@ 2009-02-05 21:45 ` Theodore Tso
0 siblings, 0 replies; 21+ messages in thread
From: Theodore Tso @ 2009-02-05 21:45 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git, Eric S. Raymond
On Thu, Feb 05, 2009 at 06:36:42PM +0100, Jakub Narebski wrote:
> What I wonder is how directory-id solution deals with situation
> where (for example die to some reorganization) where once was single
> directory (e.g. lib/) now there are two (include/ and src/); how it
> would deal with the new file at old directory, hmmm...?
In that case, it wouldn't be a directory rename, it would be a series
of file moves. So in a hypothetical scm that recorded all of these
sorts of things, you'd have something like this:
scm mv lib/*.c src
scm mv lib/*.h include
scm rmdir lib
Now if you try merging in a commit that creates a files in lib (e.g.,
creates lib/foo.c and lib/foo.h and modifies lib/Makefile), presumably
either a super smart hueristic algorith might be able to figure out
the pattern and drop the new files in src and include --- or, more
likely, it would flag a merge conflict and ask the user to figure it
out by hand.
So yes, there will always be cases where directory-id won't be able to
handle a hypothetical source tree reorganization. It really only
helps in the case where you are doing a true, full move of the
directory, i.e.:
scm mvdir src/plugin/innodb src/plugin/innodb-legacy
scm mvdir src/plugin/innodb-experimental src/plugin/innodb
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-05 13:28 ` Theodore Tso
@ 2009-02-05 23:06 ` Junio C Hamano
0 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2009-02-05 23:06 UTC (permalink / raw)
To: Theodore Tso; +Cc: Jakub Narebski, git, Eric S. Raymond
Theodore Tso <tytso@mit.edu> writes:
> The argument would be that for SCM that properly tracked user
> intentions, you did the wrong thing. If the SCM properly understood
> directory renames, there is a big differene between this:
>
> scm mvdir db innodb
>
> and this
>
> scm mv db/* innodb
>
> You see? The first moves the *directory* db to innodb. The second
Then please s/scm mv/scm mvdir/ before reading my example. The
hypothetical "scm mv" command in my example just knew it was fed an
directory and interpreted it as an intention to move the directory, not
all its contents.
> Of course, this distinction does not exist in git, because we track
> content only. And a number of other SCM's like Hg, which only track
> file renames, wouldn't get this right either. In order to get this
> right, you need to treat directory renames as separate and distinct
> operations from file renames, because they have different merge
> implications.
> ...
>> See how that argument is flawed? The point of my example is that the line
>> between your example (1) and (2) in the previous message is blurry.
>
> It's blurry if you don't properly make the distinction between file
> and directory renames, yes.
My point was even if you (in the example, who said "I want to move db
directory to innodb directory") had two different operations, it is not
enough, because you cannot capture that "I want to move db directory to
innodb directory" was contingent on "because I know everything in my db
directory should belong to innodb -- in fact in my history of db/, there
is nothing but innodb support". The other person you will eventually be
merging with may not share that precondition, as the project started out
to hold anything databasey in db/ and between the two branches being
merged, only you changed the semantics of what each directory means.
> However, your argument that it's not possible to determine whether the
> new file should appear as db/gdbm.c or innodb/gdm.c is an argument
> content-tracking alone isn't enough.
Yes, but it is stronger than that. It is not just "content-tracking alone"
is not enough. Even systems that have distinction between "scm mv" and
"scm mvdir" are not enough. That is what I was trying to illustrate.
Your plug-in example differentiates two cases, one of which is that the
renaming branch would move the directory and the other is the branch moved
files under one directory to a new directory while keeping the original
directory, and two cases should produce different results. If I
understand your argument correctly, it is that in the latter case the
outcome may be ambiguous, but in the former case, it is clear that the
intention of the remaning branch is to rename the directory itself and the
addition to the directory done in the other branch being merged should
automatically be done to the renamed directory while merging. Most
importantly, the argument makes the assumption that the intention of the
non-renaming branch (iow, why he added the new files in the directory)
does not matter and does not affect the outcome.
The source tree restructuring example I brought in questions that
assumption. It illustrates that the intention of the side that added the
new file matters. Is it an innodb support enhancement? Then it should
follow the renamer's intention to move rename db/ to innodb/. Is it
adding something unrelated to the new meaning of "innodb" directory given
by the renaming side? Then it is very likely that it should not go to the
renamed innodb/ directory, even though we may not be able to decide where
it *should* go automatically. The point to consider is that recording the
renamer's intention to rename the directory and not just its contents is
not enough and does not help the merge.
> Personally, I think the scenario I used of renaming plugins is more
> likely that the sort of source reorganization which you've posited,
> but I agree they are both possible scenarios. The question for git
> development is whether these sorts of issues ar ones that we should
> try to handle or not?
That entirely depends on your definition of "handle", I think.
I personally think that it is better for the tool to make its best effort
but stop and let the human inspect the result if the validitly of the
result is not so cut-and-dried, than blindly saying "the user said move
the directory, so I'll move the directory and move any and all new files
to it" and produce a potentially wrong result. My comment in the previous
message about <db/gdbm.c> was not that it is 100% correct to leave it
there, nor it is 100% correct to move it elsewhere. The point was it is a
case you cannot say what is correct even with help from "scm mvdir", and
the tool should stop and ask for confirmation.
Boasting that "unlike git that does not record renames, we correctly
resolve this case automatically, because our superiour design records the
user's intention to rename directory" is simply embarrassing yourself.
You may be silently producing a wrong result, which is nothing to boast
about.
I think it is Ok to assume that most of the time it is correct to move the
new file if the other branch "renamed" the directory in the situation the
example depicts, and I do not mind if the "best effort" the tool makes is
to move it to make it easy for the user to say "Yup, that is the right
outcome" and conclude the merge, but I think a tool is broken if it does
not give the user an opportunity to examine the situation and say "Oh, no,
that is not correct in *this* case" and fix it up.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Comments on "Understanding Version Control" by Eric S. Raymond
2009-02-02 18:48 Comments on "Understanding Version Control" by Eric S. Raymond Jakub Narebski
2009-02-02 20:24 ` Theodore Tso
2009-02-04 22:14 ` Tests for " Jakub Narebski
@ 2009-02-10 1:20 ` Jakub Narebski
2 siblings, 0 replies; 21+ messages in thread
From: Jakub Narebski @ 2009-02-10 1:20 UTC (permalink / raw)
To: git; +Cc: Eric S. Raymond, Theodore Tso
On Mon, 2 Feb 2009, Jakub Narebski wrote:
UVC = "Understanding Version-Control Systems" (draft),
http://www.catb.org/esr/writings/version-control/version-control.html
> UVC> = What, if anything, have we learned from history? =
> UVC>
> UVC> There's a folk saying that "It's not what you don't know that
> UVC> hurts you, it's what you think you know that ain't so." In
> UVC> examining the pattern of development of VCSes, it seems to me
> UVC> that the this sub-field of computer science has been less
> UVC> hampered than most by difficulties in finding appropriate
> UVC> techniques, but more hampered than most by wrong assumptions that
> UVC> hung on far longer than they should have.
> UVC>
> UVC> First wrong assumption: Conflict resolution by merging is
> UVC> intractably difficult, so we'll have to settle for locking. It
> UVC> took at least fifteen and arguably twenty years for VCS designers
> UVC> to get shut of that one. But it's historical now.
> UVC>
> UVC> Second wrong assumption: Change history representation as a
> UVC> snapshot sequence is perfectly dual to the representation as
> UVC> change/add/delete/rename sequences.. This folk theorem is well
> UVC> expressed in the 2004 essay "On Arch and Subversion"[3]. It is
> UVC> appealing, widely held, and dead wrong.
> UVC>
> UVC> File renames break the apparent symmetry. The failure of
> UVC> snapshot-based models to correctly address this has caused
> UVC> endless design failures, subtle bugs, and user misery.
>
> It is not true. Example of snapshot-based Git, which with its rename
> detection deals very well in practice with file renames contradict
> this theory. Bazaar which is supposedly snapshot-based, yet support
> "container identities" ('file-ids') contradict this further.
Now after thinking about this a bit, I reckon that the second wrong
assumption is not the fact that snapshots sequences representation
are perfectly dual to changesets representation, because in practice
(as in: merge doesn't have exponential time in history size) they are.
It is not even assumption that renames are not important, or in other
words not dealing correctly with renames and copies.
No, second wrong assumption (if we want to phrase knowledge from
history of version control in this terms) is not realizing that it
is _merging_ that has to be easy. Both to be able to do branching
(stable, development; feature branches), and for collaboration: the
distributed part of distributed version control systems (Linus'
"network of trust"). And intelligent, rename-aware merge strategy
is _necessary_ component for doing automated merge. Necessary,
and very important, but only a _component_.
That is what Subversion, at least up to Subversion 1.5, got wrong.
It made branching (or facsimile / cheap imitation of branching)
easy, but it *didn't* made merging easy. Even in SVN 1.5 it is not,
from what I understand, very easy.
Easy merging is extremely important for DVCS in OSS development, as
usually centralized VCS with need for commit rights simply do not
scale up to the sizes required by larger OSS projects, especially
those with diverse developers.
P.S. By the way, the hgbook contains quite good description of DVCS;
description of beginnings of Git can be found at GitHistory page on
Git Wiki; you can find history of adding features and changing design
and UI of Git in Junio C Hamano "Git Chronicles", presented at
GitTogether'08.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2009-02-10 1:21 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-02 18:48 Comments on "Understanding Version Control" by Eric S. Raymond Jakub Narebski
2009-02-02 20:24 ` Theodore Tso
2009-02-02 20:35 ` Eric S. Raymond
2009-02-03 20:57 ` Jakub Narebski
2009-02-04 2:04 ` Jakub Narebski
2009-02-04 23:54 ` Theodore Tso
2009-02-05 0:04 ` Junio C Hamano
2009-02-05 2:43 ` Theodore Tso
2009-02-05 6:24 ` Junio C Hamano
2009-02-05 13:28 ` Theodore Tso
2009-02-05 23:06 ` Junio C Hamano
2009-02-05 0:08 ` Jakub Narebski
2009-02-05 0:49 ` Theodore Tso
2009-02-05 6:01 ` Miles Bader
2009-02-05 9:34 ` Eric S. Raymond
2009-02-05 11:23 ` Jakub Narebski
2009-02-05 13:16 ` Theodore Tso
2009-02-05 17:36 ` Jakub Narebski
2009-02-05 21:45 ` Theodore Tso
2009-02-04 22:14 ` Tests for " Jakub Narebski
2009-02-10 1:20 ` Comments on " Jakub Narebski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).