GSOC remote-svn: branch detection

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* GSOC remote-svn: branch detection
@ 2012-08-03  9:43 Florian Achleitner
  2012-08-03 18:17 ` Jonathan Nieder
  0 siblings, 1 reply; 5+ messages in thread
From: Florian Achleitner @ 2012-08-03  9:43 UTC (permalink / raw)
  To: git; +Cc: David Michael Barr, Jonathan Nieder, Andrew Sayers

Hi!

I'm playing around in vcs-svn/ to start a framework for detecting and 
processing branches  in svndumps. So I wanted to let you know about my ideas.

Two approaches:
1. Import linearly and split later:
One idea is to import from svn linearly, i.e. one revision on top of it's 
predecessor, like now, and detect and split branches afterwards. The svn 
metadata is stored in git notes, so the required information would be 
available.
+ allows recovery, because the linear history is always here.
+ it's easier to peek around in the git history than in the svn dump during 
import to do the branch detection.
- requires creation of new commits in the branch detection stage.
- this results in double commits and awkward history, linear vs. branched.

2. Split during import:
Detect branches as they are created while reading the svn dump and identify to 
which branch a following node belongs.
First step is to restructure svndump.c to be able to buffer one complete 
revision for inspection before starting to write a commit to fast import.
Probably it's possible to feed the blobs to fast import directly and only 
buffer node data and defer commit creation, but not the data.
Currently, at the beginning of a new revision on the svn side, a new commit is 
created on top of a constant ref. When we support branches, we don't know the 
ref, i.e. the branch(es), the revision changes, before reading all the 'Node-
*' lines.
+ feels more 'right'
- requires revision buffering

Generally:
Detect branches as they are created by 'Node-copyfrom*' to some commonly used 
branch directories, like branches/. More complex branch detection can be 
implemented later, of course.
Store detected branches permanently (necessary for incremental fetches), and 
assign every file modification to one of those branches, if possible. Else 
assign them to, hm .. 
If a revision modifies more than one branch, create multiple commits.

Thanks for your comments and ideas! 

--
Florian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GSOC remote-svn: branch detection
  2012-08-03  9:43 GSOC remote-svn: branch detection Florian Achleitner
@ 2012-08-03 18:17 ` Jonathan Nieder
  2012-08-04  6:40   ` Dmitry Ivankov
  2012-08-04 18:23   ` Ramkumar Ramachandra
  0 siblings, 2 replies; 5+ messages in thread
From: Jonathan Nieder @ 2012-08-03 18:17 UTC (permalink / raw)
  To: Florian Achleitner
  Cc: git, David Michael Barr, Andrew Sayers, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain

Hi,

Florian Achleitner wrote:

> Two approaches:
> 1. Import linearly and split later:
> One idea is to import from svn linearly, i.e. one revision on top of it's 
> predecessor, like now, and detect and split branches afterwards. The svn 
> metadata is stored in git notes, so the required information would be 
> available.
> + allows recovery, because the linear history is always here.
> + it's easier to peek around in the git history than in the svn dump during 
> import to do the branch detection.
> - requires creation of new commits in the branch detection stage.
> - this results in double commits and awkward history, linear vs. branched.

I don't think you've captured the real pros and cons here.

+ Divides responsibility between a component that fetches and a component
that splits branches, making for easier debugging, independent refactoring
of components, reuse in other contexts (e.g., splitting out branches in
other similar VCSen, etc)

- Divides responsibility between a component that fetches and a component
that splits branches, which is tricky because it involves designing an
interface between them and documenting it.  And maybe a different
interface would be better.

There are also performance and history-clarity ramifications as you've
mentioned, but they do not seem as important.

Hope that helps,
Jonathan

> 2. Split during import:

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GSOC remote-svn: branch detection
  2012-08-03 18:17 ` Jonathan Nieder
@ 2012-08-04  6:40   ` Dmitry Ivankov
  2012-08-04 18:23   ` Ramkumar Ramachandra
  1 sibling, 0 replies; 5+ messages in thread
From: Dmitry Ivankov @ 2012-08-04  6:40 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Florian Achleitner, git, David Michael Barr, Andrew Sayers,
	Ramkumar Ramachandra, Sam Vilain

Hi,

On Sat, Aug 4, 2012 at 12:17 AM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Hi,
>
> Florian Achleitner wrote:
>
>> Two approaches:
>> 1. Import linearly and split later:
>> One idea is to import from svn linearly, i.e. one revision on top of it's
>> predecessor, like now, and detect and split branches afterwards. The svn
>> metadata is stored in git notes, so the required information would be
>> available.
>> + allows recovery, because the linear history is always here.
This is a good one, but I'd put questions another way:
- do we want to query svn server only for newer revisions even if our
settings changed (branch layout ones for example), maybe we don't mind
some queries in settings change case (like git-svn.perl)?
- do we want to be able to filter svn history early (like take
trunk,branches,tags, skip tests_data as it's huge but sometimes there
are svn cp to/from it, or maybe the repo has weird permissions or even
is corrupted)?
- do we just want a completely separate (fast) (local) storage like
svn dump file to use it for imports and settings changes?

I personally still haven't decided on those. My set of pros/cons:
+ should be the simplest thing for simple small repos
+ keeps all the original data details and looks quite robust
- becomes complicated if we don't want or can't import some parts of
the history. While git-svn.perl somehow handles is.
- looks like a thing to store and access svn dump information, do we
really want it to be in a form of git objects (almost sure), how
stable, flexible, independent from svn helper should it be (that's
what Jonathan talks about).

Weird idea: what if we keep everything in one huge git tree like
rXX/{data,props,copy-from,..}/path/path/path/file. It should represent
all the known svn info so far. Ok, I know it's a late stage now and
this thing is completely raw, just posting to have it written out
somewhere :)

>> + it's easier to peek around in the git history than in the svn dump during
>> import to do the branch detection.
>> - requires creation of new commits in the branch detection stage.
>> - this results in double commits and awkward history, linear vs. branched.
>
> I don't think you've captured the real pros and cons here.
>
> + Divides responsibility between a component that fetches and a component
> that splits branches, making for easier debugging, independent refactoring
> of components, reuse in other contexts (e.g., splitting out branches in
> other similar VCSen, etc)
>
> - Divides responsibility between a component that fetches and a component
> that splits branches, which is tricky because it involves designing an
> interface between them and documenting it.  And maybe a different
> interface would be better.
>
> There are also performance and history-clarity ramifications as you've
> mentioned, but they do not seem as important.
>
> Hope that helps,
> Jonathan
>
>> 2. Split during import:

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GSOC remote-svn: branch detection
  2012-08-03 18:17 ` Jonathan Nieder
  2012-08-04  6:40   ` Dmitry Ivankov
@ 2012-08-04 18:23   ` Ramkumar Ramachandra
  2012-08-07 21:26     ` Florian Achleitner
  1 sibling, 1 reply; 5+ messages in thread
From: Ramkumar Ramachandra @ 2012-08-04 18:23 UTC (permalink / raw)
  To: Florian Achleitner
  Cc: Jonathan Nieder, git, David Michael Barr, Andrew Sayers,
	Dmitry Ivankov, Sam Vilain

Hi,

Florian Achleitner wrote:
> 1. Import linearly and split later:

I think this approach will be a lot less messy if you can cleanly
separate the fetching component from the mapper.  Currently, svndump
re-creates the layout of the SVN repository.  And the series you
posted last week contains a patch that attaches a note with SVN
metadata to each commit.  Do you have thoughts on how the mapping will
take place?

Ram

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GSOC remote-svn: branch detection
  2012-08-04 18:23   ` Ramkumar Ramachandra
@ 2012-08-07 21:26     ` Florian Achleitner
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Achleitner @ 2012-08-07 21:26 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Florian Achleitner, Jonathan Nieder, git, David Michael Barr,
	Andrew Sayers, Dmitry Ivankov, Sam Vilain

On Saturday 04 August 2012 23:53:58 Ramkumar Ramachandra wrote:
> Hi,
> 
> Florian Achleitner wrote:
> > 1. Import linearly and split later:
> I think this approach will be a lot less messy if you can cleanly
> separate the fetching component from the mapper.  Currently, svndump
> re-creates the layout of the SVN repository.  And the series you
> posted last week contains a patch that attaches a note with SVN
> metadata to each commit.  Do you have thoughts on how the mapping will
> take place?

The mapping itself is currently a black box for me, it's internals could be 
rather complex. It could get a function like is_branch_start, that is called 
with a node ctx and tells if this is likely to be the start of branch. The 
detected branches are stored and upcoming changes in the associated 
directories are mapped to a commit on a branch.
The detection of branch starts and the list of existing branches can be taken 
from whatever logic we want. So that's approx. the idea.

Currently I'm working on more basic preparations. I want to split the creation 
of commits and the creation of blobs in svndump.c.
This is necessary because fast import requires a branch name as an argument to 
the 'commit' command, and
currently a 'commit' command is started when a new revision is encountered in 
the svndump.
But to decide on which branch the commit should go, or even if it will be more 
than one commit, it is necessary to read all the nodes first.
To prevent buffering the node content, I want to replace the inline data format 
(currently used) by 'blob' commands.
While parsing the dump, every node change creates a blob command to feed the 
data immediately into fast-import while the node metadata (struct node_ctx) is 
stored at least until the revision ends. Then the blobs can be put on a linear 
master tree and other branch trees. The node metadata could also be read from 
notes, if remapping branches.
That's not so easy to do, because the current implementation mixes tree-
operations and blob-operations heavily, and relies on only one global 
node_ctx.

> 
> Ram

Flo

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-08-07 21:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-03  9:43 GSOC remote-svn: branch detection Florian Achleitner
2012-08-03 18:17 ` Jonathan Nieder
2012-08-04  6:40   ` Dmitry Ivankov
2012-08-04 18:23   ` Ramkumar Ramachandra
2012-08-07 21:26     ` Florian Achleitner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).