git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Sayers <andrew-git@pileofstuff.org>
To: Florian Achleitner <florian.achleitner2.6.31@gmail.com>
Cc: Jonathan Nieder <jrnieder@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Ramkumar Ramachandra <artagnon@gmail.com>,
	David Barr <davidbarr@google.com>,
	Sverre Rabbelier <srabbelier@gmail.com>,
	Dmitry Ivankov <divanorama@gmail.com>
Subject: Re: GSOC Proposal draft: git-remote-svn
Date: Thu, 12 Apr 2012 23:30:29 +0100	[thread overview]
Message-ID: <4F875785.6040103@pileofstuff.org> (raw)
In-Reply-To: <2104868.dCxFQtJHdU@flomedio>

[-- Attachment #1: Type: text/plain, Size: 3801 bytes --]

On 12/04/12 16:28, Florian Achleitner wrote:
> 
> I'm not sure if storing this in a seperate directory tree makes sense, mostly 
> looking at performance. All these files will only contain some bytes, I guess.
> Andrew, why did you choose JSON?
> 

JSON has become my default storage format in recent years, so it seemed
like the natural thing to use for a format I wanted to chuck in and get
on with my work :)

JSON is my default format because it's reasonably space-efficient,
human-readable, widely supported and can represent everything I care
about except recursive data structures (which I didn't need for this
job).  You can do cleverer things if you don't mind being
language-specific (e.g. Perl's "Storable" module supports recursive data
structures but can't be used with other languages) or if you don't mind
needing special tools (e.g. git's index is highly efficient but can't be
debugged with `less`).  I've found you won't go far wrong if you start
with JSON and pick something else when the requirements become more obvious.

I gzipped the file because JSON isn't *that* space-efficient, and
because very large repositories are likely to produce enough JSON that
people will notice.  I found that gzipping the file significantly
reduced its size without having too much effect on run time.

I've attached a sample file representing the first few commits from the
GNU R repository.  The problem I referred to obliquely before isn't with
JSON, but with gzip - how would you add more revisions to the end of the
file without gunzipping it, adding one line, then gzipping it again?
One very nice feature of a directory structure is that you could store
it in git and get all that stuff for free.

To be clear, I'm not pushing any particular solution to this problem,
just offering some anecdotal evidence.  I'm pretty sure that SVN branch
export is an I/O bound problem - David Barr has said much the same about
svn-fe, but I was surprised to see it was still the bottleneck with a
problem that stripped out almost all the data from the dump and pushed
it through not-particularly-optimised Perl.  Having said that, the
initial import problem (potentially hundreds of thousands of revisions
needing manual attention) doesn't necessarily want the same solution as
update (tens of revisions that can almost always be read automatically).

>>  . tracing history past branch creation events, using the now-saved
>>    copyfrom information.
>>
>>  . tracing second-parent history using svn:mergeinfo properties.
> 
> This is about detection when to create a git merge-commit, right?

Yes - SVN has always stored metadata about where a directory was copied
from (unlike git, which prefers to detect it automatically), and since
version 1.0.5, SVN has added "svn:mergeinfo" metadata to files and
directories specifying which revisions of which other files or
directories have been cherry-picked in to them.

If you know a directory is a branch, "copyfrom" metadata is a very
useful signal for detecting branches created from it.  Unfortunately,
"svn:mergeinfo" is not as useful - aside from anything else, older
repositories often exhibit a period where there's no metadata at all,
then a gradual migration through SVN's early experiments with merge
tracking (like svnmerge.py), before everyone gradually standardises on
svn:mergeinfo and leaves the other tools behind.  Oh, and the interface
doesn't tell you about unmerged revisions, so if anybody ever forgets to
merge a revision then you'll probably never notice.

I'm planning to tackle this stuff in the work I'm doing, but I expect
people will be reporting edge cases until the day the last SVN
repository shuts down.  You shouldn't need to worry about it much on the
git side of SBL, which is probably best for your sanity ;)

	- Andrew

[-- Attachment #2: repo.json.gz --]
[-- Type: application/x-gzip, Size: 466 bytes --]

  reply	other threads:[~2012-04-12 22:30 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-19 14:42 GSoC intro Florian Achleitner
2012-03-19 21:31 ` Andrew Sayers
2012-03-20 12:25 ` Florian Achleitner
2012-03-20 13:19 ` David Barr
2012-03-21 21:16   ` Florian Achleitner
2012-03-26 11:06     ` Ramkumar Ramachandra
2012-03-27 13:53       ` Florian Achleitner
2012-04-02  8:30         ` GSOC Proposal draft: git-remote-svn Florian Achleitner
2012-04-02 11:00           ` Ramkumar Ramachandra
2012-04-02 20:57           ` Jonathan Nieder
2012-04-02 23:04             ` Jonathan Nieder
2012-04-03  7:49             ` Florian Achleitner
2012-04-03 18:48               ` Jonathan Nieder
2012-04-05 16:18             ` Tomas Carnecky
2012-04-02 22:17           ` Andrew Sayers
2012-04-02 22:29             ` Jonathan Nieder
2012-04-02 23:20               ` Andrew Sayers
2012-04-03  0:09                 ` Jonathan Nieder
2012-04-03 21:53                   ` Andrew Sayers
2012-04-03 22:21                     ` Jonathan Nieder
2012-04-05 13:36           ` Florian Achleitner
2012-04-05 15:47             ` Dmitry Ivankov
2012-04-09 18:59             ` Stephen Bash
2012-04-10 17:17             ` Jonathan Nieder
2012-04-10 22:30               ` Andrew Sayers
2012-04-10 23:46                 ` Jonathan Nieder
2012-04-11 19:09                 ` Florian Achleitner
2012-04-14 22:57                   ` Andrew Sayers
2012-04-11 15:51               ` Jakub Narebski
2012-04-11 15:56                 ` Jonathan Nieder
2012-04-11 19:20               ` Florian Achleitner
2012-04-11 19:44                 ` Dmitry Ivankov
2012-04-11 19:53                 ` Jonathan Nieder
2012-04-11 22:43                   ` Andrew Sayers
2012-04-12  9:02                   ` Thomas Rast
2012-04-12 15:28               ` Florian Achleitner
2012-04-12 22:30                 ` Andrew Sayers [this message]
2012-04-14 20:09                   ` Florian Achleitner
2012-04-14 21:35                     ` Andrew Sayers
2012-04-15  3:13                       ` Stephen Bash
2012-04-13 19:19                 ` Jonathan Nieder
2012-04-14 20:15                   ` Florian Achleitner
2012-04-18 20:16               ` Florian Achleitner
2012-04-19 12:26                 ` Florian Achleitner
2012-03-28  8:09       ` GSoC intro Miles Bader
2012-03-28  9:30         ` Dmitry Ivankov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F875785.6040103@pileofstuff.org \
    --to=andrew-git@pileofstuff.org \
    --cc=artagnon@gmail.com \
    --cc=davidbarr@google.com \
    --cc=divanorama@gmail.com \
    --cc=florian.achleitner2.6.31@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).