Proposed design of fast-export helper

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Proposed design of fast-export helper
@ 2011-04-01  6:14 Ramkumar Ramachandra
  2011-04-07 23:03 ` Jonathan Nieder
  0 siblings, 1 reply; 4+ messages in thread
From: Ramkumar Ramachandra @ 2011-04-01  6:14 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git Mailing List, Junio C Hamano

Hi Jonathan,

(+CC: Git List, Junio)

This is the proposed design for a new "fast-export helper library".
Normally, this would be an RFC patch; however, I'm not happy with the
design, and I'd like some input before starting off.

The Problem
-----------
svn-fi, a program to convert a git-fast-import stream to
a Subversion dumpfile, has hit a dead-end [1].  This is because it
doesn't know how to handle any `<dataref>` except 'inline' (this
appears inside `filemodify`, `notemodify`, `ls` and `cat-blob`
commands).

The other two kinds of `<dataref>` that exporters can produce are:
1. A mark reference (`:<idnum>`) set by a prior `blob` command
2. A full 40-byte SHA-1 of an existing Git blob object.

The most naive solution will involve modifying svn-fi to persist blobs
in-memory as soon as it sees one after a `blob` command (this is the
approach that git2svn uses).  However, this solution is both expensive
in memory and highly unscalable.  Also, svn-fi's job is lightweight
parsing and conversion from one format to another- I don't want to
clutter it up with this additional complexity.

The other alternative that svn-fi currently uses: --inline-blobs [2].
This is a modification to the git-fast-export so that it only ever
produces inlined blobs.  However, this has severe drawbacks, the main
one being that every exporter must implement it for it to become
accepted.  You also pointed out another problem: One blob may be
referenced multiple times in the same stream, especially when dealing
with cherry-picks and rebases (when branch support is added to
svn-fi); writing it out explicitly that many times will pollute the
stream unnecessarily large with a lot of redundant data.  In the best
case, this can simply be a way to hint the git-fast-export to minimize
the work that the helper has to do.

Junio suggested using a fast-import-filter that can convert a
fast-import stream from the current format to one that contains only
inlined blobs [3].  The final proposal for the implementation differs,
because I don't like the idea of having to parse the data twice and do
the same error handling in two different places (svn-fi and the
fast-import-filter).

The library's API
-----------------
I've thought of building a sort of library which applications can link
to. The API is as follows:
int write_blob(unit32_t, char *, size_t, FILE *);
int fetch_blob_mark(unit32_t, struct strbuf *);
int fetch_blob_sha1(char *sha1, struct strbuf *); /* sha1[20] */

The svn-fi parser should call write_blob when it encounters some data
that it wants to persist. The arguments are:

1. A mark using which the blob can be recalled using fetch_blob_mark
(optional: use 0 to omit).
2. A terminator character in the case of delimited format. Should be
NULL when the format is non-delimited.
3. In the case of the delimited format, the size of the delimeter
itself.  Otherwise, the size of the blob itself.
4. The FILE * to parse the blob from, which is already seeked to the
right position and ready to parse.

The library then parses this data and dumps it into a storage backend
(described later) after computing its SHA1.

fetch_blob_mark and fetch_blob_sha1 can then be used to fetch blobs
using their mark or SHA1.  Fetching blobs using their mark should be
O(1), while locating the exact SHA1 will require a bisect of sorts:
slightly better than O(log (n)).

How the library works
---------------------
It maintains an sorted list of (SHA1, off_t, off_t) triplets in a
buffer along with a 256-entry fanout table (call this blob_index) --
this is mmap'ed, and only munmap'ed at the end of the program.  When
write_blob is invoked, the blob is read from the FILE, and its SHA1 is
computed.  Then it is written to another big buffer (call this
blob_buffer) after an inexpensive zlib deflate along with its deflated
size, and its (SHA1, offset1, offset2) is written to the blob_index --
the first number refers to the blob_buffer number (there are many
blob_buffers), and the second offset refers to the offset within that
blob_buffer.  No dobut, this is an expensive table to maintain, but we
don't have a choice in the matter -- there's nothing in the spec
preventing a dataref from referring to blobs using their marks and
SHA1s interchangably.

For marks, there is another marks_buffer which stores (uint32_t,
off_t, off_t) triplets.

So, what do you think?

-- Ram

[1]: http://thread.gmane.org/gmane.comp.version-control.git/170290
[2]: http://thread.gmane.org/gmane.comp.version-control.git/170290/focus=170292
[3]: http://thread.gmane.org/gmane.comp.version-control.git/165237/focus=165289

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Proposed design of fast-export helper
  2011-04-01  6:14 Proposed design of fast-export helper Ramkumar Ramachandra
@ 2011-04-07 23:03 ` Jonathan Nieder
  2011-04-08  5:33   ` Ramkumar Ramachandra
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Nieder @ 2011-04-07 23:03 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Git Mailing List, Junio C Hamano, David Barr, Sverre Rabbelier

Hi,

Ramkumar Ramachandra wrote:

> The other two kinds of `<dataref>` that exporters can produce are:
> 1. A mark reference (`:<idnum>`) set by a prior `blob` command
> 2. A full 40-byte SHA-1 of an existing Git blob object.

The above is very git-specific --- arbitrary foreign vcs-es are
unlikely to all use 40-byte hashes as <dataref>.  So far I've been
assuming that a <dataref> is sufficiently "nice" (not containing
spaces, NULs, quotation marks, or newlines nor starting with a colon).

It would be better to come up with a more formal rule and document it.

> The most naive solution will involve modifying svn-fi to persist blobs
> in-memory as soon as it sees one after a `blob` command (this is the
> approach that git2svn uses).  However, this solution is both expensive
> in memory and highly unscalable.

I suppose it also means trouble with large (>4 GiB) files on 32-bit
architectures.  Of course git proper does not support those anyway.

[...]
> The other alternative that svn-fi currently uses: --inline-blobs [2].
> This is a modification to the git-fast-export so that it only ever
> produces inlined blobs.  However, this has severe drawbacks
[... explanation of drawbacks to relying on "inline" data only ...]
> In the best
> case, this can simply be a way to hint the git-fast-export to minimize
> the work that the helper has to do.

Yes.

> The library's API
> -----------------
> I've thought of building a sort of library which applications can link
> to. The API is as follows:
> int write_blob(unit32_t, char *, size_t, FILE *);
> int fetch_blob_mark(unit32_t, struct strbuf *);
> int fetch_blob_sha1(char *sha1, struct strbuf *); /* sha1[20] */

Hmm.

What the first function does is not obvious without argument names.
I had guessed it would write a blob to file.  More nits:

 - the type of marks in fast-import is uintmax_t (yech) --- they're
   not restricted to 32-bit integers as far as I know.

 - missing "const" before char *. :)

 - lengths of subsets of files should be of type off_t, if we want
   to allow 64-bit lengths on 32-bit architectures.  If this is
   git-specific, it would be more consistent to use "unsigned long".

 - the size that toggles between size of a delimiter and length of
   file seems strange to me.  I'd rather have two separate functions.

 - this requires the frontend to think in terms of git blob object
   names, which doesn't fit well with most applications I'm thinking
   of (think: fast-import backends in C that if written in python
   would use python-fastimport).

I assume the delimited format works as in fast-import's "data" command
(and only supports blobs ending with LF)?

[...]
> The library then parses this data and dumps it into a storage backend
> (described later) after computing its SHA1.

Right, so a nice aspect of this exercise is that the nature of the
key/value store is hidden from the caller.

> fetch_blob_mark and fetch_blob_sha1 can then be used to fetch blobs
> using their mark or SHA1.  Fetching blobs using their mark should be
> O(1), while locating the exact SHA1 will require a bisect of sorts:
> slightly better than O(log (n)).

http://fanf.livejournal.com/101174.html

> How the library works

I wonder if it would be sensible to make it run as a separate process.
The upside: writing to and from pipes is easy in a variety of
programming languages (including the shell), even easier than calling
C code.  So in particular that would make testing it easier.  But
performance considerations might outweigh that.

I also wonder if it is possible or makes sense to make the API less
git-specific.  If the buffers were in-memory, something like

	set(key, value);
	value = get(key);

would do.  Since they are not, maybe something vaguely like

	FILE *f = kvstore_fopen(key, O_WRONLY);
	fwrite(value, sz, 1, f);
	kvstore_fclose(f);

	FILE *f = kvstore_fopen(key, O_RDONLY);
	strbuf_fread(&value, SIZE_MAX, f);
	kvstore_fclose(f);

would be something to aim for.  For the getter case, fmemopen is
portable (in case one wants to just put the value in memory) but
fopencookie (in case one doesn't) is not, so the idea does not work as
nicely as one might like.  And it's not quite the right abstraction
--- for a fast-import backend, I suppose the operations needed are:

 * get length
 * dump the value to a caller-specified FILE * or fd
 * let the caller read the value one chunk or line at a time to
   transform it (e.g., to escape special characters).

Is there prior art that this could mimic or reuse (so we can learn
from others' mistakes and make sure the API feels familiar)?

Hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Proposed design of fast-export helper
  2011-04-07 23:03 ` Jonathan Nieder
@ 2011-04-08  5:33   ` Ramkumar Ramachandra
  2011-04-08  6:47     ` Jonathan Nieder
  0 siblings, 1 reply; 4+ messages in thread
From: Ramkumar Ramachandra @ 2011-04-08  5:33 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Git Mailing List, Junio C Hamano, David Barr, Sverre Rabbelier

Hi Jonathan,

Jonathan Nieder writes:
> Ramkumar Ramachandra wrote:
> > The other two kinds of `<dataref>` that exporters can produce are:
> > 1. A mark reference (`:<idnum>`) set by a prior `blob` command
> > 2. A full 40-byte SHA-1 of an existing Git blob object.
> 
> The above is very git-specific --- arbitrary foreign vcs-es are
> unlikely to all use 40-byte hashes as <dataref>.  So far I've been
> assuming that a <dataref> is sufficiently "nice" (not containing
> spaces, NULs, quotation marks, or newlines nor starting with a colon).
> 
> It would be better to come up with a more formal rule and document it.

Actually, we need to tighten this <dataref> thing before we build
anything else- it's a nightmare to handle a stream that refers to the
same blob using the mark the first time, the SHA1 the second time, and
the MD5 the third time.  How is our store supposed to know how to
index and retrieve blobs?

Next step: We should find out all the things <dataref> can currently
be, by looking at existing frontend implementation.  Then, we should
come tighten the spec so that it doesn't clobber any of those things.
Also, we should find a way to let the backend know "how" to index/
retrieve a blob -- this is only straightforward in the case of marks.

> I assume the delimited format works as in fast-import's "data" command
> (and only supports blobs ending with LF)?

Yes.  This is actually quite an ugly to support -- We should probably
drop support for this.

Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 2c2ea12..1fb71f7 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -826,8 +826,8 @@ of the next line, even if `<raw>` did not end with an `LF`.
 Delimited format::
 	A delimiter string is used to mark the end of the data.
 	fast-import will compute the length by searching for the delimiter.
-	This format is primarily useful for testing and is not
-	recommended for real data.
+	This format is should only be used for testing; other
+	backends are not required to support this.
 +
 ....
 	'data' SP '<<' <delim> LF

> > fetch_blob_mark and fetch_blob_sha1 can then be used to fetch blobs
> > using their mark or SHA1.  Fetching blobs using their mark should be
> > O(1), while locating the exact SHA1 will require a bisect of sorts:
> > slightly better than O(log (n)).
> 
> http://fanf.livejournal.com/101174.html

Right, but this discussion is now useless, since keys can be just
about anything.

> > How the library works
> 
> I wonder if it would be sensible to make it run as a separate process.
> The upside: writing to and from pipes is easy in a variety of
> programming languages (including the shell), even easier than calling
> C code.  So in particular that would make testing it easier.  But
> performance considerations might outweigh that.

Performance and portability considerations.  Calling semantics will
probably be highly inelegant too, since full-blown bi-directional
communication is necessary.

> I also wonder if it is possible or makes sense to make the API less
> git-specific.  If the buffers were in-memory, something like
> 
> 	set(key, value);
> 	value = get(key);
> 
> would do.  Since they are not, maybe something vaguely like
> 
> 	FILE *f = kvstore_fopen(key, O_WRONLY);
> 	fwrite(value, sz, 1, f);
> 	kvstore_fclose(f);
> 
> 	FILE *f = kvstore_fopen(key, O_RDONLY);
> 	strbuf_fread(&value, SIZE_MAX, f);
> 	kvstore_fclose(f);

I don't like this.  The caller should not have to know about whether
blobs are persisted in-memory or on-disk.  When there are a few small
frequently-used blobs, the key-value might decide to persist them in
memory, and we should allow for this kind of optimization.

> would be something to aim for.  For the getter case, fmemopen is
> portable (in case one wants to just put the value in memory) but
> fopencookie (in case one doesn't) is not, so the idea does not work as
> nicely as one might like.  And it's not quite the right abstraction
> --- for a fast-import backend, I suppose the operations needed are:
> 
>  * get length
>  * dump the value to a caller-specified FILE * or fd
>  * let the caller read the value one chunk or line at a time to
>    transform it (e.g., to escape special characters).
> 
> Is there prior art that this could mimic or reuse (so we can learn
> from others' mistakes and make sure the API feels familiar)?

Kyoto Cabinet, or just any key-value store for that matter.  All prior
discussion related to SHA1 is useless then, because the key can be
just about anything: the only option we have is to implement the
hashtable as a data structure with a very high fanout value like a B+
tree.  Obviously, this will be less efficient than a store which keys
everything using a fixed 20-byte SHA1 -- how much speed are we willing
to trade off for the sake of this simplicity?

-- Ram

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Proposed design of fast-export helper
  2011-04-08  5:33   ` Ramkumar Ramachandra
@ 2011-04-08  6:47     ` Jonathan Nieder
  0 siblings, 0 replies; 4+ messages in thread
From: Jonathan Nieder @ 2011-04-08  6:47 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Git Mailing List, Junio C Hamano, David Barr, Sverre Rabbelier

Hi again,

Ramkumar Ramachandra wrote:

> Next step: We should find out all the things <dataref> can currently
> be, by looking at existing frontend implementation.

There are only a few places a <dataref> can come from.

 * marks (":5"-style datarefs and from the marks file)
 * output from the "ls" or "cat" command
 * out-of-band knowledge by the frontend about this specific backend
   (e.g., "I know git fast-import supports git-style blob names, so I
   will use them").

> Then, we should
> come tighten the spec so that it doesn't clobber any of those things.
> Also, we should find a way to let the backend know "how" to index/
> retrieve a blob -- this is only straightforward in the case of marks.

If this key/value store is specifically for use with fast-import
backends, I'd prefer it just deal with marks.  Caching responses to
lookup of blobs using a backend-specific <dataref> format is a
different problem.

>> I assume the delimited format works as in fast-import's "data" command
>> (and only supports blobs ending with LF)?
>
> Yes.  This is actually quite an ugly to support -- We should probably
> drop support for this.
>
> Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>
>
> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
> index 2c2ea12..1fb71f7 100644
> --- a/Documentation/git-fast-import.txt
> +++ b/Documentation/git-fast-import.txt
> @@ -826,8 +826,8 @@ of the next line, even if `<raw>` did not end with an `LF`.
>  Delimited format::
>  	A delimiter string is used to mark the end of the data.
>  	fast-import will compute the length by searching for the delimiter.
> -	This format is primarily useful for testing and is not
> -	recommended for real data.
> +	This format is should only be used for testing; other
> +	backends are not required to support this.

That would mean git's fast-import test suite couldn't be used with
arbitrary fast-import backends.  The delimited format is also very
convenient for testing "by hand".  I'm not convinced it's that hard to
support.

> Performance and portability considerations.  Calling semantics will
> probably be highly inelegant too, since full-blown bi-directional
> communication is necessary.

Takes command on stdin, writes response to stdout.  Seems kind of
typical for helper programs, but you are right to mention that shell
scripts do not deal so well with that when the helper process needs
to handle multiple requests.

e.g., see /usr/share/debconf/confmodule.

> Jonathan Nieder writes:

>> 	FILE *f = kvstore_fopen(key, O_WRONLY);
>> 	fwrite(value, sz, 1, f);
>> 	kvstore_fclose(f);
>> 
>> 	FILE *f = kvstore_fopen(key, O_RDONLY);
>> 	strbuf_fread(&value, SIZE_MAX, f);
>> 	kvstore_fclose(f);
>
> I don't like this.  The caller should not have to know about whether
> blobs are persisted in-memory or on-disk.  When there are a few small
> frequently-used blobs, the key-value might decide to persist them in
> memory, and we should allow for this kind of optimization.

A FILE * can have a pipe underlying it, or a memory area (with
fmemopen).  But I agree that using FILE * here would be a bad API
after all, for other reasons (the FILE * operations are just not quite
right).

>> Is there prior art that this could mimic or reuse (so we can learn
>> from others' mistakes and make sure the API feels familiar)?
>
> Kyoto Cabinet

Yikes, its API is complicated. :)  Does Kyoto Cabinet support values
longer than a size_t can describe?  Maybe that's not worth supporting
after all (I guess it would be nice to have an example of a 4 GiB file
in version control to motivate it first).

> Obviously, this will be less efficient than a store which keys
> everything using a fixed 20-byte SHA1

Sure, sticking to fixed-length, 20-byte keys could be reasonable if
that's what the application requires.  Is this for caching or for
avoiding an internal marks table?

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-04-08  6:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-01  6:14 Proposed design of fast-export helper Ramkumar Ramachandra
2011-04-07 23:03 ` Jonathan Nieder
2011-04-08  5:33   ` Ramkumar Ramachandra
2011-04-08  6:47     ` Jonathan Nieder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).