From: Michael J Gruber <git@drmicha.warpmail.net>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 0/3] fast textconv
Date: Sun, 28 Mar 2010 18:09:35 +0200 [thread overview]
Message-ID: <4BAF7F3F.5020604@drmicha.warpmail.net> (raw)
In-Reply-To: <20100328145301.GA26213@coredump.intra.peff.net>
Jeff King venit, vidit, dixit 28.03.2010 16:53:
> The normal textconv procedure is to dump the binary file to a tempfile
> (optionally using a working tree file if available), then run the
> textconv helper to produce a textual version on stdout. This is a very
> convenient interface, as helpers don't need to be aware of git at all
> and many standard commands can be used without wrappers.
>
> Unfortunately, it can be slow for large binary files. We spool the file
> to disk before invoking the textconv helper, so the helper has no way to
> do any optimizations. For example, the helper may need only part of the
> file (e.g., when showing metadata at the beginning of a media file), or
> it may implement a caching scheme to avoid repeating expensive
> conversions.
>
> This series introduces a "fast textconv", which does not automatically
> spool a tempfile, but instead gives the helper program the sha1 of the
> blob to be converted.
>
> Here are some timings from my photo repository, on a commit with 37
> JPEGs and 8 AVIs. Each file had two lines added to its exif metadata.
> My textconv helper is a perl script that dumps the exif tags, and
> implements its own caching scheme.
>
> $ time git show >/dev/null ;# before patch
> real 0m13.818s
> user 0m12.137s
> sys 0m1.552s
>
> $ time git show >/dev/null ;# after patch, first run
> real 0m15.076s
> user 0m13.321s
> sys 0m1.772s
>
> $ time git show >/dev/null ;# after patch, subsequent runs
> real 0m2.502s
> user 0m1.820s
> sys 0m0.592s
>
> So you can see a 5.5x speedup. The first run is a little bit slower,
> presumably due to the extra git-cat-file calls by the helper.
>
> The speedup is purely from caching; I am not using the "we only need to
> read the first part of the file" optimization. My files are only a few
> megabytes. Probably that would be more useful for people storing files
> in the hundreds of megabytes, where a full cat-file will cause a lot of
> unwanted I/O.
>
> There are two things I'm still not 100% happy with:
>
> 1. 2.5 seconds is still a little slower than I would like. The slowness
> comes from the fact that my helper is written in perl, and therefore
> perl gets invoked for each diff. I could try collecting all of the
> to-be-textconv'd files at the beginning of the diff process and just
> invoking the helper once. But that means we need to store the
> results in core, and they could potentially be long (in my case,
> they are only a few hundred bytes, but somebody could potentially be
> textconv'ing a large documents).
>
> 2. It is up to the helper to implement a caching layer. This offers a
> lot of flexibility, but it means each helper must implement its own.
> It also means we have to run the helper even for a cache hit, which
> causes slowness.
>
> An alternative would be for git to support textconv caching
> natively, probably by using the notes mechanism to map blob sha1's
> to their textconv'd contents. But that opens a whole can of worms
> with how the cache is managed. If I change my textconv helper to
> produce different results, how do I invalidate the cache? Should it
> happen automatically if I change the contents of
> diff.$method.textconv? Or do I need to do it manually (you will
> still need to do it manually if, e.g., you upgrade your textconv
> helper. Git can't know about that). How do I evict entries if the
> cache gets too large when notes are stored as a history?
Really, "Notes!" was my first thought even before reading 2. Happy to
have found a like mind :)
This would still need a mechanism where the conv helper gets the blob's
SHA1 - hey, it's there in your patch...
How about:
Set fasttextconv=notestextconv
notestextconv does the following:
- If $sha1 has a note in refs/notes/bikeshed display it.
- If not create one and then display it.
In fact, the creation could be done using the textconv setting!
Pruning the cache is done be deleting the refs/notes/bikeshed ref,
truncating it by truncating it's DAG (filter-branch...).
Cheers,
Michael
next prev parent reply other threads:[~2010-03-28 16:09 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-28 14:53 [PATCH 0/3] fast textconv Jeff King
2010-03-28 14:53 ` [PATCH 1/3] textconv: refactor calls to run_textconv Jeff King
2010-03-28 14:53 ` [PATCH 2/3] textconv: refactor to handle multiple textconv types Jeff King
2010-03-28 14:54 ` [PATCH 3/3] diff: add "fasttextconv" config option Jeff King
2010-03-28 18:23 ` Johannes Sixt
2010-03-30 16:30 ` Jeff King
2010-03-30 17:36 ` [PATCH] diff: fix textconv error zombies Johannes Sixt
2010-03-30 21:46 ` Junio C Hamano
2010-03-30 22:17 ` Johannes Sixt
2010-03-30 22:56 ` Jeff King
2010-03-28 16:09 ` Michael J Gruber [this message]
2010-03-28 16:17 ` [PATCH 0/3] fast textconv Jeff King
2010-03-28 16:19 ` Jeff King
2010-03-28 16:56 ` Jeff King
2010-03-28 17:34 ` Jeff King
2010-03-28 18:13 ` Sverre Rabbelier
2010-03-30 16:04 ` Jeff King
2010-03-30 3:52 ` Junio C Hamano
2010-03-30 17:07 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4BAF7F3F.5020604@drmicha.warpmail.net \
--to=git@drmicha.warpmail.net \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.