git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>, git@vger.kernel.org
Subject: Re: [WIP PATCH] Manual rename correction
Date: Thu, 2 Aug 2012 18:58:44 -0400	[thread overview]
Message-ID: <20120802225843.GA29208@sigill.intra.peff.net> (raw)
In-Reply-To: <7v6290di1m.fsf@alter.siamese.dyndns.org>

On Thu, Aug 02, 2012 at 03:51:17PM -0700, Junio C Hamano wrote:

> > On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote:
> > ...
> >> When you move porn/0001.jpg in the preimage to naughty/00001.jpg in
> >> the postimage, they both can hit "*.jpg contentid=jpeg" line in the
> >> top-level .gitattribute file, and the contentid driver for jpeg type
> >> may strip exif and hash the remainder bits in the image to come up
> >> with a token you can use in a similar way as object ID is used in
> >> the exact rename detection phase.
> >> 
> >> Just thinking aloud.
> >
> > Ah, I see. That still feels like way too specific a use case to me. A
> > much more general use case to me would be a contentid driver which
> > splits the file into multiple chunks (which can be concatenated to
> > arrive at the original content), and marks chunks as "OK to delta" or
> > "not able to delta".  In other words, a content-specific version of the
> > bup-style splitting that people have proposed.
> >
> > Assuming we split a jpeg into its EXIF bits (+delta) and its image bits
> > (-delta), then you could do a fast rename or pack-objects comparison
> > between two such files (in fact, with chunked object storage,
> > pack-objects can avoid looking at the image parts at all).
> >
> > However, it may be the case that such "smart" splitting is not
> > necessary, as stupid and generic bup-style splitting may be enough. I
> > really need to start playing with the patches you wrote last year that
> > started in that direction.
> 
> I wasn't interested in "packing split object representation",
> actually.  The idea was still within the context of "rename".

But it would work for rename, too. If you want to compare two files, the
driver would give you back { sha1_exif (+delta), sha1_image (-delta) }
for each file. You know the size of each chunk and the size of the total
file.

Then you would just compare sha1_image for each entry. If they match,
then you have a lower bound on similarity of image_chunk_size /
total_size. If they don't, then you have an upper bound of similarity of
1-(image_chunk_size/total_size). In the former case, you can get the
exact similarity by doing a real delta on the sha1_exif content. In the
latter case, you can either exit early (if you are already below the
similarity threshold, which is likely), or possibly do the delta on the
sha1_exif content to get an exact value.

But either way, you never had to do a direct comparison between the big
image data; you only needed to know the sha1s. And as a bonus, if you
did want to cache results, you can have an O(# of blobs) cache of the
chunked sha1s of the chunked form (because the information is immutable
for a given sha1 and content driver). Whereas by caching the result of
estimate_similarity, our worst-case cache is the square of that (because
we are storing sha1 pairs).

-Peff

  reply	other threads:[~2012-08-02 22:58 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-31 14:15 [WIP PATCH] Manual rename correction Nguyen Thai Ngoc Duy
2012-07-31 16:32 ` Junio C Hamano
2012-07-31 19:23   ` Jeff King
2012-07-31 20:20     ` Junio C Hamano
2012-08-01  0:42       ` Jeff King
2012-08-01  6:01         ` Junio C Hamano
2012-08-01 21:54           ` Jeff King
2012-08-01 22:10             ` Junio C Hamano
2012-08-02 22:37               ` Jeff King
2012-08-02 22:51                 ` Junio C Hamano
2012-08-02 22:58                   ` Jeff King [this message]
2012-08-02  5:33             ` Junio C Hamano
2012-08-01  1:10     ` Nguyen Thai Ngoc Duy
2012-08-01  2:01       ` Jeff King
2012-08-01  4:36         ` Nguyen Thai Ngoc Duy
2012-08-01  6:09           ` Junio C Hamano
2012-08-01  6:34             ` Nguyen Thai Ngoc Duy
2012-08-01 21:32               ` Jeff King
2012-08-01 21:27           ` Jeff King
2012-08-02 12:08             ` Nguyen Thai Ngoc Duy
2012-08-02 22:41               ` Jeff King
2012-08-04 17:09                 ` [PATCH 0/8] caching rename results Jeff King
2012-08-04 17:10                   ` [PATCH 1/8] implement generic key/value map Jeff King
2012-08-04 22:58                     ` Junio C Hamano
2012-08-06 20:35                       ` Jeff King
2012-08-04 17:10                   ` [PATCH 2/8] map: add helper functions for objects as keys Jeff King
2012-08-04 17:11                   ` [PATCH 3/8] fast-export: use object to uint32 map instead of "decorate" Jeff King
2012-08-04 17:11                   ` [PATCH 4/8] decorate: use "map" for the underlying implementation Jeff King
2012-08-04 17:11                   ` [PATCH 5/8] map: implement persistent maps Jeff King
2012-08-04 17:11                   ` [PATCH 6/8] implement metadata cache subsystem Jeff King
2012-08-04 22:49                     ` Junio C Hamano
2012-08-06 20:31                       ` Jeff King
2012-08-06 20:38                     ` Jeff King
2012-08-04 17:12                   ` [PATCH 7/8] implement rename cache Jeff King
2012-08-04 17:14                   ` [PATCH 8/8] diff: optionally use " Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120802225843.GA29208@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).