From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Barret Rhoden <brho@google.com>
Cc: git@vger.kernel.org, "Michael Platings" <michael@platin.gs>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"David Kastrup" <dak@gnu.org>, "Jeff King" <peff@peff.net>,
"Jeff Smith" <whydoubt@gmail.com>,
"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
"Junio C Hamano" <gitster@pobox.com>,
"René Scharfe" <l.s.r@web.de>,
"Stefan Beller" <stefanbeller@gmail.com>
Subject: Re: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines
Date: Thu, 4 Apr 2019 18:37:07 +0200 [thread overview]
Message-ID: <20190404163707.GP32732@szeder.dev> (raw)
In-Reply-To: <20190403160207.149174-7-brho@google.com>
On Wed, Apr 03, 2019 at 12:02:07PM -0400, Barret Rhoden wrote:
> diff --git a/blame.c b/blame.c
> index c06cbd906658..50511a300059 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -915,27 +915,109 @@ static int are_lines_adjacent(struct blame_line_tracker *first,
> first->s_lno + 1 == second->s_lno;
> }
>
> -/*
> - * This cheap heuristic assigns lines in the chunk to their relative location in
> - * the parent's chunk. Any additional lines are left with the target.
> +/* https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel */
> +static int bitcount(uint32_t v)
> +{
> + v = v - ((v >> 1) & 0x55555555u);
> + v = (v & 0x33333333u) + ((v >> 2) & 0x33333333u);
> + return (((v + (v >> 4)) & 0xf0f0f0fu) * 0x1010101u) >> 24;
> +}
> +
> +#define FINGERPRINT_LENGTH (8 * 256)
> +#define FINGERPRINT_THRESHOLD 1
> +/* This is just a bitset indicating which byte pairs are present.
> + * e.g. the string "good goo" has pairs "go", "oo", "od", "d ", " g"
> + * String similarity is calculated as a bitwise or and counting the set bits.
> + *
> + * TODO for the string lengths we typically deal with, this would probably be
> + * implemented more efficiently with a set data structure.
> */
> +struct fingerprint {
> + uint32_t bits[FINGERPRINT_LENGTH];
> +};
> +
> +static void get_fingerprint(struct fingerprint *result, const char *line_begin,
> + const char *line_end)
> +{
> + for (const char *p = line_begin; p + 1 < line_end; ++p) {
We still stick to C89, which doesn't support for loop initial
declarations yet. Please declare the loop variable as a regular local
variable. This also applies to the several 'for (int i = 0; ...)'
loops in the functions below.
> + unsigned int c = tolower(*p) | (tolower(*(p + 1)) << 8);
> +
> + result->bits[c >> 5] |= 1u << (c & 0x1f);
> + }
> +}
> +
> +static int fingerprint_similarity(const struct fingerprint *a,
> + const struct fingerprint *b)
> +{
> + int intersection = 0;
> +
> + for (int i = 0; i < FINGERPRINT_LENGTH; ++i)
> + intersection += bitcount(a->bits[i] & b->bits[i]);
> + return intersection;
> +}
> +
> +static void get_chunk_fingerprints(struct fingerprint *fingerprints,
> + const char *content,
> + const int *line_starts,
> + long chunk_start,
> + long chunk_length)
> +{
> + line_starts += chunk_start;
> + for (int i = 0; i != chunk_length; ++i) {
> + const char *linestart = content + line_starts[i];
> + const char *lineend = content + line_starts[i + 1];
> +
> + get_fingerprint(fingerprints + i, linestart, lineend);
> + }
> +}
> +
> static void guess_line_blames(struct blame_entry *e,
> struct blame_origin *parent,
> struct blame_origin *target,
> int offset, int delta,
> struct blame_line_tracker *line_blames)
> {
> + struct fingerprint *fp_parent, *fp_target;
> int nr_parent_lines = e->num_lines - delta;
>
> + fp_parent = xcalloc(sizeof(struct fingerprint), nr_parent_lines);
> + fp_target = xcalloc(sizeof(struct fingerprint), e->num_lines);
> +
> + get_chunk_fingerprints(fp_parent, parent->file.ptr,
> + parent->line_starts,
> + e->s_lno + offset, nr_parent_lines);
> + get_chunk_fingerprints(fp_target, target->file.ptr,
> + target->line_starts,
> + e->s_lno, e->num_lines);
> +
> for (int i = 0; i < e->num_lines; i++) {
> - if (i < nr_parent_lines) {
> + int best_sim_val = FINGERPRINT_THRESHOLD;
> + int best_sim_idx = -1;
> + int sim;
> +
> + for (int j = 0; j < nr_parent_lines; j++) {
> + sim = fingerprint_similarity(&fp_target[i],
> + &fp_parent[j]);
> + if (sim < best_sim_val)
> + continue;
> + /* Break ties with the closest-to-target line number */
> + if (sim == best_sim_val && best_sim_idx != -1 &&
> + abs(best_sim_idx - i) < abs(j - i))
> + continue;
> + best_sim_val = sim;
> + best_sim_idx = j;
> + }
> + if (best_sim_idx >= 0) {
> line_blames[i].is_parent = 1;
> - line_blames[i].s_lno = e->s_lno + i + offset;
> + line_blames[i].s_lno = e->s_lno + offset + best_sim_idx;
> } else {
> line_blames[i].is_parent = 0;
> line_blames[i].s_lno = e->s_lno + i;
> }
> }
> +
> + free(fp_parent);
> + free(fp_target);
> }
>
> /*
> --
> 2.21.0.392.gf8f6787159e-goog
>
next prev parent reply other threads:[~2019-04-04 16:37 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-03 16:02 [PATCH v5 0/6] blame: add the ability to ignore commits Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 1/6] Move init_skiplist() outside of fsck Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 2/6] blame: use a helper function in blame_chunk() Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 3/6] blame: optionally track the line starts during fill_blame_origin() Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 4/6] blame: add the ability to ignore commits and their changes Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 5/6] blame: add a config option to mark ignored lines Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match " Barret Rhoden
2019-04-04 16:37 ` SZEDER Gábor [this message]
[not found] <[PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines>
2019-04-07 21:46 ` michael
2019-04-07 21:52 ` David Kastrup
2019-04-08 9:48 ` Michael Platings
2019-04-08 16:03 ` Barret Rhoden
2019-04-09 15:38 ` Junio C Hamano
2019-04-09 15:56 ` Barret Rhoden
2019-04-09 19:10 ` Barret Rhoden
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190404163707.GP32732@szeder.dev \
--to=szeder.dev@gmail.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=avarab@gmail.com \
--cc=brho@google.com \
--cc=dak@gnu.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=l.s.r@web.de \
--cc=michael@platin.gs \
--cc=peff@peff.net \
--cc=stefanbeller@gmail.com \
--cc=whydoubt@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.