From: Junio C Hamano <gitster@pobox.com>
To: Kirill Smelkov <kirr@mns.spb.ru>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 1/2] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well
Date: Fri, 14 Feb 2014 09:37:00 -0800 [thread overview]
Message-ID: <xmqqppmpiojn.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <20140214121529.GB3416@tugrik.mns.mnsspb.ru> (Kirill Smelkov's message of "Fri, 14 Feb 2014 16:15:29 +0400")
Kirill Smelkov <kirr@mns.spb.ru> writes:
> ---- 8< ----
> From: Kirill Smelkov <kirr@mns.spb.ru>
> Subject: [PATCH v2 1/2] tree-diff: rework diff_tree() to generate diffs for
> multiparent cases as well
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
The last three do not belong here. Only From, Date and Subject are
relevant and taken as in-body headers.
For that matter, as the From and Subject above are exactly the same
as what you have on the e-mail header, you do not want any of them.
I'll strip them out here, so no need to resend. Thanks.
> Previously diff_tree(), which is now named __diff_tree_sha1(), was
That name with two leading underscores is a rather unfortunate,
especially for a function that is not a file scope static. No way
to rename this to something more sensible?
> That impedance mismatch *hurts* *performance* *badly* for generating
> combined diffs - in c839f1bd (combine-diff: optimize combine_diff_path
Please avoid referring to a commit that is not in 'master' by its
object name. It can be reworked later and get a different name.
> That slowness comes from the fact that currently, while generating
> combined diff, a lot of time is spent computing diff(commit,commit^2)
> just to only then intersect that huge diff to almost small set of files
> from diff(commit,commit^1).
Good observation.
> |a| |b| a < b -> a ∉ B -> D(A,B) += +a a↓
> |-| |-| a > b -> b ∉ A -> D(A,B) += -b b↓
> | | | | a = b -> investigate δ(a,b) a↓ b↓
In both the "n-parallel" and "diff-tree", when an entry 'a' is a
tree, I take this "D(A,B) += +a" to mean (recursively) adding all
the paths within 'a' to the result as addition. Sounds sensible.
> D(A,B)
>
> is by definition the same as combined diff
>
> D(A,[B]),
>
> so if we could rework the code for common case and make it be not slower
> for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
> multiparent diff tree-walker would greatly benefit generating
> combine-diff.
OK.
> What we do is as follows:
>
> 1) diff tree-walker __diff_tree_sha1() is internally reworked to be
> a paths generator (new name diff_tree_paths()), with each generated path
> being `struct combine_diff_path` with info for path, new sha1,mode and for
> every parent which sha1,mode it was in it.
>
> 2) From that info, we can still generate usual diff queue with
> struct diff_filepairs, via "exporting" generated
> combine_diff_path, if we know we run for nparent=1 case.
> (see emit_diff() which is now named emit_diff_p0only())
s/p0/first_parent_/; perhaps?
> 3) In order for diff_can_quit_early(), which checks
>
> DIFF_OPT_TST(opt, HAS_CHANGES))
>
> to work, that exporting have to be happening not in bulk, but
> incrementally, one diff path at a time.
Good thinking.
> Some notes(*):
>
> 1) For loops,
>
> i = 0; do { ... } while (++i < nparent);
>
> is used instead of
>
> for (i = 0; i < nparent; ++i)
> ...
>
> because for the former case, the compiler have to emit additional
> prologue code which checks for i >= nparent case before entering the
> loop.
>
> As we require nparent must be >0, that additional overhead
> conflicts with the "runs not slower for nparent=1 case than before"
> goal.
Unfortunate. I'd rather see us stick to more readable and familiar
form for maintainability if this were not measurable.
> 2) alloca(), for small arrays, is used for the same reason - if we change
> it to xmalloc()/free() the timings get worse
Do you see any use of it outside compat/?
I thought we specifically avoid alloca() for portability. Also we
do not use variable-length-arrays on the stack either, I think.
> 3) For every parent tree, we need to keep a tag, whether entry from that
> parent equals to entry from minimal parent. For performance reasons I'm
> keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.
Unfortunate, but I do not see another place to keep this
information offhand (nor implement this approach without keeping
that piece of information).
> P.S. and combined diff is not some exotic/for-play-only stuff - for
No need to convince us about that ;-)
> example for a program I write to represent Git archives as readonly
> filesystem, there is initial scan with
>
> `git log --reverse --raw --no-abbrev --no-renames -c`
>
> to extract log of what was created/changed when, as a result building a
> map
>
> {} sha1 -> in which commit (and date) a content was added
>
> that `-c` means also show combined diff for merges, and without them, if
> a merge is non-trivial (merges changes from two parents with both having
> separate changes to a file), or an evil one, the map will not be full,
> i.e. some valid sha1 would be absent from it.
>
> That case was my initial motivation for combined diffs speedup.
I wonder if this machinery can be reused for "log -m" as well (or
perhaps you do that already?). After all, by performing a single
parallel scan, you are gathering all the necessary information to
let you pretend that you did N pairwise diff-tree.
> diff --git a/tree-diff.c b/tree-diff.c
> index ab61a0a..2b7c991 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -7,6 +7,25 @@
> #include "tree.h"
>
> /*
> + * internal mode marker, saying a tree entry != entry of tp[imin]
> + * (see __diff_tree_paths for what it means there)
> + *
> + * it *must* not overlap with any valid modes, and we will update/use/emit
> + * entry for diff only with it unset. Only non-overlapping to valid modes is
> + * required, because mode in tree_desc, comes here canonicalized via
> + * canon_mode().
> + *
> + * the definition assumes unsigned is at least 32 bits.
> + */
> +#define S_IFXMIN_NEQ 0x80000000
To allow better coordination across multiple codepaths that deal
with modes, I am wondering if this should be defined in cache.h
where made-up S_FIGITLINK and S_IFINVALID are defined (note the
comment that is there, as well).
> +static struct combine_diff_path *__diff_tree_paths(
> + struct combine_diff_path *p, const unsigned char *sha1,
> + const unsigned char **parents_sha1, int nparent,
> + struct strbuf *base, struct diff_options *opt);
Most of our code do not name private helper functions with leading
underscores.
I do like the direction this is going, but it looks to me that
"struct combine_diff" is now misnamed, because it no longer is about
combined diff. You are introducing a good framework for n-way diff,
and producing combined diff (i.e. -c or --cc) is now merely one way
to use that framework. We may want to clean these names up after
this series settles---perhaps "struct nway_diff" or something.
> +
> +/*
> * Compare two tree entries, taking into account only path/S_ISDIR(mode),
> * but not their sha1's.
> *
> @@ -33,72 +52,152 @@ static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
> }
>
>
> -/* convert path, t1/t2 -> opt->diff_*() callbacks */
> -static void emit_diff(struct diff_options *opt, struct strbuf *path,
> - struct tree_desc *t1, struct tree_desc *t2)
> +/*
> + * convert path -> opt->diff_*() callbacks
> + *
> + * emits diff to parent0 only.
Please call that "first parent".
> + */
"Returns 0 to tell the caller that we are done with p and it can be
freed" or something?
> +static int emit_diff_p0only(struct diff_options *opt, struct combine_diff_path *p)
> {
> ...
> +
> + return 0; /* = no need to keep allocated combine_diff_path */
Curious; what is that equal sign in the comment?
next prev parent reply other threads:[~2014-02-14 17:37 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-13 14:02 [PATCH 0/2] Multiparent diff tree-walker + combine-diff speedup Kirill Smelkov
2014-02-13 14:02 ` [PATCH 1/2] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well Kirill Smelkov
2014-02-13 19:51 ` Junio C Hamano
2014-02-14 12:15 ` Kirill Smelkov
2014-02-14 17:37 ` Junio C Hamano [this message]
2014-02-16 8:08 ` Kirill Smelkov
2014-02-18 21:29 ` Junio C Hamano
2014-02-13 14:02 ` [PATCH 2/2] combine-diff: speed it up, by using multiparent diff tree-walker directly Kirill Smelkov
2014-02-13 19:55 ` Junio C Hamano
2014-02-14 12:16 ` Kirill Smelkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqppmpiojn.fsf@gitster.dls.corp.google.com \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=kirr@mns.spb.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.