Re: [RFC] Possible idea for GSoC 2020

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: Abhishek Kumar <abhishekkumar8222@gmail.com>
Cc: git@vger.kernel.org,
	Christian Couder <christian.couder@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Heba Waly <heba.waly@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Emily Shaffer <emilyshaffer@google.com>,
	Abhishek Kumar <abhishekkumar8222@gmail.com>
Subject: Re: [RFC] Possible idea for GSoC 2020
Date: Tue, 17 Mar 2020 19:05:37 +0100	[thread overview]
Message-ID: <86sgi66emm.fsf@gmail.com> (raw)
In-Reply-To: <CAHk66ftx=XTSeVcPe09yA9KMDpHwiKFLKa62cCBFufjeenAbaQ@mail.gmail.com> (Abhishek Kumar's message of "Tue, 17 Mar 2020 22:30:00 +0530")

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

>> Having such a complicated two-dimensional system would need to
>> justify itself by being measurably faster than that one-dimensional
>> system in these example commands.
>>
>> [...]
>>
>> My _prediction_ is that the two-dimensional system will be more
>> complicated to write and use, and will not have any measurable
>> difference. I'd be happy to be wrong, but I also would not send
>> anyone down this direction only to find out I'm right and that
>> effort was wasted.
>
> Agreed. I have been through the papers of the involved variants and on graphs
> comparable to some of the largest git repositories, the performance improves by
> fifty nanoseconds for a random query.

I would recommend extending results for other types of large graphs to
the commit graphs with care.  The characteristics of those graphs are
quite different from characteristics of commit graph: they usually are
scale-free graphs, with low maximum level, and low connectivity: the
probability of two random nodes being connected in order of 10^-3 or
10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99

The last one, called R-ratio, means that testing on random query
actually tests mainly negative-cut filters.  That is why some papers
provide either separate numbers for negative and for positive queries,
or separate numbers for random and for balanced queries.

> Additionally:
> 1. They require significantly more space per commit.

This depends on the type of the algorithm: is it Label-Only (answering
reachability queries without consulting graph), or Label+Graph
(augmented online search algorithms):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78

The CDAT chunk in current version of commit-graph format takes H+16
bytes per commit (where H is the size of object id hash).  From those
H+16 bytes 30 bits (slightly less that 4 bytes) are used for current
reachability label: the topological level aka generation number.
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45

The proposed min-post interval label would take 8 bytes per commit, that
is 4 bytes per single number in interval.  That is not much, provided
that we get visible performance improvements for at least some often
used git commands.

> 2. They require significantly more preprocessing time.

This again depends on the type of algorithm: Label-Only or Label+G.

In the case of min-post interval labels, they can be computed together
with generation number, during the same commit-graph walk.  The amount
of calculations required to compute min-post interval is not much.
Therefore I think it would be not unreasonable cost.

>> My recommendation is that a GSoC student update the
>> generation number to "v2" based on the definition you made in [1].
>> That proposal is also more likely to be effective in Git because
>> it makes use of extra heuristic information (commit date) to
>> assist the types of algorithms we care about.
>
>> In that case, the "difficult" part is moving the "generation"
>> member of struct commit into a slab before making it a 64-bit
>> value. (This is likely necessary for your plan, anyway.) Updating
>> the generation number to v2 is relatively straight-forward after
>> that, as someone can follow all places that reference or compute
>> generation numbers and apply a diff
>
> Thanks for the recommendation. Reading about how this fits in more
> with REU on the other thread, I too agree that updating generation
> number to use corrected commit date would be more appropriate for a GSoC
> project.

You can read about monotonically offset corrected commit date for
example on my slides, from page 64 (the problem) to 75 (references):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=64

I'll try to update the proposal on SoC 2020 Ideas page soon, perhaps
tomorrow when I will have more free time.

https://git.github.io/SoC-2020-Ideas/

In the meantime I will refer to what is written at the top of it:

> **Students:** Please consider these ideas as starting points for
> generating proposals. We are also more than happy to receive proposals
> for other ideas related to Git. Make sure you have read the “Note
> about refactoring projects versus projects that implement new
> features” in the general application information page though.

Best,
-- 
Jakub Narębski

next prev parent reply	other threads:[~2020-03-17 18:05 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-17 17:00 [RFC] Possible idea for GSoC 2020 Abhishek Kumar
2020-03-17 18:05 ` Jakub Narebski [this message]
     [not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
2020-03-27 18:31 ` Jakub Narębski
  -- strict thread matches above, loose matches on Subject: below --
2020-03-17 18:00 Abhishek Kumar
2020-03-19 12:50 ` Jakub Narebski
2020-03-13 17:30 Abhishek Kumar
2020-03-10 14:50 Jakub Narebski
2020-03-11 19:03 ` Junio C Hamano
2020-03-13 10:56   ` Jakub Narebski
2020-03-15 14:26     ` Jakub Narebski
2020-03-17 12:24       ` Kaartic Sivaraam
2020-03-17 12:39         ` Kaartic Sivaraam
2020-03-17 14:22         ` Jakub Narebski
2020-03-11 20:29 ` Christian Couder
2020-03-11 21:30   ` Jakub Narebski
2020-03-11 21:48     ` Christian Couder
2020-03-12 12:17       ` Jakub Narebski
2020-03-12 13:08         ` Jakub Narebski
2020-03-13 10:59           ` Jakub Narebski
2020-03-13 13:08 ` Philip Oakley
2020-03-13 14:34   ` Jakub Narebski
2020-03-15 18:57     ` Philip Oakley
2020-03-15 21:14       ` Jakub Narebski
2020-03-16 14:47         ` Philip Oakley
2020-03-16 12:44 ` Derrick Stolee
2020-03-17  3:13   ` Jakub Narebski
2020-03-17  7:24     ` Christian Couder
2020-03-17 11:49       ` Derrick Stolee
2020-03-17 14:18       ` Jakub Narebski
2020-03-17 17:04         ` Christian Couder
2020-03-18 13:55           ` Jakub Narebski
2020-03-18 15:25             ` Derrick Stolee
2020-03-19 12:52               ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86sgi66emm.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=emilyshaffer@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=heba.waly@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.