Re: [GSoC] Designing a faster index format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Thomas Gummerer <t.gummerer@gmail.com>
Cc: Thomas Rast <trast@student.ethz.ch>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>, <git@vger.kernel.org>
Subject: Re: [GSoC] Designing a faster index format
Date: Mon, 26 Mar 2012 14:14:23 -0700	[thread overview]
Message-ID: <7vd37znjyo.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <2A61C352-5C71-4EDF-9DBE-01CC09AE03A5@gmail.com> (Thomas Gummerer's message of "Mon, 26 Mar 2012 22:35:07 +0200")

Thomas Gummerer <t.gummerer@gmail.com> writes:

> The proposed solution is to rework the index, in order to make it possible to
> have a complexity of O(log(n)), where n is the number of index entries, for 
> simple operations like adding files. More complex operations shall also
> benefit from the new index format, although a complexity of O(log(n)) might not
> be reached, while keeping the index format easy enough to parse for .git reading
> programs. The amount of data for which that the hash will be computed however
> shall be log(n). N is the number of entries in the index.

Where does log(N) come from, and is that a hard requirement?

Rephrasing your problem description, you want to address the inefficiency
that we have to write the full 1m entries to a new file in order to update
only one file when the index tracks 1m paths.  

Wouldn't the goal be more in line with that problem statement if you
instead aim to make the cost proportional to the number of entries that
are updated, regardless of how many entries exist in the index?

> In phase one the pysical index format shall be replaced with the new index 
> format, which does neither require a full recalculation of the sha-1 hash nor a
> full rewrite of the index to the disk. The new index format will be mapped to 
> the current in-memory structure, which will only be changed in phase two. For 
> further optimization the cache-tree sha-1 hashes shall be mapped to the new 
> index format, which should avoid cache-tree invalidations.

It is unclear what you meant by the last sentence.  The cache-tree
invalidation is a critical part of the machinery that allows later
write-tree to reuse the already computed subtree object names, without
having to recompute subtree object names until they are really necessary
(i.e. when writing a tree out).  By "avoiding" it, are you going to make
write-tree always recompute all the subtree object names?  Or are you
planning to keep the cached tree information always up to date by
recomputing subtree object names and keeping them in the index even when
they are not yet needed?  If the latter, how would you handle when a part
of the index contains unmerged entries?

Right now, the code that updates the in-core index records "Is the in-core
index clean, or modified?" bit and nothing else.  Without updating the
in-core structure and the codepaths that access it, how is it possible for
your phase I to selectively write out only the modified (either added,
updated or removed) part of it?  In other words, I do not think it is
realistic to expect that the core parts to stay oblivious to the new index
semantics during the "phase one".

> -- Timeline --
> 24/04 - 12/05: Getting familiar with the old index

It might make more sense to write the "proposed solution" *after* this
step is done; otherwise you wouldn't even know the problems you are trying
to solve.  That may mean that this part of the schedule may need to be
shortened or started way before the beginning of GSoC.

next prev parent reply	other threads:[~2012-03-26 21:14 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-20 21:17 [GSoC] Designing a faster index format Thomas Gummerer
2012-03-21  1:29 ` Nguyen Thai Ngoc Duy
2012-03-21  9:22   ` Thomas Gummerer
2012-03-21 10:34     ` Nguyen Thai Ngoc Duy
     [not found]       ` <CALgYhfPOJpKbM__iU4KvChWeXWyyhWb2ocR-SLvrQQHNw5F5dQ@mail.gmail.com>
2012-03-21 11:18         ` Nguyen Thai Ngoc Duy
2012-03-21 12:51       ` Thomas Rast
2012-03-21 15:43         ` Thomas Gummerer
2012-03-21 16:19           ` Thomas Rast
2012-03-22 22:51             ` Thomas Gummerer
2012-03-23 10:10               ` Thomas Rast
2012-03-25  1:28                 ` Thomas Gummerer
2012-03-26 20:35                 ` Thomas Gummerer
2012-03-26 21:14                   ` Junio C Hamano [this message]
2012-03-27 11:08                     ` Thomas Gummerer
2012-03-27 11:47                   ` Thomas Rast
2012-03-29 15:21                     ` Thomas Gummerer
2012-03-29 21:06                       ` Junio C Hamano
2012-03-30  5:19                         ` Nguyen Thai Ngoc Duy
2012-04-02 21:02                           ` Thomas Gummerer
2012-04-03  8:51                             ` Michael Haggerty
2012-04-03 12:28                               ` Nguyen Thai Ngoc Duy
2012-04-03 19:07                               ` Thomas Gummerer
2012-04-03 20:15                                 ` david
2012-04-04 20:05                                   ` Thomas Gummerer
2012-04-05 14:39                                     ` Noel Grandin
2012-04-05 21:49                                     ` Thomas Rast
2012-04-06  3:22                                       ` Shawn Pearce
2012-04-06 15:11                                         ` Nguyen Thai Ngoc Duy
2012-04-06 15:24                                           ` Thomas Rast
2012-04-06 15:44                                             ` Nguyen Thai Ngoc Duy
2012-04-06 17:13                                               ` Shawn Pearce
2012-04-06 17:23                                                 ` Nguyen Thai Ngoc Duy
2012-04-06 17:56                                                   ` Shawn Pearce
     [not found]                                       ` <878vi18eqd.fsf@thomas.inf.ethz.ch>
     [not found]                                         ` <83571955-9256-4032-9182-FA9062D28B9D@gmail.com>
     [not found]                                           ` <8D2805A4-9C5F-43A9-B3ED-0DB77341A03C@gmail.com>
2012-04-19 10:49                                             ` Nguyen Thai Ngoc Duy
     [not found]                                             ` <877gxcoron.fsf@thomas.inf.ethz.ch>
2012-04-20 20:02                                               ` Jeff King
2012-04-05 10:43                                 ` Michael Haggerty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vd37znjyo.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    --cc=t.gummerer@gmail.com \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).