Re: [GSoC] Designing a faster index format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Thomas Gummerer <italyhockeyfeed@gmail.com>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Thomas Rast <trast@student.ethz.ch>,
	git@vger.kernel.org
Subject: Re: [GSoC] Designing a faster index format
Date: Thu, 05 Apr 2012 12:43:37 +0200	[thread overview]
Message-ID: <4F7D7759.6020603@alum.mit.edu> (raw)
In-Reply-To: <C15BAB9A-EAFA-4EA4-85B2-0E0C5FF473E9@gmail.com>

On 04/03/2012 09:07 PM, Thomas Gummerer wrote:
> On Apr 3, 2012, at 10:51 AM, Michael Haggerty wrote:
>> On 04/02/2012 11:02 PM, Thomas Gummerer wrote:
>>> - Append-only data structure
>>> [...]
>>> To make sure the index isn't corrupted, without calculating the sha1 hash for
>>> the whole index file every time something is changed, the hash is always
>>> calculated for the whole index when merging, but when only a single entry is 
>>> changed the sha-1 hash is only calculated for the last change. This will 
>>> increase the cost for reading the index to log(n) + k * log(k) where n is the 
>>> number of entries in the sorted part of the index and k is the number of entries
>>> in the unsorted part of the index, which will have to be merged with the rest 
>>> of the index.
>>
>> I don't understand this analysis of the reading time.  I suppose you are
>> assuming that you want to read the status of a single file.  But in that
>> case, it is enough to find the entry in the old index (O(log(n))
>> assuming some sort of tree structure) plus do a linear scan through the
>> unsorted entries (i.e., O(k), not O(k log(k))).
> 
> The current way git operates it always reads the whole index, making it necessary
> to merge the unsorted entries with the sorted part. Thinking about it it would even
> be O(k log(n)), because the appended part is unsorted.
> O(log(n)) + O(k) would be the complexity for loading only a single entry from the
> index.

I was confused because in your original mail you seemed to claim that
reading the whole sorted part of the index scales like O(log(n)), where
it certainly scales at least like O(n).

To read the whole index if using an append-only data structure, I would
do the following:

1. Read the file header to find where the addenda begin: one seek plus O(1).
2. Read the addenda in order (I assume each addendum to be sorted on
disk), and merge-sort the addenda together, discarding the earlier of
any duplicates: one seek plus O(k) I/O plus O(k log k) computation (this
is the worst case, if each addendum contains a single file).
3. Read the sorted part of the file in order, while merging it together
with the combined addenda: one seek plus O(n) I/O plus O(n + k) computation.

Total: 3 seeks plus O(n+k) I/O plus O(n + k log(k)) computation.

Whereas for a B-tree, it is hard to estimate the complexity because you
have provided very little detail about how you want to lay the data
structure out on disk.  But presumably the number of seeks will be
significantly larger.  And if you are not careful, the number of seeks
will approach the number of nodes in the index O(n) or perhaps the
number of added nodes (not files!) which could go something like O(k
log(k)).

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

     prev parent reply	other threads:[~2012-04-05 10:43 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-20 21:17 [GSoC] Designing a faster index format Thomas Gummerer
2012-03-21  1:29 ` Nguyen Thai Ngoc Duy
2012-03-21  9:22   ` Thomas Gummerer
2012-03-21 10:34     ` Nguyen Thai Ngoc Duy
     [not found]       ` <CALgYhfPOJpKbM__iU4KvChWeXWyyhWb2ocR-SLvrQQHNw5F5dQ@mail.gmail.com>
2012-03-21 11:18         ` Nguyen Thai Ngoc Duy
2012-03-21 12:51       ` Thomas Rast
2012-03-21 15:43         ` Thomas Gummerer
2012-03-21 16:19           ` Thomas Rast
2012-03-22 22:51             ` Thomas Gummerer
2012-03-23 10:10               ` Thomas Rast
2012-03-25  1:28                 ` Thomas Gummerer
2012-03-26 20:35                 ` Thomas Gummerer
2012-03-26 21:14                   ` Junio C Hamano
2012-03-27 11:08                     ` Thomas Gummerer
2012-03-27 11:47                   ` Thomas Rast
2012-03-29 15:21                     ` Thomas Gummerer
2012-03-29 21:06                       ` Junio C Hamano
2012-03-30  5:19                         ` Nguyen Thai Ngoc Duy
2012-04-02 21:02                           ` Thomas Gummerer
2012-04-03  8:51                             ` Michael Haggerty
2012-04-03 12:28                               ` Nguyen Thai Ngoc Duy
2012-04-03 19:07                               ` Thomas Gummerer
2012-04-03 20:15                                 ` david
2012-04-04 20:05                                   ` Thomas Gummerer
2012-04-05 14:39                                     ` Noel Grandin
2012-04-05 21:49                                     ` Thomas Rast
2012-04-06  3:22                                       ` Shawn Pearce
2012-04-06 15:11                                         ` Nguyen Thai Ngoc Duy
2012-04-06 15:24                                           ` Thomas Rast
2012-04-06 15:44                                             ` Nguyen Thai Ngoc Duy
2012-04-06 17:13                                               ` Shawn Pearce
2012-04-06 17:23                                                 ` Nguyen Thai Ngoc Duy
2012-04-06 17:56                                                   ` Shawn Pearce
     [not found]                                       ` <878vi18eqd.fsf@thomas.inf.ethz.ch>
     [not found]                                         ` <83571955-9256-4032-9182-FA9062D28B9D@gmail.com>
     [not found]                                           ` <8D2805A4-9C5F-43A9-B3ED-0DB77341A03C@gmail.com>
2012-04-19 10:49                                             ` Nguyen Thai Ngoc Duy
     [not found]                                             ` <877gxcoron.fsf@thomas.inf.ethz.ch>
2012-04-20 20:02                                               ` Jeff King
2012-04-05 10:43                                 ` Michael Haggerty [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F7D7759.6020603@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=italyhockeyfeed@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).