Clarifications on the "faster index format" project

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Rast <trast@inf.ethz.ch>
To: Thomas Gummerer <t.gummerer@gmail.com>,
	elton sky <eltonsky9404@gmail.com>,
	Calvin Deutschbein <deutschbeinc@gmail.com>,
	Mauricio Galindo <up.mauricio.g@gmail.com>
Cc: git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	"Jan Krüger" <jk@jk.gs>, "Tay Ray Chuan" <rctay89@gmail.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
	"Shawn Pearce" <spearce@spearce.org>
Subject: Clarifications on the "faster index format" project
Date: Wed, 28 Mar 2012 02:36:49 +0200	[thread overview]
Message-ID: <878vilr272.fsf@thomas.inf.ethz.ch> (raw)

Dear students,

The proposal "Designing a faster index format" has attracted quite a bit
more attention that we expected.  We would like to emphasize that it is
not an easy project.

I have myself only found out about most of the points I am listing here
through interaction with students.  My apologies for this.  Take them as
a (non-exhaustive!) list of concerns.  The more of them you can address
in your proposal, the better.

Issues with the original project idea:

- Further timings have shown that even on humongous indexes such as
  Webkit's (roughly 25MB), git-add is still pretty fast.  It is a good
  baseline for what index-changing commands need to do however.  Judge
  from your own timings and state what case you want to optimize.

- The notation is confusing.  Asymptotics noted in the proposal should
  distinguish the variables "# index entries" (n), "# changed entries",
  and depending on the application also "size of index file" (though
  this is fairly tightly coupled with n) or other variables.

Some hardnesses that we expect:

- Without prior knowledge of the index, it is hard to see how a new
  format must be designed to stay within the requirements.

- The index format is closely tied to a lot of core git's code.  The
  main work of git-ls-files, git-status, git-read-tree, git-add,
  git-diff etc. all directly accesses the index memory structure.  This
  means that changing the in-memory structure is a lot of work.

- Changing the in-memory structure also makes conversion between the old
  and new format more difficult.

- Writing out only the changed entries would save a lot of time.
  However there are many code paths that change/add/remove index
  entries, and they must all record what to write in some way.

- Cutting down the amount of verification going on is very tempting, but
  needs careful design to keep the chances of a bit error propagating
  all the way into the repository small.

There are many tradeoffs to be made, which must be evaluated carefully:

- If the current lock-rewrite-rename scheme remains in place, any change
  only affects the in-core work done.  If the lock scheme is changed,
  there are many different cases of corruption/partial writes that must
  be handled.

- There may be ways to reduce the data written (and thus checksummed) at
  the cost of extra work.  Similarly, a more complex data structure may
  or may not pay off depending on the extra space taken (or saved) and
  extra bookkeeping done.

- Some improvements would be possible by using techniques from database
  libraries.  However, this either means an external dependency or a lot
  of extra work.  It may also come with extra startup costs.

- Using database libraries may be a deal-breaker for git's own
  portability or the other readers (libgit2 and jgit, mainly).

The format should cope with requirements that are not clearly specified
at this time.  For example:

- The existing code also assumes that iterating over an index is not a
  problem.  For possible future work, the format should allow for
  iterating over only a select part of the index (known, e.g., from an
  inotify daemon or a pathspec) quickly.

Finally I have a request not related to project hardness: please Cc the
proposed mentors for discussions on the respective projects :-)

Thanks for reading,

Thomas

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

                 reply	other threads:[~2012-03-28  0:37 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878vilr272.fsf@thomas.inf.ethz.ch \
    --to=trast@inf.ethz.ch \
    --cc=deutschbeinc@gmail.com \
    --cc=eltonsky9404@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jk@jk.gs \
    --cc=jnareb@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=rctay89@gmail.com \
    --cc=spearce@spearce.org \
    --cc=t.gummerer@gmail.com \
    --cc=up.mauricio.g@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).