From: Thomas Rast <trast@inf.ethz.ch>
To: Thomas Gummerer <t.gummerer@gmail.com>,
elton sky <eltonsky9404@gmail.com>,
Calvin Deutschbein <deutschbeinc@gmail.com>,
Mauricio Galindo <up.mauricio.g@gmail.com>
Cc: git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>,
"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
"Jan Krüger" <jk@jk.gs>, "Tay Ray Chuan" <rctay89@gmail.com>,
"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
"Shawn Pearce" <spearce@spearce.org>
Subject: Clarifications on the "faster index format" project
Date: Wed, 28 Mar 2012 02:36:49 +0200 [thread overview]
Message-ID: <878vilr272.fsf@thomas.inf.ethz.ch> (raw)
Dear students,
The proposal "Designing a faster index format" has attracted quite a bit
more attention that we expected. We would like to emphasize that it is
not an easy project.
I have myself only found out about most of the points I am listing here
through interaction with students. My apologies for this. Take them as
a (non-exhaustive!) list of concerns. The more of them you can address
in your proposal, the better.
Issues with the original project idea:
- Further timings have shown that even on humongous indexes such as
Webkit's (roughly 25MB), git-add is still pretty fast. It is a good
baseline for what index-changing commands need to do however. Judge
from your own timings and state what case you want to optimize.
- The notation is confusing. Asymptotics noted in the proposal should
distinguish the variables "# index entries" (n), "# changed entries",
and depending on the application also "size of index file" (though
this is fairly tightly coupled with n) or other variables.
Some hardnesses that we expect:
- Without prior knowledge of the index, it is hard to see how a new
format must be designed to stay within the requirements.
- The index format is closely tied to a lot of core git's code. The
main work of git-ls-files, git-status, git-read-tree, git-add,
git-diff etc. all directly accesses the index memory structure. This
means that changing the in-memory structure is a lot of work.
- Changing the in-memory structure also makes conversion between the old
and new format more difficult.
- Writing out only the changed entries would save a lot of time.
However there are many code paths that change/add/remove index
entries, and they must all record what to write in some way.
- Cutting down the amount of verification going on is very tempting, but
needs careful design to keep the chances of a bit error propagating
all the way into the repository small.
There are many tradeoffs to be made, which must be evaluated carefully:
- If the current lock-rewrite-rename scheme remains in place, any change
only affects the in-core work done. If the lock scheme is changed,
there are many different cases of corruption/partial writes that must
be handled.
- There may be ways to reduce the data written (and thus checksummed) at
the cost of extra work. Similarly, a more complex data structure may
or may not pay off depending on the extra space taken (or saved) and
extra bookkeeping done.
- Some improvements would be possible by using techniques from database
libraries. However, this either means an external dependency or a lot
of extra work. It may also come with extra startup costs.
- Using database libraries may be a deal-breaker for git's own
portability or the other readers (libgit2 and jgit, mainly).
The format should cope with requirements that are not clearly specified
at this time. For example:
- The existing code also assumes that iterating over an index is not a
problem. For possible future work, the format should allow for
iterating over only a select part of the index (known, e.g., from an
inotify daemon or a pathspec) quickly.
Finally I have a request not related to project hardness: please Cc the
proposed mentors for discussions on the respective projects :-)
Thanks for reading,
Thomas
--
Thomas Rast
trast@{inf,student}.ethz.ch
reply other threads:[~2012-03-28 0:37 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878vilr272.fsf@thomas.inf.ethz.ch \
--to=trast@inf.ethz.ch \
--cc=deutschbeinc@gmail.com \
--cc=eltonsky9404@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jk@jk.gs \
--cc=jnareb@gmail.com \
--cc=pclouds@gmail.com \
--cc=peff@peff.net \
--cc=rctay89@gmail.com \
--cc=spearce@spearce.org \
--cc=t.gummerer@gmail.com \
--cc=up.mauricio.g@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).