* Clarifications on the "faster index format" project
@ 2012-03-28 0:36 Thomas Rast
0 siblings, 0 replies; only message in thread
From: Thomas Rast @ 2012-03-28 0:36 UTC (permalink / raw)
To: Thomas Gummerer, elton sky, Calvin Deutschbein, Mauricio Galindo
Cc: git, Junio C Hamano, Nguyễn Thái Ngọc Duy,
Jan Krüger, Tay Ray Chuan, Jakub Narebski, Jeff King,
Shawn Pearce
Dear students,
The proposal "Designing a faster index format" has attracted quite a bit
more attention that we expected. We would like to emphasize that it is
not an easy project.
I have myself only found out about most of the points I am listing here
through interaction with students. My apologies for this. Take them as
a (non-exhaustive!) list of concerns. The more of them you can address
in your proposal, the better.
Issues with the original project idea:
- Further timings have shown that even on humongous indexes such as
Webkit's (roughly 25MB), git-add is still pretty fast. It is a good
baseline for what index-changing commands need to do however. Judge
from your own timings and state what case you want to optimize.
- The notation is confusing. Asymptotics noted in the proposal should
distinguish the variables "# index entries" (n), "# changed entries",
and depending on the application also "size of index file" (though
this is fairly tightly coupled with n) or other variables.
Some hardnesses that we expect:
- Without prior knowledge of the index, it is hard to see how a new
format must be designed to stay within the requirements.
- The index format is closely tied to a lot of core git's code. The
main work of git-ls-files, git-status, git-read-tree, git-add,
git-diff etc. all directly accesses the index memory structure. This
means that changing the in-memory structure is a lot of work.
- Changing the in-memory structure also makes conversion between the old
and new format more difficult.
- Writing out only the changed entries would save a lot of time.
However there are many code paths that change/add/remove index
entries, and they must all record what to write in some way.
- Cutting down the amount of verification going on is very tempting, but
needs careful design to keep the chances of a bit error propagating
all the way into the repository small.
There are many tradeoffs to be made, which must be evaluated carefully:
- If the current lock-rewrite-rename scheme remains in place, any change
only affects the in-core work done. If the lock scheme is changed,
there are many different cases of corruption/partial writes that must
be handled.
- There may be ways to reduce the data written (and thus checksummed) at
the cost of extra work. Similarly, a more complex data structure may
or may not pay off depending on the extra space taken (or saved) and
extra bookkeeping done.
- Some improvements would be possible by using techniques from database
libraries. However, this either means an external dependency or a lot
of extra work. It may also come with extra startup costs.
- Using database libraries may be a deal-breaker for git's own
portability or the other readers (libgit2 and jgit, mainly).
The format should cope with requirements that are not clearly specified
at this time. For example:
- The existing code also assumes that iterating over an index is not a
problem. For possible future work, the format should allow for
iterating over only a select part of the index (known, e.g., from an
inotify daemon or a pathspec) quickly.
Finally I have a request not related to project hardness: please Cc the
proposed mentors for discussions on the respective projects :-)
Thanks for reading,
Thomas
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2012-03-28 0:37 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-28 0:36 Clarifications on the "faster index format" project Thomas Rast
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).