[GSoC] Designing a faster index format - Progress report week 14

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GSoC] Designing a faster index format - Progress report week 14
@ 2012-07-23 19:08 Thomas Gummerer
  2012-07-24  1:23 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Gummerer @ 2012-07-23 19:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, mhagger, pclouds, trast, Robin Rosenberg,
	Tomas Carnecky

== Work done in the previous 13 weeks ==

- Definition of a tentative index file v5 format [1]. This differs
  from the proposal in making it possible to bisect the directory
  entries and file entries, to do a binary search. The exact bits
  for each section were also defined. To further compress the index,
  along with prefix compression, the stat data is hashed, since
  it's only used for equality comparison, but the plain data is
  never used.
  Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
  and Robin Rosenberg for feedback.
- Prototype of a converter from the index format v2/v3 to the index
  format v5. [2] The converter reads the index from a git repository,
  can output parts of the index (header, index entries as in
  git ls-files --debug, cache tree as in test-dump-cache-tree, or
  the reuc data). Then it writes the v5 index file format to
  .git/index-v5. Thanks to Michael Haggerty for the code review.
- Prototype of a reader for the new index file format. [3] The
  reader has mainly the purpose to show the algorithm used to read
  the index lexicographically sorted after the full name which is
  required by the current internal memory format. Big thanks for
  reviewing this code and giving me advice on refactoring goes
  to Michael Haggerty.
- Read the on-disk index file format and translate it to the current
  in memory format. This doesn't include reading any of the current
  extensions, which are now part of the main index. The code again
  is on github. [4] Thanks for reviewing the first steps to Thomas
  Rast.
- Read the cache-tree data (formerly an extension, now it's integrated
  with the rest of the directory data) from the new ondisk format.
  There are still a few optimizations to do in this algorithm.
- Started implementing the API (suggested by Duy), but it's still
  in the very early stages. There is one commit for this on GitHub [1],
  but it's a very early work in progress.
- Started implementing the writer, which extracts the directories from
  the in-memory format, and writes the header and the directories to
  disk. The algorithm uses a hash-table instead of a simple list,
  to avoid many corner cases.
- Implemented writing the file block to disk, and basic tests from the
  test suite are running fine, not including tests that require
  conflicted data or the cache-tree to work, which both are not
  implemented yet.
- Started implementing a patch to introduce a ce_namelen field in
  struct cache_entry and drop the name length from the flags. [5]
  Thanks to Junio, Duy and Thomas for reviews and suggestions for
  improving it.
- Implemented the cache-tree and conflict data writing to the
  index-v5 file.

== Work done in the last week ==

- Fixed a few bugs in cache-tree and conflict writing in my index-v5
  code.
- I visited Thomas in Zuerich this weekend, as some of you may
  have noticed on the mailing list, thanks a lot for the
  hospitality. And thanks also to Tomas for the company on saturday
  night.
- With the help of Thomas I fixed the resolve undo data writing,
  and a couple more bugs in my code, so that the test suite passes
  with INDEX_VERSION_DEFAULT = 5.
- We added a POC, for partial loading in git grep. This is still a
  pretty hacky implementation, but it demonstrates pretty well
  how much can be gained. Here are the timings Thomas posted on
  IRC yesterday. The improvements of ls-files are not drastic
  compared to index-v4, but git greps in subdirs benefit a lot
  from partial loading.

  Test                                      this tree
  -----------------------------------------------------------
  0002.2: v[23]: ls-files                   0.13(0.11+0.02)
  0002.3: v[23]: grep nonexistent -- subdir 0.12(0.10+0.02)
  0002.5: v4: ls-files                      0.11(0.09+0.01)
  0002.6: v4: grep nonexistent -- subdir    0.10(0.08+0.02)
  0002.8: v5: ls-files                      0.10(0.07+0.02)
  0002.9: v5: grep nonexistent -- subdir    0.01(0.00+0.00)

- All changes again are in my repository on github, in case someone
  wants to try them out. [4]

== Outlook for the next week ==

- Refactoring of the code. There are still some code smells in the
  current code, which need to be refactored.
- There are also still some possible optimizations for the code,
  which will be included.
- I'll also bring the reader up to date again, and make it possible
  for it to update a single index entry, to test the implementation 
  of rereading a entry when the crc is wrong (for the future partial
  write, both writer and reader side still have to be implemented).

[1] https://github.com/tgummerer/git/wiki/Index-file-format-v5
[2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py
[3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
[4] https://github.com/tgummerer/git/tree/index-v5
[5] http://thread.gmane.org/gmane.comp.version-control.git/200997

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report week 14
  2012-07-23 19:08 [GSoC] Designing a faster index format - Progress report week 14 Thomas Gummerer
@ 2012-07-24  1:23 ` Nguyen Thai Ngoc Duy
  2012-07-24  7:52   ` Thomas Gummerer
  0 siblings, 1 reply; 3+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-07-24  1:23 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: git, Junio C Hamano, mhagger, trast, Robin Rosenberg,
	Tomas Carnecky

On Tue, Jul 24, 2012 at 2:08 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> - We added a POC, for partial loading in git grep. This is still a
>   pretty hacky implementation, but it demonstrates pretty well
>   how much can be gained. Here are the timings Thomas posted on
>   IRC yesterday. The improvements of ls-files are not drastic
>   compared to index-v4, but git greps in subdirs benefit a lot
>   from partial loading.
>
>   Test                                      this tree
>   -----------------------------------------------------------
>   0002.2: v[23]: ls-files                   0.13(0.11+0.02)
>   0002.3: v[23]: grep nonexistent -- subdir 0.12(0.10+0.02)
>   0002.5: v4: ls-files                      0.11(0.09+0.01)
>   0002.6: v4: grep nonexistent -- subdir    0.10(0.08+0.02)
>   0002.8: v5: ls-files                      0.10(0.07+0.02)
>   0002.9: v5: grep nonexistent -- subdir    0.01(0.00+0.00)
>

Is ls-files improvement not drastic because you do not limit subdir
like grep? I see Thomas Rast put similar partial loading hack to
ls-files.c so I assume it can partial load too. Is partial loading
still fast with when a lot of unmerged entries are present?
-- 
Duy

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report week 14
  2012-07-24  1:23 ` Nguyen Thai Ngoc Duy
@ 2012-07-24  7:52   ` Thomas Gummerer
  0 siblings, 0 replies; 3+ messages in thread
From: Thomas Gummerer @ 2012-07-24  7:52 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: git, Junio C Hamano, mhagger, trast, Robin Rosenberg,
	Tomas Carnecky

On 07/24, Nguyen Thai Ngoc Duy wrote:
> On Tue, Jul 24, 2012 at 2:08 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
> Is ls-files improvement not drastic because you do not limit subdir
> like grep? I see Thomas Rast put similar partial loading hack to
> ls-files.c so I assume it can partial load too. Is partial loading
> still fast with when a lot of unmerged entries are present?
> -- 
> Duy

Yes, exactly, the ls-files was to show the overall performance of the
index for a full read. The improvement when limiting it to a subdir
is about the same as with git grep. Here are some more timings. I'm
using update-index instead of ls-files in this tests, with a
--force-rewrite option I added which writes the index even if there
are no changes, to test the performance of both the reader and the
writer.

Test                                        this tree      
-----------------------------------------------------------
0002.2: v[23]: update-index                 0.29(0.21+0.06)
0002.3: v[23]: grep nonexistent -- subdir   0.13(0.12+0.01)
0002.5: v4: update-index                    0.26(0.20+0.05)
0002.6: v4: grep nonexistent -- subdir      0.11(0.08+0.02)
0002.7: v4: ls-files -- subdir              0.10(0.07+0.02)
0002.9: v5: update-index                    0.19(0.11+0.07)
0002.10: v5: grep nonexistent -- subdir     0.01(0.00+0.00)
0002.11: v5: ls-files -- subdir             0.01(0.00+0.00)

Partial loading is still fast with unmerged entries, since we only
need to load files that belong to a specific directory there too.
I've created about 15,000 conflicts on the webkit repository, and
got the following times:

Test                                        this tree      
-----------------------------------------------------------
0002.2: v[23]: update-index                 0.30(0.18+0.10)
0002.3: v[23]: grep nonexistent -- subdir   0.13(0.09+0.04)
0002.5: v4: update-index                    0.26(0.22+0.04)
0002.6: v4: grep nonexistent -- subdir      0.11(0.07+0.03)
0002.7: v4: ls-files -- subdir              0.10(0.09+0.01)
0002.9: v5: update-index                    0.21(0.16+0.05)
0002.10: v5: grep nonexistent -- subdir     0.01(0.00+0.00)
0002.11: v5: ls-files -- subdir             0.01(0.00+0.00)

I could create more conflicts (~180,000 files are  in the index),
but I think 15,000 already is a number that's very unlikely to
be reached.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-07-24  7:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-23 19:08 [GSoC] Designing a faster index format - Progress report week 14 Thomas Gummerer
2012-07-24  1:23 ` Nguyen Thai Ngoc Duy
2012-07-24  7:52   ` Thomas Gummerer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).