[GSoC] Designing a faster index format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GSoC] Designing a faster index format - Progress report
@ 2012-05-23 12:21 Thomas Gummerer
  2012-05-24 20:01 ` Thomas Rast
  2012-05-25 11:31 ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 13+ messages in thread
From: Thomas Gummerer @ 2012-05-23 12:21 UTC (permalink / raw)
  To: git; +Cc: trast, gitster, pclouds, mhagger



mhagger@alum.mit.edu, pclouds@gmail.com
Bcc: 
Subject: [GSoC] Designing a new index format - Progress update
Reply-To: 

As Thomas Rast suggested yesterday on IRC, I'll give you a quick
overview of the work that has already been done in my GSoC project.


== Work done in the past 5 weeks ==

- Definition of a tentative index file v5 format [1]. This differs
  from the proposal in making it possible to bisect the directory
  entries and file entries, to do a binary search. The exact bits
  for each section were also defined. To further compress the index,
  along with prefix compression, the stat data is hashed, since
  it's only used for comparison, but the plain data is never used.
  Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
  and Robin Rosenberg for feedback.
- Prototype of a converter from the index format v2/v3 to the index
  format v5. [2] The converter reads the index from a git repository,
  can output parts of the index (header, index entries as in
  git ls-files --debug, cache tree as in test-dump-cache-tree, or
  the reuc data). Then it writes the v5 index file format to
  .git/index-v5. Thanks to Michael Haggerty for the code review.
- Prototype of a reader for the new index file format. [3] The
  reader has mainly the purpose to show the algorithm used to read
  the index lexicographically sorted after the full name which is
  required by the current internal memory format. Big thanks for
  reviewing this code and giving me advice on refactoring goes
  to Michael Haggerty.

== Outlook for the next week ==

- Start working on actual git code
- Read the header of the new format

[1] https://github.com/tgummerer/git/wiki/Index-file-format-v5
[2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py
[3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-23 12:21 [GSoC] Designing a faster index format - Progress report Thomas Gummerer
@ 2012-05-24 20:01 ` Thomas Rast
  2012-05-24 20:57   ` Junio C Hamano
  2012-05-25 11:31 ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 13+ messages in thread
From: Thomas Rast @ 2012-05-24 20:01 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, pclouds, mhagger

Thomas asked for feedback on IRC, and Michael requested that I put it in
email for posterity, so here goes.

This is referring to 0c9214a from git://github.com/tgummerer/git.git,
where Thomas put the very early beginnings of wiring v5 parsing code
into read_index_from().

* On the high-level history making side of things: You structured your
  commits by "steps of implementation", but that sometimes does not give
  very useful intermediate steps. for example, now your index-v5~1 just
  prints "Header verified". That means it cannot be meaningfully tested,
  or judged by tests, or some such.

  That's ok as long as you are still drafting up things, but from
  experience the most workable approach is refactor-patch-improve (or
  sometimes patch-refactor-improve).  That is, you proceed in up to
  three steps, each spelled as one or more commits:

  - (Optionally, but in this case crucially) refactor the existing code
    into a shape where your feature patches become simpler.  This should
    be a true refactoring, i.e., not changing any observable behavior.

  - Optionally fix any bugs you encounter along the way.

  - Patch in your shiny new feature.

  (Tests should of course go with the last two steps.)

  This tends to make the feature patches simpler, and thus easier to
  review.  For a long project like yours, you may end up having several
  feature patches or even several iterations of this approach.

  For a great example of just this happening, consider Junio's index-v4
  series:

  - Preparations:
    varint: make it available outside the context of pack
    cache.h: hide on-disk index details
    read-cache.c: allow unaligned mapping of the index file
    read-cache.c: make create_from_disk() report number of bytes it consumed

  - Bugfix (ok, it only makes the error better)
    read-cache.c: report the header version we do not understand

  - Refactoring:
    read-cache.c: move code to copy ondisk to incore cache to a helper function
    read-cache.c: move code to copy incore to ondisk cache to a helper function

  - The feature itself:
    read-cache.c: read prefix-compressed names in index on-disk version v4
    read-cache.c: write prefix-compressed names in the index

  - The supporting mini-features required to use it:
    update-index: upgrade/downgrade on-disk index version
    unpack-trees: preserve the index file version of original

  - The docs:
    index-v4: document the entry format

  Ok, you can already see that the reality is not quite as simple as my
  initial explanations...

* The code structure itself:

  I like the split you made in the verify_hdr_* family, like so:

    -static int verify_hdr(struct cache_header *hdr, unsigned long size)
    +static int verify_hdr_version(struct cache_header *hdr, unsigned long size)
     {
    -       git_SHA_CTX c;
    -       unsigned char sha1[20];
            int hdr_version;

            if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
                    return error("bad signature");
            hdr_version = ntohl(hdr->hdr_version);
    -       if (hdr_version < 2 || 4 < hdr_version)
    +       if (hdr_version < 2 || 5 < hdr_version)
                    return error("bad index version %d", hdr_version);
    +       return 0;
    +}
    +
    +static int verify_hdr(struct cache_header *hdr, unsigned long size)
    +{
    +       git_SHA_CTX c;
    +       unsigned char sha1[20];
    +
            git_SHA1_Init(&c);
            git_SHA1_Update(&c, hdr, size - 20);
            git_SHA1_Final(sha1, &c);

  since it decouples version checking from the checksumming.  (Without
  the change to make 5 an acceptable version, this could actually go in
  a refactoring patch.)

  On the other hand the read_index_from() changes show none of that
  "function modularity".  Your results would be much better if you first
  refactored the meat of read_index_from to read_index_v2_from(),
  so that read_index_from() is essentially:

    mmap();
    verify_hdr();
    read_index_v2_from(...);
    munmap();

  Then, in another patch, wire in your shiny new read_index_v5_from()
  function.  Then change required to read_index_from() will be a
  two-liner, which makes it *much* easier to see what is going on,
  especially if you compare with your f50c747b

     read-cache.c | 118 ++++++++++++++++++++++++++++++++++--------------------

  80% of that is just reindentation because you are wrapping things in a
  version test!

* There's little code at this point, but I like the general direction
  you are taking: for now, start with a flat array of index entries and
  slurp everything in there, like the v3 code does.  Later patches can
  then improve on that to use the features of the new format.

* The struct{} layout should be such that it is not a pure coincidence
  that hdr_version has the same position in all structures.  You
  basically have two options to achieve this:

  - You write it as a concatenation of two structs, such as

      struct cache_version_header { unsigned hdr_signature, hdr_version; }
      struct cache_header_v2      { unsigned hdr_entries; }

    and similarly for v5.  The code then just reads the first struct,
    and dispatches to read the second one depending.

  - You write it as a struct-within-a-struct, so that the generic header
    becomes a part of the version-specific ones, like

      struct cache_header_v2 {
        struct cache_version_header foo;
        unsigned hdr_entries;
      }

    Finding a good substitute for "foo" is left as an exercise for the
    reader.

* Finally, on style: Your indentation is off in parts, please use a
  tab-width of 8 spaces and use tabs for the leading indentation.

Cheers,
Thomas

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-24 20:01 ` Thomas Rast
@ 2012-05-24 20:57   ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2012-05-24 20:57 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, git, pclouds, mhagger

Thomas Rast <trast@student.ethz.ch> writes:

> Thomas asked for feedback on IRC, and Michael requested that I put it in
> email for posterity, so here goes.

I agree with everything you said in your review.

As to the example index-v4 series, it needs to be stressed that the series
did *not* happen in the order it was presented.  While the series was
being written, it looked pretty much "Ok, now we can read the header but
nothing else works" kind of "implemetation steps".  Only very few people
can write a series in the right presentation order from the beginning.

Thanks for a nice summary of what is happening.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-23 12:21 [GSoC] Designing a faster index format - Progress report Thomas Gummerer
  2012-05-24 20:01 ` Thomas Rast
@ 2012-05-25 11:31 ` Nguyen Thai Ngoc Duy
  2012-05-25 20:15   ` Thomas Gummerer
  1 sibling, 1 reply; 13+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-25 11:31 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger

On Wed, May 23, 2012 at 7:21 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> == Outlook for the next week ==
>
> - Start working on actual git code
> - Read the header of the new format

I know it's out of scope, but it would be great if you could make
ls-files read the new index format directly. Having something that
actual works will ensure we don't overlook anything in the new format.
We can then learn from ls-files lesson (especially how to handle both
new/old format) and come up with api/in-core structures for the rest
of git later.
-- 
Duy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-25 11:31 ` Nguyen Thai Ngoc Duy
@ 2012-05-25 20:15   ` Thomas Gummerer
  2012-05-26  4:09     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gummerer @ 2012-05-25 20:15 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, trast, gitster, mhagger

On 05/25, Nguyen Thai Ngoc Duy wrote:
> On Wed, May 23, 2012 at 7:21 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > == Outlook for the next week ==
> >
> > - Start working on actual git code
> > - Read the header of the new format
> 
> I know it's out of scope, but it would be great if you could make
> ls-files read the new index format directly. Having something that
> actual works will ensure we don't overlook anything in the new format.
> We can then learn from ls-files lesson (especially how to handle both
> new/old format) and come up with api/in-core structures for the rest
> of git later.

Thanks for your suggestion. How did you think this should be done?
Writing a extra function in ls-files, just for outputting? I don't
think it is necessary to write a extra function, since the result
from the read_index_from function in read-cache is used for that
anyway. Or did you have something different in mind, that I'm missing
here?

--
Thomas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-25 20:15   ` Thomas Gummerer
@ 2012-05-26  4:09     ` Nguyen Thai Ngoc Duy
  2012-05-27  9:04       ` Thomas Gummerer
  0 siblings, 1 reply; 13+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-26  4:09 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger

On Sat, May 26, 2012 at 3:15 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> On 05/25, Nguyen Thai Ngoc Duy wrote:
>> On Wed, May 23, 2012 at 7:21 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> > == Outlook for the next week ==
>> >
>> > - Start working on actual git code
>> > - Read the header of the new format
>>
>> I know it's out of scope, but it would be great if you could make
>> ls-files read the new index format directly. Having something that
>> actual works will ensure we don't overlook anything in the new format.
>> We can then learn from ls-files lesson (especially how to handle both
>> new/old format) and come up with api/in-core structures for the rest
>> of git later.
>
> Thanks for your suggestion. How did you think this should be done?
> Writing a extra function in ls-files, just for outputting? I don't
> think it is necessary to write a extra function, since the result
> from the read_index_from function in read-cache is used for that
> anyway. Or did you have something different in mind, that I'm missing
> here?

No, read_index_from would go through the normal tree->list conversion.
What I'd like to see is what it looks like when a command accesses
index v5 directly in tree form, taking all advantages that tree-form
provides, and how we should deal with old index versions while still
supporting index v5 (without losing tree advantages)
-- 
Duy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-26  4:09     ` Nguyen Thai Ngoc Duy
@ 2012-05-27  9:04       ` Thomas Gummerer
  2012-05-27  9:27         ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gummerer @ 2012-05-27  9:04 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, trast, gitster, mhagger



On 05/26, Nguyen Thai Ngoc Duy wrote:
> On Sat, May 26, 2012 at 3:15 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > On 05/25, Nguyen Thai Ngoc Duy wrote:
> >> On Wed, May 23, 2012 at 7:21 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> >> > == Outlook for the next week ==
> >> >
> >> > - Start working on actual git code
> >> > - Read the header of the new format
> >>
> >> I know it's out of scope, but it would be great if you could make
> >> ls-files read the new index format directly. Having something that
> >> actual works will ensure we don't overlook anything in the new format.
> >> We can then learn from ls-files lesson (especially how to handle both
> >> new/old format) and come up with api/in-core structures for the rest
> >> of git later.
> >
> > Thanks for your suggestion. How did you think this should be done?
> > Writing a extra function in ls-files, just for outputting? I don't
> > think it is necessary to write a extra function, since the result
> > from the read_index_from function in read-cache is used for that
> > anyway. Or did you have something different in mind, that I'm missing
> > here?
> 
> No, read_index_from would go through the normal tree->list conversion.
> What I'd like to see is what it looks like when a command accesses
> index v5 directly in tree form, taking all advantages that tree-form
> provides, and how we should deal with old index versions while still
> supporting index v5 (without losing tree advantages)

Ah ok, thanks for the clarification, I understand what you meant now.
I think however, that it's not very beneficial to do this conversion
now. git ls-files needs the whole index file anyway, so it's probably
not a very good test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-27  9:04       ` Thomas Gummerer
@ 2012-05-27  9:27         ` Junio C Hamano
  2012-05-27 12:23           ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2012-05-27  9:27 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Nguyen Thai Ngoc Duy, git, trast, mhagger

Thomas Gummerer <t.gummerer@gmail.com> writes:

>> No, read_index_from would go through the normal tree->list conversion.
>> What I'd like to see is what it looks like when a command accesses
>> index v5 directly in tree form, taking all advantages that tree-form
>> provides, and how we should deal with old index versions while still
>> supporting index v5 (without losing tree advantages)
>
> Ah ok, thanks for the clarification, I understand what you meant now.
> I think however, that it's not very beneficial to do this conversion
> now. git ls-files needs the whole index file anyway, so it's probably
> not a very good test.

Think about "git ls-files t/" and "git ls-files -u".  

The former obviously does *not* have to look at the whole thing, even
though the current code assumes the in-core data structure that has the
whole thing in a flat array.  IIRC, you had unmerged entries tucked at the
end outside the main index data, so the latter is also an interesting
demonstration of how wonderful the new data format could be.

Unlike other commands like status and diff that may need to look at things
other than the index, the core functionalitly of ls-files is purely about
the index.  I do not understand why you think it is not a good test case.
If an updated index structure cannot even improve ls-files, there is no
hope it can improve other more complex commands that need to walk the
index and something else in parallel.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-27  9:27         ` Junio C Hamano
@ 2012-05-27 12:23           ` Nguyen Thai Ngoc Duy
  2012-05-28  8:26             ` Thomas Gummerer
  2012-05-29 13:29             ` Thomas Rast
  0 siblings, 2 replies; 13+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-27 12:23 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Junio C Hamano, git, trast, mhagger

On Sun, May 27, 2012 at 4:27 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>>> No, read_index_from would go through the normal tree->list conversion.
>>> What I'd like to see is what it looks like when a command accesses
>>> index v5 directly in tree form, taking all advantages that tree-form
>>> provides, and how we should deal with old index versions while still
>>> supporting index v5 (without losing tree advantages)
>>
>> Ah ok, thanks for the clarification, I understand what you meant now.
>> I think however, that it's not very beneficial to do this conversion
>> now. git ls-files needs the whole index file anyway, so it's probably
>> not a very good test.
>
> Think about "git ls-files t/" and "git ls-files -u".

Or harder things like "ls-files -- 't/*.sh'"

> The former obviously does *not* have to look at the whole thing, even
> though the current code assumes the in-core data structure that has the
> whole thing in a flat array.  IIRC, you had unmerged entries tucked at the
> end outside the main index data, so the latter is also an interesting
> demonstration of how wonderful the new data format could be.

and "ls-files -uc" can show how you combine unmerged entries back.
There's also entry existence check deep in "ls-files -o" that you can
show how good bsearch on trees is, though that might be going too far
for an experiment because the call chain is really deep, way outside
ls-files.c:

show_files (builtin/ls-files.c)
 fill_directory (dir.c)
  read_directory
   read_directory_recursive
    treat_path
     treat_one_path
      treat_directory
       directory_exists_in_index
        cache_pos_name (read-cache.c)

I just want to make sure that by exercising the new format with some
real problems, we are certain we don't overlook anything in designing
the format (or else could be fixed before finalizing it).
-- 
Duy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-27 12:23           ` Nguyen Thai Ngoc Duy
@ 2012-05-28  8:26             ` Thomas Gummerer
  2012-05-29 13:29             ` Thomas Rast
  1 sibling, 0 replies; 13+ messages in thread
From: Thomas Gummerer @ 2012-05-28  8:26 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Junio C Hamano, git, trast, mhagger

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 2222 bytes --]



On 05/27, Nguyen Thai Ngoc Duy wrote:
> On Sun, May 27, 2012 at 4:27 PM, Junio C Hamano <gitster@pobox.com> wrote:
> > Thomas Gummerer <t.gummerer@gmail.com> writes:
> >
> >>> No, read_index_from would go through the normal tree->list conversion.
> >>> What I'd like to see is what it looks like when a command accesses
> >>> index v5 directly in tree form, taking all advantages that tree-form
> >>> provides, and how we should deal with old index versions while still
> >>> supporting index v5 (without losing tree advantages)
> >>
> >> Ah ok, thanks for the clarification, I understand what you meant now.
> >> I think however, that it's not very beneficial to do this conversion
> >> now. git ls-files needs the whole index file anyway, so it's probably
> >> not a very good test.
> >
> > Think about "git ls-files t/" and "git ls-files -u".
> 
> Or harder things like "ls-files -- 't/*.sh'"
> 
> > The former obviously does *not* have to look at the whole thing, even
> > though the current code assumes the in-core data structure that has the
> > whole thing in a flat array.  IIRC, you had unmerged entries tucked at the
> > end outside the main index data, so the latter is also an interesting
> > demonstration of how wonderful the new data format could be.
> 
> and "ls-files -uc" can show how you combine unmerged entries back.
> There's also entry existence check deep in "ls-files -o" that you can
> show how good bsearch on trees is, though that might be going too far
> for an experiment because the call chain is really deep, way outside
> ls-files.c:
> 
> show_files (builtin/ls-files.c)
>  fill_directory (dir.c)
>   read_directory
>    read_directory_recursive
>     treat_path
>      treat_one_path
>       treat_directory
>        directory_exists_in_index
>         cache_pos_name (read-cache.c)
> 
> I just want to make sure that by exercising the new format with some
> real problems, we are certain we don't overlook anything in designing
> the format (or else could be fixed before finalizing it).

Ok, that makes sense. I just thought of git ls-files alone, for which
it wouldn't make a lot of sense. I'll try implementing this as next
step.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-27 12:23           ` Nguyen Thai Ngoc Duy
  2012-05-28  8:26             ` Thomas Gummerer
@ 2012-05-29 13:29             ` Thomas Rast
  2012-05-29 13:43               ` Nguyen Thai Ngoc Duy
  2012-05-29 18:33               ` Junio C Hamano
  1 sibling, 2 replies; 13+ messages in thread
From: Thomas Rast @ 2012-05-29 13:29 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Junio C Hamano, git, mhagger, Nguyen Thai Ngoc Duy

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Sun, May 27, 2012 at 4:27 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Thomas Gummerer <t.gummerer@gmail.com> writes:
>>
>>> Ah ok, thanks for the clarification, I understand what you meant now.
>>> I think however, that it's not very beneficial to do this conversion
>>> now. git ls-files needs the whole index file anyway, so it's probably
>>> not a very good test.
>>
>> Think about "git ls-files t/" and "git ls-files -u".
>
> Or harder things like "ls-files -- 't/*.sh'"
>
>> The former obviously does *not* have to look at the whole thing, even
>> though the current code assumes the in-core data structure that has the
>> whole thing in a flat array.  IIRC, you had unmerged entries tucked at the
>> end outside the main index data, so the latter is also an interesting
>> demonstration of how wonderful the new data format could be.
>
> and "ls-files -uc" can show how you combine unmerged entries back.
> There's also entry existence check deep in "ls-files -o" that you can
> show how good bsearch on trees is, though that might be going too far
> for an experiment because the call chain is really deep, way outside
> ls-files.c:
>a
> show_files (builtin/ls-files.c)
>  fill_directory (dir.c)
>   read_directory
>    read_directory_recursive
>     treat_path
>      treat_one_path
>       treat_directory
>        directory_exists_in_index
>         cache_pos_name (read-cache.c)
>
> I just want to make sure that by exercising the new format with some
> real problems, we are certain we don't overlook anything in designing
> the format (or else could be fixed before finalizing it).

I envision an index API that more strictly controls access to the index.
Right now the API consists largely of read_index, write_index and the
flat the_index->cache array of entries.  Eventually it will have to be a
family of calls that support the v5 format, and boil down to suitable
wrappers for older ones.  For example (just tossing up ideas):

  index_open(struct index_state *index, int fd):
    initialization, checking, leaves the "real" data fields empty

  index_load_filtered(..., const char **pathspec):
    load everything needed to satisfy queries filtered by 'pathspec'

  index_for_each_entry(..., void (*callback)(struct cache_entry *ent)):
    like the current hand-rolled looping

  index_for_each_entry_filtered(..., void (*callback)(struct cache_entry *ent), char **pathspec):
    ditto but for a pathspec lookup

etc.

Then I will twist Duy's words to mean that you should make git-ls-files
the poster child of this new API for development and profiling purposes
:-)

Actually converting the rest of the git code base to such an API is too
big an undertaking for the summer, so please don't stray on that path.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-29 13:29             ` Thomas Rast
@ 2012-05-29 13:43               ` Nguyen Thai Ngoc Duy
  2012-05-29 18:33               ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-29 13:43 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, Junio C Hamano, git, mhagger

On Tue, May 29, 2012 at 8:29 PM, Thomas Rast <trast@student.ethz.ch> wrote:
> I envision an index API that more strictly controls access to the index.
> Right now the API consists largely of read_index, write_index and the
> flat the_index->cache array of entries.  Eventually it will have to be a
> family of calls that support the v5 format, and boil down to suitable
> wrappers for older ones.  For example (just tossing up ideas):
>
>  index_open(struct index_state *index, int fd):
>    initialization, checking, leaves the "real" data fields empty
>
>  index_load_filtered(..., const char **pathspec):
>    load everything needed to satisfy queries filtered by 'pathspec'
>
>  index_for_each_entry(..., void (*callback)(struct cache_entry *ent)):
>    like the current hand-rolled looping
>
>  index_for_each_entry_filtered(..., void (*callback)(struct cache_entry *ent), char **pathspec):
>    ditto but for a pathspec lookup
>
> etc.

I'm towards readdir interface with filter flags (e.g. staged entries
only, unmerged only, all...) for serial access and something like
stat() for one-file check. That may be enough for ls-files conversion.
With readdir-like interface where entries are sorted, I can help add
pathspec support to tree_entry_interesting(). This function can be
used by new ls-files to filter pathspec instead of the list-based
pathspec_matches().

One thing to consider is, will this new API work with old versions too
(i.e. can conversion be hidden behind the scene without significant
performance loss)? If it does, great we only have to maintain one API,
the future is bright. Otherwise, I don't know..

> Actually converting the rest of the git code base to such an API is too
> big an undertaking for the summer, so please don't stray on that path.

Agreed.
-- 
Duy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GSoC] Designing a faster index format - Progress report
  2012-05-29 13:29             ` Thomas Rast
  2012-05-29 13:43               ` Nguyen Thai Ngoc Duy
@ 2012-05-29 18:33               ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2012-05-29 18:33 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, git, mhagger, Nguyen Thai Ngoc Duy

Thomas Rast <trast@student.ethz.ch> writes:

> Then I will twist Duy's words to mean that you should make git-ls-files
> the poster child of this new API for development and profiling purposes
> :-)

Exactly.

> Actually converting the rest of the git code base to such an API is too
> big an undertaking for the summer, so please don't stray on that path.

Didn't I say that this topic is too big for a GSoC task _way_ before
GSoC organization application started?  Without meaningful portion
of the codebase using the newly proposed data stracture and giving
demonstratably better performance figure, it is very hard to justify
that the project completed successfully at the end of the summer.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-05-29 18:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-23 12:21 [GSoC] Designing a faster index format - Progress report Thomas Gummerer
2012-05-24 20:01 ` Thomas Rast
2012-05-24 20:57   ` Junio C Hamano
2012-05-25 11:31 ` Nguyen Thai Ngoc Duy
2012-05-25 20:15   ` Thomas Gummerer
2012-05-26  4:09     ` Nguyen Thai Ngoc Duy
2012-05-27  9:04       ` Thomas Gummerer
2012-05-27  9:27         ` Junio C Hamano
2012-05-27 12:23           ` Nguyen Thai Ngoc Duy
2012-05-28  8:26             ` Thomas Gummerer
2012-05-29 13:29             ` Thomas Rast
2012-05-29 13:43               ` Nguyen Thai Ngoc Duy
2012-05-29 18:33               ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).