Re: Re Question About Sorting the Index

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jon Forrest <nobozo@gmail.com>
To: K Jayatheerth <jayatheerthkulkarni2005@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Re Question About Sorting the Index
Date: Sat, 17 May 2025 10:20:29 -0700	[thread overview]
Message-ID: <49226f81-08df-4d35-861d-eca76fffa449@gmail.com> (raw)
In-Reply-To: <20250517034625.9100-1-jayatheerthkulkarni2005@gmail.com>

First of all, thanks for the very thoughtful response.

On 5/16/25 8:46 PM, K Jayatheerth wrote:

> Correct the index maintains `struct cache_entry` entries in sorted
> order by the file name. This is essential for fast lookup, diffing, and
> pathspec-based operations.

Somehow I hadn't realized that the index was also stored in
memory. However, my concerns about index lookups where each
entry in the index, in memory or on disk, isn't the same size
still apply. How can you do a binary search if you don't know
where the middle of the index is? Maybe I'd understand if I
had studied the source code in more detail.

> Not exactly. Git keeps the index in memory as a sorted array, so adding
> a new entry doesn't require a full resort just a binary search to find
> the right insertion point. Only when the index is written to disk does
> it serialize the in-memory array, which is already sorted.

I see. However, my comment above still applies.

>> If so, this seems like a lot of
>> work, especially since not all the entries
>> are the same size.
> 
> That's true in principle, but in practice, the memory layout of
> `cache_entry` objects and memory mapping makes it quite efficient,
> especially since typical index sizes are modest.

I guess you're right since the Git universe isn't clamoring
about poor performance when accessing the index.

> you're describing a secondary structure
> like a sparse index or log-structured merge pattern. It has its appeal,
> particularly for large repositories with high-churn working trees.

Right. I would think it would also add some degree of safety since
the index file would only be rewritten when garbage collection is
done. The "index index" would be much more volatile but since it
could be easily and quickly recreated any time, there would be no added
danger.

> many Git operations rely on the index being sorted. Reads would
> have to scan or use your "index index", which introduces more I/O and
> complexity.

I'm not convinced it would result in more I/O. The "index index" file
would be written more often, true, but each entry in it is much smaller
than the in the regular index.

Maybe a better approach would only keep the "index index" in memory.
This would result in less memory being used, with access to entries
in the regular index being fast since the "index index" entries
contain enough data to do random access to the desired entry
in the on-disk index.

On the third hand, all this might be unnecessary since the way
Git currently works is good enough for most (maybe all) people
most (maybe all) of the time.

> This assumes laziness in cleanup, which is reasonable for append-only
> systems. But Git today avoids keeping dead entries around for clarity
> and correctness (especially under concurrent access).

If Postgres can do it, so could Git.

> Agreed but it's worth noting that sorting a relatively small
> in-memory structure (as Git does now) is often cheaper than
> maintaining two files in sync (your full index and index index).

With my in memory "index index", this would become easier.

> You’re not missing much in fact, Git has features like the "split index",
> "untracked cache", and "index v4" that address similar performance issues
> through other means. Your idea would likely help in edge cases (very large
> repos, massive parallelism), but the added complexity and I/O overhead of
> maintaining multiple files likely outweighs the benefits for the common case.

That is indeed the question. I admittedly have no data either about how
large a repo would have to be for a change like this to have any effect,
or how many such repos exist. I notice that the currrent Git repo has
4638 entries, which I suppose really isn't all that large.

Thanks again,
Jon

next prev parent reply	other threads:[~2025-05-17 17:20 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-16 23:43 Question About Sorting the Index Jon Forrest
2025-05-17  3:46 ` Re " K Jayatheerth
2025-05-17 17:20   ` Jon Forrest [this message]
2025-05-17 18:36 ` Junio C Hamano
2025-05-17 18:48   ` Jon Forrest
2025-05-18  5:13     ` Elijah Newren
2025-05-18 15:13       ` Jon Forrest
2025-05-27 16:38   ` Jon Forrest
2025-05-28  2:34     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49226f81-08df-4d35-861d-eca76fffa449@gmail.com \
    --to=nobozo@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jayatheerthkulkarni2005@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).