Question About Sorting the Index

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question About Sorting the Index
@ 2025-05-16 23:43 Jon Forrest
  2025-05-17  3:46 ` Re " K Jayatheerth
  2025-05-17 18:36 ` Junio C Hamano
  0 siblings, 2 replies; 9+ messages in thread
From: Jon Forrest @ 2025-05-16 23:43 UTC (permalink / raw)
  To: git

I've learned that entries in the index file "are
sorted in ascending order on the name field".

Am I right in thinking that this means that
every time a file is added to the index by
running "git add" the whole index file must
be resorted? If so, this seems like a lot of
work, especially since not all the entries
are the same size.

Has any thought been made about improving this,
such as perhaps having an "index index"? This
would be a separate file that contains the name
field of each entry, the location of where the entry
starts in the index, and the length of the entry.
I'll call this a partial index entry.
The "index index" would also be sorted by the name field.

With this approach, running "git add" would simply
append a full index entry to the index, and
append the partial entry to the "index index", which
would then be sorted. The full index would not be
sorted. I'm guessing this is the common path.

To delete a file from the index, I'd propose adding an
"deleted" bit to the full cache entry. When "git rm --cached"
is run, 2 things would happen:

1) The "deleted" bit would be turned on in the full index
entry for the file. The index itself will not be sorted.
Every so often, perhaps when "git fsck" is run, these
entries could be deleted. The full index won't have
to be resorted when this happens because it won't be
assumed to be in sorted order any longer.

2) The "index index" would be modified by removing the
partial entry for the file. This could be done by
writing the partial entries up to the entry being
deleted, and then the entries following. No sort would
be necessary because the "index index" is already sorted.

One drawback of this approach would be that since the "index index"
entries also won't be the same length, sorting it will still require
extra work. However, this wouldn't be any harder then sorting the full
index, and a lot less data wouldn't have to be moved around.

All this is so simple that I suspect that it's been considered before.
Am I missing something?

Cordially,
Jon Forrest

P.S. I'm trying to read the Git source code to get a better handle
on what actually goes on in the index but this is taking some time.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re Question About Sorting the Index
  2025-05-16 23:43 Question About Sorting the Index Jon Forrest
@ 2025-05-17  3:46 ` K Jayatheerth
  2025-05-17 17:20   ` Jon Forrest
  2025-05-17 18:36 ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: K Jayatheerth @ 2025-05-17  3:46 UTC (permalink / raw)
  To: nobozo; +Cc: git

> I've learned that entries in the index file "are
> sorted in ascending order on the name field".

Correct the index maintains `struct cache_entry` entries in sorted
order by the file name. This is essential for fast lookup, diffing, and
pathspec-based operations.

> Am I right in thinking that this means that
> every time a file is added to the index by
> running "git add" the whole index file must
> be resorted?

Not exactly. Git keeps the index in memory as a sorted array, so adding
a new entry doesn't require a full resort just a binary search to find
the right insertion point. Only when the index is written to disk does
it serialize the in-memory array, which is already sorted.

> If so, this seems like a lot of
> work, especially since not all the entries
> are the same size.

That's true in principle, but in practice, the memory layout of
`cache_entry` objects and memory mapping makes it quite efficient,
especially since typical index sizes are modest.

> Has any thought been made about improving this,
> such as perhaps having an "index index"? This
> would be a separate file that contains the name
> field of each entry, the location of where the entry
> starts in the index, and the length of the entry.
> I'll call this a partial index entry.

Interesting idea effectively you're describing a secondary structure
like a sparse index or log-structured merge pattern. It has its appeal,
particularly for large repositories with high-churn working trees.

> With this approach, running "git add" would simply
> append a full index entry to the index, and
> append the partial entry to the "index index", which
> would then be sorted. The full index would not be
> sorted. I'm guessing this is the common path.

This would reduce write amplification for `git add`, but it comes at a
cost: many Git operations rely on the index being sorted. Reads would
have to scan or use your "index index", which introduces more I/O and
complexity.

> To delete a file from the index, I'd propose adding an
> "deleted" bit to the full cache entry. When "git rm --cached"
> is run, 2 things would happen:
>
> 1) The "deleted" bit would be turned on in the full index
> entry for the file.

This assumes laziness in cleanup, which is reasonable for append-only
systems. But Git today avoids keeping dead entries around for clarity
and correctness (especially under concurrent access).

> 2) The "index index" would be modified by removing the
> partial entry for the file.

This makes sense for maintaining the primary lookup structure.

> One drawback of this approach would be that since the "index index"
> entries also won't be the same length, sorting it will still require
> extra work. However, this wouldn't be any harder then sorting the full
> index, and a lot less data wouldn't have to be moved around.

Agreed but it's worth noting that sorting a relatively small
in-memory structure (as Git does now) is often cheaper than
maintaining two files in sync (your full index and index index).

> All this is so simple that I suspect that it's been considered before.
> Am I missing something?

You’re not missing much in fact, Git has features like the "split index",
"untracked cache", and "index v4" that address similar performance issues
through other means. Your idea would likely help in edge cases (very large
repos, massive parallelism), but the added complexity and I/O overhead of
maintaining multiple files likely outweighs the benefits for the common case.

> P.S. I'm trying to read the Git source code to get a better handle
> on what actually goes on in the index but this is taking some time.

You’re in good company. The index code lives mostly in `read-cache.c`.
Definitely a dense but rewarding part of the Git source tree to explore.

Maybe you can start from functions like read_index_from() do_write_index() 
discard_index(). I think that would be an amazing start.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re Question About Sorting the Index
  2025-05-17  3:46 ` Re " K Jayatheerth
@ 2025-05-17 17:20   ` Jon Forrest
  0 siblings, 0 replies; 9+ messages in thread
From: Jon Forrest @ 2025-05-17 17:20 UTC (permalink / raw)
  To: K Jayatheerth; +Cc: git

First of all, thanks for the very thoughtful response.

On 5/16/25 8:46 PM, K Jayatheerth wrote:

> Correct the index maintains `struct cache_entry` entries in sorted
> order by the file name. This is essential for fast lookup, diffing, and
> pathspec-based operations.

Somehow I hadn't realized that the index was also stored in
memory. However, my concerns about index lookups where each
entry in the index, in memory or on disk, isn't the same size
still apply. How can you do a binary search if you don't know
where the middle of the index is? Maybe I'd understand if I
had studied the source code in more detail.

> Not exactly. Git keeps the index in memory as a sorted array, so adding
> a new entry doesn't require a full resort just a binary search to find
> the right insertion point. Only when the index is written to disk does
> it serialize the in-memory array, which is already sorted.

I see. However, my comment above still applies.

>> If so, this seems like a lot of
>> work, especially since not all the entries
>> are the same size.
> 
> That's true in principle, but in practice, the memory layout of
> `cache_entry` objects and memory mapping makes it quite efficient,
> especially since typical index sizes are modest.

I guess you're right since the Git universe isn't clamoring
about poor performance when accessing the index.

> you're describing a secondary structure
> like a sparse index or log-structured merge pattern. It has its appeal,
> particularly for large repositories with high-churn working trees.

Right. I would think it would also add some degree of safety since
the index file would only be rewritten when garbage collection is
done. The "index index" would be much more volatile but since it
could be easily and quickly recreated any time, there would be no added
danger.

> many Git operations rely on the index being sorted. Reads would
> have to scan or use your "index index", which introduces more I/O and
> complexity.

I'm not convinced it would result in more I/O. The "index index" file
would be written more often, true, but each entry in it is much smaller
than the in the regular index.

Maybe a better approach would only keep the "index index" in memory.
This would result in less memory being used, with access to entries
in the regular index being fast since the "index index" entries
contain enough data to do random access to the desired entry
in the on-disk index.

On the third hand, all this might be unnecessary since the way
Git currently works is good enough for most (maybe all) people
most (maybe all) of the time.

> This assumes laziness in cleanup, which is reasonable for append-only
> systems. But Git today avoids keeping dead entries around for clarity
> and correctness (especially under concurrent access).

If Postgres can do it, so could Git.

> Agreed but it's worth noting that sorting a relatively small
> in-memory structure (as Git does now) is often cheaper than
> maintaining two files in sync (your full index and index index).

With my in memory "index index", this would become easier.

> You’re not missing much in fact, Git has features like the "split index",
> "untracked cache", and "index v4" that address similar performance issues
> through other means. Your idea would likely help in edge cases (very large
> repos, massive parallelism), but the added complexity and I/O overhead of
> maintaining multiple files likely outweighs the benefits for the common case.

That is indeed the question. I admittedly have no data either about how
large a repo would have to be for a change like this to have any effect,
or how many such repos exist. I notice that the currrent Git repo has
4638 entries, which I suppose really isn't all that large.

Thanks again,
Jon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question About Sorting the Index
  2025-05-16 23:43 Question About Sorting the Index Jon Forrest
  2025-05-17  3:46 ` Re " K Jayatheerth
@ 2025-05-17 18:36 ` Junio C Hamano
  2025-05-17 18:48   ` Jon Forrest
  2025-05-27 16:38   ` Jon Forrest
  1 sibling, 2 replies; 9+ messages in thread
From: Junio C Hamano @ 2025-05-17 18:36 UTC (permalink / raw)
  To: Jon Forrest; +Cc: git

Jon Forrest <nobozo@gmail.com> writes:

> P.S. I'm trying to read the Git source code to get a better handle
> on what actually goes on in the index but this is taking some time.

Depending on the style of the learner, I often recommend reading the
very initial revision of Git, i.e.  e83c5163 (Initial revision of
"git", the information manager from hell, 2005-04-07), to quickly
get a feel of what various pieces there are and how they fit
together, by doing

    $ git checkout -b initial e83c5163316f89bfb

This would give you a mere 1244 lines spread across 11 files, which
is something that can be read from cover to cover in a single
sitting and see how various data structures relate to each other and
interact.  In the past 20 years, we of course have added features
and auxiliary data structures, and the various details of the
implementation have changed, but the really core part of the concept
haven't drifted too far from the original.

For example, the fact that the index is first read into core, each
entry is represented as a cache_entry in-core structure, and the
code accesses them via an array active_cache[], and that array is
sorted per pathnames, haven't changed.  In the 3-4 months that
followed that initial revision, we added higher-stage entries that
are used to represent a merge in progress (together with sorting
rules for them), and later we added prefix-compression for the
pathnames, but the basic structure of the index subsystem hasn't
changed all that much over the years.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question About Sorting the Index
  2025-05-17 18:36 ` Junio C Hamano
@ 2025-05-17 18:48   ` Jon Forrest
  2025-05-18  5:13     ` Elijah Newren
  2025-05-27 16:38   ` Jon Forrest
  1 sibling, 1 reply; 9+ messages in thread
From: Jon Forrest @ 2025-05-17 18:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On 5/17/25 11:36 AM, Junio C Hamano wrote:
> Jon Forrest <nobozo@gmail.com> writes:
> 
>> P.S. I'm trying to read the Git source code to get a better handle
>> on what actually goes on in the index but this is taking some time.
> 
> Depending on the style of the learner, I often recommend reading the
> very initial revision of Git, i.e.  e83c5163 (Initial revision of
> "git", the information manager from hell, 2005-04-07), to quickly
> get a feel of what various pieces there are and how they fit
> together, by doing
> 
>      $ git checkout -b initial e83c5163316f89bfb

Thanks for the suggestion. I'll do that.

Meanwhile, do you see any merit to my idea?

Cordially,
Jon Forrest


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question About Sorting the Index
  2025-05-17 18:48   ` Jon Forrest
@ 2025-05-18  5:13     ` Elijah Newren
  2025-05-18 15:13       ` Jon Forrest
  0 siblings, 1 reply; 9+ messages in thread
From: Elijah Newren @ 2025-05-18  5:13 UTC (permalink / raw)
  To: Jon Forrest; +Cc: Junio C Hamano, git

On Sat, May 17, 2025 at 11:48 AM Jon Forrest <nobozo@gmail.com> wrote:
>
> On 5/17/25 11:36 AM, Junio C Hamano wrote:
> > Jon Forrest <nobozo@gmail.com> writes:
> >
> >> P.S. I'm trying to read the Git source code to get a better handle
> >> on what actually goes on in the index but this is taking some time.
> >
> > Depending on the style of the learner, I often recommend reading the
> > very initial revision of Git, i.e.  e83c5163 (Initial revision of
> > "git", the information manager from hell, 2005-04-07), to quickly
> > get a feel of what various pieces there are and how they fit
> > together, by doing
> >
> >      $ git checkout -b initial e83c5163316f89bfb
>
> Thanks for the suggestion. I'll do that.
>
> Meanwhile, do you see any merit to my idea?

Isn't the idea essentially the split index we already have?  (See the
"SPLIT INDEX" section of the git-update-index manual.)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question About Sorting the Index
  2025-05-18  5:13     ` Elijah Newren
@ 2025-05-18 15:13       ` Jon Forrest
  0 siblings, 0 replies; 9+ messages in thread
From: Jon Forrest @ 2025-05-18 15:13 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Junio C Hamano, git

On 5/17/25 10:13 PM, Elijah Newren wrote:

> Isn't the idea essentially the split index we already have?  (See the
> "SPLIT INDEX" section of the git-update-index manual.)

I admittedly hadn't read that before. Yes, it's similar.
But, I think I better become more familiar with this man
page before I take up any more of this list's time.

One observation - wouldn't things be nice if the index entries
were (somehow) the same length.

Jon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question About Sorting the Index
  2025-05-17 18:36 ` Junio C Hamano
  2025-05-17 18:48   ` Jon Forrest
@ 2025-05-27 16:38   ` Jon Forrest
  2025-05-28  2:34     ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Jon Forrest @ 2025-05-27 16:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 5/17/25 11:36 AM, Junio C Hamano wrote:

> For example, the fact that the index is first read into core, each
> entry is represented as a cache_entry in-core structure, and the
> code accesses them via an array active_cache[], and that array is
> sorted per pathnames, haven't changed.  

I had a thought. What if the in-memory cache were stored in a hash,
where the pathname is the key? That way nothing would have to be
sorted in order to lookup a particular file.

The on-disk index could be in any order.

I don't know how the overhead of creating the hash when a
git program starts compares to that of creating the
cache_entry struct and then later doing the sorting.
This seems like the key question.

Jon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question About Sorting the Index
  2025-05-27 16:38   ` Jon Forrest
@ 2025-05-28  2:34     ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2025-05-28  2:34 UTC (permalink / raw)
  To: Jon Forrest; +Cc: git

Jon Forrest <nobozo@gmail.com> writes:

> On 5/17/25 11:36 AM, Junio C Hamano wrote:
>
>> For example, the fact that the index is first read into core, each
>> entry is represented as a cache_entry in-core structure, and the
>> code accesses them via an array active_cache[], and that array is
>> sorted per pathnames, haven't changed.  
>
> I had a thought. What if the in-memory cache were stored in a hash,
> where the pathname is the key? That way nothing would have to be
> sorted in order to lookup a particular file.

The index must be in sorted order in order to allow a set of tree
objects written out of it.  Hash may be good for looking up, but it
is not the best data structure for stable and efficient enumeration.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-05-28  2:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16 23:43 Question About Sorting the Index Jon Forrest
2025-05-17  3:46 ` Re " K Jayatheerth
2025-05-17 17:20   ` Jon Forrest
2025-05-17 18:36 ` Junio C Hamano
2025-05-17 18:48   ` Jon Forrest
2025-05-18  5:13     ` Elijah Newren
2025-05-18 15:13       ` Jon Forrest
2025-05-27 16:38   ` Jon Forrest
2025-05-28  2:34     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).