* [PATCH] packfile: avoid access(3p) calls for missing packs @ 2025-05-16 8:55 Patrick Steinhardt 2025-05-16 18:34 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-16 8:55 UTC (permalink / raw) To: git; +Cc: Taylor Blau The function `add_packed_git()` adds a packfile to the known list of packs in case it exists. In case the packfile doesn't look like a pack though, or in case it doesn't exist, the function will simply return a `NULL` pointer. The ordering in which those checks are done is somewhat curious though: we first perform a couple of access(3p) syscalls for auxiliary files like ".keep" before we determine whether the ".pack" file itself exists. And if we see that the ".pack" file does not exist, we bail out and thus effectively discard all the information. This means that the access(3p) calls were a complete waste of time. The reason why we do this is likely because we reuse `p->pack_name` to derive the other files' names, as well, so that we only have to allocate this buffer once. As such, we have to compute the packfile name last so that it doesn't get overwritten by the other names. All of this likely doesn't matter in the general case: Git shouldn't end up looking too many nonexistent packfiles anyway. But there are edge cases where it _does_ matter. One such edge case that we have hit in our production systems was a stale "multi-pack-index" file that contained references to nonexistent packfiles. As there is no negative lookup cache this causes us to repeatedly look up the same packfiles via the multi-pack index, only to realize that they don't exist. This translated into hundreds of thousands of syscalls to access(3p) and stat(3p). While the issue was entirely self-made because the multi-pack index should have been regenerated, we can still reduce the number of syscalls by 75% in the case of nonexistent packfiles by reordering these calls. This requires us to restore the final packfile name after the access(3p) calls. But given that this is a mere memory copy it is very unlikely to matter in comparison to the four syscalls we performed beforehand. Another mitigation would be to introduce a negative lookup cache so that we don't try to add missing packfiles repeatedly. But that refactoring would be a bit more involved for dubious gains, so for now the author has decided against it. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- Hi, this patch addresses an issue we have recently seen in our production systems due to a stale MIDX. The MIDX contained entries for packfiles that didn't exist anymore, which caused Git to repeatedly look up those packfiles. Each missing packfile resulted in four repeated syscalls: three access(3p) calls to check for supporting data structures, and one call to stat(3p) to check for the packfile itself. The first three calls are essentially wasted though when the stat(3p) call itself fails, which is being fixed by this patch. I doubt that the patch matters in almost any repository, but given that the refactoring is trivial I thought to submit the patch regardless of that. Another step would be to introduce a negative lookup cache -- but that would be a bit more involved, so I decided against it for now as I don't want to introduce complexity for dubious gains. Thanks! Patrick --- packfile.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/packfile.c b/packfile.c index d91016f1c7f..870d48bd949 100644 --- a/packfile.c +++ b/packfile.c @@ -737,6 +737,12 @@ struct packed_git *add_packed_git(struct repository *r, const char *path, p = alloc_packed_git(r, alloc); memcpy(p->pack_name, path, path_len); + xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack"); + if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) { + free(p); + return NULL; + } + xsnprintf(p->pack_name + path_len, alloc - path_len, ".keep"); if (!access(p->pack_name, F_OK)) p->pack_keep = 1; @@ -749,11 +755,8 @@ struct packed_git *add_packed_git(struct repository *r, const char *path, if (!access(p->pack_name, F_OK)) p->is_cruft = 1; + /* Restore the final packfile name. */ xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack"); - if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) { - free(p); - return NULL; - } /* ok, it looks sane as far as we can check without * actually mapping the pack file. --- base-commit: 1a8a4971cc6c179c4dd711f4a7f5d7178f4b3ab7 change-id: 20250516-pks-pack-avoid-stats-on-missing-8e3b75755cf0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH] packfile: avoid access(3p) calls for missing packs 2025-05-16 8:55 [PATCH] packfile: avoid access(3p) calls for missing packs Patrick Steinhardt @ 2025-05-16 18:34 ` Junio C Hamano 2025-05-19 6:52 ` Jeff King 2025-05-20 9:53 ` [PATCH v2 0/2] " Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt 2 siblings, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2025-05-16 18:34 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Taylor Blau Patrick Steinhardt <ps@pks.im> writes: > The function `add_packed_git()` adds a packfile to the known list of > packs in case it exists. In case the packfile doesn't look like a pack > though, or in case it doesn't exist, the function will simply return a > `NULL` pointer. > > The ordering in which those checks are done is somewhat curious though: > we first perform a couple of access(3p) syscalls for auxiliary files > like ".keep" before we determine whether the ".pack" file itself exists. > And if we see that the ".pack" file does not exist, we bail out and thus > effectively discard all the information. This means that the access(3p) > calls were a complete waste of time. > > The reason why we do this is likely because we reuse `p->pack_name` to > derive the other files' names, as well, so that we only have to allocate > this buffer once. As such, we have to compute the packfile name last so > that it doesn't get overwritten by the other names. I vaguely recall that the order of checks between .idx and .pack were deliberately chosen, as we do not want to add and attempt to use .pack file before its associated .idx file is ready since without the latter the former is unusable at runtime. Side note: It may have been more important back when the name of the packfile was a hash of names of the objects in the pack, which meant that when you repack a quiescent and fully repacked repository twice with different parameters, you could have ended up with a packfile whose name is the same but the contents as a bytestream (hence the object offset) are different, making a stale .idx not pointing into the correct position in the new .pack file. These days the name of the packfile is based on the contents of the pack as a bytestream, so we no longer suffer in such a scenario. But I wouldn't be surprised if checks for other auxiliary files were added later to come next to .idx not .pack, exactly because .pack is the primary thing. > All of this likely doesn't matter in the general case: Git shouldn't end > up looking too many nonexistent packfiles anyway. In any case, we shouldn't attempt to use .pack when its associated files are missing. It is not about nonexistent but about incomplete (including "in the process of being written"). > But there are edge > cases where it _does_ matter. One such edge case that we have hit in our > production systems was a stale "multi-pack-index" file that contained > references to nonexistent packfiles. As there is no negative lookup > cache this causes us to repeatedly look up the same packfiles via the > multi-pack index, only to realize that they don't exist. This translated > into hundreds of thousands of syscalls to access(3p) and stat(3p). Ouch. Multi-pack index is a more recent invention, and I am not surprised if such bugs still lurk. And as you hinted, a solution may be a negative "known-not-to-exist" list, but given one such stale midx, would we try to open the same missing packfiles over and over within the same process? > while the issue was entirely self-made because the multi-pack index > should have been regenerated, we can still reduce the number of syscalls > by 75% in the case of nonexistent packfiles by reordering these calls. That sounds more like a band-aid, if we still do the remaining 25% that we somehow know would be unnecessary. Note that I do not know if the ordering that I think was originally introduced for correctness still matters for correctness. The first paragraph of this message was written without consulting "git log", lore archive, or even the current state of the source, but from vague memory. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] packfile: avoid access(3p) calls for missing packs 2025-05-16 18:34 ` Junio C Hamano @ 2025-05-19 6:52 ` Jeff King 2025-05-19 15:46 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Jeff King @ 2025-05-19 6:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Taylor Blau On Fri, May 16, 2025 at 11:34:10AM -0700, Junio C Hamano wrote: > > The reason why we do this is likely because we reuse `p->pack_name` to > > derive the other files' names, as well, so that we only have to allocate > > this buffer once. As such, we have to compute the packfile name last so > > that it doesn't get overwritten by the other names. > > I vaguely recall that the order of checks between .idx and .pack > were deliberately chosen, as we do not want to add and attempt to > use .pack file before its associated .idx file is ready since > without the latter the former is unusable at runtime. It is definitely intentional and important that we discover packs by their .idx file first. The writing side uses the creation of the .idx file (via atomic rename) as the final step in making a pack accessible. Before that, a reader could see a .pack without its various accessory files (and so not realize it has a .keep file, for example). > Side note: It may have been more important back when the name of > the packfile was a hash of names of the objects in the pack, > which meant that when you repack a quiescent and fully repacked > repository twice with different parameters, you could have ended > up with a packfile whose name is the same but the contents as a > bytestream (hence the object offset) are different, making a > stale .idx not pointing into the correct position in the new > .pack file. These days the name of the packfile is based on the > contents of the pack as a bytestream, so we no longer suffer > in such a scenario. I don't think reading the .idx file first was sufficient to protect us there, since the .pack could be replaced racily after the reader opened the .idx. I don't recall if we were subject to that race back then (and it just didn't come up enough to worry about), or if repack would throw away an identically-named file. Anyway, as you note, it's no longer an issue. But I think that is mostly orthogonal to how the auxiliary files are handled. I think Patrick's guess is correct that the order we have is simply because it was most convenient to end up with ".pack" in the name. > > All of this likely doesn't matter in the general case: Git shouldn't end > > up looking too many nonexistent packfiles anyway. > > In any case, we shouldn't attempt to use .pack when its associated > files are missing. It is not about nonexistent but about incomplete > (including "in the process of being written"). This shouldn't be a race for files being written. The .idx is written last, so the .pack should always be there. And the midx is written after that, so again, the pack should always be there. At least for newly written packs. We might see the opposite: packs going away because they were repacked (and perhaps even a new midx generated). In that case you might see a race like: - process R opens the existing midx, which mentions pack P (but does not yet open P) - process W repacks, deleting P and generating a new midx with P' - process R wants object X; the midx claims it is in pack P, so it tries to open that but fails What should happen here is that R will fail to find the object at all, should hit reprepare_packed_git(), and will then pick up the new pack. I don't recall offhand whether reprepare_packed_git() will open the new midx. If it doesn't, we'd still find the object via the actual pack .idx, but we may end up consulting the now-stale midx over and over (so we'll get the right answer, but there may be some room for optimization). > > while the issue was entirely self-made because the multi-pack index > > should have been regenerated, we can still reduce the number of syscalls > > by 75% in the case of nonexistent packfiles by reordering these calls. > > That sounds more like a band-aid, if we still do the remaining 25% > that we somehow know would be unnecessary. Yeah, that was my gut reaction, too. add_packed_git() is not optimized[1] at all. But it shouldn't have to be, because it's meant to be called once per process. Even with the re-ordering, we'd still make a bunch of pointless stat() calls for the .pack file we know is not there. The code in prepare_midx_pack() converts a numeric pack id (referenced inside the midx) into a "struct packed_git" pointer, caching the results in multi_pack_index->packs. That field holds NULL for "we have not looked it up yet" or a valid pointer to a packed_git. It probably needs to hold a third state: "we tried and failed". Something like this (large untested) patch: diff --git a/midx.c b/midx.c index 3d0015f782..354b1f886c 100644 --- a/midx.c +++ b/midx.c @@ -405,7 +405,7 @@ void close_midx(struct multi_pack_index *m) munmap((unsigned char *)m->data, m->data_len); for (i = 0; i < m->num_packs; i++) { - if (m->packs[i]) + if (m->packs[i] && m->packs[i] != (void *)(intptr_t)-1) m->packs[i]->multi_pack_index = 0; } FREE_AND_NULL(m->packs); @@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, pack_int_id = midx_for_pack(&m, pack_int_id); + if (m->packs[pack_int_id] == (void *)(intptr_t)-1) + return 1; if (m->packs[pack_int_id]) return 0; @@ -482,8 +484,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, strbuf_release(&pack_name); strbuf_release(&key); - if (!p) + if (!p) { + m->packs[pack_int_id] = (void *)(intptr_t)-1; return 1; + } p->multi_pack_index = 1; m->packs[pack_int_id] = p; -Peff [1] There probably are optimization opportunities in add_packed_git(). I don't think re-ordering will help much in the common case that we actually do have the pack. But really, most callers do not care about these auxiliary files at all! We could simply skip them during the initial setup and lazy-load them via accessor functions. I _think_ that should be OK with respect to races. For a newly added pack, we know they will always be in place before the matching .idx file (per the logic I outlined above). For a pack that goes away, we might racily fail to see its auxiliary file. But that is mostly true now (we might see its .idx, and then the .promisor file is deleted before we call access()). It does increase the size of that window, though (and in particular lets it happen even if the pack has actually been opened). I'm not sure how much that would matter in practice. OTOH, I'm not sure that saving a few access() calls would be all that meaningful, unless you have a zillion packs. And if you have a zillion packs, you're likely to get much bigger speedups in other areas by repacking them anyway. So it hasn't really been an area I've pursued before. ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH] packfile: avoid access(3p) calls for missing packs 2025-05-19 6:52 ` Jeff King @ 2025-05-19 15:46 ` Junio C Hamano 2025-05-20 6:45 ` Patrick Steinhardt 2025-05-23 1:02 ` Taylor Blau 2 siblings, 0 replies; 53+ messages in thread From: Junio C Hamano @ 2025-05-19 15:46 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Steinhardt, git, Taylor Blau Jeff King <peff@peff.net> writes: > The code in prepare_midx_pack() converts a numeric pack id (referenced > inside the midx) into a "struct packed_git" pointer, caching the results > in multi_pack_index->packs. That field holds NULL for "we have not > looked it up yet" or a valid pointer to a packed_git. It probably needs > to hold a third state: "we tried and failed". Yeah, this was exactly why I asked "are we having repeated failures in the same process", and I am happy with this direction. > Something like this (large untested) patch: > > diff --git a/midx.c b/midx.c > index 3d0015f782..354b1f886c 100644 > --- a/midx.c > +++ b/midx.c > @@ -405,7 +405,7 @@ void close_midx(struct multi_pack_index *m) > munmap((unsigned char *)m->data, m->data_len); > > for (i = 0; i < m->num_packs; i++) { > - if (m->packs[i]) > + if (m->packs[i] && m->packs[i] != (void *)(intptr_t)-1) > m->packs[i]->multi_pack_index = 0; > } > FREE_AND_NULL(m->packs); > @@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > pack_int_id = midx_for_pack(&m, pack_int_id); > > + if (m->packs[pack_int_id] == (void *)(intptr_t)-1) > + return 1; > if (m->packs[pack_int_id]) > return 0; > > @@ -482,8 +484,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > strbuf_release(&pack_name); > strbuf_release(&key); > > - if (!p) > + if (!p) { > + m->packs[pack_int_id] = (void *)(intptr_t)-1; > return 1; > + } > > p->multi_pack_index = 1; > m->packs[pack_int_id] = p; ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] packfile: avoid access(3p) calls for missing packs 2025-05-19 6:52 ` Jeff King 2025-05-19 15:46 ` Junio C Hamano @ 2025-05-20 6:45 ` Patrick Steinhardt 2025-05-22 5:28 ` Jeff King 2025-05-23 1:02 ` Taylor Blau 2 siblings, 1 reply; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-20 6:45 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, git, Taylor Blau On Mon, May 19, 2025 at 02:52:21AM -0400, Jeff King wrote: > On Fri, May 16, 2025 at 11:34:10AM -0700, Junio C Hamano wrote: > > > > The reason why we do this is likely because we reuse `p->pack_name` to > > > derive the other files' names, as well, so that we only have to allocate > > > this buffer once. As such, we have to compute the packfile name last so > > > that it doesn't get overwritten by the other names. > > > > I vaguely recall that the order of checks between .idx and .pack > > were deliberately chosen, as we do not want to add and attempt to > > use .pack file before its associated .idx file is ready since > > without the latter the former is unusable at runtime. > > It is definitely intentional and important that we discover packs by > their .idx file first. The writing side uses the creation of the .idx > file (via atomic rename) as the final step in making a pack accessible. > Before that, a reader could see a .pack without its various accessory > files (and so not realize it has a .keep file, for example). Yup. But that order doesn't change anyway in this patch. What changes is only the order for auxiliary and non-mandatory ".keep", ".promisor" and ".mtimes" files. > > Side note: It may have been more important back when the name of > > the packfile was a hash of names of the objects in the pack, > > which meant that when you repack a quiescent and fully repacked > > repository twice with different parameters, you could have ended > > up with a packfile whose name is the same but the contents as a > > bytestream (hence the object offset) are different, making a > > stale .idx not pointing into the correct position in the new > > .pack file. These days the name of the packfile is based on the > > contents of the pack as a bytestream, so we no longer suffer > > in such a scenario. > > I don't think reading the .idx file first was sufficient to protect us > there, since the .pack could be replaced racily after the reader opened > the .idx. I don't recall if we were subject to that race back then (and > it just didn't come up enough to worry about), or if repack would throw > away an identically-named file. Anyway, as you note, it's no longer an > issue. > > But I think that is mostly orthogonal to how the auxiliary files are > handled. I think Patrick's guess is correct that the order we have is > simply because it was most convenient to end up with ".pack" in the > name. Taking a step back though: do we always ensure the order in which we those auxiliary files are created and deleted? If we know that those are always: - Moved into place before their ".idx" file. - Deleted after deleting their ".pack" file. Then reordering may cause us to race indeed so that we see the packfile, but miss the auxiliary data structures. `unlink_pack_path()` does seem to ensure the latter property, as both ".idx" and ".pack" are deleted first. I'm not quite sure about the former, but it seems like we also do this. So I think that the proposed patch is wrong. There should definitely be a comment in that function though to say that this order is intentional and not merely an optimization. [snip] > > > while the issue was entirely self-made because the multi-pack index > > > should have been regenerated, we can still reduce the number of syscalls > > > by 75% in the case of nonexistent packfiles by reordering these calls. > > > > That sounds more like a band-aid, if we still do the remaining 25% > > that we somehow know would be unnecessary. > > Yeah, that was my gut reaction, too. add_packed_git() is not > optimized[1] at all. But it shouldn't have to be, because it's meant to > be called once per process. Even with the re-ordering, we'd still make a > bunch of pointless stat() calls for the .pack file we know is not there. It certainly is, as mentioned in the cover letter. It was a supposedly cheap win as I didn't want to introduce a negative lookup cache for an edge case like this. But the patch below looks simple enough -- I feared it would be a lot more involved. > The code in prepare_midx_pack() converts a numeric pack id (referenced > inside the midx) into a "struct packed_git" pointer, caching the results > in multi_pack_index->packs. That field holds NULL for "we have not > looked it up yet" or a valid pointer to a packed_git. It probably needs > to hold a third state: "we tried and failed". > > Something like this (large untested) patch: > > diff --git a/midx.c b/midx.c > index 3d0015f782..354b1f886c 100644 > --- a/midx.c > +++ b/midx.c > @@ -405,7 +405,7 @@ void close_midx(struct multi_pack_index *m) > munmap((unsigned char *)m->data, m->data_len); > > for (i = 0; i < m->num_packs; i++) { > - if (m->packs[i]) > + if (m->packs[i] && m->packs[i] != (void *)(intptr_t)-1) > m->packs[i]->multi_pack_index = 0; > } > FREE_AND_NULL(m->packs); > @@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > pack_int_id = midx_for_pack(&m, pack_int_id); > > + if (m->packs[pack_int_id] == (void *)(intptr_t)-1) > + return 1; > if (m->packs[pack_int_id]) > return 0; > > @@ -482,8 +484,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > strbuf_release(&pack_name); > strbuf_release(&key); > > - if (!p) > + if (!p) { > + m->packs[pack_int_id] = (void *)(intptr_t)-1; > return 1; > + } > > p->multi_pack_index = 1; > m->packs[pack_int_id] = p; > > -Peff > > [1] There probably are optimization opportunities in add_packed_git(). I > don't think re-ordering will help much in the common case that we > actually do have the pack. But really, most callers do not care > about these auxiliary files at all! We could simply skip them during > the initial setup and lazy-load them via accessor functions. > > I _think_ that should be OK with respect to races. For a newly added > pack, we know they will always be in place before the matching .idx > file (per the logic I outlined above). For a pack that goes away, we > might racily fail to see its auxiliary file. But that is mostly true > now (we might see its .idx, and then the .promisor file is deleted > before we call access()). It does increase the size of that window, > though (and in particular lets it happen even if the pack has > actually been opened). I'm not sure it would be okay, as mentioned above. The current ordering ensures that we always see auxiliary data structures in case the ".pack" file exists. If we started to cache then that wouldn't be the case anymore. > I'm not sure how much that would matter in practice. OTOH, I'm not > sure that saving a few access() calls would be all that meaningful, > unless you have a zillion packs. And if you have a zillion packs, > you're likely to get much bigger speedups in other areas by > repacking them anyway. So it hasn't really been an area I've > pursued before. It probably doesn't in the general case at all. Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] packfile: avoid access(3p) calls for missing packs 2025-05-20 6:45 ` Patrick Steinhardt @ 2025-05-22 5:28 ` Jeff King 0 siblings, 0 replies; 53+ messages in thread From: Jeff King @ 2025-05-22 5:28 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Junio C Hamano, git, Taylor Blau On Tue, May 20, 2025 at 08:45:06AM +0200, Patrick Steinhardt wrote: > Taking a step back though: do we always ensure the order in which we > those auxiliary files are created and deleted? If we know that those are > always: > > - Moved into place before their ".idx" file. > > - Deleted after deleting their ".pack" file. > > Then reordering may cause us to race indeed so that we see the packfile, > but miss the auxiliary data structures. `unlink_pack_path()` does seem > to ensure the latter property, as both ".idx" and ".pack" are deleted > first. I'm not quite sure about the former, but it seems like we also do > this. > > So I think that the proposed patch is wrong. There should definitely be > a comment in that function though to say that this order is intentional > and not merely an optimization. Hmm, interesting. Yes, I agree that deleting the .pack before any other auxiliary files should catch the race, given the ordering on the read side in add_packed_git(). But I don't think we ever discussed that ordering or its implications. So I assumed all bets were off with respect to racing. However, it seems we lucked into doing the correct thing by virtue of adding to the end of the list in unlink_pack_path() repeatedly (though I think if you dig all the way back into git-repack.sh, when there were just .keep files, there were periods where we did it in the wrong order). But if we are doing it in the best order, whether correct or not, I agree we should not disrupt that. > > [1] There probably are optimization opportunities in add_packed_git(). I > > don't think re-ordering will help much in the common case that we > > actually do have the pack. But really, most callers do not care > > about these auxiliary files at all! We could simply skip them during > > the initial setup and lazy-load them via accessor functions. > > > > I _think_ that should be OK with respect to races. For a newly added > > pack, we know they will always be in place before the matching .idx > > file (per the logic I outlined above). For a pack that goes away, we > > might racily fail to see its auxiliary file. But that is mostly true > > now (we might see its .idx, and then the .promisor file is deleted > > before we call access()). It does increase the size of that window, > > though (and in particular lets it happen even if the pack has > > actually been opened). > > I'm not sure it would be okay, as mentioned above. The current ordering > ensures that we always see auxiliary data structures in case the ".pack" > file exists. If we started to cache then that wouldn't be the case > anymore. Yeah, you're right. I thought it was already racy, but it luckily is not. It does still feel unfortunate that we have to spend syscalls loading information that won't be used by most commands, but it may not be a measurable performance issue. I think if we did want to lazy-load that information, we'd probably need to do an extra stat() of .pack afterwards to check for the deletion race (so repack and pack-objects would pay the penalty, but most other commands would not). Anyway, that can be explored another day. -Peff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] packfile: avoid access(3p) calls for missing packs 2025-05-19 6:52 ` Jeff King 2025-05-19 15:46 ` Junio C Hamano 2025-05-20 6:45 ` Patrick Steinhardt @ 2025-05-23 1:02 ` Taylor Blau 2025-05-23 2:03 ` Jeff King 2 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-23 1:02 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Patrick Steinhardt, git On Mon, May 19, 2025 at 02:52:21AM -0400, Jeff King wrote: > On Fri, May 16, 2025 at 11:34:10AM -0700, Junio C Hamano wrote: > > > > The reason why we do this is likely because we reuse `p->pack_name` to > > > derive the other files' names, as well, so that we only have to allocate > > > this buffer once. As such, we have to compute the packfile name last so > > > that it doesn't get overwritten by the other names. > > > > I vaguely recall that the order of checks between .idx and .pack > > were deliberately chosen, as we do not want to add and attempt to > > use .pack file before its associated .idx file is ready since > > without the latter the former is unusable at runtime. > > It is definitely intentional and important that we discover packs by > their .idx file first. The writing side uses the creation of the .idx > file (via atomic rename) as the final step in making a pack accessible. > Before that, a reader could see a .pack without its various accessory > files (and so not realize it has a .keep file, for example). Yup. Or seeing a pack before its .bitmap is written, in which case any clones or fetches unfortunate enough to take place after the pack was read but before the .bitmap was written would be significantly slower than they otherwise would have been with bitmaps. > > > All of this likely doesn't matter in the general case: Git shouldn't end > > > up looking too many nonexistent packfiles anyway. > > > > In any case, we shouldn't attempt to use .pack when its associated > > files are missing. It is not about nonexistent but about incomplete > > (including "in the process of being written"). > > This shouldn't be a race for files being written. The .idx is written > last, so the .pack should always be there. And the midx is written after > that, so again, the pack should always be there. At least for newly > written packs. > > We might see the opposite: packs going away because they were repacked > (and perhaps even a new midx generated). In that case you might see > a race like: > > - process R opens the existing midx, which mentions pack P (but does > not yet open P) > > - process W repacks, deleting P and generating a new midx with P' > > - process R wants object X; the midx claims it is in pack P, so it > tries to open that but fails > > What should happen here is that R will fail to find the object at all, > should hit reprepare_packed_git(), and will then pick up the new pack. > > I don't recall offhand whether reprepare_packed_git() will open the new > midx. If it doesn't, we'd still find the object via the actual pack > .idx, but we may end up consulting the now-stale midx over and over (so > we'll get the right answer, but there may be some room for > optimization). This all sounds right to me. And we do end up loading a new MIDX via reprepare_packed_git(): that function calls prepare_packed_git() (which doesn't immediately return, since we just zero'd out the r->objects->packed_git_initialized flag). We then enumerate the ODBs, calling prepare_multi_pack_index_one() and prepare_packed_git_one() for each. From there we end up in load_multi_pack_index(), which gets a freshened view of the MIDX file. If all goes well there, then prepare_multi_pack_index_one() assigns it to r->objects->multi_pack_index as expected. > The code in prepare_midx_pack() converts a numeric pack id (referenced > inside the midx) into a "struct packed_git" pointer, caching the results > in multi_pack_index->packs. That field holds NULL for "we have not > looked it up yet" or a valid pointer to a packed_git. It probably needs > to hold a third state: "we tried and failed". Yup, that seems like a good direction to me. > [1] There probably are optimization opportunities in add_packed_git(). I > don't think re-ordering will help much in the common case that we > actually do have the pack. But really, most callers do not care > about these auxiliary files at all! We could simply skip them during > the initial setup and lazy-load them via accessor functions. Nice idea. I initially though, "that's not right, we definitely need to populate the flag bits, e.g., p->is_cruft". But that was before I read "and lazy-load them via accessor functions". That turns those fields into a tri-state, which is a nice way to avoid this altogether. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] packfile: avoid access(3p) calls for missing packs 2025-05-23 1:02 ` Taylor Blau @ 2025-05-23 2:03 ` Jeff King 0 siblings, 0 replies; 53+ messages in thread From: Jeff King @ 2025-05-23 2:03 UTC (permalink / raw) To: Taylor Blau; +Cc: Junio C Hamano, Patrick Steinhardt, git On Thu, May 22, 2025 at 09:02:06PM -0400, Taylor Blau wrote: > > I don't recall offhand whether reprepare_packed_git() will open the new > > midx. If it doesn't, we'd still find the object via the actual pack > > .idx, but we may end up consulting the now-stale midx over and over (so > > we'll get the right answer, but there may be some room for > > optimization). > > This all sounds right to me. And we do end up loading a new MIDX via > reprepare_packed_git(): that function calls prepare_packed_git() (which > doesn't immediately return, since we just zero'd out the > r->objects->packed_git_initialized flag). We then enumerate the ODBs, > calling prepare_multi_pack_index_one() and prepare_packed_git_one() for > each. > > From there we end up in load_multi_pack_index(), which gets a freshened > view of the MIDX file. If all goes well there, then > prepare_multi_pack_index_one() assigns it to > r->objects->multi_pack_index as expected. One thing that puzzled me here is how prepare_multi_pack_index_one() deals with an existing r->objects->multi_pack_index: m = load_multi_pack_index(r, object_dir, local); if (m) { struct multi_pack_index *mp = r->objects->multi_pack_index; if (mp) { m->next = mp->next; mp->next = m; } else r->objects->multi_pack_index = m; return 1; } I don't see reprepare_packed_git() zero-ing out the existing midx pointer. So would we hit that "if (mp)" block, in which case we seem to create a loop of next pointers? But the old midx would still be the first one in the loop. Would we still consult it? Or am I just reading this totally wrong? -Peff ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs 2025-05-16 8:55 [PATCH] packfile: avoid access(3p) calls for missing packs Patrick Steinhardt 2025-05-16 18:34 ` Junio C Hamano @ 2025-05-20 9:53 ` Patrick Steinhardt 2025-05-20 9:53 ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt ` (2 more replies) 2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt 2 siblings, 3 replies; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-20 9:53 UTC (permalink / raw) To: git; +Cc: Taylor Blau, Jeff King, Junio C Hamano Hi, this patch addresses an issue we have recently seen in our production systems due to a stale MIDX. The MIDX contained entries for packfiles that didn't exist anymore, which caused Git to repeatedly look up those packfiles. Each missing packfile resulted in four repeated syscalls: three access(3p) calls to check for supporting data structures, and one call to stat(3p) to check for the packfile itself. The first three calls are essentially wasted though when the stat(3p) call itself fails, which is being fixed by this patch. I doubt that the patch matters in almost any repository, but given that the refactoring is trivial I thought to submit the patch regardless of that. Another step would be to introduce a negative lookup cache -- but that would be a bit more involved, so I decided against it for now as I don't want to introduce complexity for dubious gains. Changes in v2: - Drop the patch that reorders syscalls and add a comment explaining why the order is important. - Add a negative lookup cache for indexed packfiles. - Link to v1: https://lore.kernel.org/r/20250516-pks-pack-avoid-stats-on-missing-v1-1-e2ef4d8798a3@pks.im Thanks! Patrick --- Patrick Steinhardt (2): packfile: explain ordering of how we look up auxiliary pack files midx: stop repeatedly looking up nonexistent packfiles midx.c | 10 ++++++++-- packfile.c | 11 +++++++++++ 2 files changed, 19 insertions(+), 2 deletions(-) Range-diff versus v1: 1: 31ffb3712ca < -: ----------- packfile: avoid access(3p) calls for missing packs -: ----------- > 1: 6125b84389d packfile: explain ordering of how we look up auxiliary pack files -: ----------- > 2: 8cb82a771c0 midx: stop repeatedly looking up nonexistent packfiles --- base-commit: 1a8a4971cc6c179c4dd711f4a7f5d7178f4b3ab7 change-id: 20250516-pks-pack-avoid-stats-on-missing-8e3b75755cf0 ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files 2025-05-20 9:53 ` [PATCH v2 0/2] " Patrick Steinhardt @ 2025-05-20 9:53 ` Patrick Steinhardt 2025-05-23 1:03 ` Taylor Blau 2025-05-20 9:53 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt 2025-05-21 13:24 ` [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs Junio C Hamano 2 siblings, 1 reply; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-20 9:53 UTC (permalink / raw) To: git; +Cc: Taylor Blau, Jeff King, Junio C Hamano When adding a packfile to an object database we perform four syscalls: - Three calls to access(3p) are done to check for auxiliary data structures. - One call to stat(3p) is done to check for the ".pack" itself. One curious bit is that we perform the access(3p) calls before checking for the packfile itself, but if the packfile doesn't exist we discard all results. The access(3p) calls are thus essentially wasted, so one may be triggered to reorder those calls so that we can short-circuit the other syscalls in case the packfile does not exist. The order in which we look up files is quite important though to help avoid races: - When installing a packfile we move auxiliary data structures into place before we install the ".idx" file. - When deleting a packfile we first delete the ".idx" and ".pack" files before deleting auxiliary data structures. As such, to avoid any races with concurrently created or deleted packs we need to make sure that we _first_ read auxiliary data structures before we read the corresponding ".idx" or ".pack" file. Otherwise it may easily happen that we return a populated but misclassified pack. Add a comment to `add_packed_git()` to make future readers aware of this ordering requirement. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- packfile.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/packfile.c b/packfile.c index d91016f1c7f..933036e2606 100644 --- a/packfile.c +++ b/packfile.c @@ -737,6 +737,17 @@ struct packed_git *add_packed_git(struct repository *r, const char *path, p = alloc_packed_git(r, alloc); memcpy(p->pack_name, path, path_len); + /* + * Note that we have to check auxiliary data structures before we check + * for the ".pack" file to exist to avoid races with a packfile that is + * in the process of being deleted. The ".pack" file is unlinked before + * its auxiliary data structures, so we know that we either get a + * consistent snapshot of all data structures or that we'll fail to + * stat(3p) the packfile itself and thus return `NULL`. + * + * As such, we cannot bail out before the access(3p) calls in case the + * packfile doesn't exist without doing two stat(3p) calls for it. + */ xsnprintf(p->pack_name + path_len, alloc - path_len, ".keep"); if (!access(p->pack_name, F_OK)) p->pack_keep = 1; -- 2.49.0.1151.ga128411c76.dirty ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files 2025-05-20 9:53 ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt @ 2025-05-23 1:03 ` Taylor Blau 0 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-23 1:03 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano On Tue, May 20, 2025 at 11:53:09AM +0200, Patrick Steinhardt wrote: > Add a comment to `add_packed_git()` to make future readers aware of this > ordering requirement. Thanks, I am really glad that you added this comment to aide future readers. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-20 9:53 ` [PATCH v2 0/2] " Patrick Steinhardt 2025-05-20 9:53 ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt @ 2025-05-20 9:53 ` Patrick Steinhardt 2025-05-22 5:32 ` Jeff King 2025-05-21 13:24 ` [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs Junio C Hamano 2 siblings, 1 reply; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-20 9:53 UTC (permalink / raw) To: git; +Cc: Taylor Blau, Jeff King, Junio C Hamano The multi-pack index acts as a cache across a set of packfiles so that we can quickly look up which of those packfiles contains a given object. As such, the multi-pack index naturally needs to be updated every time one of the packfiles goes away, or otherwise the multi-pack index has grown stale. A stale multi-pack index should be handled gracefully by Git though, and in fact it is: if the indexed pack cannot be found we simply ignore it and eventually we fall back to doing the object lookup by just iterating through all packs, even if those aren't indexed. But while this fallback works, it has one significant downside: we don't cache the fact that a pack has vanished. This leads to us repeatedly trying to look up the same pack only to realize that it (still) doesn't exist. This issue can be easily demonstrated by creating a repository with a stale multi-pack index and a couple of objects. We do so by creating a repository with two packfiles, both of which are indexed by the multi-pack index, and then repack those two packfiles. Note that we have to move the multi-pack-index before doing the final repack, as Git knows to delete it otherwise. $ git init repo $ cd repo/ $ git config set maintenance.auto false $ for i in $(seq 1000); do printf "%d-original" $i >file-$i; done $ git add . $ git commit -moriginal $ git repack -dl $ for i in $(seq 1000); do printf "%d-modified" $i >file-$i; done $ git commit -a -mmodified $ git repack -dl $ git multi-pack-index write $ mv .git/objects/pack/multi-pack-index . $ git repack -Adl $ mv multi-pack-index .git/objects/pack/ Commands that cause a lot of objects lookups will now repeatedly invoke `add_packed_git()`, which leads to three failed access(3p) calls as well as one failed stat(3p) call. The following strace for example is done for `git log --patch` in the above repository: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 74.67 0.024693 1 18038 18031 access 25.33 0.008378 1 6045 6017 newfstatat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.033071 1 24083 24048 total Fix the issue by introducing a negative lookup cache for indexed packs. This cache works by simply storing an invalid pointer for a missing pack when `prepare_midx_pack()` fails to look up the pack. Most users of the `packs` array don't need to be adjusted, either, as they all know to call `prepare_midx_pack()` before accessing the array. With this change in place we can now see a significantly reduced number of syscalls: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 73.58 0.000323 5 60 28 newfstatat 26.42 0.000116 5 23 16 access ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000439 5 83 44 total Furthermore, this change also results in a speedup: Benchmark 1: git log --patch (revision = HEAD~) Time (mean ± σ): 50.4 ms ± 2.5 ms [User: 22.0 ms, System: 24.4 ms] Range (min … max): 45.4 ms … 54.9 ms 53 runs Benchmark 2: git log --patch (revision = HEAD) Time (mean ± σ): 12.7 ms ± 0.4 ms [User: 11.1 ms, System: 1.6 ms] Range (min … max): 12.4 ms … 15.0 ms 191 runs Summary git log --patch (revision = HEAD) ran 3.96 ± 0.22 times faster than git log --patch (revision = HEAD~) In the end, it should in theory never be necessary to have this negative lookup cache given that we know to update the multi-pack index together with repacks. But as the change is quite contained and as the speedup can be significant as demonstrated above, it does feel sensible to have the negative lookup cache regardless. Based-on-patch-by: Jeff King <peff@peff.net> Signed-off-by: Patrick Steinhardt <ps@pks.im> --- midx.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/midx.c b/midx.c index 3d0015f7828..fbce88bd463 100644 --- a/midx.c +++ b/midx.c @@ -405,7 +405,7 @@ void close_midx(struct multi_pack_index *m) munmap((unsigned char *)m->data, m->data_len); for (i = 0; i < m->num_packs; i++) { - if (m->packs[i]) + if (m->packs[i] && m->packs[i] != (void *)(intptr_t)-1) m->packs[i]->multi_pack_index = 0; } FREE_AND_NULL(m->packs); @@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, pack_int_id = midx_for_pack(&m, pack_int_id); + if (m->packs[pack_int_id] == (void *)(intptr_t)-1) + return 1; if (m->packs[pack_int_id]) return 0; @@ -482,8 +484,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, strbuf_release(&pack_name); strbuf_release(&key); - if (!p) + if (!p) { + m->packs[pack_int_id] = (void *)(intptr_t)-1; return 1; + } p->multi_pack_index = 1; m->packs[pack_int_id] = p; @@ -495,6 +499,8 @@ struct packed_git *nth_midxed_pack(struct multi_pack_index *m, uint32_t pack_int_id) { uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); + if (m->packs[local_pack_int_id] == (void *)(intptr_t)-1) + return NULL; return m->packs[local_pack_int_id]; } -- 2.49.0.1151.ga128411c76.dirty ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-20 9:53 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt @ 2025-05-22 5:32 ` Jeff King 2025-05-22 15:47 ` Junio C Hamano 2025-05-23 1:31 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Taylor Blau 0 siblings, 2 replies; 53+ messages in thread From: Jeff King @ 2025-05-22 5:32 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Taylor Blau, Junio C Hamano On Tue, May 20, 2025 at 11:53:10AM +0200, Patrick Steinhardt wrote: > @@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > pack_int_id = midx_for_pack(&m, pack_int_id); > > + if (m->packs[pack_int_id] == (void *)(intptr_t)-1) > + return 1; > if (m->packs[pack_int_id]) > return 0; I did wonder while writing this if we might be able to hide the magic number and gross casting inside a constant or macro. I think just: #define MIDX_PACK_ERROR ((void *)(intptr_t)-1) would be enough? Though... > @@ -495,6 +499,8 @@ struct packed_git *nth_midxed_pack(struct multi_pack_index *m, > uint32_t pack_int_id) > { > uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); > + if (m->packs[local_pack_int_id] == (void *)(intptr_t)-1) > + return NULL; > return m->packs[local_pack_int_id]; Yuck, yet another spot that needs to be aware of the new tri-state value. One alternative is using an auxiliary array to cache the errors, and then only the lookup function needs to care. Like: diff --git a/midx.c b/midx.c index c1adff4404..df71ead50b 100644 --- a/midx.c +++ b/midx.c @@ -186,6 +186,7 @@ static struct multi_pack_index *load_multi_pack_index_one(struct repository *r, CALLOC_ARRAY(m->pack_names, m->num_packs); CALLOC_ARRAY(m->packs, m->num_packs); + CALLOC_ARRAY(m->pack_err, m->num_packs); cur_pack_name = (const char *)m->chunk_pack_names; for (i = 0; i < m->num_packs; i++) { @@ -408,6 +409,7 @@ void close_midx(struct multi_pack_index *m) if (m->packs[i]) m->packs[i]->multi_pack_index = 0; } + FREE_AND_NULL(m->pack_errs); FREE_AND_NULL(m->packs); FREE_AND_NULL(m->pack_names); free(m); @@ -460,6 +462,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, if (m->packs[pack_int_id]) return 0; + if (m->pack_errs[pack_int_id]) + return 1; strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir, m->pack_names[pack_int_id]); @@ -482,8 +486,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, strbuf_release(&pack_name); strbuf_release(&key); - if (!p) + if (!p) { + m->pack_errs[pack_int_id] = 1; return 1; + } p->multi_pack_index = 1; m->packs[pack_int_id] = p; You could even lazy-malloc the extra array if you wanted to optimize the common no-errors case, but I'm not sure it's a big deal. -Peff ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-22 5:32 ` Jeff King @ 2025-05-22 15:47 ` Junio C Hamano 2025-05-22 16:59 ` Jeff King 2025-05-23 1:31 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Taylor Blau 1 sibling, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2025-05-22 15:47 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Steinhardt, git, Taylor Blau Jeff King <peff@peff.net> writes: > @@ -460,6 +462,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > if (m->packs[pack_int_id]) > return 0; > + if (m->pack_errs[pack_int_id]) > + return 1; > > strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir, > m->pack_names[pack_int_id]); > @@ -482,8 +486,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > strbuf_release(&pack_name); > strbuf_release(&key); > > - if (!p) > + if (!p) { > + m->pack_errs[pack_int_id] = 1; > return 1; > + } > > p->multi_pack_index = 1; > m->packs[pack_int_id] = p; > > You could even lazy-malloc the extra array if you wanted to optimize the > common no-errors case, but I'm not sure it's a big deal. So the idea is to leave m->packs[] for unused pack NULL instead of magic value, so that users of that array do not need to check? I think that is a lot safer than the magic "we know this fails" value that existing callers that have long trusted that a non-NULL .packs[] element is a valid pack. By the way, I suspect I am not reading the code correctly, but I am not sure what fill_midx_entry() does with a failed case. midx_for_object(&m, pos); pack_int_id = nth_midxed_pack_int_id(m, pos); if (prepare_midx_pack(r, m, pack_int_id)) return 0; With or without cached failure, this should return 0 when and only when m->packs[pack_int_id] is a usable pack. But what about the access on the next line? p = m->packs[pack_int_id - m->num_packs_in_base]; Do we have any guarantee that we called prepare_midx_pack() for the pack at (pack_int_id - m->num_packs_in_base)th slot? Can p be NULL here? And with the magic "we know this fails" value, can p be that magic value? /* * We are about to tell the caller where they can locate the * requested object. We better make sure the packfile is * still here and can be accessed before supplying that * answer, as it may have been deleted since the MIDX was * loaded! */ if (!is_pack_valid(p)) return 0; That value is passed to is_pack_valid(), which I do not think even takes NULL. Puzzled. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-22 15:47 ` Junio C Hamano @ 2025-05-22 16:59 ` Jeff King 2025-05-22 18:44 ` Junio C Hamano 2025-05-23 1:22 ` Taylor Blau 0 siblings, 2 replies; 53+ messages in thread From: Jeff King @ 2025-05-22 16:59 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Taylor Blau On Thu, May 22, 2025 at 08:47:58AM -0700, Junio C Hamano wrote: > So the idea is to leave m->packs[] for unused pack NULL instead of > magic value, so that users of that array do not need to check? > > I think that is a lot safer than the magic "we know this fails" > value that existing callers that have long trusted that a non-NULL > .packs[] element is a valid pack. Right. It would protect any further users of that assumption that we missed. > By the way, I suspect I am not reading the code correctly, but I am > not sure what fill_midx_entry() does with a failed case. > > midx_for_object(&m, pos); > pack_int_id = nth_midxed_pack_int_id(m, pos); > > if (prepare_midx_pack(r, m, pack_int_id)) > return 0; > > With or without cached failure, this should return 0 when and only > when m->packs[pack_int_id] is a usable pack. But what about the > access on the next line? I think there's a subtlety here with incremental midx's, in that a pack id can be "global" within the whole midx chain, or a local index into a specific chain element's list of packs. In fill_midx_entry(), I think we get such a global id back from nth_midxed_pack_int_id(). And then when we hand that to prepare_midx_pack(), it is converted into a local midx/id pair with: pack_int_id = midx_for_pack(&m, pack_int_id); where the end of that function is the same: return pack_int_id - m->num_packs_in_base; you saw elsewhere. So at that point we have a local midx and index, which is what prepare_midx_pack() fills via the m->packs[pack_int_id] field. So when you say "only when m->packs[pack_int_id] is a usable pack", you are talking about the local m/pack_int_id within that function. But back in the caller... > p = m->packs[pack_int_id - m->num_packs_in_base]; > > Do we have any guarantee that we called prepare_midx_pack() for > the pack at (pack_int_id - m->num_packs_in_base)th slot? Can p > be NULL here? And with the magic "we know this fails" value, can p > be that magic value? Our pack_int_id is the global one, so it needs to be adjusted. But this pack pointer we access is the same one that was filled (or not) by prepare_midx_pack(). So it cannot be NULL or the magic "fails" value, because prepare_midx_pack() returned 0. So I think this code is fine. One thing that did puzzle me: in prepare_midx_pack() we not only adjust the pack_int_id, but we may walk back through the midx chain to find the correct multi_pack_index struct. Wouldn't the caller need to do the same? The answer is that it does. The midx_for_object() call in fill_midx_entry() does that same walk, storing the result in its local "m" variable. So the walk backwards in prepare_midx_pack() is superfluous for this particular caller, who we know is already handing us the desired multi_pack_index struct, and it could just do: pack_int_id -= m->num_packs_in_base; rather than calling midx_for_pack(). But the same is not necessarily true for other callers, so we should continue calling that function. I suspect this would all be a bit more obvious if prepare_midx_pack() simply returned the pack pointer, avoiding the need for callers to look at m->packs at all (and making it a true cache, internally only to prepare_midx_pack()). Looking at other callers of prepare_midx_pack(): - in fill_packs_from_midx(), we do not adjust our "m" to match the index. But that is OK, because we adjust our local index (which we get by iterating from 0 to m->num_packs) to a global index when calling the function: if (prepare_midx_pack(ctx->repo, m, m->num_packs_in_base + i)) ... open_pack_index(m->packs[i]); which is fine. - I'm less sure of the call in expire_midx_packs(). It iterates over num_packs in the same way, but does: if (prepare_midx_pack(r, m, i)) continue; and then looks at m->packs[i]. That would be wrong if "m" is not the first item in the chain. Ah, I see. Earlier we do: if (m->base_midx) die(_("cannot expire packs from an incremental multi-pack-index")); so we know that the global and local ids are equivalent in this instance (since the "base" midx . Still seems a bit fragile. - There's a similar case in midx-write.c:want_included_pack(). That one seems to have the same local/global confusion, but I do not obviously see anything preventing it from being fed a non-base midx. So it might possibly be buggy? Likewise fill_included_packs_batch() in the same file. In both cases I think if prepare_midx_pack() returned a pointer, we could just use it directly. - In nth_bitmapped_pack(), we call midx_for_pack() ourselves to get the right midx. That's good, but getting a pack pointer from prepare_midx_pack() wouldn't help us, because we still look at the midx struct for other reasons. So I dunno. It might be a useful refactor, but it doesn't make the problem go away entirely. I'm suspicious of those calls in midx-write.c, but Taylor can probably say more about whether they're wrong or if there's some less-obvious reason we'd only see a base midx there. -Peff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-22 16:59 ` Jeff King @ 2025-05-22 18:44 ` Junio C Hamano 2025-05-23 1:22 ` Taylor Blau 1 sibling, 0 replies; 53+ messages in thread From: Junio C Hamano @ 2025-05-22 18:44 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Steinhardt, git, Taylor Blau Jeff King <peff@peff.net> writes: > But back in the caller... > >> p = m->packs[pack_int_id - m->num_packs_in_base]; >> ... > > Our pack_int_id is the global one, so it needs to be adjusted. But this > pack pointer we access is the same one that was filled (or not) by > prepare_midx_pack(). So it cannot be NULL or the magic "fails" value, > because prepare_midx_pack() returned 0. > > So I think this code is fine. Ahh, I missed that call to midx_for_pack() in prepare_midx_pack() that modifies pack_int_id variable. Of course, it did not help that the implementation detail of adjusting by m->num_packs_in_base which is done in midx_for_pack() is not abstracted out and the caller needs to do the same as above. That confused me. Thanks for an explanation. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-22 16:59 ` Jeff King 2025-05-22 18:44 ` Junio C Hamano @ 2025-05-23 1:22 ` Taylor Blau 2025-05-23 2:08 ` Jeff King 1 sibling, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-23 1:22 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Patrick Steinhardt, git On Thu, May 22, 2025 at 12:59:24PM -0400, Jeff King wrote: > > By the way, I suspect I am not reading the code correctly, but I am > > not sure what fill_midx_entry() does with a failed case. > > > > midx_for_object(&m, pos); > > pack_int_id = nth_midxed_pack_int_id(m, pos); > > > > if (prepare_midx_pack(r, m, pack_int_id)) > > return 0; > > > > With or without cached failure, this should return 0 when and only > > when m->packs[pack_int_id] is a usable pack. But what about the > > access on the next line? > > I think there's a subtlety here with incremental midx's, in that a pack > id can be "global" within the whole midx chain, or a local index into a > specific chain element's list of packs. If you'll indulge me in a bit of pedantry for a moment, I would add that the pack_int_id is *always* global within the whole MIDX chain. It only happens to additionally be the correct local index when m->num_packs_in_base is zero. The thing here: m->packs[pack_int_id - m->num_packs_in_base]; ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ is indexing into the current layer's pack array, which only contains the packs at that layer, so the index there is no longer really a pack_int_id at all. But... > In fill_midx_entry(), I think we get such a global id back from > nth_midxed_pack_int_id(). And then when we hand that to > prepare_midx_pack(), it is converted into a local midx/id pair with: > > pack_int_id = midx_for_pack(&m, pack_int_id); ...this confusion is really my fault, since the thing midx_for_pack() is returning isn't really a pack_int_id in the classic sense at all. I think I chose this name to avoid making the patch noisy, and to avoid the possibility of accidentally using the global identifier when you meant to use the local index. (As an aside, I think I recall you suggesting a while ago that it might be interesting to define "global" things with a different type than "local" ones to prevent this sort of confusion. That would allow us to keep both "pack_int_id" and the return value of "midx_for_pack()" in scope at the same time, without the possibility of using one when you meant to use the other.) > where the end of that function is the same: > > return pack_int_id - m->num_packs_in_base; > > you saw elsewhere. So at that point we have a local midx and index, > which is what prepare_midx_pack() fills via the m->packs[pack_int_id] > field. > > So when you say "only when m->packs[pack_int_id] is a usable pack", you > are talking about the local m/pack_int_id within that function. > > But back in the caller... > > > p = m->packs[pack_int_id - m->num_packs_in_base]; > > > > Do we have any guarantee that we called prepare_midx_pack() for > > the pack at (pack_int_id - m->num_packs_in_base)th slot? Can p > > be NULL here? And with the magic "we know this fails" value, can p > > be that magic value? > > Our pack_int_id is the global one, so it needs to be adjusted. But this > pack pointer we access is the same one that was filled (or not) by > prepare_midx_pack(). So it cannot be NULL or the magic "fails" value, > because prepare_midx_pack() returned 0. > > So I think this code is fine. Thanks for the analysis. > One thing that did puzzle me: in prepare_midx_pack() we not only adjust > the pack_int_id, but we may walk back through the midx chain to find the > correct multi_pack_index struct. Wouldn't the caller need to do the > same? > > The answer is that it does. The midx_for_object() call in > fill_midx_entry() does that same walk, storing the result in its local > "m" variable. So the walk backwards in prepare_midx_pack() is > superfluous for this particular caller, who we know is already handing > us the desired multi_pack_index struct, and it could just do: > > pack_int_id -= m->num_packs_in_base; Right. You could imagine that after getting the pack_int_id back from nth_midxed_pack_int_id(), we could do: midx_for_pack(&m, pack_int_id); , which would be a noop because of the chain invariant that we never duplicate objects or packs anywhere in the chain. (IOW, if you know object X is in pack Y, and you move your MIDX pointer to the layer containing X, you are guaranteed to be able to find Y in that layer's array of packs.) > rather than calling midx_for_pack(). But the same is not necessarily > true for other callers, so we should continue calling that function. ...exactly ;-). > I suspect this would all be a bit more obvious if prepare_midx_pack() > simply returned the pack pointer, avoiding the need for callers to look > at m->packs at all (and making it a true cache, internally only to > prepare_midx_pack()). > > Looking at other callers of prepare_midx_pack(): > > - in fill_packs_from_midx(), we do not adjust our "m" to match the > index. But that is OK, because we adjust our local index (which we > get by iterating from 0 to m->num_packs) to a global index when > calling the function: > > if (prepare_midx_pack(ctx->repo, m, m->num_packs_in_base + i)) > ... > open_pack_index(m->packs[i]); > > which is fine. > > - I'm less sure of the call in expire_midx_packs(). It iterates over > num_packs in the same way, but does: > > if (prepare_midx_pack(r, m, i)) > continue; > > and then looks at m->packs[i]. That would be wrong if "m" is not the > first item in the chain. Ah, I see. Earlier we do: > > if (m->base_midx) > die(_("cannot expire packs from an incremental multi-pack-index")); > > so we know that the global and local ids are equivalent in this > instance (since the "base" midx . Still seems a bit fragile. Yeah, I think you can only meaningfully expire packs from the most recent layer, which is why I added that guard. I agree it is still fragile, though. > - There's a similar case in midx-write.c:want_included_pack(). That > one seems to have the same local/global confusion, but I do not > obviously see anything preventing it from being fed a non-base midx. > So it might possibly be buggy? Yeah, this spot is definitely broken. At minimum it would need something like: --- 8< --- diff --git a/midx-write.c b/midx-write.c index 0897cbd829..54a04f7b75 100644 --- a/midx-write.c +++ b/midx-write.c @@ -1634,9 +1634,10 @@ static int want_included_pack(struct repository *r, uint32_t pack_int_id) { struct packed_git *p; + midx_for_pack(&m, pack_int_id); if (prepare_midx_pack(r, m, pack_int_id)) return 0; - p = m->packs[pack_int_id]; + p = m->packs[pack_int_id - m->num_packs_in_base]; if (!pack_kept_objects && p->pack_keep) return 0; if (p->is_cruft) --- >8 --- > Likewise fill_included_packs_batch() in the same file. I think this one is actually OK for the same reason as the expire_midx_packs() case. Its sole caller in midx_repack() has: if (m->base_midx) die(_("cannot repack an incremental multi-pack-index")); , so we are OK there. We might want to add an ASSERT() in fill_included_packs_batch() to make it clearer, though. > In both cases I think if prepare_midx_pack() returned a pointer, we > could just use it directly. Agreed. Thanks, Taylor ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-23 1:22 ` Taylor Blau @ 2025-05-23 2:08 ` Jeff King 2025-05-23 17:46 ` Taylor Blau 0 siblings, 1 reply; 53+ messages in thread From: Jeff King @ 2025-05-23 2:08 UTC (permalink / raw) To: Taylor Blau; +Cc: Junio C Hamano, Patrick Steinhardt, git On Thu, May 22, 2025 at 09:22:23PM -0400, Taylor Blau wrote: > > I think there's a subtlety here with incremental midx's, in that a pack > > id can be "global" within the whole midx chain, or a local index into a > > specific chain element's list of packs. > > If you'll indulge me in a bit of pedantry for a moment, I would add that > the pack_int_id is *always* global within the whole MIDX chain. It only > happens to additionally be the correct local index when > m->num_packs_in_base is zero. Yeah, I agree with that... > The thing here: > > m->packs[pack_int_id - m->num_packs_in_base]; > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > is indexing into the current layer's pack array, which only contains the > packs at that layer, so the index there is no longer really a > pack_int_id at all. ...and this... > > In fill_midx_entry(), I think we get such a global id back from > > nth_midxed_pack_int_id(). And then when we hand that to > > prepare_midx_pack(), it is converted into a local midx/id pair with: > > > > pack_int_id = midx_for_pack(&m, pack_int_id); > > ...this confusion is really my fault, since the thing midx_for_pack() is > returning isn't really a pack_int_id in the classic sense at all. I > think I chose this name to avoid making the patch noisy, and to avoid > the possibility of accidentally using the global identifier when you > meant to use the local index. ...but yeah, this is the part that got me. ;) I think elsewhere you called it local_pack_int_id, in nth_midxed_pack(). Which made things more clear (though still calls it a pack_int_id ;) ). > (As an aside, I think I recall you suggesting a while ago that it might > be interesting to define "global" things with a different type than > "local" ones to prevent this sort of confusion. That would allow us to > keep both "pack_int_id" and the return value of "midx_for_pack()" in > scope at the same time, without the possibility of using one when you > meant to use the other.) Yeah. The trouble is that it becomes awkward in C, since the language will happily intermix two integer typedefs. So you have to wrap them in a struct and access the struct fields everywhere. > > - There's a similar case in midx-write.c:want_included_pack(). That > > one seems to have the same local/global confusion, but I do not > > obviously see anything preventing it from being fed a non-base midx. > > So it might possibly be buggy? > > Yeah, this spot is definitely broken. At minimum it would need something > like: > > --- 8< --- > diff --git a/midx-write.c b/midx-write.c > index 0897cbd829..54a04f7b75 100644 > --- a/midx-write.c > +++ b/midx-write.c > @@ -1634,9 +1634,10 @@ static int want_included_pack(struct repository *r, > uint32_t pack_int_id) > { > struct packed_git *p; > + midx_for_pack(&m, pack_int_id); > if (prepare_midx_pack(r, m, pack_int_id)) > return 0; > - p = m->packs[pack_int_id]; > + p = m->packs[pack_int_id - m->num_packs_in_base]; > if (!pack_kept_objects && p->pack_keep) > return 0; > if (p->is_cruft) > --- >8 --- Yep, that's exactly what I was envisioning. I guess it's probably possible to trigger in practice by writing a new midx based on an existing incremental state. I'll let you figure that part out. :) > > Likewise fill_included_packs_batch() in the same file. > > I think this one is actually OK for the same reason as the > expire_midx_packs() case. Its sole caller in midx_repack() has: > > if (m->base_midx) > die(_("cannot repack an incremental multi-pack-index")); > > , so we are OK there. We might want to add an ASSERT() in > fill_included_packs_batch() to make it clearer, though. Ah, good. Yes, I agree that an assertion would be a nice bit of documentation/safety. Though for these cases I think they only care about the pack (not the containing midx), so changing the return type of prepare_midx_pack() might be an even easier way to get that safety. -Peff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-23 2:08 ` Jeff King @ 2025-05-23 17:46 ` Taylor Blau 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau 0 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-23 17:46 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Patrick Steinhardt, git On Thu, May 22, 2025 at 10:08:20PM -0400, Jeff King wrote: > > (As an aside, I think I recall you suggesting a while ago that it might > > be interesting to define "global" things with a different type than > > "local" ones to prevent this sort of confusion. That would allow us to > > keep both "pack_int_id" and the return value of "midx_for_pack()" in > > scope at the same time, without the possibility of using one when you > > meant to use the other.) > > Yeah. The trouble is that it becomes awkward in C, since the language > will happily intermix two integer typedefs. So you have to wrap them in > a struct and access the struct fields everywhere. Yup :-<. > > > - There's a similar case in midx-write.c:want_included_pack(). That > > > one seems to have the same local/global confusion, but I do not > > > obviously see anything preventing it from being fed a non-base midx. > > > So it might possibly be buggy? > > > > Yeah, this spot is definitely broken. At minimum it would need something > > like: > > > > --- 8< --- > > diff --git a/midx-write.c b/midx-write.c > > index 0897cbd829..54a04f7b75 100644 > > --- a/midx-write.c > > +++ b/midx-write.c > > @@ -1634,9 +1634,10 @@ static int want_included_pack(struct repository *r, > > uint32_t pack_int_id) > > { > > struct packed_git *p; > > + midx_for_pack(&m, pack_int_id); > > if (prepare_midx_pack(r, m, pack_int_id)) > > return 0; > > - p = m->packs[pack_int_id]; > > + p = m->packs[pack_int_id - m->num_packs_in_base]; > > if (!pack_kept_objects && p->pack_keep) > > return 0; > > if (p->is_cruft) > > --- >8 --- > > Yep, that's exactly what I was envisioning. I guess it's probably > possible to trigger in practice by writing a new midx based on an > existing incremental state. I'll let you figure that part out. :) Hmm. I thought that this spot was broken last night, but looking at it again today I think that it is actually OK. I started writing an analysis of why in response to your email, and then decided to throw it away since those details really should live in the patch message. Here's what I came up with: --- 8< --- Subject: [PATCH] midx-write.c: guard against incremental MIDXs in want_included_pack() The function want_included_pack() is used to determine whether or not a the packfile corresponding to some given pack_int_id should be included in a 'git multi-pack-index repack' operation. This spot looks like it would be broken, particularly in: struct packed_git *p; if (prepare_midx_pack(r, m, pack_int_id)) return 0; p = m->packs[pack_int_id]; , when pack_int_id is greater than m->num_pack_in_base (supposing that m->num_packs_in_base is non-zero, or equivalently that m->base_midx is non-NULL). Suppose we have two MIDXs in an incremental MIDX chain, each having two packs: - M0 = {packs: [P0, P1], base_midx: NULL} - M1 = {packs: [P2, P3], base_midx: M0} noting that each pack is identified by its global pack_int_id within the chain. So if you do something like: want_included_pack(the_repository, M1, pack_kept_objects, 2); that function will call prepare_midx_pack(), which is smart enough to realize that the pack of interest is in the current layer (M1), and knows how to adjust its global pack_int_id into an index into the current layer's 'packs' array. But the following line: p = m->packs[pack_int_id]; /* m is M1, pack_int_id is 2 */ looks broken, since each layer of the MIDX only maintains an array of the packs stored within that layer, and 'm' wasn't adjusted back to point at M1->base_midx (M0). The right thing to do would be: struct packed_git *p; if (prepare_midx_pack(r, m, pack_int_id)) return 0; /* open-code midx.c::midx_for_pack(), which is private */ while (m && pack_int_id < m->num_packs_in_base) m = m->base_midx; if (!m) BUG("broken midx?"); if (pack_int_id >= m->num_packs + m->num_packs_in_base) BUG("broken pack_int_id?"); p = m->packs[pack_int_id - m->num_packs_in_base]; But that would be overkill, since this function never deals with incremental MIDXs having more than one layer! To see why, observe that want_included_pack() has two callers: - midx-write.c::fill_included_packs_all() - midx-write.c::fill_included_packs_batch() and those functions' sole caller is in midx-write.c::midx_repack(), which dispatches a call to one or the other depending on whether or not the batch_size is non-zero. And at the very top of midx_repack(), we have a guard against non-trivial incremental MIDX chains: if (m->base_midx) die(_("cannot repack an incremental multi-pack-index")); So want_included_pack() is OK becuase it will never encounter a situation where it has to chase backwards through the '->base_midx' pointer. But that is not immediately clear from reading the code, and is too fragile for my comfort. Make this more clear by adding an ASSERT() to the above effect. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/midx-write.c b/midx-write.c index 0897cbd829..991c42d1dc 100644 --- a/midx-write.c +++ b/midx-write.c @@ -1634,6 +1634,9 @@ static int want_included_pack(struct repository *r, uint32_t pack_int_id) { struct packed_git *p; + + ASSERT(m && !m->base_midx); + if (prepare_midx_pack(r, m, pack_int_id)) return 0; p = m->packs[pack_int_id]; @@ -1653,6 +1656,8 @@ static void fill_included_packs_all(struct repository *r, uint32_t i; int pack_kept_objects = 0; + ASSERT(m && !m->base_midx); + repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); for (i = 0; i < m->num_packs; i++) { @@ -1673,6 +1678,8 @@ static void fill_included_packs_batch(struct repository *r, struct repack_info *pack_info; int pack_kept_objects = 0; + ASSERT(m && !m->base_midx); + CALLOC_ARRAY(pack_info, m->num_packs); repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); -- 2.49.0.221.g485f5f8636.dirty --- >8 --- > > > Likewise fill_included_packs_batch() in the same file. > > > > I think this one is actually OK for the same reason as the > > expire_midx_packs() case. Its sole caller in midx_repack() has: > > > > if (m->base_midx) > > die(_("cannot repack an incremental multi-pack-index")); > > > > , so we are OK there. We might want to add an ASSERT() in > > fill_included_packs_batch() to make it clearer, though. > > Ah, good. Yes, I agree that an assertion would be a nice bit of > documentation/safety. Though for these cases I think they only care > about the pack (not the containing midx), so changing the return type of > prepare_midx_pack() might be an even easier way to get that safety. Ironically, the same argument applies for this function (and the _all() variant) as well ;-). Thanks, Taylor ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics 2025-05-23 17:46 ` Taylor Blau @ 2025-05-25 18:41 ` Taylor Blau 2025-05-25 18:41 ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau ` (5 more replies) 0 siblings, 6 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-25 18:41 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano [This brief series is based on ps/midx-negative-packfile-cache]. This series implements a suggestion from this thread to have prepare_midx_pack() return a pointer to packed_git structure identified by the given pack_int_id. This convention makes converting a pack_int_id into a bona-fide pack significantly less error-prone, particularly when dealing with incremental MIDXs. Along the way, I noticed a couple of opportunities for clean-up and light refactorings, which make up the first four patches. In the future, I think we could take this further by: - applying the same pattern to the 'pack_names' array, and - renaming both arrays something like 'packs_' or similar to indicate that they are "private" and should not be accessed outside of midx.c. We might be able to go even further than that by making the entire structure opaque to external callers by defining it within midx.c itself. We don't get there in this series, but it sets us on the right direction. Thanks in advance for your review! Taylor Blau (5): pack-bitmap.c: fix broken warning() when missing MIDX'd pack midx-write.c: guard against incremental MIDXs in want_included_pack() midx-write.c: simplify fill_packs_from_midx() midx-write.c: extract inner loop from fill_packs_from_midx() midx: return a `packed_git` pointer from `prepare_midx_pack()` midx-write.c | 114 +++++++++++++++++++++++++++++++------------------- midx.c | 81 ++++++++++++++++++----------------- midx.h | 4 +- pack-bitmap.c | 4 +- 4 files changed, 115 insertions(+), 88 deletions(-) base-commit: 73fe0882eec06f7b9ea021fe062a32cd1731e574 -- 2.49.0.641.gb9c9c4c3bd ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau @ 2025-05-25 18:41 ` Taylor Blau 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-25 18:41 ` [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau ` (4 subsequent siblings) 5 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-25 18:41 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano In commit 44f9fd6496 (pack-bitmap.c: check preferred pack validity when opening MIDX bitmap, 2022-05-24) we started opening all packs contained within a MIDX when loading its corresponding bitmap. However, if a pack is missing, then we will emit a warning like: warning: could not open pack pack-$HASH.pack Later on commit f31a17cea5 (pack-bitmap.c: open and store incremental bitmap layers, 2025-03-20) updated this code to work with incremental MIDX bitmaps, but did not adjust the index into the 'pack_names' field. So if there is a pack in an incremental MIDX chain with a pack in a MIDX layer with a non-zero number of packs in its base layer(s) (in other words, any MIDX layer outside of the first one) that cannot be loaded, we will do an out-of-bounds lookup. Adjust the lookup into the 'pack_names' array by the number of packs in the base to prevent a potential SIGSEGV here. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- pack-bitmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pack-bitmap.c b/pack-bitmap.c index b9f1d86604..99c4927e9c 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -490,7 +490,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { warning(_("could not open pack %s"), - bitmap_git->midx->pack_names[i]); + bitmap_git->midx->pack_names[i - bitmap_git->midx->num_packs_in_base]); goto cleanup; } } -- 2.49.0.641.gb9c9c4c3bd ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack 2025-05-25 18:41 ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau @ 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-28 2:00 ` Taylor Blau 0 siblings, 1 reply; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-26 7:23 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano On Sun, May 25, 2025 at 02:41:51PM -0400, Taylor Blau wrote: > diff --git a/pack-bitmap.c b/pack-bitmap.c > index b9f1d86604..99c4927e9c 100644 > --- a/pack-bitmap.c > +++ b/pack-bitmap.c > @@ -490,7 +490,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, > for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { > if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { > warning(_("could not open pack %s"), > - bitmap_git->midx->pack_names[i]); > + bitmap_git->midx->pack_names[i - bitmap_git->midx->num_packs_in_base]); Doesn't this cause a negative array index though in the case where `prepare_midx_pack()` returns an error for any `i` smaller than the number of packs in base? Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack 2025-05-26 7:23 ` Patrick Steinhardt @ 2025-05-28 2:00 ` Taylor Blau 0 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-28 2:00 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano On Mon, May 26, 2025 at 09:23:48AM +0200, Patrick Steinhardt wrote: > On Sun, May 25, 2025 at 02:41:51PM -0400, Taylor Blau wrote: > > diff --git a/pack-bitmap.c b/pack-bitmap.c > > index b9f1d86604..99c4927e9c 100644 > > --- a/pack-bitmap.c > > +++ b/pack-bitmap.c > > @@ -490,7 +490,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, > > for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { > > if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { > > warning(_("could not open pack %s"), > > - bitmap_git->midx->pack_names[i]); > > + bitmap_git->midx->pack_names[i - bitmap_git->midx->num_packs_in_base]); > > Doesn't this cause a negative array index though in the case where > `prepare_midx_pack()` returns an error for any `i` smaller than the > number of packs in base? Nice catch, yeah, this is definitely broken when the value i is smaller than m->num_packs_in_base. This is another spot that would benefit from similar treatment where callers (outside of midx.c and *maybe* midx-write.c) access the pack_names array through a function liked nth_midxed_pack_name() or similar. I'll adjust this and send a new round. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau 2025-05-25 18:41 ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau @ 2025-05-25 18:41 ` Taylor Blau 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-25 18:41 ` [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() Taylor Blau ` (3 subsequent siblings) 5 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-25 18:41 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano The function want_included_pack() is used to determine whether or not a the packfile corresponding to some given pack_int_id should be included in a 'git multi-pack-index repack' operation. This spot looks like it would be broken, particularly in: struct packed_git *p; if (prepare_midx_pack(r, m, pack_int_id)) return 0; p = m->packs[pack_int_id]; , when pack_int_id is greater than m->num_pack_in_base (supposing that m->num_packs_in_base is non-zero, or equivalently that m->base_midx is non-NULL). Suppose we have two MIDXs in an incremental MIDX chain, each having two packs: - M0 = {packs: [P0, P1], base_midx: NULL} - M1 = {packs: [P2, P3], base_midx: M0} noting that each pack is identified by its global pack_int_id within the chain. So if you do something like: want_included_pack(the_repository, M1, pack_kept_objects, 2); that function will call prepare_midx_pack(), which is smart enough to realize that the pack of interest is in the current layer (M1), and knows how to adjust its global pack_int_id into an index into the current layer's 'packs' array. But the following line: p = m->packs[pack_int_id]; /* m is M1, pack_int_id is 2 */ looks broken, since each layer of the MIDX only maintains an array of the packs stored within that layer, and 'm' wasn't adjusted back to point at M1->base_midx (M0). The right thing to do would be: struct packed_git *p; if (prepare_midx_pack(r, m, pack_int_id)) return 0; /* open-code midx.c::midx_for_pack(), which is private */ while (m && pack_int_id < m->num_packs_in_base) m = m->base_midx; if (!m) BUG("broken midx?"); if (pack_int_id >= m->num_packs + m->num_packs_in_base) BUG("broken pack_int_id?"); p = m->packs[pack_int_id - m->num_packs_in_base]; But that would be overkill, since this function never deals with incremental MIDXs having more than one layer! To see why, observe that want_included_pack() has two callers: - midx-write.c::fill_included_packs_all() - midx-write.c::fill_included_packs_batch() and those functions' sole caller is in midx-write.c::midx_repack(), which dispatches a call to one or the other depending on whether or not the batch_size is non-zero. And at the very top of midx_repack(), we have a guard against non-trivial incremental MIDX chains: if (m->base_midx) die(_("cannot repack an incremental multi-pack-index")); So want_included_pack() is OK becuase it will never encounter a situation where it has to chase backwards through the '->base_midx' pointer. But that is not immediately clear from reading the code, and is too fragile for my comfort. Make this more clear by adding an ASSERT() to the above effect. Apply the same treatment to each of the fill_included_packs-related functions as well, since those are deceptively OK by the same reasoning. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/midx-write.c b/midx-write.c index dd3b3070e5..e4a3830d45 100644 --- a/midx-write.c +++ b/midx-write.c @@ -1636,6 +1636,9 @@ static int want_included_pack(struct repository *r, uint32_t pack_int_id) { struct packed_git *p; + + ASSERT(m && !m->base_midx); + if (prepare_midx_pack(r, m, pack_int_id)) return 0; p = m->packs[pack_int_id]; @@ -1655,6 +1658,8 @@ static void fill_included_packs_all(struct repository *r, uint32_t i; int pack_kept_objects = 0; + ASSERT(m && !m->base_midx); + repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); for (i = 0; i < m->num_packs; i++) { @@ -1675,6 +1680,8 @@ static void fill_included_packs_batch(struct repository *r, struct repack_info *pack_info; int pack_kept_objects = 0; + ASSERT(m && !m->base_midx); + CALLOC_ARRAY(pack_info, m->num_packs); repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); -- 2.49.0.641.gb9c9c4c3bd ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() 2025-05-25 18:41 ` [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau @ 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-28 2:08 ` Taylor Blau 0 siblings, 1 reply; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-26 7:23 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano On Sun, May 25, 2025 at 02:41:54PM -0400, Taylor Blau wrote: [snip] > So want_included_pack() is OK becuase it will never encounter a s/becuase/because/ > situation where it has to chase backwards through the '->base_midx' > pointer. But that is not immediately clear from reading the code, and is > too fragile for my comfort. Make this more clear by adding an ASSERT() > to the above effect. > > Apply the same treatment to each of the fill_included_packs-related > functions as well, since those are deceptively OK by the same reasoning. Ok. Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() 2025-05-26 7:23 ` Patrick Steinhardt @ 2025-05-28 2:08 ` Taylor Blau 0 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-28 2:08 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano On Mon, May 26, 2025 at 09:23:56AM +0200, Patrick Steinhardt wrote: > On Sun, May 25, 2025 at 02:41:54PM -0400, Taylor Blau wrote: > [snip] > > So want_included_pack() is OK becuase it will never encounter a > > s/becuase/because/ Thanks for spotting. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau 2025-05-25 18:41 ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau 2025-05-25 18:41 ` [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau @ 2025-05-25 18:41 ` Taylor Blau 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-25 18:42 ` [PATCH 4/5] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau ` (2 subsequent siblings) 5 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-25 18:41 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano When enumerating packs within fill_packs_from_midx(), we open the pack's index iff we are either writing a reverse index, have a preferred pack (by name) or both. Let's simplify this into a single case by setting the MIDX_WRITE_REV_INDEX flag bit when we have a preferred_pack_name. This is a little bit of a shortcut to reduce the line length in the loop below. But it sets us up nicely to extract the inner loop of this function out into its own function, where we will no longer have to pass the preferred_pack_name. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/midx-write.c b/midx-write.c index e4a3830d45..7802a4b694 100644 --- a/midx-write.c +++ b/midx-write.c @@ -943,6 +943,19 @@ static int fill_packs_from_midx(struct write_midx_context *ctx, { struct multi_pack_index *m; + if (preferred_pack_name) { + /* + * If a preferred pack is specified, need to have + * packed_git's loaded to ensure the chosen preferred + * pack has a non-zero object count. + * + * Trick ourselves into thinking that we're writing a + * reverse index in this case in order to open up the + * pack index file. + */ + flags |= MIDX_WRITE_REV_INDEX; + } + for (m = ctx->m; m; m = m->base_midx) { uint32_t i; @@ -953,13 +966,8 @@ static int fill_packs_from_midx(struct write_midx_context *ctx, * If generating a reverse index, need to have * packed_git's loaded to compare their * mtimes and object count. - * - * If a preferred pack is specified, need to - * have packed_git's loaded to ensure the chosen - * preferred pack has a non-zero object count. */ - if (flags & MIDX_WRITE_REV_INDEX || - preferred_pack_name) { + if (flags & MIDX_WRITE_REV_INDEX) { if (prepare_midx_pack(ctx->repo, m, m->num_packs_in_base + i)) { error(_("could not load pack")); -- 2.49.0.641.gb9c9c4c3bd ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() 2025-05-25 18:41 ` [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() Taylor Blau @ 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-28 2:15 ` Taylor Blau 0 siblings, 1 reply; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-26 7:23 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano On Sun, May 25, 2025 at 02:41:57PM -0400, Taylor Blau wrote: > diff --git a/midx-write.c b/midx-write.c > index e4a3830d45..7802a4b694 100644 > --- a/midx-write.c > +++ b/midx-write.c > @@ -943,6 +943,19 @@ static int fill_packs_from_midx(struct write_midx_context *ctx, > { > struct multi_pack_index *m; > > + if (preferred_pack_name) { > + /* > + * If a preferred pack is specified, need to have > + * packed_git's loaded to ensure the chosen preferred > + * pack has a non-zero object count. > + * > + * Trick ourselves into thinking that we're writing a > + * reverse index in this case in order to open up the > + * pack index file. > + */ > + flags |= MIDX_WRITE_REV_INDEX; > + } This change feels a bit weird to me. Sure, it does allow us to pull out the loop in the subsequent patch. But honestly, that makes this workaround even weirder in that we now set unrelated flags in some function and expect a different function to only honor it in order to open the packfile. Shouldn't we instead have a separate flag for the new function that tells it whether or not it is supposed to prepare the pack? Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() 2025-05-26 7:23 ` Patrick Steinhardt @ 2025-05-28 2:15 ` Taylor Blau 0 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-28 2:15 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano On Mon, May 26, 2025 at 09:23:59AM +0200, Patrick Steinhardt wrote: > On Sun, May 25, 2025 at 02:41:57PM -0400, Taylor Blau wrote: > > diff --git a/midx-write.c b/midx-write.c > > index e4a3830d45..7802a4b694 100644 > > --- a/midx-write.c > > +++ b/midx-write.c > > @@ -943,6 +943,19 @@ static int fill_packs_from_midx(struct write_midx_context *ctx, > > { > > struct multi_pack_index *m; > > > > + if (preferred_pack_name) { > > + /* > > + * If a preferred pack is specified, need to have > > + * packed_git's loaded to ensure the chosen preferred > > + * pack has a non-zero object count. > > + * > > + * Trick ourselves into thinking that we're writing a > > + * reverse index in this case in order to open up the > > + * pack index file. > > + */ > > + flags |= MIDX_WRITE_REV_INDEX; > > + } > > This change feels a bit weird to me. Sure, it does allow us to pull out > the loop in the subsequent patch. But honestly, that makes this > workaround even weirder in that we now set unrelated flags in some > function and expect a different function to only honor it in order to > open the packfile. > > Shouldn't we instead have a separate flag for the new function that > tells it whether or not it is supposed to prepare the pack? Yeah, I like that much better, thanks! Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 4/5] midx-write.c: extract inner loop from fill_packs_from_midx() 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau ` (2 preceding siblings ...) 2025-05-25 18:41 ` [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() Taylor Blau @ 2025-05-25 18:42 ` Taylor Blau 2025-05-25 18:42 ` [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau 2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau 5 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-25 18:42 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano The function fill_packs_from_midx() does relatively little, but ends up in a doubly-nested loop because we're enumerating each pack within each layer of the incremental MIDX chain. Let's de-dent the inner loop of fill_packs_from_midx() by extracting its contents into a separate function, and calling that. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 62 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 36 insertions(+), 26 deletions(-) diff --git a/midx-write.c b/midx-write.c index 7802a4b694..2afb67d728 100644 --- a/midx-write.c +++ b/midx-write.c @@ -938,6 +938,38 @@ static struct multi_pack_index *lookup_multi_pack_index(struct repository *r, return result; } +static int fill_packs_from_midx_1(struct write_midx_context *ctx, + struct multi_pack_index *m, + uint32_t flags) +{ + for (uint32_t i = 0; i < m->num_packs; i++) { + ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); + + /* + * If generating a reverse index, need to have + * packed_git's loaded to compare their + * mtimes and object count. + */ + if (flags & MIDX_WRITE_REV_INDEX) { + if (prepare_midx_pack(ctx->repo, m, + m->num_packs_in_base + i)) { + error(_("could not load pack")); + return 1; + } + + if (open_pack_index(m->packs[i])) + die(_("could not open index for %s"), + m->packs[i]->pack_name); + } + + fill_pack_info(&ctx->info[ctx->nr++], m->packs[i], + m->pack_names[i], + m->num_packs_in_base + i); + } + + return 0; +} + static int fill_packs_from_midx(struct write_midx_context *ctx, const char *preferred_pack_name, uint32_t flags) { @@ -957,33 +989,11 @@ static int fill_packs_from_midx(struct write_midx_context *ctx, } for (m = ctx->m; m; m = m->base_midx) { - uint32_t i; - - for (i = 0; i < m->num_packs; i++) { - ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); - - /* - * If generating a reverse index, need to have - * packed_git's loaded to compare their - * mtimes and object count. - */ - if (flags & MIDX_WRITE_REV_INDEX) { - if (prepare_midx_pack(ctx->repo, m, - m->num_packs_in_base + i)) { - error(_("could not load pack")); - return 1; - } - - if (open_pack_index(m->packs[i])) - die(_("could not open index for %s"), - m->packs[i]->pack_name); - } - - fill_pack_info(&ctx->info[ctx->nr++], m->packs[i], - m->pack_names[i], - m->num_packs_in_base + i); - } + int ret = fill_packs_from_midx_1(ctx, m, flags); + if (ret) + return ret; } + return 0; } -- 2.49.0.641.gb9c9c4c3bd ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau ` (3 preceding siblings ...) 2025-05-25 18:42 ` [PATCH 4/5] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau @ 2025-05-25 18:42 ` Taylor Blau 2025-05-26 7:24 ` Patrick Steinhardt 2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau 5 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-25 18:42 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano The prepare_midx_pack() function is designed to convert a MIDX-specific pack_int_id for a given pack into a pointer into an actual `packed_git` structure. In general, these calls look something like: struct packed_git *p; if (prepare_midx_pack(the_repository, some_midx, some_pack_int_id)) die("could not load pack xyz"); p = some_midx->packs[some_pack_int_id]; , and in a pre-incremental MIDX world, this pattern works well. However, in a post-incremental MIDX world, this pattern is a little more prone to errors. These errors can happen when the given 'pack_int_id' is not usable as an index into the 'm->packs' array. And this happens in all layers but the bottom-most one in an incremental MIDX chain. Each layer stores only the packs that are local to that layer of the chain, and offsets them by the total number of packs in the base MIDX(s). But there is other awkwardness here. Thinking back to the above snippet, suppose that the pack with ID 'some_pack_int_id' is in a layer in the middle of the MIDX chain. Then it is still invalid to do: p = some_midx->packs[some_pack_int_id - some_midx->num_packs_in_base]; , becuase the top-most layer (here 'some_midx') may not even have that pack! So we would have to chase down the '->base_midx' pointer in order to get the correct result. midx.c has a helper to do this (called 'midx_for_pack()'), but it is meant only for internal use. That means that a full, working version of the above adjusted to handle incremental MIDXs would look something like: struct packed_git *p; if (prepare_midx_pack(the_repository, some_midx, some_pack_int_id)) die("could not load pack xyz"); while (m && pack_int_id < m->num_packs_in_base) m = m->base_midx; if (!m) BUG("broken midx?"); if (pack_int_id >= m->num_packs + m->num_packs_in_base) BUG("broken pack_int_id?"); p = m->packs[pack_int_id - m->num_packs_in_base]; , which is far too verbose to access a single pack by its pack_int_id in a MIDX chain. Let's instead have prepare_midx_pack() return a pointer to the packed_git structure itself, hiding the above as an implementation detail of prepare_midx_pack(). This patch turns the above snippet into: struct packed_git *p = prepare_midx_pack(the_repository, some_midx, some_pack_int_id); if (!p) die("could not load pack xyz"); making it far easier and less error-prone to access packs by their pack_int_id in a MIDX chain. (In the future, we may want to consider similar treatment for, e.g., the pack_names array. Likewise, it might make sense to rename the "packs" member of the MIDX structure to suggest that it shouldn't be accessed directly outside of midx.c.) Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 37 +++++++++++------------ midx.c | 81 +++++++++++++++++++++++++-------------------------- midx.h | 4 ++- pack-bitmap.c | 2 +- 4 files changed, 63 insertions(+), 61 deletions(-) diff --git a/midx-write.c b/midx-write.c index 2afb67d728..e8dc771665 100644 --- a/midx-write.c +++ b/midx-write.c @@ -943,6 +943,8 @@ static int fill_packs_from_midx_1(struct write_midx_context *ctx, uint32_t flags) { for (uint32_t i = 0; i < m->num_packs; i++) { + struct packed_git *p = NULL; + ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); /* @@ -951,18 +953,19 @@ static int fill_packs_from_midx_1(struct write_midx_context *ctx, * mtimes and object count. */ if (flags & MIDX_WRITE_REV_INDEX) { - if (prepare_midx_pack(ctx->repo, m, - m->num_packs_in_base + i)) { + p = prepare_midx_pack(ctx->repo, m, + m->num_packs_in_base + i); + if (!p) { error(_("could not load pack")); return 1; } - if (open_pack_index(m->packs[i])) + if (open_pack_index(p)) die(_("could not open index for %s"), - m->packs[i]->pack_name); + p->pack_name); } - fill_pack_info(&ctx->info[ctx->nr++], m->packs[i], + fill_pack_info(&ctx->info[ctx->nr++], p, m->pack_names[i], m->num_packs_in_base + i); } @@ -1596,20 +1599,19 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla _("Finding and deleting unreferenced packfiles"), m->num_packs); for (i = 0; i < m->num_packs; i++) { + struct packed_git *p; char *pack_name; display_progress(progress, i + 1); if (count[i]) continue; - if (prepare_midx_pack(r, m, i)) + p = prepare_midx_pack(r, m, i); + if (!p || p->pack_keep || p->is_cruft) continue; - if (m->packs[i]->pack_keep || m->packs[i]->is_cruft) - continue; - - pack_name = xstrdup(m->packs[i]->pack_name); - close_pack(m->packs[i]); + pack_name = xstrdup(p->pack_name); + close_pack(p); string_list_insert(&packs_to_drop, m->pack_names[i]); unlink_pack_path(pack_name, 0); @@ -1657,9 +1659,9 @@ static int want_included_pack(struct repository *r, ASSERT(m && !m->base_midx); - if (prepare_midx_pack(r, m, pack_int_id)) + p = prepare_midx_pack(r, m, pack_int_id); + if (!p) return 0; - p = m->packs[pack_int_id]; if (!pack_kept_objects && p->pack_keep) return 0; if (p->is_cruft) @@ -1705,12 +1707,11 @@ static void fill_included_packs_batch(struct repository *r, repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); for (i = 0; i < m->num_packs; i++) { + struct packed_git *p = prepare_midx_pack(r, m, i); + pack_info[i].pack_int_id = i; - - if (prepare_midx_pack(r, m, i)) - continue; - - pack_info[i].mtime = m->packs[i]->mtime; + if (p) + pack_info[i].mtime = p->mtime; } for (i = 0; i < m->num_objects; i++) { diff --git a/midx.c b/midx.c index fbce88bd46..f7f509cf46 100644 --- a/midx.c +++ b/midx.c @@ -449,50 +449,48 @@ static uint32_t midx_for_pack(struct multi_pack_index **_m, return pack_int_id - m->num_packs_in_base; } -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, - uint32_t pack_int_id) +struct packed_git *prepare_midx_pack(struct repository *r, + struct multi_pack_index *m, + uint32_t pack_int_id) { - struct strbuf pack_name = STRBUF_INIT; - struct strbuf key = STRBUF_INIT; - struct packed_git *p; + uint32_t pack_pos = midx_for_pack(&m, pack_int_id); - pack_int_id = midx_for_pack(&m, pack_int_id); + if (!m->packs[pack_pos]) { + struct strbuf pack_name = STRBUF_INIT; + struct strbuf key = STRBUF_INIT; + struct packed_git *p; - if (m->packs[pack_int_id] == (void *)(intptr_t)-1) - return 1; - if (m->packs[pack_int_id]) - return 0; + strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir, + m->pack_names[pack_pos]); - strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir, - m->pack_names[pack_int_id]); - - /* pack_map holds the ".pack" name, but we have the .idx */ - strbuf_addbuf(&key, &pack_name); - strbuf_strip_suffix(&key, ".idx"); - strbuf_addstr(&key, ".pack"); - p = hashmap_get_entry_from_hash(&r->objects->pack_map, - strhash(key.buf), key.buf, - struct packed_git, packmap_ent); - if (!p) { - p = add_packed_git(r, pack_name.buf, pack_name.len, m->local); - if (p) { - install_packed_git(r, p); - list_add_tail(&p->mru, &r->objects->packed_git_mru); + /* pack_map holds the ".pack" name, but we have the .idx */ + strbuf_addbuf(&key, &pack_name); + strbuf_strip_suffix(&key, ".idx"); + strbuf_addstr(&key, ".pack"); + p = hashmap_get_entry_from_hash(&r->objects->pack_map, + strhash(key.buf), key.buf, + struct packed_git, packmap_ent); + if (!p) { + p = add_packed_git(r, pack_name.buf, pack_name.len, + m->local); + if (p) { + install_packed_git(r, p); + list_add_tail(&p->mru, + &r->objects->packed_git_mru); + } } - } - strbuf_release(&pack_name); - strbuf_release(&key); + strbuf_release(&pack_name); + strbuf_release(&key); - if (!p) { - m->packs[pack_int_id] = (void *)(intptr_t)-1; - return 1; + m->packs[pack_pos] = p ? p : (void *)(intptr_t)-1; + if (p) + p->multi_pack_index = 1; } - p->multi_pack_index = 1; - m->packs[pack_int_id] = p; - - return 0; + if (m->packs[pack_pos] == (void *)(intptr_t)-1) + return NULL; + return m->packs[pack_pos]; } struct packed_git *nth_midxed_pack(struct multi_pack_index *m, @@ -514,10 +512,11 @@ int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, if (!m->chunk_bitmapped_packs) return error(_("MIDX does not contain the BTMP chunk")); - if (prepare_midx_pack(r, m, pack_int_id)) - return error(_("could not load bitmapped pack %"PRIu32), pack_int_id); + bp->p = prepare_midx_pack(r, m, pack_int_id); + if (!bp->p) + return error(_("could not load bitmapped pack %"PRIu32), + pack_int_id); - bp->p = m->packs[local_pack_int_id]; bp->bitmap_pos = get_be32((char *)m->chunk_bitmapped_packs + MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * local_pack_int_id); bp->bitmap_nr = get_be32((char *)m->chunk_bitmapped_packs + @@ -614,9 +613,9 @@ int fill_midx_entry(struct repository *r, midx_for_object(&m, pos); pack_int_id = nth_midxed_pack_int_id(m, pos); - if (prepare_midx_pack(r, m, pack_int_id)) + p = prepare_midx_pack(r, m, pack_int_id); + if (!p) return 0; - p = m->packs[pack_int_id - m->num_packs_in_base]; /* * We are about to tell the caller where they can locate the @@ -917,7 +916,7 @@ int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag _("Looking for referenced packfiles"), m->num_packs + m->num_packs_in_base); for (i = 0; i < m->num_packs + m->num_packs_in_base; i++) { - if (prepare_midx_pack(r, m, i)) + if (!prepare_midx_pack(r, m, i)) midx_report("failed to load pack in position %d", i); display_progress(progress, i + 1); diff --git a/midx.h b/midx.h index 9d1374cbd5..612999fe6a 100644 --- a/midx.h +++ b/midx.h @@ -104,7 +104,9 @@ void get_split_midx_filename_ext(const struct git_hash_algo *hash_algo, struct multi_pack_index *load_multi_pack_index(struct repository *r, const char *object_dir, int local); -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); +struct packed_git *prepare_midx_pack(struct repository *r, + struct multi_pack_index *m, + uint32_t pack_int_id); struct packed_git *nth_midxed_pack(struct multi_pack_index *m, uint32_t pack_int_id); int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, diff --git a/pack-bitmap.c b/pack-bitmap.c index 99c4927e9c..648b9c0212 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -488,7 +488,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, } for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { - if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { + if (!prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { warning(_("could not open pack %s"), bitmap_git->midx->pack_names[i - bitmap_git->midx->num_packs_in_base]); goto cleanup; -- 2.49.0.641.gb9c9c4c3bd ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-05-25 18:42 ` [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau @ 2025-05-26 7:24 ` Patrick Steinhardt 2025-05-28 2:18 ` Taylor Blau 0 siblings, 1 reply; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-26 7:24 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano On Sun, May 25, 2025 at 02:42:03PM -0400, Taylor Blau wrote: > diff --git a/midx.c b/midx.c > index fbce88bd46..f7f509cf46 100644 > --- a/midx.c > +++ b/midx.c > @@ -449,50 +449,48 @@ static uint32_t midx_for_pack(struct multi_pack_index **_m, > return pack_int_id - m->num_packs_in_base; > } > > -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > - uint32_t pack_int_id) > +struct packed_git *prepare_midx_pack(struct repository *r, > + struct multi_pack_index *m, > + uint32_t pack_int_id) > { > - struct strbuf pack_name = STRBUF_INIT; > - struct strbuf key = STRBUF_INIT; > - struct packed_git *p; > + uint32_t pack_pos = midx_for_pack(&m, pack_int_id); > > - pack_int_id = midx_for_pack(&m, pack_int_id); > + if (!m->packs[pack_pos]) { > + struct strbuf pack_name = STRBUF_INIT; > + struct strbuf key = STRBUF_INIT; > + struct packed_git *p; > > - if (m->packs[pack_int_id] == (void *)(intptr_t)-1) > - return 1; Ah, so this series builds on top of my patch that introduces the negative lookup cache? That wasn't quite clear to me and makes it a bit hard to iterate on my patch now. Could I suggest that you maybe include that patch as part of this series so that those can be iterated on in tandem? Overall I think that this change is quite sensible and hides away at least some of the complexity. Thanks! Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-05-26 7:24 ` Patrick Steinhardt @ 2025-05-28 2:18 ` Taylor Blau 2025-05-28 11:53 ` Patrick Steinhardt 0 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-28 2:18 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano On Mon, May 26, 2025 at 09:24:03AM +0200, Patrick Steinhardt wrote: > On Sun, May 25, 2025 at 02:42:03PM -0400, Taylor Blau wrote: > > diff --git a/midx.c b/midx.c > > index fbce88bd46..f7f509cf46 100644 > > --- a/midx.c > > +++ b/midx.c > > @@ -449,50 +449,48 @@ static uint32_t midx_for_pack(struct multi_pack_index **_m, > > return pack_int_id - m->num_packs_in_base; > > } > > > > -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > - uint32_t pack_int_id) > > +struct packed_git *prepare_midx_pack(struct repository *r, > > + struct multi_pack_index *m, > > + uint32_t pack_int_id) > > { > > - struct strbuf pack_name = STRBUF_INIT; > > - struct strbuf key = STRBUF_INIT; > > - struct packed_git *p; > > + uint32_t pack_pos = midx_for_pack(&m, pack_int_id); > > > > - pack_int_id = midx_for_pack(&m, pack_int_id); > > + if (!m->packs[pack_pos]) { > > + struct strbuf pack_name = STRBUF_INIT; > > + struct strbuf key = STRBUF_INIT; > > + struct packed_git *p; > > > > - if (m->packs[pack_int_id] == (void *)(intptr_t)-1) > > - return 1; > > Ah, so this series builds on top of my patch that introduces the > negative lookup cache? That wasn't quite clear to me and makes it a bit > hard to iterate on my patch now. Eek, sorry for the confusion. I mentioned it in the beginning of my cover letter as well as the (less visible) "base-commit" identifier. Is there another spot where I could have highlighted the dependency more clearly? > Could I suggest that you maybe include that patch as part of this > series so that those can be iterated on in tandem? I could, however your branch is already tentatively marked for 'next', and Junio applied this topic to a branch based on yours as I had suggested. So we could change it, but I think it would be more of a hassle for Junio, so I'd rather avoid doing so. Are there changes that you want to make on top of your patch? > Overall I think that this change is quite sensible and hides away at > least some of the complexity. Thanks! Much appreciated :-). Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-05-28 2:18 ` Taylor Blau @ 2025-05-28 11:53 ` Patrick Steinhardt 0 siblings, 0 replies; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-28 11:53 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano On Tue, May 27, 2025 at 10:18:06PM -0400, Taylor Blau wrote: > On Mon, May 26, 2025 at 09:24:03AM +0200, Patrick Steinhardt wrote: > > On Sun, May 25, 2025 at 02:42:03PM -0400, Taylor Blau wrote: > > > diff --git a/midx.c b/midx.c > > > index fbce88bd46..f7f509cf46 100644 > > > --- a/midx.c > > > +++ b/midx.c > > > @@ -449,50 +449,48 @@ static uint32_t midx_for_pack(struct multi_pack_index **_m, > > > return pack_int_id - m->num_packs_in_base; > > > } > > > > > > -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > > - uint32_t pack_int_id) > > > +struct packed_git *prepare_midx_pack(struct repository *r, > > > + struct multi_pack_index *m, > > > + uint32_t pack_int_id) > > > { > > > - struct strbuf pack_name = STRBUF_INIT; > > > - struct strbuf key = STRBUF_INIT; > > > - struct packed_git *p; > > > + uint32_t pack_pos = midx_for_pack(&m, pack_int_id); > > > > > > - pack_int_id = midx_for_pack(&m, pack_int_id); > > > + if (!m->packs[pack_pos]) { > > > + struct strbuf pack_name = STRBUF_INIT; > > > + struct strbuf key = STRBUF_INIT; > > > + struct packed_git *p; > > > > > > - if (m->packs[pack_int_id] == (void *)(intptr_t)-1) > > > - return 1; > > > > Ah, so this series builds on top of my patch that introduces the > > negative lookup cache? That wasn't quite clear to me and makes it a bit > > hard to iterate on my patch now. > > Eek, sorry for the confusion. I mentioned it in the beginning of my > cover letter as well as the (less visible) "base-commit" identifier. Is > there another spot where I could have highlighted the dependency more > clearly? Oh, it was right at the top of that mail. Maybe the square brackets have made me skip it? No idea. In any case it was present, I just happened to read over it. > > Could I suggest that you maybe include that patch as part of this > > series so that those can be iterated on in tandem? > > I could, however your branch is already tentatively marked for 'next', > and Junio applied this topic to a branch based on yours as I had > suggested. So we could change it, but I think it would be more of a > hassle for Junio, so I'd rather avoid doing so. > > Are there changes that you want to make on top of your patch? The only change was that I wanted to add is the `(void *)(intptr_t)-1` part with a macro, as proposed by Peff. You know, it should be trivial to adapt to that change, so I'll just send out a new version now, which should then probably be merged down to `next` soonish anyway. Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau ` (4 preceding siblings ...) 2025-05-25 18:42 ` [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau @ 2025-05-28 22:58 ` Taylor Blau 2025-05-28 22:59 ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau ` (3 more replies) 5 siblings, 4 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-28 22:58 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano This series has been updated to be based on v3 of ps/midx-negative-packfile-cache, which is 1f34bf3e08 (midx: stop repeatedly looking up nonexistent packfiles, 2025-05-28) at the time of writing. There are a couple of minor changes from last time, including: - using the new MIDX_PACK_ERROR macro instead of casting '-1' to a void pointer - converting some direct m->pack_names lookups (outside of midx.c and midx-write.c with one exception) to use a similar pattern to access the nth pack name As usual, a range-diff against v1 is below for convenience. Thanks again for your review! Taylor Blau (4): midx: access pack names through `nth_midxed_pack_name()` midx-write.c: guard against incremental MIDXs in want_included_pack() midx-write.c: extract inner loop from fill_packs_from_midx() midx: return a `packed_git` pointer from `prepare_midx_pack()` midx-write.c | 101 +++++++++++++++++++++----------------- midx.c | 88 +++++++++++++++++---------------- midx.h | 6 ++- pack-bitmap.c | 6 +-- t/helper/test-read-midx.c | 7 +-- 5 files changed, 116 insertions(+), 92 deletions(-) Range-diff against v1: 1: a7082dc7f0 < -: ---------- packfile: explain ordering of how we look up auxiliary pack files 2: 73fe0882ee < -: ---------- midx: stop repeatedly looking up nonexistent packfiles 3: ad7295b11b < -: ---------- pack-bitmap.c: fix broken warning() when missing MIDX'd pack 5: 5d97b706e1 ! 1: d3508d3cfb midx-write.c: simplify fill_packs_from_midx() @@ Metadata Author: Taylor Blau <me@ttaylorr.com> ## Commit message ## - midx-write.c: simplify fill_packs_from_midx() + midx: access pack names through `nth_midxed_pack_name()` - When enumerating packs within fill_packs_from_midx(), we open the pack's - index iff we are either writing a reverse index, have a preferred pack - (by name) or both. + Accessing a MIDX's 'pack_names' array is somewhat error-prone when + dealing with incremental MIDX chains, where the (global) pack_int_id for + some pack may differ from the containing layer's index for that pack. - Let's simplify this into a single case by setting the - MIDX_WRITE_REV_INDEX flag bit when we have a preferred_pack_name. This - is a little bit of a shortcut to reduce the line length in the loop - below. But it sets us up nicely to extract the inner loop of this - function out into its own function, where we will no longer have to pass - the preferred_pack_name. + Introduce `nth_midxed_pack_name()` in an effort to reduce a common + source of errors by discouraging external callers from accessing a + layer's `pack_names` array directly. Signed-off-by: Taylor Blau <me@ttaylorr.com> - ## midx-write.c ## -@@ midx-write.c: static int fill_packs_from_midx(struct write_midx_context *ctx, - { - struct multi_pack_index *m; + ## midx.c ## +@@ midx.c: struct packed_git *nth_midxed_pack(struct multi_pack_index *m, + return m->packs[local_pack_int_id]; + } -+ if (preferred_pack_name) { -+ /* -+ * If a preferred pack is specified, need to have -+ * packed_git's loaded to ensure the chosen preferred -+ * pack has a non-zero object count. -+ * -+ * Trick ourselves into thinking that we're writing a -+ * reverse index in this case in order to open up the -+ * pack index file. -+ */ -+ flags |= MIDX_WRITE_REV_INDEX; -+ } ++const char *nth_midxed_pack_name(struct multi_pack_index *m, ++ uint32_t pack_int_id) ++{ ++ uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); ++ return m->pack_names[local_pack_int_id]; ++} + - for (m = ctx->m; m; m = m->base_midx) { - uint32_t i; + #define MIDX_CHUNK_BITMAPPED_PACKS_WIDTH (2 * sizeof(uint32_t)) -@@ midx-write.c: static int fill_packs_from_midx(struct write_midx_context *ctx, - * If generating a reverse index, need to have - * packed_git's loaded to compare their - * mtimes and object count. -- * -- * If a preferred pack is specified, need to -- * have packed_git's loaded to ensure the chosen -- * preferred pack has a non-zero object count. - */ -- if (flags & MIDX_WRITE_REV_INDEX || -- preferred_pack_name) { -+ if (flags & MIDX_WRITE_REV_INDEX) { - if (prepare_midx_pack(ctx->repo, m, - m->num_packs_in_base + i)) { - error(_("could not load pack")); + int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, + + ## midx.h ## +@@ midx.h: struct multi_pack_index *load_multi_pack_index(struct repository *r, + int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); + struct packed_git *nth_midxed_pack(struct multi_pack_index *m, + uint32_t pack_int_id); ++const char *nth_midxed_pack_name(struct multi_pack_index *m, ++ uint32_t pack_int_id); + int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, + struct bitmapped_pack *bp, uint32_t pack_int_id); + int bsearch_one_midx(const struct object_id *oid, struct multi_pack_index *m, + + ## pack-bitmap.c ## +@@ pack-bitmap.c: static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, + for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { + if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { + warning(_("could not open pack %s"), +- bitmap_git->midx->pack_names[i]); ++ nth_midxed_pack_name(bitmap_git->midx, i)); + goto cleanup; + } + } +@@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git, + struct bitmapped_pack pack; + if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) { + warning(_("unable to load pack: '%s', disabling pack-reuse"), +- bitmap_git->midx->pack_names[i]); ++ nth_midxed_pack_name(bitmap_git->midx, i)); + free(packs); + return; + } + + ## t/helper/test-read-midx.c ## +@@ t/helper/test-read-midx.c: static int read_midx_file(const char *object_dir, const char *checksum, + printf("\nnum_objects: %d\n", m->num_objects); + + printf("packs:\n"); +- for (i = 0; i < m->num_packs; i++) +- printf("%s\n", m->pack_names[i]); ++ for (i = m->num_packs_in_base; i < m->num_packs + m->num_packs_in_base; ++ i++) ++ printf("%s\n", nth_midxed_pack_name(m, i)); + + printf("object-dir: %s\n", m->object_dir); + +@@ t/helper/test-read-midx.c: static int read_midx_preferred_pack(const char *object_dir) + return 1; + } + +- printf("%s\n", midx->pack_names[preferred_pack]); ++ printf("%s\n", nth_midxed_pack_name(midx, preferred_pack)); + close_midx(midx); + return 0; + } 4: d2f9645aa1 ! 2: cf3ab81a08 midx-write.c: guard against incremental MIDXs in want_included_pack() @@ Commit message if (m->base_midx) die(_("cannot repack an incremental multi-pack-index")); - So want_included_pack() is OK becuase it will never encounter a + So want_included_pack() is OK because it will never encounter a situation where it has to chase backwards through the '->base_midx' pointer. But that is not immediately clear from reading the code, and is too fragile for my comfort. Make this more clear by adding an ASSERT() 6: a4080c12df ! 3: e54988bfea midx-write.c: extract inner loop from fill_packs_from_midx() @@ midx-write.c: static struct multi_pack_index *lookup_multi_pack_index(struct rep +static int fill_packs_from_midx_1(struct write_midx_context *ctx, + struct multi_pack_index *m, -+ uint32_t flags) ++ int prepare_packs) +{ + for (uint32_t i = 0; i < m->num_packs; i++) { -+ ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); -+ + /* + * If generating a reverse index, need to have + * packed_git's loaded to compare their + * mtimes and object count. + */ -+ if (flags & MIDX_WRITE_REV_INDEX) { ++ if (prepare_packs) { + if (prepare_midx_pack(ctx->repo, m, + m->num_packs_in_base + i)) { + error(_("could not load pack")); @@ midx-write.c: static struct multi_pack_index *lookup_multi_pack_index(struct rep static int fill_packs_from_midx(struct write_midx_context *ctx, const char *preferred_pack_name, uint32_t flags) { -@@ midx-write.c: static int fill_packs_from_midx(struct write_midx_context *ctx, - } + struct multi_pack_index *m; ++ int prepare_packs; ++ ++ /* ++ * If generating a reverse index, need to have packed_git's ++ * loaded to compare their mtimes and object count. ++ */ ++ prepare_packs = !!(flags & MIDX_WRITE_REV_INDEX || preferred_pack_name); for (m = ctx->m; m; m = m->base_midx) { - uint32_t i; @@ midx-write.c: static int fill_packs_from_midx(struct write_midx_context *ctx, - * If generating a reverse index, need to have - * packed_git's loaded to compare their - * mtimes and object count. +- * +- * If a preferred pack is specified, need to +- * have packed_git's loaded to ensure the chosen +- * preferred pack has a non-zero object count. - */ -- if (flags & MIDX_WRITE_REV_INDEX) { +- if (flags & MIDX_WRITE_REV_INDEX || +- preferred_pack_name) { - if (prepare_midx_pack(ctx->repo, m, - m->num_packs_in_base + i)) { - error(_("could not load pack")); @@ midx-write.c: static int fill_packs_from_midx(struct write_midx_context *ctx, - m->pack_names[i], - m->num_packs_in_base + i); - } -+ int ret = fill_packs_from_midx_1(ctx, m, flags); ++ int ret = fill_packs_from_midx_1(ctx, m, prepare_packs); + if (ret) + return ret; } 7: 80699bb3ee ! 4: e3e21db673 midx: return a `packed_git` pointer from `prepare_midx_pack()` @@ Commit message ## midx-write.c ## @@ midx-write.c: static int fill_packs_from_midx_1(struct write_midx_context *ctx, - uint32_t flags) + int prepare_packs) { for (uint32_t i = 0; i < m->num_packs; i++) { +- /* +- * If generating a reverse index, need to have +- * packed_git's loaded to compare their +- * mtimes and object count. +- */ + struct packed_git *p = NULL; + - ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); - - /* -@@ midx-write.c: static int fill_packs_from_midx_1(struct write_midx_context *ctx, - * mtimes and object count. - */ - if (flags & MIDX_WRITE_REV_INDEX) { ++ ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); + if (prepare_packs) { - if (prepare_midx_pack(ctx->repo, m, - m->num_packs_in_base + i)) { + p = prepare_midx_pack(ctx->repo, m, @@ midx-write.c: static int fill_packs_from_midx_1(struct write_midx_context *ctx, } - fill_pack_info(&ctx->info[ctx->nr++], m->packs[i], -+ fill_pack_info(&ctx->info[ctx->nr++], p, - m->pack_names[i], +- m->pack_names[i], ++ fill_pack_info(&ctx->info[ctx->nr++], p, m->pack_names[i], m->num_packs_in_base + i); } + @@ midx-write.c: int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla _("Finding and deleting unreferenced packfiles"), m->num_packs); @@ midx.c: static uint32_t midx_for_pack(struct multi_pack_index **_m, + struct strbuf key = STRBUF_INIT; + struct packed_git *p; -- if (m->packs[pack_int_id] == (void *)(intptr_t)-1) +- if (m->packs[pack_int_id] == MIDX_PACK_ERROR) - return 1; - if (m->packs[pack_int_id]) - return 0; @@ midx.c: static uint32_t midx_for_pack(struct multi_pack_index **_m, + strbuf_release(&key); - if (!p) { -- m->packs[pack_int_id] = (void *)(intptr_t)-1; +- m->packs[pack_int_id] = MIDX_PACK_ERROR; - return 1; -+ m->packs[pack_pos] = p ? p : (void *)(intptr_t)-1; ++ m->packs[pack_pos] = p ? p : MIDX_PACK_ERROR; + if (p) + p->multi_pack_index = 1; } @@ midx.c: static uint32_t midx_for_pack(struct multi_pack_index **_m, - m->packs[pack_int_id] = p; - - return 0; -+ if (m->packs[pack_pos] == (void *)(intptr_t)-1) ++ if (m->packs[pack_pos] == MIDX_PACK_ERROR) + return NULL; + return m->packs[pack_pos]; } @@ midx.h: void get_split_midx_filename_ext(const struct git_hash_algo *hash_algo, + uint32_t pack_int_id); struct packed_git *nth_midxed_pack(struct multi_pack_index *m, uint32_t pack_int_id); - int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, + const char *nth_midxed_pack_name(struct multi_pack_index *m, ## pack-bitmap.c ## @@ pack-bitmap.c: static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, @@ pack-bitmap.c: static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, - if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { + if (!prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { warning(_("could not open pack %s"), - bitmap_git->midx->pack_names[i - bitmap_git->midx->num_packs_in_base]); + nth_midxed_pack_name(bitmap_git->midx, i)); goto cleanup; base-commit: 1f34bf3e082741e053d25b76a0ffe31d9d967594 -- 2.49.0.640.ga4de40e6a8 ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` 2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau @ 2025-05-28 22:59 ` Taylor Blau 2025-05-29 20:47 ` Junio C Hamano 2025-05-29 20:51 ` Junio C Hamano 2025-05-28 22:59 ` [PATCH v2 2/4] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau ` (2 subsequent siblings) 3 siblings, 2 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-28 22:59 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano Accessing a MIDX's 'pack_names' array is somewhat error-prone when dealing with incremental MIDX chains, where the (global) pack_int_id for some pack may differ from the containing layer's index for that pack. Introduce `nth_midxed_pack_name()` in an effort to reduce a common source of errors by discouraging external callers from accessing a layer's `pack_names` array directly. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx.c | 7 +++++++ midx.h | 2 ++ pack-bitmap.c | 4 ++-- t/helper/test-read-midx.c | 7 ++++--- 4 files changed, 15 insertions(+), 5 deletions(-) diff --git a/midx.c b/midx.c index cd6e766ce2..6705e77881 100644 --- a/midx.c +++ b/midx.c @@ -506,6 +506,13 @@ struct packed_git *nth_midxed_pack(struct multi_pack_index *m, return m->packs[local_pack_int_id]; } +const char *nth_midxed_pack_name(struct multi_pack_index *m, + uint32_t pack_int_id) +{ + uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); + return m->pack_names[local_pack_int_id]; +} + #define MIDX_CHUNK_BITMAPPED_PACKS_WIDTH (2 * sizeof(uint32_t)) int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, diff --git a/midx.h b/midx.h index 9d1374cbd5..0fb490f4d4 100644 --- a/midx.h +++ b/midx.h @@ -107,6 +107,8 @@ struct multi_pack_index *load_multi_pack_index(struct repository *r, int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); struct packed_git *nth_midxed_pack(struct multi_pack_index *m, uint32_t pack_int_id); +const char *nth_midxed_pack_name(struct multi_pack_index *m, + uint32_t pack_int_id); int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, struct bitmapped_pack *bp, uint32_t pack_int_id); int bsearch_one_midx(const struct object_id *oid, struct multi_pack_index *m, diff --git a/pack-bitmap.c b/pack-bitmap.c index b9f1d86604..8ddc150778 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -490,7 +490,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { warning(_("could not open pack %s"), - bitmap_git->midx->pack_names[i]); + nth_midxed_pack_name(bitmap_git->midx, i)); goto cleanup; } } @@ -2469,7 +2469,7 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git, struct bitmapped_pack pack; if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) { warning(_("unable to load pack: '%s', disabling pack-reuse"), - bitmap_git->midx->pack_names[i]); + nth_midxed_pack_name(bitmap_git->midx, i)); free(packs); return; } diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c index ac81390899..fbed0f6919 100644 --- a/t/helper/test-read-midx.c +++ b/t/helper/test-read-midx.c @@ -53,8 +53,9 @@ static int read_midx_file(const char *object_dir, const char *checksum, printf("\nnum_objects: %d\n", m->num_objects); printf("packs:\n"); - for (i = 0; i < m->num_packs; i++) - printf("%s\n", m->pack_names[i]); + for (i = m->num_packs_in_base; i < m->num_packs + m->num_packs_in_base; + i++) + printf("%s\n", nth_midxed_pack_name(m, i)); printf("object-dir: %s\n", m->object_dir); @@ -108,7 +109,7 @@ static int read_midx_preferred_pack(const char *object_dir) return 1; } - printf("%s\n", midx->pack_names[preferred_pack]); + printf("%s\n", nth_midxed_pack_name(midx, preferred_pack)); close_midx(midx); return 0; } -- 2.49.0.640.ga4de40e6a8 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` 2025-05-28 22:59 ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau @ 2025-05-29 20:47 ` Junio C Hamano 2025-06-03 22:22 ` Taylor Blau 2025-05-29 20:51 ` Junio C Hamano 1 sibling, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2025-05-29 20:47 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Patrick Steinhardt, Jeff King Taylor Blau <me@ttaylorr.com> writes: > Accessing a MIDX's 'pack_names' array is somewhat error-prone when > dealing with incremental MIDX chains, where the (global) pack_int_id for > some pack may differ from the containing layer's index for that pack. > > Introduce `nth_midxed_pack_name()` in an effort to reduce a common > source of errors by discouraging external callers from accessing a > layer's `pack_names` array directly. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > midx.c | 7 +++++++ > midx.h | 2 ++ > pack-bitmap.c | 4 ++-- > t/helper/test-read-midx.c | 7 ++++--- > 4 files changed, 15 insertions(+), 5 deletions(-) Hmph, I am not sure if an accessor really makes it harder to make mistakes, but I'd expect it to be mechanical rewrite from a[n] to fn(a, n)? > +const char *nth_midxed_pack_name(struct multi_pack_index *m, > + uint32_t pack_int_id) > +{ > + uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); > + return m->pack_names[local_pack_int_id]; > +} OK, midx_for_pack() takes a pack_int_id, finds the midx that contains the pack (by updating the 'm' via its pointer arg), and turns pack_int_id into local offset into m->pack_names[] array, and returns that string. > diff --git a/pack-bitmap.c b/pack-bitmap.c > index b9f1d86604..8ddc150778 100644 > --- a/pack-bitmap.c > +++ b/pack-bitmap.c > @@ -490,7 +490,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, > for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { > if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { > warning(_("could not open pack %s"), > - bitmap_git->midx->pack_names[i]); > + nth_midxed_pack_name(bitmap_git->midx, i)); This loop runs from 0 to (num_packs + num_packs_in_base). I understand if it runs from num_packs_in_base to (num_packs + num_packs_in_base), iterating only on this layer, but probably this just tries to open everything (i.e. in addition to num_packs we have, we know num_packs_in_base packs are there in our base layer(s), so we iterate from 0 to that number). The updated code converts the global 'i', which runs from 0 to "everything under us" num_packs + num_packs_in_base, to corresponding layer midx plus offset in it, so it looks good, but then, is the original reference to bitmap_git->midx->pack_names[i] even correct? If we have a base, i can run larger than bitmap_git->midx->num_packs, which is the size of the array bitmap_git->midx->pack_names[]. Or, unlike how the proposed log message portrayed this change as (i.e. code clean up), does this patch fix real bugs that manifest only when midx files are chained? > @@ -2469,7 +2469,7 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git, > struct bitmapped_pack pack; > if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) { > warning(_("unable to load pack: '%s', disabling pack-reuse"), > - bitmap_git->midx->pack_names[i]); > + nth_midxed_pack_name(bitmap_git->midx, i)); > free(packs); > return; > } Similar to the above, this is also in a loop that runs from 0 to num_packs+num_packs_in_base. Is the array access to find the name for the error message in the original even correct when midx are chained? > diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c > index ac81390899..fbed0f6919 100644 > --- a/t/helper/test-read-midx.c > +++ b/t/helper/test-read-midx.c > @@ -53,8 +53,9 @@ static int read_midx_file(const char *object_dir, const char *checksum, > printf("\nnum_objects: %d\n", m->num_objects); > > printf("packs:\n"); > - for (i = 0; i < m->num_packs; i++) > - printf("%s\n", m->pack_names[i]); > + for (i = m->num_packs_in_base; i < m->num_packs + m->num_packs_in_base; > + i++) > + printf("%s\n", nth_midxed_pack_name(m, i)); OK. This used to iterate from 0 to num_packs using the local offset. Now it iterates from num_packs_in_base to num_packs_in_base+num_packs, meaning we iterate over packs in the given midx. No change in behaviour, as accesses to m->pack_names[i] using the local offset in the original was correct, and the updated code iterates using the global offset. This is not a bugfix but is a code cleanup. > @@ -108,7 +109,7 @@ static int read_midx_preferred_pack(const char *object_dir) > return 1; > } > > - printf("%s\n", midx->pack_names[preferred_pack]); > + printf("%s\n", nth_midxed_pack_name(midx, preferred_pack)); Again, is the original buggy when midx are chained? > close_midx(midx); > return 0; > } ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` 2025-05-29 20:47 ` Junio C Hamano @ 2025-06-03 22:22 ` Taylor Blau 0 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-06-03 22:22 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Jeff King On Thu, May 29, 2025 at 01:47:44PM -0700, Junio C Hamano wrote: > > diff --git a/pack-bitmap.c b/pack-bitmap.c > > index b9f1d86604..8ddc150778 100644 > > --- a/pack-bitmap.c > > +++ b/pack-bitmap.c > > @@ -490,7 +490,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, > > for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { > > if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { > > warning(_("could not open pack %s"), > > - bitmap_git->midx->pack_names[i]); > > + nth_midxed_pack_name(bitmap_git->midx, i)); > > This loop runs from 0 to (num_packs + num_packs_in_base). I > understand if it runs from num_packs_in_base to (num_packs + > num_packs_in_base), iterating only on this layer, but probably this > just tries to open everything (i.e. in addition to num_packs we > have, we know num_packs_in_base packs are there in our base layer(s), > so we iterate from 0 to that number). > > The updated code converts the global 'i', which runs from 0 to > "everything under us" num_packs + num_packs_in_base, to > corresponding layer midx plus offset in it, so it looks good, but > then, is the original reference to bitmap_git->midx->pack_names[i] > even correct? If we have a base, i can run larger than > bitmap_git->midx->num_packs, which is the size of the array > bitmap_git->midx->pack_names[]. > > Or, unlike how the proposed log message portrayed this change as > (i.e. code clean up), does this patch fix real bugs that manifest > only when midx files are chained? Right; the original code was buggy if we had a failure opening a MIDX'd pack outside of the base layer in an incremental MIDX bitmap. Reading the proposed log message again, I see what you're saying. I am happy to clarify that this is indeed a bugfix, not just a cleanup. > > @@ -2469,7 +2469,7 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git, > > struct bitmapped_pack pack; > > if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) { > > warning(_("unable to load pack: '%s', disabling pack-reuse"), > > - bitmap_git->midx->pack_names[i]); > > + nth_midxed_pack_name(bitmap_git->midx, i)); > > free(packs); > > return; > > } > > Similar to the above, this is also in a loop that runs from 0 to > num_packs+num_packs_in_base. Is the array access to find the name > for the error message in the original even correct when midx are > chained? Right; this spot suffers from the same bug as the previous hunk. > > diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c > > index ac81390899..fbed0f6919 100644 > > --- a/t/helper/test-read-midx.c > > +++ b/t/helper/test-read-midx.c > > @@ -53,8 +53,9 @@ static int read_midx_file(const char *object_dir, const char *checksum, > > printf("\nnum_objects: %d\n", m->num_objects); > > > > printf("packs:\n"); > > - for (i = 0; i < m->num_packs; i++) > > - printf("%s\n", m->pack_names[i]); > > + for (i = m->num_packs_in_base; i < m->num_packs + m->num_packs_in_base; > > + i++) > > + printf("%s\n", nth_midxed_pack_name(m, i)); > > OK. This used to iterate from 0 to num_packs using the local > offset. Now it iterates from num_packs_in_base to num_packs_in_base+num_packs, > meaning we iterate over packs in the given midx. No change in > behaviour, as accesses to m->pack_names[i] using the local offset in > the original was correct, and the updated code iterates using the > global offset. This is not a bugfix but is a code cleanup. Right. > > @@ -108,7 +109,7 @@ static int read_midx_preferred_pack(const char *object_dir) > > return 1; > > } > > > > - printf("%s\n", midx->pack_names[preferred_pack]); > > + printf("%s\n", nth_midxed_pack_name(midx, preferred_pack)); > > Again, is the original buggy when midx are chained? This is also a bugfix. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` 2025-05-28 22:59 ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau 2025-05-29 20:47 ` Junio C Hamano @ 2025-05-29 20:51 ` Junio C Hamano 2025-06-03 22:23 ` Taylor Blau 1 sibling, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2025-05-29 20:51 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Patrick Steinhardt, Jeff King Taylor Blau <me@ttaylorr.com> writes: > Accessing a MIDX's 'pack_names' array is somewhat error-prone when > dealing with incremental MIDX chains, where the (global) pack_int_id for > some pack may differ from the containing layer's index for that pack. > > Introduce `nth_midxed_pack_name()` in an effort to reduce a common > source of errors by discouraging external callers from accessing a > layer's `pack_names` array directly. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > midx.c | 7 +++++++ > midx.h | 2 ++ > pack-bitmap.c | 4 ++-- > t/helper/test-read-midx.c | 7 ++++--- > 4 files changed, 15 insertions(+), 5 deletions(-) One thing I forgot to ask. Should we expect that $ git grep -E -e '(\.|->)pack_names\[' to give hits only from the implementation of nth_midxed_pack_name()? > +const char *nth_midxed_pack_name(struct multi_pack_index *m, > + uint32_t pack_int_id) > +{ > + uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); > + return m->pack_names[local_pack_int_id]; > +} ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` 2025-05-29 20:51 ` Junio C Hamano @ 2025-06-03 22:23 ` Taylor Blau 0 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-06-03 22:23 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Jeff King On Thu, May 29, 2025 at 01:51:03PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > Accessing a MIDX's 'pack_names' array is somewhat error-prone when > > dealing with incremental MIDX chains, where the (global) pack_int_id for > > some pack may differ from the containing layer's index for that pack. > > > > Introduce `nth_midxed_pack_name()` in an effort to reduce a common > > source of errors by discouraging external callers from accessing a > > layer's `pack_names` array directly. > > > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > > --- > > midx.c | 7 +++++++ > > midx.h | 2 ++ > > pack-bitmap.c | 4 ++-- > > t/helper/test-read-midx.c | 7 ++++--- > > 4 files changed, 15 insertions(+), 5 deletions(-) > > One thing I forgot to ask. Should we expect that > > $ git grep -E -e '(\.|->)pack_names\[' > > to give hits only from the implementation of nth_midxed_pack_name()? I waffled on this a bit... and I think the answer is "almost". Certainly callers outside of midx.c or midx-write.c should not be touching "pack_names" directly. But there are a couple of spots in midx-write.c that are really more convenient to directly look at the pack_names array rather than going through the accessor. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 2/4] midx-write.c: guard against incremental MIDXs in want_included_pack() 2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau 2025-05-28 22:59 ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau @ 2025-05-28 22:59 ` Taylor Blau 2025-05-28 22:59 ` [PATCH v2 3/4] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau 2025-05-28 22:59 ` [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau 3 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-28 22:59 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano The function want_included_pack() is used to determine whether or not a the packfile corresponding to some given pack_int_id should be included in a 'git multi-pack-index repack' operation. This spot looks like it would be broken, particularly in: struct packed_git *p; if (prepare_midx_pack(r, m, pack_int_id)) return 0; p = m->packs[pack_int_id]; , when pack_int_id is greater than m->num_pack_in_base (supposing that m->num_packs_in_base is non-zero, or equivalently that m->base_midx is non-NULL). Suppose we have two MIDXs in an incremental MIDX chain, each having two packs: - M0 = {packs: [P0, P1], base_midx: NULL} - M1 = {packs: [P2, P3], base_midx: M0} noting that each pack is identified by its global pack_int_id within the chain. So if you do something like: want_included_pack(the_repository, M1, pack_kept_objects, 2); that function will call prepare_midx_pack(), which is smart enough to realize that the pack of interest is in the current layer (M1), and knows how to adjust its global pack_int_id into an index into the current layer's 'packs' array. But the following line: p = m->packs[pack_int_id]; /* m is M1, pack_int_id is 2 */ looks broken, since each layer of the MIDX only maintains an array of the packs stored within that layer, and 'm' wasn't adjusted back to point at M1->base_midx (M0). The right thing to do would be: struct packed_git *p; if (prepare_midx_pack(r, m, pack_int_id)) return 0; /* open-code midx.c::midx_for_pack(), which is private */ while (m && pack_int_id < m->num_packs_in_base) m = m->base_midx; if (!m) BUG("broken midx?"); if (pack_int_id >= m->num_packs + m->num_packs_in_base) BUG("broken pack_int_id?"); p = m->packs[pack_int_id - m->num_packs_in_base]; But that would be overkill, since this function never deals with incremental MIDXs having more than one layer! To see why, observe that want_included_pack() has two callers: - midx-write.c::fill_included_packs_all() - midx-write.c::fill_included_packs_batch() and those functions' sole caller is in midx-write.c::midx_repack(), which dispatches a call to one or the other depending on whether or not the batch_size is non-zero. And at the very top of midx_repack(), we have a guard against non-trivial incremental MIDX chains: if (m->base_midx) die(_("cannot repack an incremental multi-pack-index")); So want_included_pack() is OK because it will never encounter a situation where it has to chase backwards through the '->base_midx' pointer. But that is not immediately clear from reading the code, and is too fragile for my comfort. Make this more clear by adding an ASSERT() to the above effect. Apply the same treatment to each of the fill_included_packs-related functions as well, since those are deceptively OK by the same reasoning. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/midx-write.c b/midx-write.c index dd3b3070e5..e4a3830d45 100644 --- a/midx-write.c +++ b/midx-write.c @@ -1636,6 +1636,9 @@ static int want_included_pack(struct repository *r, uint32_t pack_int_id) { struct packed_git *p; + + ASSERT(m && !m->base_midx); + if (prepare_midx_pack(r, m, pack_int_id)) return 0; p = m->packs[pack_int_id]; @@ -1655,6 +1658,8 @@ static void fill_included_packs_all(struct repository *r, uint32_t i; int pack_kept_objects = 0; + ASSERT(m && !m->base_midx); + repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); for (i = 0; i < m->num_packs; i++) { @@ -1675,6 +1680,8 @@ static void fill_included_packs_batch(struct repository *r, struct repack_info *pack_info; int pack_kept_objects = 0; + ASSERT(m && !m->base_midx); + CALLOC_ARRAY(pack_info, m->num_packs); repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); -- 2.49.0.640.ga4de40e6a8 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v2 3/4] midx-write.c: extract inner loop from fill_packs_from_midx() 2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau 2025-05-28 22:59 ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau 2025-05-28 22:59 ` [PATCH v2 2/4] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau @ 2025-05-28 22:59 ` Taylor Blau 2025-05-28 22:59 ` [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau 3 siblings, 0 replies; 53+ messages in thread From: Taylor Blau @ 2025-05-28 22:59 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano The function fill_packs_from_midx() does relatively little, but ends up in a doubly-nested loop because we're enumerating each pack within each layer of the incremental MIDX chain. Let's de-dent the inner loop of fill_packs_from_midx() by extracting its contents into a separate function, and calling that. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 72 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 41 insertions(+), 31 deletions(-) diff --git a/midx-write.c b/midx-write.c index e4a3830d45..ca2384e291 100644 --- a/midx-write.c +++ b/midx-write.c @@ -938,44 +938,54 @@ static struct multi_pack_index *lookup_multi_pack_index(struct repository *r, return result; } +static int fill_packs_from_midx_1(struct write_midx_context *ctx, + struct multi_pack_index *m, + int prepare_packs) +{ + for (uint32_t i = 0; i < m->num_packs; i++) { + /* + * If generating a reverse index, need to have + * packed_git's loaded to compare their + * mtimes and object count. + */ + if (prepare_packs) { + if (prepare_midx_pack(ctx->repo, m, + m->num_packs_in_base + i)) { + error(_("could not load pack")); + return 1; + } + + if (open_pack_index(m->packs[i])) + die(_("could not open index for %s"), + m->packs[i]->pack_name); + } + + fill_pack_info(&ctx->info[ctx->nr++], m->packs[i], + m->pack_names[i], + m->num_packs_in_base + i); + } + + return 0; +} + static int fill_packs_from_midx(struct write_midx_context *ctx, const char *preferred_pack_name, uint32_t flags) { struct multi_pack_index *m; + int prepare_packs; + + /* + * If generating a reverse index, need to have packed_git's + * loaded to compare their mtimes and object count. + */ + prepare_packs = !!(flags & MIDX_WRITE_REV_INDEX || preferred_pack_name); for (m = ctx->m; m; m = m->base_midx) { - uint32_t i; - - for (i = 0; i < m->num_packs; i++) { - ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); - - /* - * If generating a reverse index, need to have - * packed_git's loaded to compare their - * mtimes and object count. - * - * If a preferred pack is specified, need to - * have packed_git's loaded to ensure the chosen - * preferred pack has a non-zero object count. - */ - if (flags & MIDX_WRITE_REV_INDEX || - preferred_pack_name) { - if (prepare_midx_pack(ctx->repo, m, - m->num_packs_in_base + i)) { - error(_("could not load pack")); - return 1; - } - - if (open_pack_index(m->packs[i])) - die(_("could not open index for %s"), - m->packs[i]->pack_name); - } - - fill_pack_info(&ctx->info[ctx->nr++], m->packs[i], - m->pack_names[i], - m->num_packs_in_base + i); - } + int ret = fill_packs_from_midx_1(ctx, m, prepare_packs); + if (ret) + return ret; } + return 0; } -- 2.49.0.640.ga4de40e6a8 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau ` (2 preceding siblings ...) 2025-05-28 22:59 ` [PATCH v2 3/4] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau @ 2025-05-28 22:59 ` Taylor Blau 2025-05-30 6:50 ` Jeff King 3 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-28 22:59 UTC (permalink / raw) To: git; +Cc: Patrick Steinhardt, Jeff King, Junio C Hamano The prepare_midx_pack() function is designed to convert a MIDX-specific pack_int_id for a given pack into a pointer into an actual `packed_git` structure. In general, these calls look something like: struct packed_git *p; if (prepare_midx_pack(the_repository, some_midx, some_pack_int_id)) die("could not load pack xyz"); p = some_midx->packs[some_pack_int_id]; , and in a pre-incremental MIDX world, this pattern works well. However, in a post-incremental MIDX world, this pattern is a little more prone to errors. These errors can happen when the given 'pack_int_id' is not usable as an index into the 'm->packs' array. And this happens in all layers but the bottom-most one in an incremental MIDX chain. Each layer stores only the packs that are local to that layer of the chain, and offsets them by the total number of packs in the base MIDX(s). But there is other awkwardness here. Thinking back to the above snippet, suppose that the pack with ID 'some_pack_int_id' is in a layer in the middle of the MIDX chain. Then it is still invalid to do: p = some_midx->packs[some_pack_int_id - some_midx->num_packs_in_base]; , becuase the top-most layer (here 'some_midx') may not even have that pack! So we would have to chase down the '->base_midx' pointer in order to get the correct result. midx.c has a helper to do this (called 'midx_for_pack()'), but it is meant only for internal use. That means that a full, working version of the above adjusted to handle incremental MIDXs would look something like: struct packed_git *p; if (prepare_midx_pack(the_repository, some_midx, some_pack_int_id)) die("could not load pack xyz"); while (m && pack_int_id < m->num_packs_in_base) m = m->base_midx; if (!m) BUG("broken midx?"); if (pack_int_id >= m->num_packs + m->num_packs_in_base) BUG("broken pack_int_id?"); p = m->packs[pack_int_id - m->num_packs_in_base]; , which is far too verbose to access a single pack by its pack_int_id in a MIDX chain. Let's instead have prepare_midx_pack() return a pointer to the packed_git structure itself, hiding the above as an implementation detail of prepare_midx_pack(). This patch turns the above snippet into: struct packed_git *p = prepare_midx_pack(the_repository, some_midx, some_pack_int_id); if (!p) die("could not load pack xyz"); making it far easier and less error-prone to access packs by their pack_int_id in a MIDX chain. (In the future, we may want to consider similar treatment for, e.g., the pack_names array. Likewise, it might make sense to rename the "packs" member of the MIDX structure to suggest that it shouldn't be accessed directly outside of midx.c.) Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx-write.c | 44 +++++++++++++--------------- midx.c | 81 +++++++++++++++++++++++++-------------------------- midx.h | 4 ++- pack-bitmap.c | 2 +- 4 files changed, 64 insertions(+), 67 deletions(-) diff --git a/midx-write.c b/midx-write.c index ca2384e291..fc74be813d 100644 --- a/midx-write.c +++ b/midx-write.c @@ -943,25 +943,23 @@ static int fill_packs_from_midx_1(struct write_midx_context *ctx, int prepare_packs) { for (uint32_t i = 0; i < m->num_packs; i++) { - /* - * If generating a reverse index, need to have - * packed_git's loaded to compare their - * mtimes and object count. - */ + struct packed_git *p = NULL; + + ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); if (prepare_packs) { - if (prepare_midx_pack(ctx->repo, m, - m->num_packs_in_base + i)) { + p = prepare_midx_pack(ctx->repo, m, + m->num_packs_in_base + i); + if (!p) { error(_("could not load pack")); return 1; } - if (open_pack_index(m->packs[i])) + if (open_pack_index(p)) die(_("could not open index for %s"), - m->packs[i]->pack_name); + p->pack_name); } - fill_pack_info(&ctx->info[ctx->nr++], m->packs[i], - m->pack_names[i], + fill_pack_info(&ctx->info[ctx->nr++], p, m->pack_names[i], m->num_packs_in_base + i); } @@ -1588,20 +1586,19 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla _("Finding and deleting unreferenced packfiles"), m->num_packs); for (i = 0; i < m->num_packs; i++) { + struct packed_git *p; char *pack_name; display_progress(progress, i + 1); if (count[i]) continue; - if (prepare_midx_pack(r, m, i)) + p = prepare_midx_pack(r, m, i); + if (!p || p->pack_keep || p->is_cruft) continue; - if (m->packs[i]->pack_keep || m->packs[i]->is_cruft) - continue; - - pack_name = xstrdup(m->packs[i]->pack_name); - close_pack(m->packs[i]); + pack_name = xstrdup(p->pack_name); + close_pack(p); string_list_insert(&packs_to_drop, m->pack_names[i]); unlink_pack_path(pack_name, 0); @@ -1649,9 +1646,9 @@ static int want_included_pack(struct repository *r, ASSERT(m && !m->base_midx); - if (prepare_midx_pack(r, m, pack_int_id)) + p = prepare_midx_pack(r, m, pack_int_id); + if (!p) return 0; - p = m->packs[pack_int_id]; if (!pack_kept_objects && p->pack_keep) return 0; if (p->is_cruft) @@ -1697,12 +1694,11 @@ static void fill_included_packs_batch(struct repository *r, repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects); for (i = 0; i < m->num_packs; i++) { + struct packed_git *p = prepare_midx_pack(r, m, i); + pack_info[i].pack_int_id = i; - - if (prepare_midx_pack(r, m, i)) - continue; - - pack_info[i].mtime = m->packs[i]->mtime; + if (p) + pack_info[i].mtime = p->mtime; } for (i = 0; i < m->num_objects; i++) { diff --git a/midx.c b/midx.c index 6705e77881..a2e7a3ec0e 100644 --- a/midx.c +++ b/midx.c @@ -451,50 +451,48 @@ static uint32_t midx_for_pack(struct multi_pack_index **_m, return pack_int_id - m->num_packs_in_base; } -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, - uint32_t pack_int_id) +struct packed_git *prepare_midx_pack(struct repository *r, + struct multi_pack_index *m, + uint32_t pack_int_id) { - struct strbuf pack_name = STRBUF_INIT; - struct strbuf key = STRBUF_INIT; - struct packed_git *p; + uint32_t pack_pos = midx_for_pack(&m, pack_int_id); - pack_int_id = midx_for_pack(&m, pack_int_id); + if (!m->packs[pack_pos]) { + struct strbuf pack_name = STRBUF_INIT; + struct strbuf key = STRBUF_INIT; + struct packed_git *p; - if (m->packs[pack_int_id] == MIDX_PACK_ERROR) - return 1; - if (m->packs[pack_int_id]) - return 0; + strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir, + m->pack_names[pack_pos]); - strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir, - m->pack_names[pack_int_id]); - - /* pack_map holds the ".pack" name, but we have the .idx */ - strbuf_addbuf(&key, &pack_name); - strbuf_strip_suffix(&key, ".idx"); - strbuf_addstr(&key, ".pack"); - p = hashmap_get_entry_from_hash(&r->objects->pack_map, - strhash(key.buf), key.buf, - struct packed_git, packmap_ent); - if (!p) { - p = add_packed_git(r, pack_name.buf, pack_name.len, m->local); - if (p) { - install_packed_git(r, p); - list_add_tail(&p->mru, &r->objects->packed_git_mru); + /* pack_map holds the ".pack" name, but we have the .idx */ + strbuf_addbuf(&key, &pack_name); + strbuf_strip_suffix(&key, ".idx"); + strbuf_addstr(&key, ".pack"); + p = hashmap_get_entry_from_hash(&r->objects->pack_map, + strhash(key.buf), key.buf, + struct packed_git, packmap_ent); + if (!p) { + p = add_packed_git(r, pack_name.buf, pack_name.len, + m->local); + if (p) { + install_packed_git(r, p); + list_add_tail(&p->mru, + &r->objects->packed_git_mru); + } } - } - strbuf_release(&pack_name); - strbuf_release(&key); + strbuf_release(&pack_name); + strbuf_release(&key); - if (!p) { - m->packs[pack_int_id] = MIDX_PACK_ERROR; - return 1; + m->packs[pack_pos] = p ? p : MIDX_PACK_ERROR; + if (p) + p->multi_pack_index = 1; } - p->multi_pack_index = 1; - m->packs[pack_int_id] = p; - - return 0; + if (m->packs[pack_pos] == MIDX_PACK_ERROR) + return NULL; + return m->packs[pack_pos]; } struct packed_git *nth_midxed_pack(struct multi_pack_index *m, @@ -523,10 +521,11 @@ int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m, if (!m->chunk_bitmapped_packs) return error(_("MIDX does not contain the BTMP chunk")); - if (prepare_midx_pack(r, m, pack_int_id)) - return error(_("could not load bitmapped pack %"PRIu32), pack_int_id); + bp->p = prepare_midx_pack(r, m, pack_int_id); + if (!bp->p) + return error(_("could not load bitmapped pack %"PRIu32), + pack_int_id); - bp->p = m->packs[local_pack_int_id]; bp->bitmap_pos = get_be32((char *)m->chunk_bitmapped_packs + MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * local_pack_int_id); bp->bitmap_nr = get_be32((char *)m->chunk_bitmapped_packs + @@ -623,9 +622,9 @@ int fill_midx_entry(struct repository *r, midx_for_object(&m, pos); pack_int_id = nth_midxed_pack_int_id(m, pos); - if (prepare_midx_pack(r, m, pack_int_id)) + p = prepare_midx_pack(r, m, pack_int_id); + if (!p) return 0; - p = m->packs[pack_int_id - m->num_packs_in_base]; /* * We are about to tell the caller where they can locate the @@ -926,7 +925,7 @@ int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag _("Looking for referenced packfiles"), m->num_packs + m->num_packs_in_base); for (i = 0; i < m->num_packs + m->num_packs_in_base; i++) { - if (prepare_midx_pack(r, m, i)) + if (!prepare_midx_pack(r, m, i)) midx_report("failed to load pack in position %d", i); display_progress(progress, i + 1); diff --git a/midx.h b/midx.h index 0fb490f4d4..4ac05b8234 100644 --- a/midx.h +++ b/midx.h @@ -104,7 +104,9 @@ void get_split_midx_filename_ext(const struct git_hash_algo *hash_algo, struct multi_pack_index *load_multi_pack_index(struct repository *r, const char *object_dir, int local); -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); +struct packed_git *prepare_midx_pack(struct repository *r, + struct multi_pack_index *m, + uint32_t pack_int_id); struct packed_git *nth_midxed_pack(struct multi_pack_index *m, uint32_t pack_int_id); const char *nth_midxed_pack_name(struct multi_pack_index *m, diff --git a/pack-bitmap.c b/pack-bitmap.c index 8ddc150778..1f72e605d4 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -488,7 +488,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git, } for (i = 0; i < bitmap_git->midx->num_packs + bitmap_git->midx->num_packs_in_base; i++) { - if (prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { + if (!prepare_midx_pack(bitmap_repo(bitmap_git), bitmap_git->midx, i)) { warning(_("could not open pack %s"), nth_midxed_pack_name(bitmap_git->midx, i)); goto cleanup; -- 2.49.0.640.ga4de40e6a8 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-05-28 22:59 ` [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau @ 2025-05-30 6:50 ` Jeff King 2025-06-03 22:27 ` Taylor Blau 0 siblings, 1 reply; 53+ messages in thread From: Jeff King @ 2025-05-30 6:50 UTC (permalink / raw) To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano On Wed, May 28, 2025 at 06:59:09PM -0400, Taylor Blau wrote: > Let's instead have prepare_midx_pack() return a pointer to the > packed_git structure itself, hiding the above as an implementation > detail of prepare_midx_pack(). This patch turns the above snippet into: > > struct packed_git *p = prepare_midx_pack(the_repository, some_midx, > some_pack_int_id); > if (!p) > die("could not load pack xyz"); > > making it far easier and less error-prone to access packs by their > pack_int_id in a MIDX chain. So obviously I like this direction, but a few small comments: > (In the future, we may want to consider similar treatment for, e.g., the > pack_names array. Likewise, it might make sense to rename the "packs" > member of the MIDX structure to suggest that it shouldn't be accessed > directly outside of midx.c.) Is this note still valid for v2? It looks like patch 1 adds nth_midxed_pack_name() and tries to use it everywhere. > @@ -1649,9 +1646,9 @@ static int want_included_pack(struct repository *r, > > ASSERT(m && !m->base_midx); > > - if (prepare_midx_pack(r, m, pack_int_id)) > + p = prepare_midx_pack(r, m, pack_int_id); > + if (!p) > return 0; > - p = m->packs[pack_int_id]; > if (!pack_kept_objects && p->pack_keep) > return 0; > if (p->is_cruft) The ASSERT() in the context is from earlier in the series. But do we need it once we have this patch? We no longer look at pack_int_id except to pass it to prepare_midx_pack(), which handles non-base midx slices just fine. So we could loosen the assertion now. Or we could wait for later when somebody wants/needs to do so, but I'm not sure how easy they would find it to dig in the history. They would find the commit that added the ASSERT(), but may not realize that this later commit made it OK to loosen. I didn't check the other ASSERT() spots from that earlier patch (IIRC, some of them may actually look use the pack_int_id for other things, and wouldn't be ready for non-base slices). > -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > - uint32_t pack_int_id) > +struct packed_git *prepare_midx_pack(struct repository *r, > + struct multi_pack_index *m, > + uint32_t pack_int_id) We used to return "1" for failure and "0" for success. Now we're reversed: we return NULL for failure and non-zero for success. So code like: if (prepare_midx_pack(...)) return error("yikes"); needs to be updated, but the compiler won't help us because it is happy to convert both an int and a pointer into a boolean check. Should we rename the function to make sure we catch any callers for topics in flight? I'd have thought we could call it nth_midxed_pack(), but that seems to exist already, with the caveat that it never prepares the pack, but only serves what's in the cache. I wonder if we could simply replace that with what prepare_midx_pack() does, but it may be more conservative to leave the two separate. So I guess nth_midxed_pack_load() or something. -Peff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-05-30 6:50 ` Jeff King @ 2025-06-03 22:27 ` Taylor Blau 2025-08-28 23:25 ` Junio C Hamano 0 siblings, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-06-03 22:27 UTC (permalink / raw) To: Jeff King; +Cc: git, Patrick Steinhardt, Junio C Hamano On Fri, May 30, 2025 at 02:50:34AM -0400, Jeff King wrote: > On Wed, May 28, 2025 at 06:59:09PM -0400, Taylor Blau wrote: > > > Let's instead have prepare_midx_pack() return a pointer to the > > packed_git structure itself, hiding the above as an implementation > > detail of prepare_midx_pack(). This patch turns the above snippet into: > > > > struct packed_git *p = prepare_midx_pack(the_repository, some_midx, > > some_pack_int_id); > > if (!p) > > die("could not load pack xyz"); > > > > making it far easier and less error-prone to access packs by their > > pack_int_id in a MIDX chain. > > So obviously I like this direction, but a few small comments: > > > (In the future, we may want to consider similar treatment for, e.g., the > > pack_names array. Likewise, it might make sense to rename the "packs" > > member of the MIDX structure to suggest that it shouldn't be accessed > > directly outside of midx.c.) > > Is this note still valid for v2? It looks like patch 1 adds > nth_midxed_pack_name() and tries to use it everywhere. Yeah, we should get rid of this. I had written it before I wrote what is now the first patch in this series, and neglected to remove it before sending out the latest round. > > @@ -1649,9 +1646,9 @@ static int want_included_pack(struct repository *r, > > > > ASSERT(m && !m->base_midx); > > > > - if (prepare_midx_pack(r, m, pack_int_id)) > > + p = prepare_midx_pack(r, m, pack_int_id); > > + if (!p) > > return 0; > > - p = m->packs[pack_int_id]; > > if (!pack_kept_objects && p->pack_keep) > > return 0; > > if (p->is_cruft) > > The ASSERT() in the context is from earlier in the series. But do we > need it once we have this patch? We no longer look at pack_int_id except > to pass it to prepare_midx_pack(), which handles non-base midx slices > just fine. > > So we could loosen the assertion now. Or we could wait for later when > somebody wants/needs to do so, but I'm not sure how easy they would find > it to dig in the history. They would find the commit that added the > ASSERT(), but may not realize that this later commit made it OK to > loosen. > > I didn't check the other ASSERT() spots from that earlier patch (IIRC, > some of them may actually look use the pack_int_id for other things, and > wouldn't be ready for non-base slices). We could loosen the assertion here, but part of me likes keeping it as a self-documenting note that this function is only intended to be used for non-incremental MIDXs. > > -int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > - uint32_t pack_int_id) > > +struct packed_git *prepare_midx_pack(struct repository *r, > > + struct multi_pack_index *m, > > + uint32_t pack_int_id) > > We used to return "1" for failure and "0" for success. Now we're > reversed: we return NULL for failure and non-zero for success. > > So code like: > > if (prepare_midx_pack(...)) > return error("yikes"); > > needs to be updated, but the compiler won't help us because it is happy > to convert both an int and a pointer into a boolean check. > > Should we rename the function to make sure we catch any callers for > topics in flight? > > I'd have thought we could call it nth_midxed_pack(), but that seems to > exist already, with the caveat that it never prepares the pack, but only > serves what's in the cache. I wonder if we could simply replace that > with what prepare_midx_pack() does, but it may be more conservative to > leave the two separate. So I guess nth_midxed_pack_load() or something. In general there aren't a ton of in-flight changes in the MIDX code at any given time, so I think we could get away without renaming it. But I don't mind erring on the side of caution here, either. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` 2025-06-03 22:27 ` Taylor Blau @ 2025-08-28 23:25 ` Junio C Hamano 0 siblings, 0 replies; 53+ messages in thread From: Junio C Hamano @ 2025-08-28 23:25 UTC (permalink / raw) To: Taylor Blau; +Cc: Jeff King, git, Patrick Steinhardt Taylor Blau <me@ttaylorr.com> writes: > On Fri, May 30, 2025 at 02:50:34AM -0400, Jeff King wrote: >> ... >> Is this note still valid for v2? It looks like patch 1 adds >> nth_midxed_pack_name() and tries to use it everywhere. > > Yeah, we should get rid of this. I had written it before I wrote what is > now the first patch in this series, and neglected to remove it before > sending out the latest round. > ... >> I'd have thought we could call it nth_midxed_pack(), but that seems to >> exist already, with the caveat that it never prepares the pack, but only >> serves what's in the cache. I wonder if we could simply replace that >> with what prepare_midx_pack() does, but it may be more conservative to >> leave the two separate. So I guess nth_midxed_pack_load() or something. > > In general there aren't a ton of in-flight changes in the MIDX code at > any given time, so I think we could get away without renaming it. But I > don't mind erring on the side of caution here, either. The topic went dark after this message. Are there still unresolved issues, or do we want to get it unstuck? Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-22 5:32 ` Jeff King 2025-05-22 15:47 ` Junio C Hamano @ 2025-05-23 1:31 ` Taylor Blau 2025-05-23 2:18 ` Jeff King 1 sibling, 1 reply; 53+ messages in thread From: Taylor Blau @ 2025-05-23 1:31 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Steinhardt, git, Junio C Hamano On Thu, May 22, 2025 at 01:32:35AM -0400, Jeff King wrote: > On Tue, May 20, 2025 at 11:53:10AM +0200, Patrick Steinhardt wrote: > > > @@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, > > > > pack_int_id = midx_for_pack(&m, pack_int_id); > > > > + if (m->packs[pack_int_id] == (void *)(intptr_t)-1) > > + return 1; > > if (m->packs[pack_int_id]) > > return 0; > > I did wonder while writing this if we might be able to hide the magic > number and gross casting inside a constant or macro. I think just: > > #define MIDX_PACK_ERROR ((void *)(intptr_t)-1) > > would be enough? > > Though... I agree with the longer-term goal of having prepare_midx_pack() just return a pointer to a struct packed_git. But in the meantime, I do think having a #define for the "oops, I tried to load this packfile and it was broken" case is a good idea. > > @@ -495,6 +499,8 @@ struct packed_git *nth_midxed_pack(struct multi_pack_index *m, > > uint32_t pack_int_id) > > { > > uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); > > + if (m->packs[local_pack_int_id] == (void *)(intptr_t)-1) > > + return NULL; > > return m->packs[local_pack_int_id]; > > Yuck, yet another spot that needs to be aware of the new tri-state > value. One alternative is using an auxiliary array to cache the errors, > and then only the lookup function needs to care. Like: I like this direction, though I dislike having a separate array that we need to keep in sync with m->packs. It might be nice to have an array like: struct { struct packed_git *p; unsigned err:1; } *packs; , which would allow you to keep the error state next to the packed_git itself. I wonder if changing the signature to: int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id, struct packed_git **p_out); would be a good idea. It allows you to pass garbage input (like a non-existent pack_int_id) and get a useful error back. It also allows you to pass a pack_int_id that is valid, but cannot be loaded and get a useful error back via the return value. But I think without actually trying it and seeing what the fallout looks like, it's hard to say whether or not the above is a step in the right direction. Thanks, Taylor ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-23 1:31 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Taylor Blau @ 2025-05-23 2:18 ` Jeff King 0 siblings, 0 replies; 53+ messages in thread From: Jeff King @ 2025-05-23 2:18 UTC (permalink / raw) To: Taylor Blau; +Cc: Patrick Steinhardt, git, Junio C Hamano On Thu, May 22, 2025 at 09:31:48PM -0400, Taylor Blau wrote: > > Yuck, yet another spot that needs to be aware of the new tri-state > > value. One alternative is using an auxiliary array to cache the errors, > > and then only the lookup function needs to care. Like: > > I like this direction, though I dislike having a separate array that we > need to keep in sync with m->packs. It might be nice to have an array > like: > > struct { > struct packed_git *p; > unsigned err:1; > } *packs; > > , which would allow you to keep the error state next to the packed_git > itself. In general, yes, I think array-of-struct is better than struct-of-array. In this particular case the latter is not too bad because the management is all handled centrally in the midx constructor. The downside of doing array-of-struct as you propose is that every site that uses it will need to be modified. But that is at least a one-time pain and not an ongoing maintenance burden (unlike the "magic" value approach). > I wonder if changing the signature to: > > int prepare_midx_pack(struct repository *r, > struct multi_pack_index *m, > uint32_t pack_int_id, > struct packed_git **p_out); > > would be a good idea. It allows you to pass garbage input (like a > non-existent pack_int_id) and get a useful error back. It also allows > you to pass a pack_int_id that is valid, but cannot be loaded and get a > useful error back via the return value. Would the return value be a richer set of values than the current success/fail? If not, then I think just returning the pack pointer does that fine. I was also tempted to suggest that it should take a "struct multi_pack_index **", to return the matching midx as an out-parameter. That would "just work" for callers that want to look at the surrounding midx, too. But maybe it gets weird for ones that are (correctly) expecting to find the pack within the same midx. I.e., code like this: for (i = 0; i < m->num_packs; i++) { if (prepare_midx_pack(r, m, i + m->num_packs_in_base)) die(...); /* do something with m->pack[i] */ } is correct now, and we _shouldn't_ ever need to switch to a different "m" inside prepare_midx_pack(). But if we ever did, propagating that up to the caller would be mighty confusing. So perhaps better to leave it as-is, and let the caller explicitly do midx_for_pack() or whatever to find the right "m" as necessary. > But I think without actually trying it and seeing what the fallout looks > like, it's hard to say whether or not the above is a step in the right > direction. Yup. All of this can wait, too. Patrick's series is fixing its own localized issue (however he wants to structure the extra bit of storage). Most of what we're talking about here is future-proofing so it can happen at our leisure. I think the only other actual bug is the want_included_pack() one discussed in the other part of the thread. -Peff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs 2025-05-20 9:53 ` [PATCH v2 0/2] " Patrick Steinhardt 2025-05-20 9:53 ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt 2025-05-20 9:53 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt @ 2025-05-21 13:24 ` Junio C Hamano 2 siblings, 0 replies; 53+ messages in thread From: Junio C Hamano @ 2025-05-21 13:24 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Taylor Blau, Jeff King Patrick Steinhardt <ps@pks.im> writes: > Changes in v2: > - Drop the patch that reorders syscalls and add a comment explaining > why the order is important. Much nicer. > - Add a negative lookup cache for indexed packfiles. Looking very good. Thanks, queued. > - Link to v1: https://lore.kernel.org/r/20250516-pks-pack-avoid-stats-on-missing-v1-1-e2ef4d8798a3@pks.im > > Thanks! > > Patrick > > --- > Patrick Steinhardt (2): > packfile: explain ordering of how we look up auxiliary pack files > midx: stop repeatedly looking up nonexistent packfiles > > midx.c | 10 ++++++++-- > packfile.c | 11 +++++++++++ > 2 files changed, 19 insertions(+), 2 deletions(-) > > Range-diff versus v1: > > 1: 31ffb3712ca < -: ----------- packfile: avoid access(3p) calls for missing packs > -: ----------- > 1: 6125b84389d packfile: explain ordering of how we look up auxiliary pack files > -: ----------- > 2: 8cb82a771c0 midx: stop repeatedly looking up nonexistent packfiles > > --- > base-commit: 1a8a4971cc6c179c4dd711f4a7f5d7178f4b3ab7 > change-id: 20250516-pks-pack-avoid-stats-on-missing-8e3b75755cf0 ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs 2025-05-16 8:55 [PATCH] packfile: avoid access(3p) calls for missing packs Patrick Steinhardt 2025-05-16 18:34 ` Junio C Hamano 2025-05-20 9:53 ` [PATCH v2 0/2] " Patrick Steinhardt @ 2025-05-28 12:24 ` Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt ` (2 more replies) 2 siblings, 3 replies; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-28 12:24 UTC (permalink / raw) To: git; +Cc: Taylor Blau, Jeff King, Junio C Hamano Hi, this patch addresses an issue we have recently seen in our production systems due to a stale MIDX. The MIDX contained entries for packfiles that didn't exist anymore, which caused Git to repeatedly look up those packfiles. Each missing packfile resulted in four repeated syscalls: three access(3p) calls to check for supporting data structures, and one call to stat(3p) to check for the packfile itself. The first three calls are essentially wasted though when the stat(3p) call itself fails, which is being fixed by this patch. I doubt that the patch matters in almost any repository, but given that the refactoring is trivial I thought to submit the patch regardless of that. Another step would be to introduce a negative lookup cache -- but that would be a bit more involved, so I decided against it for now as I don't want to introduce complexity for dubious gains. Changes in v2: - Drop the patch that reorders syscalls and add a comment explaining why the order is important. - Add a negative lookup cache for indexed packfiles. - Link to v1: https://lore.kernel.org/r/20250516-pks-pack-avoid-stats-on-missing-v1-1-e2ef4d8798a3@pks.im Changes in v3: - Use a macro to hide away the `(void *)(intptr_t)-1` magic. - Link to v2: https://lore.kernel.org/r/20250520-pks-pack-avoid-stats-on-missing-v2-0-333c5217fb05@pks.im Thanks! Patrick --- Patrick Steinhardt (2): packfile: explain ordering of how we look up auxiliary pack files midx: stop repeatedly looking up nonexistent packfiles midx.c | 12 ++++++++++-- packfile.c | 11 +++++++++++ 2 files changed, 21 insertions(+), 2 deletions(-) Range-diff versus v2: 1: cf601a63e98 = 1: c4087674967 packfile: explain ordering of how we look up auxiliary pack files 2: e3108f7ce48 ! 2: 2fa98231464 midx: stop repeatedly looking up nonexistent packfiles @@ Commit message Signed-off-by: Patrick Steinhardt <ps@pks.im> ## midx.c ## +@@ + #include "pack-bitmap.h" + #include "pack-revindex.h" + ++#define MIDX_PACK_ERROR ((void *)(intptr_t)-1) ++ + int midx_checksum_valid(struct multi_pack_index *m); + void clear_midx_files_ext(const char *object_dir, const char *ext, + const char *keep_hash); @@ midx.c: void close_midx(struct multi_pack_index *m) munmap((unsigned char *)m->data, m->data_len); for (i = 0; i < m->num_packs; i++) { - if (m->packs[i]) -+ if (m->packs[i] && m->packs[i] != (void *)(intptr_t)-1) ++ if (m->packs[i] && m->packs[i] != MIDX_PACK_ERROR) m->packs[i]->multi_pack_index = 0; } FREE_AND_NULL(m->packs); @@ midx.c: int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, pack_int_id = midx_for_pack(&m, pack_int_id); -+ if (m->packs[pack_int_id] == (void *)(intptr_t)-1) ++ if (m->packs[pack_int_id] == MIDX_PACK_ERROR) + return 1; if (m->packs[pack_int_id]) return 0; @@ midx.c: int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, - if (!p) + if (!p) { -+ m->packs[pack_int_id] = (void *)(intptr_t)-1; ++ m->packs[pack_int_id] = MIDX_PACK_ERROR; return 1; + } @@ midx.c: struct packed_git *nth_midxed_pack(struct multi_pack_index *m, uint32_t pack_int_id) { uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); -+ if (m->packs[local_pack_int_id] == (void *)(intptr_t)-1) ++ if (m->packs[local_pack_int_id] == MIDX_PACK_ERROR) + return NULL; return m->packs[local_pack_int_id]; } --- base-commit: 1a8a4971cc6c179c4dd711f4a7f5d7178f4b3ab7 change-id: 20250516-pks-pack-avoid-stats-on-missing-8e3b75755cf0 ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files 2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt @ 2025-05-28 12:24 ` Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt 2025-05-30 6:27 ` [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs Jeff King 2 siblings, 0 replies; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-28 12:24 UTC (permalink / raw) To: git; +Cc: Taylor Blau, Jeff King, Junio C Hamano When adding a packfile to an object database we perform four syscalls: - Three calls to access(3p) are done to check for auxiliary data structures. - One call to stat(3p) is done to check for the ".pack" itself. One curious bit is that we perform the access(3p) calls before checking for the packfile itself, but if the packfile doesn't exist we discard all results. The access(3p) calls are thus essentially wasted, so one may be triggered to reorder those calls so that we can short-circuit the other syscalls in case the packfile does not exist. The order in which we look up files is quite important though to help avoid races: - When installing a packfile we move auxiliary data structures into place before we install the ".idx" file. - When deleting a packfile we first delete the ".idx" and ".pack" files before deleting auxiliary data structures. As such, to avoid any races with concurrently created or deleted packs we need to make sure that we _first_ read auxiliary data structures before we read the corresponding ".idx" or ".pack" file. Otherwise it may easily happen that we return a populated but misclassified pack. Add a comment to `add_packed_git()` to make future readers aware of this ordering requirement. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- packfile.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/packfile.c b/packfile.c index d91016f1c7f..933036e2606 100644 --- a/packfile.c +++ b/packfile.c @@ -737,6 +737,17 @@ struct packed_git *add_packed_git(struct repository *r, const char *path, p = alloc_packed_git(r, alloc); memcpy(p->pack_name, path, path_len); + /* + * Note that we have to check auxiliary data structures before we check + * for the ".pack" file to exist to avoid races with a packfile that is + * in the process of being deleted. The ".pack" file is unlinked before + * its auxiliary data structures, so we know that we either get a + * consistent snapshot of all data structures or that we'll fail to + * stat(3p) the packfile itself and thus return `NULL`. + * + * As such, we cannot bail out before the access(3p) calls in case the + * packfile doesn't exist without doing two stat(3p) calls for it. + */ xsnprintf(p->pack_name + path_len, alloc - path_len, ".keep"); if (!access(p->pack_name, F_OK)) p->pack_keep = 1; -- 2.49.0.1266.g31b7d2e469.dirty ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v3 2/2] midx: stop repeatedly looking up nonexistent packfiles 2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt @ 2025-05-28 12:24 ` Patrick Steinhardt 2025-05-30 6:27 ` [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs Jeff King 2 siblings, 0 replies; 53+ messages in thread From: Patrick Steinhardt @ 2025-05-28 12:24 UTC (permalink / raw) To: git; +Cc: Taylor Blau, Jeff King, Junio C Hamano The multi-pack index acts as a cache across a set of packfiles so that we can quickly look up which of those packfiles contains a given object. As such, the multi-pack index naturally needs to be updated every time one of the packfiles goes away, or otherwise the multi-pack index has grown stale. A stale multi-pack index should be handled gracefully by Git though, and in fact it is: if the indexed pack cannot be found we simply ignore it and eventually we fall back to doing the object lookup by just iterating through all packs, even if those aren't indexed. But while this fallback works, it has one significant downside: we don't cache the fact that a pack has vanished. This leads to us repeatedly trying to look up the same pack only to realize that it (still) doesn't exist. This issue can be easily demonstrated by creating a repository with a stale multi-pack index and a couple of objects. We do so by creating a repository with two packfiles, both of which are indexed by the multi-pack index, and then repack those two packfiles. Note that we have to move the multi-pack-index before doing the final repack, as Git knows to delete it otherwise. $ git init repo $ cd repo/ $ git config set maintenance.auto false $ for i in $(seq 1000); do printf "%d-original" $i >file-$i; done $ git add . $ git commit -moriginal $ git repack -dl $ for i in $(seq 1000); do printf "%d-modified" $i >file-$i; done $ git commit -a -mmodified $ git repack -dl $ git multi-pack-index write $ mv .git/objects/pack/multi-pack-index . $ git repack -Adl $ mv multi-pack-index .git/objects/pack/ Commands that cause a lot of objects lookups will now repeatedly invoke `add_packed_git()`, which leads to three failed access(3p) calls as well as one failed stat(3p) call. The following strace for example is done for `git log --patch` in the above repository: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 74.67 0.024693 1 18038 18031 access 25.33 0.008378 1 6045 6017 newfstatat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.033071 1 24083 24048 total Fix the issue by introducing a negative lookup cache for indexed packs. This cache works by simply storing an invalid pointer for a missing pack when `prepare_midx_pack()` fails to look up the pack. Most users of the `packs` array don't need to be adjusted, either, as they all know to call `prepare_midx_pack()` before accessing the array. With this change in place we can now see a significantly reduced number of syscalls: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 73.58 0.000323 5 60 28 newfstatat 26.42 0.000116 5 23 16 access ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000439 5 83 44 total Furthermore, this change also results in a speedup: Benchmark 1: git log --patch (revision = HEAD~) Time (mean ± σ): 50.4 ms ± 2.5 ms [User: 22.0 ms, System: 24.4 ms] Range (min … max): 45.4 ms … 54.9 ms 53 runs Benchmark 2: git log --patch (revision = HEAD) Time (mean ± σ): 12.7 ms ± 0.4 ms [User: 11.1 ms, System: 1.6 ms] Range (min … max): 12.4 ms … 15.0 ms 191 runs Summary git log --patch (revision = HEAD) ran 3.96 ± 0.22 times faster than git log --patch (revision = HEAD~) In the end, it should in theory never be necessary to have this negative lookup cache given that we know to update the multi-pack index together with repacks. But as the change is quite contained and as the speedup can be significant as demonstrated above, it does feel sensible to have the negative lookup cache regardless. Based-on-patch-by: Jeff King <peff@peff.net> Signed-off-by: Patrick Steinhardt <ps@pks.im> --- midx.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/midx.c b/midx.c index 3d0015f7828..cd6e766ce2b 100644 --- a/midx.c +++ b/midx.c @@ -13,6 +13,8 @@ #include "pack-bitmap.h" #include "pack-revindex.h" +#define MIDX_PACK_ERROR ((void *)(intptr_t)-1) + int midx_checksum_valid(struct multi_pack_index *m); void clear_midx_files_ext(const char *object_dir, const char *ext, const char *keep_hash); @@ -405,7 +407,7 @@ void close_midx(struct multi_pack_index *m) munmap((unsigned char *)m->data, m->data_len); for (i = 0; i < m->num_packs; i++) { - if (m->packs[i]) + if (m->packs[i] && m->packs[i] != MIDX_PACK_ERROR) m->packs[i]->multi_pack_index = 0; } FREE_AND_NULL(m->packs); @@ -458,6 +460,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, pack_int_id = midx_for_pack(&m, pack_int_id); + if (m->packs[pack_int_id] == MIDX_PACK_ERROR) + return 1; if (m->packs[pack_int_id]) return 0; @@ -482,8 +486,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, strbuf_release(&pack_name); strbuf_release(&key); - if (!p) + if (!p) { + m->packs[pack_int_id] = MIDX_PACK_ERROR; return 1; + } p->multi_pack_index = 1; m->packs[pack_int_id] = p; @@ -495,6 +501,8 @@ struct packed_git *nth_midxed_pack(struct multi_pack_index *m, uint32_t pack_int_id) { uint32_t local_pack_int_id = midx_for_pack(&m, pack_int_id); + if (m->packs[local_pack_int_id] == MIDX_PACK_ERROR) + return NULL; return m->packs[local_pack_int_id]; } -- 2.49.0.1266.g31b7d2e469.dirty ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs 2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt @ 2025-05-30 6:27 ` Jeff King 2 siblings, 0 replies; 53+ messages in thread From: Jeff King @ 2025-05-30 6:27 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Taylor Blau, Junio C Hamano On Wed, May 28, 2025 at 02:24:09PM +0200, Patrick Steinhardt wrote: > Changes in v3: > - Use a macro to hide away the `(void *)(intptr_t)-1` magic. > - Link to v2: https://lore.kernel.org/r/20250520-pks-pack-avoid-stats-on-missing-v2-0-333c5217fb05@pks.im Thanks, this looks good to me. -Peff ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2025-08-28 23:25 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-16 8:55 [PATCH] packfile: avoid access(3p) calls for missing packs Patrick Steinhardt 2025-05-16 18:34 ` Junio C Hamano 2025-05-19 6:52 ` Jeff King 2025-05-19 15:46 ` Junio C Hamano 2025-05-20 6:45 ` Patrick Steinhardt 2025-05-22 5:28 ` Jeff King 2025-05-23 1:02 ` Taylor Blau 2025-05-23 2:03 ` Jeff King 2025-05-20 9:53 ` [PATCH v2 0/2] " Patrick Steinhardt 2025-05-20 9:53 ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt 2025-05-23 1:03 ` Taylor Blau 2025-05-20 9:53 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt 2025-05-22 5:32 ` Jeff King 2025-05-22 15:47 ` Junio C Hamano 2025-05-22 16:59 ` Jeff King 2025-05-22 18:44 ` Junio C Hamano 2025-05-23 1:22 ` Taylor Blau 2025-05-23 2:08 ` Jeff King 2025-05-23 17:46 ` Taylor Blau 2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau 2025-05-25 18:41 ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-28 2:00 ` Taylor Blau 2025-05-25 18:41 ` [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-28 2:08 ` Taylor Blau 2025-05-25 18:41 ` [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() Taylor Blau 2025-05-26 7:23 ` Patrick Steinhardt 2025-05-28 2:15 ` Taylor Blau 2025-05-25 18:42 ` [PATCH 4/5] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau 2025-05-25 18:42 ` [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau 2025-05-26 7:24 ` Patrick Steinhardt 2025-05-28 2:18 ` Taylor Blau 2025-05-28 11:53 ` Patrick Steinhardt 2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau 2025-05-28 22:59 ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau 2025-05-29 20:47 ` Junio C Hamano 2025-06-03 22:22 ` Taylor Blau 2025-05-29 20:51 ` Junio C Hamano 2025-06-03 22:23 ` Taylor Blau 2025-05-28 22:59 ` [PATCH v2 2/4] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau 2025-05-28 22:59 ` [PATCH v2 3/4] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau 2025-05-28 22:59 ` [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau 2025-05-30 6:50 ` Jeff King 2025-06-03 22:27 ` Taylor Blau 2025-08-28 23:25 ` Junio C Hamano 2025-05-23 1:31 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Taylor Blau 2025-05-23 2:18 ` Jeff King 2025-05-21 13:24 ` [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs Junio C Hamano 2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt 2025-05-28 12:24 ` [PATCH v3 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt 2025-05-30 6:27 ` [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).