git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org,  Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH] packfile: avoid access(3p) calls for missing packs
Date: Fri, 16 May 2025 11:34:10 -0700	[thread overview]
Message-ID: <xmqq7c2gv1zx.fsf@gitster.g> (raw)
In-Reply-To: <20250516-pks-pack-avoid-stats-on-missing-v1-1-e2ef4d8798a3@pks.im> (Patrick Steinhardt's message of "Fri, 16 May 2025 10:55:10 +0200")

Patrick Steinhardt <ps@pks.im> writes:

> The function `add_packed_git()` adds a packfile to the known list of
> packs in case it exists. In case the packfile doesn't look like a pack
> though, or in case it doesn't exist, the function will simply return a
> `NULL` pointer.
>
> The ordering in which those checks are done is somewhat curious though:
> we first perform a couple of access(3p) syscalls for auxiliary files
> like ".keep" before we determine whether the ".pack" file itself exists.
> And if we see that the ".pack" file does not exist, we bail out and thus
> effectively discard all the information. This means that the access(3p)
> calls were a complete waste of time.
>
> The reason why we do this is likely because we reuse `p->pack_name` to
> derive the other files' names, as well, so that we only have to allocate
> this buffer once. As such, we have to compute the packfile name last so
> that it doesn't get overwritten by the other names.

I vaguely recall that the order of checks between .idx and .pack
were deliberately chosen, as we do not want to add and attempt to
use .pack file before its associated .idx file is ready since
without the latter the former is unusable at runtime.  

    Side note: It may have been more important back when the name of
    the packfile was a hash of names of the objects in the pack,
    which meant that when you repack a quiescent and fully repacked
    repository twice with different parameters, you could have ended
    up with a packfile whose name is the same but the contents as a
    bytestream (hence the object offset) are different, making a
    stale .idx not pointing into the correct position in the new
    .pack file.  These days the name of the packfile is based on the
    contents of the pack as a bytestream, so we no longer suffer
    in such a scenario.

But I wouldn't be surprised if checks for other auxiliary files were
added later to come next to .idx not .pack, exactly because .pack is
the primary thing.

> All of this likely doesn't matter in the general case: Git shouldn't end
> up looking too many nonexistent packfiles anyway.

In any case, we shouldn't attempt to use .pack when its associated
files are missing.  It is not about nonexistent but about incomplete
(including "in the process of being written").

> But there are edge
> cases where it _does_ matter. One such edge case that we have hit in our
> production systems was a stale "multi-pack-index" file that contained
> references to nonexistent packfiles. As there is no negative lookup
> cache this causes us to repeatedly look up the same packfiles via the
> multi-pack index, only to realize that they don't exist. This translated
> into hundreds of thousands of syscalls to access(3p) and stat(3p).

Ouch.  Multi-pack index is a more recent invention, and I am not
surprised if such bugs still lurk.  And as you hinted, a solution
may be a negative "known-not-to-exist" list, but given one such
stale midx, would we try to open the same missing packfiles over and
over within the same process?

> while the issue was entirely self-made because the multi-pack index
> should have been regenerated, we can still reduce the number of syscalls
> by 75% in the case of nonexistent packfiles by reordering these calls.

That sounds more like a band-aid, if we still do the remaining 25%
that we somehow know would be unnecessary.

Note that I do not know if the ordering that I think was originally
introduced for correctness still matters for correctness.  The first
paragraph of this message was written without consulting "git log",
lore archive, or even the current state of the source, but from
vague memory.


  reply	other threads:[~2025-05-16 18:34 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-16  8:55 [PATCH] packfile: avoid access(3p) calls for missing packs Patrick Steinhardt
2025-05-16 18:34 ` Junio C Hamano [this message]
2025-05-19  6:52   ` Jeff King
2025-05-19 15:46     ` Junio C Hamano
2025-05-20  6:45     ` Patrick Steinhardt
2025-05-22  5:28       ` Jeff King
2025-05-23  1:02     ` Taylor Blau
2025-05-23  2:03       ` Jeff King
2025-05-20  9:53 ` [PATCH v2 0/2] " Patrick Steinhardt
2025-05-20  9:53   ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt
2025-05-23  1:03     ` Taylor Blau
2025-05-20  9:53   ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt
2025-05-22  5:32     ` Jeff King
2025-05-22 15:47       ` Junio C Hamano
2025-05-22 16:59         ` Jeff King
2025-05-22 18:44           ` Junio C Hamano
2025-05-23  1:22           ` Taylor Blau
2025-05-23  2:08             ` Jeff King
2025-05-23 17:46               ` Taylor Blau
2025-05-25 18:41                 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau
2025-05-25 18:41                   ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau
2025-05-26  7:23                     ` Patrick Steinhardt
2025-05-28  2:00                       ` Taylor Blau
2025-05-25 18:41                   ` [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau
2025-05-26  7:23                     ` Patrick Steinhardt
2025-05-28  2:08                       ` Taylor Blau
2025-05-25 18:41                   ` [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() Taylor Blau
2025-05-26  7:23                     ` Patrick Steinhardt
2025-05-28  2:15                       ` Taylor Blau
2025-05-25 18:42                   ` [PATCH 4/5] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau
2025-05-25 18:42                   ` [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau
2025-05-26  7:24                     ` Patrick Steinhardt
2025-05-28  2:18                       ` Taylor Blau
2025-05-28 11:53                         ` Patrick Steinhardt
2025-05-28 22:58                   ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau
2025-05-29 20:47                       ` Junio C Hamano
2025-06-03 22:22                         ` Taylor Blau
2025-05-29 20:51                       ` Junio C Hamano
2025-06-03 22:23                         ` Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 2/4] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 3/4] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau
2025-05-30  6:50                       ` Jeff King
2025-06-03 22:27                         ` Taylor Blau
2025-08-28 23:25                           ` Junio C Hamano
2025-05-23  1:31       ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Taylor Blau
2025-05-23  2:18         ` Jeff King
2025-05-21 13:24   ` [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs Junio C Hamano
2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt
2025-05-28 12:24   ` [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt
2025-05-28 12:24   ` [PATCH v3 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt
2025-05-30  6:27   ` [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq7c2gv1zx.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).