git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: shejialuo <shejialuo@gmail.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
	Collin Funk <collin.funk1@gmail.com>,
	git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] wrapper: Fix a errno discrepancy on NetBSD.
Date: Sat, 3 May 2025 11:49:28 -0400	[thread overview]
Message-ID: <20250503154928.GA3412@coredump.intra.peff.net> (raw)
In-Reply-To: <aBYvMjtGjzEhKg4s@ArchLinux>

On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote:

> > PS I notice that this same function reads the whole packed-refs file
> >    into a strbuf. That may be a problem, as they can grow pretty big in
> >    extreme cases (e.g., GitHub's fork networks easily got into the
> >    gigabytes, as it was every ref of every fork). We usually mmap it.
> >    Not related to this discussion, but just something I noticed while
> >    reading the function.
> 
> Peff, thanks for notifying me. I want to know more background.
> Initially, the reason why I don't use `mmap` is that when checking the
> ref consistency, we usually don't need to share the "packed-refs"
> content for multiple processes via `mmap`.

You're not sharing with other processes running fsck, but you'd be
sharing the memory with all of the other processes using that
packed-refs file for normal lookups.

But even if it's shared with nobody, reading it all into memory is
strictly worse than just mmap (since the data is getting copied into the
new allocation).

> I don't know how Github executes "git fsck" for the forked repositories.
> Is there any regular tasks for "git fsck"? And would "packed-refs" file
> be shared for all these repositories?

I don't know offhand how often GitHub runs fsck in an automated way
these days. Or even how big packed-refs files get, for that matter.

The specific case I'm thinking of for GitHub is that each fork network
has a master "network.git" repo that stores the objects for all of the
forks (which point to it via their objects/info/alternates files).  That
network.git repo doesn't technically need to have all of the refs all
the time, but in practice it wants to know about them for reachability
during repacking, etc.

So it has something like "refs/remotes/<fork_id>/heads/master", and so
on, copying the whole refs/* namespace of each fork. If you look at,
say, torvalds/linux, the refs data for a single fork is probably ~30k or
so (based on looking at what's in a clone). And there are ~55k forks. So
that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they
have pretty beefy systems), but enough that mmap is probably better.

I'm also sure that's not the worst case. It has a lot of forks but the
ref namespace is not that huge compared to some other projects (and it's
the product of the two that is the problem).

> If above is the case, I agree that we should reuse the logic of
> "load_contents" to enhance. But I don't know whether we need to do this
> in the first place.

I think you can skip the stat validity bits. In theory you can also skip
the mmap_strategy stuff, but I guess it might mean that "fsck" could
block other writers on Windows temporarily (though we wouldn't plan to
hold it open long, the way the normal reader does).

The other gotcha is that the result won't be NUL-terminated, but it
looks like the helper functions already take an "eof" pointer to avoid
looking past the end of what was read.

-Peff

  reply	other threads:[~2025-05-03 15:49 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-02 23:33 [PATCH] wrapper: Fix a errno discrepancy on NetBSD Collin Funk
2025-05-03  0:57 ` brian m. carlson
2025-05-03  1:05   ` Junio C Hamano
2025-05-03  4:21     ` Collin Funk
2025-05-03 17:32       ` Junio C Hamano
2025-05-03  3:48   ` Collin Funk
2025-05-03 13:31   ` Jeff King
2025-05-03 14:58     ` shejialuo
2025-05-03 15:49       ` Jeff King [this message]
2025-05-05  6:39         ` Patrick Steinhardt
2025-05-05 12:17           ` shejialuo
2025-05-03 18:56     ` Collin Funk
2025-05-05 15:43     ` Junio C Hamano
2025-05-05 18:03       ` Jeff King
2025-05-06 13:43         ` shejialuo
2025-05-06 22:58         ` Junio C Hamano
2025-05-03  4:16 ` [PATCH v2] wrapper: NetBSD gives EFTYPE and FreeBSD gives EMFILE where POSIX uses ELOOP Collin Funk
2025-05-03 15:45   ` brian m. carlson
2025-05-03 18:44     ` Collin Funk
2025-05-05  6:43   ` Patrick Steinhardt
2025-05-05 20:41     ` Junio C Hamano
2025-05-06  1:16       ` Collin Funk
2025-05-06 13:23         ` Patrick Steinhardt
2025-05-06  1:08   ` [PATCH v3] " Collin Funk
2025-05-06 13:24     ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250503154928.GA3412@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=collin.funk1@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=shejialuo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).