git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: shejialuo <shejialuo@gmail.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: Jeff King <peff@peff.net>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	Collin Funk <collin.funk1@gmail.com>,
	git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] wrapper: Fix a errno discrepancy on NetBSD.
Date: Mon, 5 May 2025 20:17:19 +0800	[thread overview]
Message-ID: <aBisTxORH95BgLIT@ArchLinux> (raw)
In-Reply-To: <aBhdH9jWpnpbkPHn@pks.im>

On Mon, May 05, 2025 at 08:39:27AM +0200, Patrick Steinhardt wrote:
> On Sat, May 03, 2025 at 11:49:28AM -0400, Jeff King wrote:
> > On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote:
> > 
> > > > PS I notice that this same function reads the whole packed-refs file
> > > >    into a strbuf. That may be a problem, as they can grow pretty big in
> > > >    extreme cases (e.g., GitHub's fork networks easily got into the
> > > >    gigabytes, as it was every ref of every fork). We usually mmap it.
> > > >    Not related to this discussion, but just something I noticed while
> > > >    reading the function.
> > > 
> > > Peff, thanks for notifying me. I want to know more background.
> > > Initially, the reason why I don't use `mmap` is that when checking the
> > > ref consistency, we usually don't need to share the "packed-refs"
> > > content for multiple processes via `mmap`.
> > 
> > You're not sharing with other processes running fsck, but you'd be
> > sharing the memory with all of the other processes using that
> > packed-refs file for normal lookups.
> > 
> > But even if it's shared with nobody, reading it all into memory is
> > strictly worse than just mmap (since the data is getting copied into the
> > new allocation).
> > 
> > > I don't know how Github executes "git fsck" for the forked repositories.
> > > Is there any regular tasks for "git fsck"? And would "packed-refs" file
> > > be shared for all these repositories?
> > 
> > I don't know offhand how often GitHub runs fsck in an automated way
> > these days. Or even how big packed-refs files get, for that matter.
> 
> They typically are at most a couple of megabytes, but there certainly
> are outliers. For as at GitLab.com, the vast majority (>99%) of such
> files is less than 50MB and typically even less than 5MB.
> 
> > The specific case I'm thinking of for GitHub is that each fork network
> > has a master "network.git" repo that stores the objects for all of the
> > forks (which point to it via their objects/info/alternates files).  That
> > network.git repo doesn't technically need to have all of the refs all
> > the time, but in practice it wants to know about them for reachability
> > during repacking, etc.
> > 
> > So it has something like "refs/remotes/<fork_id>/heads/master", and so
> > on, copying the whole refs/* namespace of each fork. If you look at,
> > say, torvalds/linux, the refs data for a single fork is probably ~30k or
> > so (based on looking at what's in a clone). And there are ~55k forks. So
> > that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they
> > have pretty beefy systems), but enough that mmap is probably better.
> > 
> > I'm also sure that's not the worst case. It has a lot of forks but the
> > ref namespace is not that huge compared to some other projects (and it's
> > the product of the two that is the problem).
> 
> Yeah, the interesting case is always the outliers. One of the worst
> offenders we have at GitLab.com is our own "gitlab-org/gitlab"
> repository. This particular repository has a "packed-refs" file that is
> around 2GB in size.
> 
> So I think refactoring this code to use `mmap()` would probably make
> sense.
> 

Thank Peff and Patrick for the information. I will send a patch later.

> Patrick

  reply	other threads:[~2025-05-05 12:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-02 23:33 [PATCH] wrapper: Fix a errno discrepancy on NetBSD Collin Funk
2025-05-03  0:57 ` brian m. carlson
2025-05-03  1:05   ` Junio C Hamano
2025-05-03  4:21     ` Collin Funk
2025-05-03 17:32       ` Junio C Hamano
2025-05-03  3:48   ` Collin Funk
2025-05-03 13:31   ` Jeff King
2025-05-03 14:58     ` shejialuo
2025-05-03 15:49       ` Jeff King
2025-05-05  6:39         ` Patrick Steinhardt
2025-05-05 12:17           ` shejialuo [this message]
2025-05-03 18:56     ` Collin Funk
2025-05-05 15:43     ` Junio C Hamano
2025-05-05 18:03       ` Jeff King
2025-05-06 13:43         ` shejialuo
2025-05-06 22:58         ` Junio C Hamano
2025-05-03  4:16 ` [PATCH v2] wrapper: NetBSD gives EFTYPE and FreeBSD gives EMFILE where POSIX uses ELOOP Collin Funk
2025-05-03 15:45   ` brian m. carlson
2025-05-03 18:44     ` Collin Funk
2025-05-05  6:43   ` Patrick Steinhardt
2025-05-05 20:41     ` Junio C Hamano
2025-05-06  1:16       ` Collin Funk
2025-05-06 13:23         ` Patrick Steinhardt
2025-05-06  1:08   ` [PATCH v3] " Collin Funk
2025-05-06 13:24     ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBisTxORH95BgLIT@ArchLinux \
    --to=shejialuo@gmail.com \
    --cc=collin.funk1@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).