All of lore.kernel.org
 help / color / mirror / Atom feed
From: shejialuo <shejialuo@gmail.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: Jeff King <peff@peff.net>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	Collin Funk <collin.funk1@gmail.com>,
	git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] wrapper: Fix a errno discrepancy on NetBSD.
Date: Mon, 5 May 2025 20:17:19 +0800	[thread overview]
Message-ID: <aBisTxORH95BgLIT@ArchLinux> (raw)
In-Reply-To: <aBhdH9jWpnpbkPHn@pks.im>

On Mon, May 05, 2025 at 08:39:27AM +0200, Patrick Steinhardt wrote:
> On Sat, May 03, 2025 at 11:49:28AM -0400, Jeff King wrote:
> > On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote:
> > 
> > > > PS I notice that this same function reads the whole packed-refs file
> > > >    into a strbuf. That may be a problem, as they can grow pretty big in
> > > >    extreme cases (e.g., GitHub's fork networks easily got into the
> > > >    gigabytes, as it was every ref of every fork). We usually mmap it.
> > > >    Not related to this discussion, but just something I noticed while
> > > >    reading the function.
> > > 
> > > Peff, thanks for notifying me. I want to know more background.
> > > Initially, the reason why I don't use `mmap` is that when checking the
> > > ref consistency, we usually don't need to share the "packed-refs"
> > > content for multiple processes via `mmap`.
> > 
> > You're not sharing with other processes running fsck, but you'd be
> > sharing the memory with all of the other processes using that
> > packed-refs file for normal lookups.
> > 
> > But even if it's shared with nobody, reading it all into memory is
> > strictly worse than just mmap (since the data is getting copied into the
> > new allocation).
> > 
> > > I don't know how Github executes "git fsck" for the forked repositories.
> > > Is there any regular tasks for "git fsck"? And would "packed-refs" file
> > > be shared for all these repositories?
> > 
> > I don't know offhand how often GitHub runs fsck in an automated way
> > these days. Or even how big packed-refs files get, for that matter.
> 
> They typically are at most a couple of megabytes, but there certainly
> are outliers. For as at GitLab.com, the vast majority (>99%) of such
> files is less than 50MB and typically even less than 5MB.
> 
> > The specific case I'm thinking of for GitHub is that each fork network
> > has a master "network.git" repo that stores the objects for all of the
> > forks (which point to it via their objects/info/alternates files).  That
> > network.git repo doesn't technically need to have all of the refs all
> > the time, but in practice it wants to know about them for reachability
> > during repacking, etc.
> > 
> > So it has something like "refs/remotes/<fork_id>/heads/master", and so
> > on, copying the whole refs/* namespace of each fork. If you look at,
> > say, torvalds/linux, the refs data for a single fork is probably ~30k or
> > so (based on looking at what's in a clone). And there are ~55k forks. So
> > that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they
> > have pretty beefy systems), but enough that mmap is probably better.
> > 
> > I'm also sure that's not the worst case. It has a lot of forks but the
> > ref namespace is not that huge compared to some other projects (and it's
> > the product of the two that is the problem).
> 
> Yeah, the interesting case is always the outliers. One of the worst
> offenders we have at GitLab.com is our own "gitlab-org/gitlab"
> repository. This particular repository has a "packed-refs" file that is
> around 2GB in size.
> 
> So I think refactoring this code to use `mmap()` would probably make
> sense.
> 

Thank Peff and Patrick for the information. I will send a patch later.

> Patrick

  reply	other threads:[~2025-05-05 12:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-02 23:33 [PATCH] wrapper: Fix a errno discrepancy on NetBSD Collin Funk
2025-05-03  0:57 ` brian m. carlson
2025-05-03  1:05   ` Junio C Hamano
2025-05-03  4:21     ` Collin Funk
2025-05-03 17:32       ` Junio C Hamano
2025-05-03  3:48   ` Collin Funk
2025-05-03 13:31   ` Jeff King
2025-05-03 14:58     ` shejialuo
2025-05-03 15:49       ` Jeff King
2025-05-05  6:39         ` Patrick Steinhardt
2025-05-05 12:17           ` shejialuo [this message]
2025-05-03 18:56     ` Collin Funk
2025-05-05 15:43     ` Junio C Hamano
2025-05-05 18:03       ` Jeff King
2025-05-06 13:43         ` shejialuo
2025-05-06 22:58         ` Junio C Hamano
2025-05-03  4:16 ` [PATCH v2] wrapper: NetBSD gives EFTYPE and FreeBSD gives EMFILE where POSIX uses ELOOP Collin Funk
2025-05-03 15:45   ` brian m. carlson
2025-05-03 18:44     ` Collin Funk
2025-05-05  6:43   ` Patrick Steinhardt
2025-05-05 20:41     ` Junio C Hamano
2025-05-06  1:16       ` Collin Funk
2025-05-06 13:23         ` Patrick Steinhardt
2025-05-06  1:08   ` [PATCH v3] " Collin Funk
2025-05-06 13:24     ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBisTxORH95BgLIT@ArchLinux \
    --to=shejialuo@gmail.com \
    --cc=collin.funk1@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.