All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: mhagger@alum.mit.edu
Cc: Git Mailing List <git@vger.kernel.org>,
	Lars Schneider <larsxschneider@gmail.com>
Subject: Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
Date: Fri, 16 Mar 2018 21:01:42 +0100	[thread overview]
Message-ID: <87370zeqmx.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <CAMy9T_FaOdLP482YZcMX16mpy_EgM0ok1GKg45rE=X+HTGxSiQ@mail.gmail.com>


On Fri, Mar 16 2018, Michael Haggerty jotted:

> What makes a Git repository unwieldy to work with and host? It turns
> out that the respository's on-disk size in gigabytes is only part of
> the story. From our experience at GitHub, repositories cause problems
> because of poor internal layout at least as often as because of their
> overall size. For example,
>
> * blobs or trees that are too large
> * large blobs that are modified frequently (e.g., database dumps)
> * large trees that are modified frequently
> * trees that expand to unreasonable size when checked out (e.g., "Git
> bombs" [2])
> * too many tiny Git objects
> * too many references
> * other oddities, such as giant octopus merges, super long reference
> names or file paths, huge commit messages, etc.
>
> `git-sizer` [1] is a new open-source tool that computes various
> size-related statistics for a Git repository and points out those that
> are likely to cause problems or inconvenience to its users.

This is a very useful tool. I've been using it to get insight into some
bad repositories.

Suggestion for a thing to add to it, I don't have the time on the Go
tuits:

One thing that can make repositories very pathological is if the ratio
of trees to commits is too low.

I was dealing with a repo the other day that had several thousand files
all in the same root directory, and no subdirectories.

This meant that doing `git log -- <file>` was very expensive. I wrote a
bit about this on this related ticket the other day:
https://gitlab.com/gitlab-org/gitlab-ce/issues/42104#note_54933512

But it's not something where you can just say having more trees is
better, because on the other end of the spectrume we can imagine a repo
like linux.git where each file like COPYING instead exists at
C/O/P/Y/I/N/G, that would also be pathological.

It would be very interesting to do some tests to see what the optimal
value would be.

I also suspect it's not really about the commit / tree ratio, but that
you have some reasonable amount of nested trees per file, *and* that
changes to them are reasonably spread out. I.e. it doesn't help if you
have a doc/ and a src/ directory if 99% of your commits change src/, and
if you're doing 'git log -- src/something.c'.

Which is all a very long-winded way of saying that I don't know what the
general rule is, but I have some suspicions, but having all your files
in the root is definitely bad.

  reply	other threads:[~2018-03-16 20:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-16 15:28 [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository Michael Haggerty
2018-03-16 20:01 ` Ævar Arnfjörð Bjarmason [this message]
2018-03-16 21:29   ` Jeff King
2018-03-18 19:06     ` Michael Haggerty
2018-03-21 16:02 ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87370zeqmx.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    --cc=mhagger@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.