From: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
To: Chris Packham <judge.packham@gmail.com>
Cc: git@vger.kernel.org, weigelt@metux.de, spearce@spearce.org,
jrnieder@gmail.com, Matthieu.Moy@grenoble-inp.fr,
raa.lkml@gmail.com, Junio C Hamano <gitster@pobox.com>
Subject: Re: [RFC PATCH] git.txt: document limitations on non-typical repos (and hints)
Date: Wed, 6 Oct 2010 06:52:54 +0700 [thread overview]
Message-ID: <AANLkTimb2n4oaEBBr8RJnv4C5xoD-shP7DiDFf+Tcfde@mail.gmail.com> (raw)
In-Reply-To: <4CAB4FC4.4030002@gmail.com>
2010/10/5 Chris Packham <judge.packham@gmail.com>:
> On 05/10/10 06:00, Nguyễn Thái Ngọc Duy wrote:
>>
>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> ---
>> I wanted to make a more detailed description, per command. It would
>> serve as guidance for people on special repos, also as TODOs for Git
>> developers. But that seems a lot of work on analyzing each commands.
>>
>> Instead I made this text to warn users where performance may decrease,
>> and to hint them features that might help. Do I miss anything?
>>
>> There were discussions in the past on maintaining large files out-of-repo,
>> and symlinks to them in-repo. That sounds like a good advice, doesn't it?
>>
>> Documentation/git.txt | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/git.txt b/Documentation/git.txt
>> index dd57bdc..8408923 100644
>> --- a/Documentation/git.txt
>> +++ b/Documentation/git.txt
>> @@ -729,6 +729,52 @@ The index is also capable of storing multiple entries (called "stages")
>> for a given pathname. These stages are used to hold the various
>> unmerged version of a file when a merge is in progress.
>>
>> +Performance concerns
>> +--------------------
>> +
>> +Git is written with performance in mind and it works extremely well
>> +with its typical repositories (i.e. source code repositories, with
>> +a moderate number of small text files, possibly with long history).
>> +Non-typical repositories (huge number of files, or very large
>> +files...) may experience performance degradation. This section describes
Probably should have written "experience mild performance degradation"
>> +how Git behaves in such repositories and how to reduce impact.
>
> How huge is "huge" and how large is "large". From previous threads on
> this list I'm guessing "large" is files bigger than physical RAM. I've
A significant portion of RAM is enough to start swapping. There's also
a hard limit imposed by mmap(): a file cannot be larger than available
address space (2-3G on x86, probably larger on x86_64).
> not really run into a situation where a huge number of files causes
> performance problems.
gentoo-x86 has ~100k files. Cold cache time is definitely long. Even
with hot cache, a full cache refresh may take, I don't remember, half
a second or so. It depends on many factors. I don't think I can draw a
clear limit.
>
> Maybe there should be a distinction of where a user might see
> performance problems e.g. initial clone, subsequent fetches, commit,
> checkout or diff.
>
>> +
>> +For repositories with really long history, you may want to work on
>> +a shallow clone of it (see linkgit:git-clone[1], option '--depth').
>> +A shallow repository does not contain full history, so it may consume
>> +less disk space and network bandwidth. On the other hand, you cannot
>> +fetch from it. And obviously you cannot look further back than what
>> +it has in history (you can deepen history though).
>
> You might want to mention git clone --reference and the
> .git/objects/info/alternates for those concerned with disk usage.
Thanks
>
>> +
>> +For repositories with a large number of files, but you only need
>> +a few of them present in working tree, you can use sparse checkout
>> +(see linkgit:git-read-tree[1], section 'Sparse checkout'). Sparse
>> +checkout can be used with either a normal repository, or a shallow
>> +one.
>> +
>> +Git uses lstat(3) to detect changes in working tree. A huge number
>> +of lstat(3) calls may impact performance, especially on systems with
>> +slow lstat(3). In some cases you can reduce the number of lstat(3)
>> +calls by specifying what directories you are interested in, so no
>> +lstat(3) outside is needed.
>> +
>> +For repositories with a large number of files, you need all of them
>> +present in working tree, but you know in advance only a few of them
>> +may be modified, please consider using assume-unchanged bit (see
>> +linkgit:git-update-index[1]). This helps reduce the number of lstat(3)
>> +calls.
>> +
>> +Some Git commands need entire file content in memory to process.
>> +You may want to avoid using them if possible on large files. Those
>> +commands include:
>> +
>> +* All checkout commands (linkgit:git-checkout[1],
>> + linkgit:git-checkout-index[1], linkgit:git-read-tree[1],
>> + linkgit:git-clone[1]...)
>> +* All diff-related commands (linkgit:git-diff[1],
>> + linkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
>> +* All commands that need file conversion processing
>> +
>
> This addresses one of my comments above. It might be worth talking about
> using git bundles as an alternative to cloning over unreliable connections.
Thanks.
--
Duy
next prev parent reply other threads:[~2010-10-05 23:53 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-05 13:00 [RFC PATCH] git.txt: document limitations on non-typical repos (and hints) Nguyễn Thái Ngọc Duy
2010-10-05 16:12 ` Alex Riesen
2010-10-05 16:18 ` Chris Packham
2010-10-05 23:52 ` Nguyen Thai Ngoc Duy [this message]
2010-10-06 14:21 ` Nguyễn Thái Ngọc Duy
2010-10-06 14:23 ` [PATCH] " pclouds
2010-10-06 16:32 ` Junio C Hamano
2010-10-07 2:25 ` Nguyen Thai Ngoc Duy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AANLkTimb2n4oaEBBr8RJnv4C5xoD-shP7DiDFf+Tcfde@mail.gmail.com \
--to=pclouds@gmail.com \
--cc=Matthieu.Moy@grenoble-inp.fr \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jrnieder@gmail.com \
--cc=judge.packham@gmail.com \
--cc=raa.lkml@gmail.com \
--cc=spearce@spearce.org \
--cc=weigelt@metux.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).