git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] git.txt: document limitations on non-typical repos (and hints)
@ 2010-10-05 13:00 Nguyễn Thái Ngọc Duy
  2010-10-05 16:12 ` Alex Riesen
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-10-05 13:00 UTC (permalink / raw)
  To: git
  Cc: weigelt, spearce, jrnieder, Matthieu.Moy, raa.lkml,
	Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 I wanted to make a more detailed description, per command. It would
 serve as guidance for people on special repos, also as TODOs for Git
 developers. But that seems a lot of work on analyzing each commands.

 Instead I made this text to warn users where performance may decrease,
 and to hint them features that might help. Do I miss anything?

 There were discussions in the past on maintaining large files out-of-repo,
 and symlinks to them in-repo. That sounds like a good advice, doesn't it?

 Documentation/git.txt |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/Documentation/git.txt b/Documentation/git.txt
index dd57bdc..8408923 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -729,6 +729,52 @@ The index is also capable of storing multiple entries (called "stages")
 for a given pathname.  These stages are used to hold the various
 unmerged version of a file when a merge is in progress.
 
+Performance concerns
+--------------------
+
+Git is written with performance in mind and it works extremely well
+with its typical repositories (i.e. source code repositories, with
+a moderate number of small text files, possibly with long history).
+Non-typical repositories (huge number of files, or very large
+files...) may experience performance degradation. This section describes
+how Git behaves in such repositories and how to reduce impact.
+
+For repositories with really long history, you may want to work on
+a shallow clone of it (see linkgit:git-clone[1], option '--depth').
+A shallow repository does not contain full history, so it may consume
+less disk space and network bandwidth. On the other hand, you cannot
+fetch from it. And obviously you cannot look further back than what
+it has in history (you can deepen history though).
+
+For repositories with a large number of files, but you only need
+a few of them present in working tree, you can use sparse checkout
+(see linkgit:git-read-tree[1], section 'Sparse checkout'). Sparse
+checkout can be used with either a normal repository, or a shallow
+one.
+
+Git uses lstat(3) to detect changes in working tree. A huge number
+of lstat(3) calls may impact performance, especially on systems with
+slow lstat(3). In some cases you can reduce the number of lstat(3)
+calls by specifying what directories you are interested in, so no
+lstat(3) outside is needed.
+
+For repositories with a large number of files, you need all of them
+present in working tree, but you know in advance only a few of them
+may be modified, please consider using assume-unchanged bit (see
+linkgit:git-update-index[1]). This helps reduce the number of lstat(3)
+calls.
+
+Some Git commands need entire file content in memory to process.
+You may want to avoid using them if possible on large files. Those
+commands include:
+
+* All checkout commands (linkgit:git-checkout[1],
+  linkgit:git-checkout-index[1], linkgit:git-read-tree[1],
+  linkgit:git-clone[1]...)
+* All diff-related commands (linkgit:git-diff[1],
+  linkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
+* All commands that need file conversion processing
+
 Authors
 -------
 * git's founding father is Linus Torvalds <torvalds@osdl.org>.
-- 
1.7.0.2.445.gcbdb3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] git.txt: document limitations on non-typical repos (and hints)
  2010-10-05 13:00 [RFC PATCH] git.txt: document limitations on non-typical repos (and hints) Nguyễn Thái Ngọc Duy
@ 2010-10-05 16:12 ` Alex Riesen
  2010-10-05 16:18 ` Chris Packham
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Alex Riesen @ 2010-10-05 16:12 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: git, weigelt, spearce, jrnieder, Matthieu.Moy, Junio C Hamano

On Tue, Oct 5, 2010 at 15:00, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> +Git uses lstat(3) to detect changes in working tree. A huge number

"lstat" is a syscall, manpage section 2.

  http://linux.die.net/man/2/lstat

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] git.txt: document limitations on non-typical repos (and hints)
  2010-10-05 13:00 [RFC PATCH] git.txt: document limitations on non-typical repos (and hints) Nguyễn Thái Ngọc Duy
  2010-10-05 16:12 ` Alex Riesen
@ 2010-10-05 16:18 ` Chris Packham
  2010-10-05 23:52   ` Nguyen Thai Ngoc Duy
  2010-10-06 14:21 ` Nguyễn Thái Ngọc Duy
  2010-10-06 14:23 ` [PATCH] " pclouds
  3 siblings, 1 reply; 8+ messages in thread
From: Chris Packham @ 2010-10-05 16:18 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: git, weigelt, spearce, jrnieder, Matthieu.Moy, raa.lkml,
	Junio C Hamano

On 05/10/10 06:00, Nguyễn Thái Ngọc Duy wrote:
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  I wanted to make a more detailed description, per command. It would
>  serve as guidance for people on special repos, also as TODOs for Git
>  developers. But that seems a lot of work on analyzing each commands.
> 
>  Instead I made this text to warn users where performance may decrease,
>  and to hint them features that might help. Do I miss anything?
> 
>  There were discussions in the past on maintaining large files out-of-repo,
>  and symlinks to them in-repo. That sounds like a good advice, doesn't it?
> 
>  Documentation/git.txt |   46 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/git.txt b/Documentation/git.txt
> index dd57bdc..8408923 100644
> --- a/Documentation/git.txt
> +++ b/Documentation/git.txt
> @@ -729,6 +729,52 @@ The index is also capable of storing multiple entries (called "stages")
>  for a given pathname.  These stages are used to hold the various
>  unmerged version of a file when a merge is in progress.
>  
> +Performance concerns
> +--------------------
> +
> +Git is written with performance in mind and it works extremely well
> +with its typical repositories (i.e. source code repositories, with
> +a moderate number of small text files, possibly with long history).
> +Non-typical repositories (huge number of files, or very large
> +files...) may experience performance degradation. This section describes
> +how Git behaves in such repositories and how to reduce impact.

How huge is "huge" and how large is "large". From previous threads on
this list I'm guessing "large" is files bigger than physical RAM. I've
not really run into a situation where a huge number of files causes
performance problems.

Maybe there should be a distinction of where a user might see
performance problems e.g. initial clone, subsequent fetches, commit,
checkout or diff.

> +
> +For repositories with really long history, you may want to work on
> +a shallow clone of it (see linkgit:git-clone[1], option '--depth').
> +A shallow repository does not contain full history, so it may consume
> +less disk space and network bandwidth. On the other hand, you cannot
> +fetch from it. And obviously you cannot look further back than what
> +it has in history (you can deepen history though).

You might want to mention git clone --reference and the
.git/objects/info/alternates for those concerned with disk usage.

> +
> +For repositories with a large number of files, but you only need
> +a few of them present in working tree, you can use sparse checkout
> +(see linkgit:git-read-tree[1], section 'Sparse checkout'). Sparse
> +checkout can be used with either a normal repository, or a shallow
> +one.
> +
> +Git uses lstat(3) to detect changes in working tree. A huge number
> +of lstat(3) calls may impact performance, especially on systems with
> +slow lstat(3). In some cases you can reduce the number of lstat(3)
> +calls by specifying what directories you are interested in, so no
> +lstat(3) outside is needed.
> +
> +For repositories with a large number of files, you need all of them
> +present in working tree, but you know in advance only a few of them
> +may be modified, please consider using assume-unchanged bit (see
> +linkgit:git-update-index[1]). This helps reduce the number of lstat(3)
> +calls.
> +
> +Some Git commands need entire file content in memory to process.
> +You may want to avoid using them if possible on large files. Those
> +commands include:
> +
> +* All checkout commands (linkgit:git-checkout[1],
> +  linkgit:git-checkout-index[1], linkgit:git-read-tree[1],
> +  linkgit:git-clone[1]...)
> +* All diff-related commands (linkgit:git-diff[1],
> +  linkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
> +* All commands that need file conversion processing
> +

This addresses one of my comments above. It might be worth talking about
using git bundles as an alternative to cloning over unreliable connections.

>  Authors
>  -------
>  * git's founding father is Linus Torvalds <torvalds@osdl.org>.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] git.txt: document limitations on non-typical repos (and hints)
  2010-10-05 16:18 ` Chris Packham
@ 2010-10-05 23:52   ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 8+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-10-05 23:52 UTC (permalink / raw)
  To: Chris Packham
  Cc: git, weigelt, spearce, jrnieder, Matthieu.Moy, raa.lkml,
	Junio C Hamano

2010/10/5 Chris Packham <judge.packham@gmail.com>:
> On 05/10/10 06:00, Nguyễn Thái Ngọc Duy wrote:
>>
>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> ---
>>  I wanted to make a more detailed description, per command. It would
>>  serve as guidance for people on special repos, also as TODOs for Git
>>  developers. But that seems a lot of work on analyzing each commands.
>>
>>  Instead I made this text to warn users where performance may decrease,
>>  and to hint them features that might help. Do I miss anything?
>>
>>  There were discussions in the past on maintaining large files out-of-repo,
>>  and symlinks to them in-repo. That sounds like a good advice, doesn't it?
>>
>>  Documentation/git.txt |   46 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/git.txt b/Documentation/git.txt
>> index dd57bdc..8408923 100644
>> --- a/Documentation/git.txt
>> +++ b/Documentation/git.txt
>> @@ -729,6 +729,52 @@ The index is also capable of storing multiple entries (called "stages")
>>  for a given pathname.  These stages are used to hold the various
>>  unmerged version of a file when a merge is in progress.
>>
>> +Performance concerns
>> +--------------------
>> +
>> +Git is written with performance in mind and it works extremely well
>> +with its typical repositories (i.e. source code repositories, with
>> +a moderate number of small text files, possibly with long history).
>> +Non-typical repositories (huge number of files, or very large
>> +files...) may experience performance degradation. This section describes

Probably should have written "experience mild performance degradation"

>> +how Git behaves in such repositories and how to reduce impact.
>
> How huge is "huge" and how large is "large". From previous threads on
> this list I'm guessing "large" is files bigger than physical RAM. I've

A significant portion of RAM is enough to start swapping. There's also
a hard limit imposed by mmap(): a file cannot be larger than available
address space (2-3G on x86, probably larger on x86_64).

> not really run into a situation where a huge number of files causes
> performance problems.

gentoo-x86 has ~100k files. Cold cache time is definitely long. Even
with hot cache, a full cache refresh may take, I don't remember, half
a second or so. It depends on many factors. I don't think I can draw a
clear limit.

>
> Maybe there should be a distinction of where a user might see
> performance problems e.g. initial clone, subsequent fetches, commit,
> checkout or diff.
>
>> +
>> +For repositories with really long history, you may want to work on
>> +a shallow clone of it (see linkgit:git-clone[1], option '--depth').
>> +A shallow repository does not contain full history, so it may consume
>> +less disk space and network bandwidth. On the other hand, you cannot
>> +fetch from it. And obviously you cannot look further back than what
>> +it has in history (you can deepen history though).
>
> You might want to mention git clone --reference and the
> .git/objects/info/alternates for those concerned with disk usage.

Thanks

>
>> +
>> +For repositories with a large number of files, but you only need
>> +a few of them present in working tree, you can use sparse checkout
>> +(see linkgit:git-read-tree[1], section 'Sparse checkout'). Sparse
>> +checkout can be used with either a normal repository, or a shallow
>> +one.
>> +
>> +Git uses lstat(3) to detect changes in working tree. A huge number
>> +of lstat(3) calls may impact performance, especially on systems with
>> +slow lstat(3). In some cases you can reduce the number of lstat(3)
>> +calls by specifying what directories you are interested in, so no
>> +lstat(3) outside is needed.
>> +
>> +For repositories with a large number of files, you need all of them
>> +present in working tree, but you know in advance only a few of them
>> +may be modified, please consider using assume-unchanged bit (see
>> +linkgit:git-update-index[1]). This helps reduce the number of lstat(3)
>> +calls.
>> +
>> +Some Git commands need entire file content in memory to process.
>> +You may want to avoid using them if possible on large files. Those
>> +commands include:
>> +
>> +* All checkout commands (linkgit:git-checkout[1],
>> +  linkgit:git-checkout-index[1], linkgit:git-read-tree[1],
>> +  linkgit:git-clone[1]...)
>> +* All diff-related commands (linkgit:git-diff[1],
>> +  linkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
>> +* All commands that need file conversion processing
>> +
>
> This addresses one of my comments above. It might be worth talking about
> using git bundles as an alternative to cloning over unreliable connections.

Thanks.
-- 
Duy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH] git.txt: document limitations on non-typical repos (and hints)
  2010-10-05 13:00 [RFC PATCH] git.txt: document limitations on non-typical repos (and hints) Nguyễn Thái Ngọc Duy
  2010-10-05 16:12 ` Alex Riesen
  2010-10-05 16:18 ` Chris Packham
@ 2010-10-06 14:21 ` Nguyễn Thái Ngọc Duy
  2010-10-06 14:23 ` [PATCH] " pclouds
  3 siblings, 0 replies; 8+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-10-06 14:21 UTC (permalink / raw)
  To: git
  Cc: weigelt, spearce, jrnieder, Matthieu.Moy, raa.lkml,
	Junio C Hamano, judge.packham,
	Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 I wanted to make a more detailed description, per command. It would
 serve as guidance for people on special repos, also as TODOs for Git
 developers. But that seems a lot of work on analyzing each commands.

 Instead I made this text to warn users where performance may decrease,
 and to hint them features that might help. Do I miss anything?

 There were discussions in the past on maintaining large files out-of-repo,
 and symlinks to them in-repo. That sounds like a good advice, doesn't it?

 Documentation/git.txt |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/Documentation/git.txt b/Documentation/git.txt
index dd57bdc..8408923 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -729,6 +729,52 @@ The index is also capable of storing multiple entries (called "stages")
 for a given pathname.  These stages are used to hold the various
 unmerged version of a file when a merge is in progress.
 
+Performance concerns
+--------------------
+
+Git is written with performance in mind and it works extremely well
+with its typical repositories (i.e. source code repositories, with
+a moderate number of small text files, possibly with long history).
+Non-typical repositories (huge number of files, or very large
+files...) may experience performance degradation. This section describes
+how Git behaves in such repositories and how to reduce impact.
+
+For repositories with really long history, you may want to work on
+a shallow clone of it (see linkgit:git-clone[1], option '--depth').
+A shallow repository does not contain full history, so it may consume
+less disk space and network bandwidth. On the other hand, you cannot
+fetch from it. And obviously you cannot look further back than what
+it has in history (you can deepen history though).
+
+For repositories with a large number of files, but you only need
+a few of them present in working tree, you can use sparse checkout
+(see linkgit:git-read-tree[1], section 'Sparse checkout'). Sparse
+checkout can be used with either a normal repository, or a shallow
+one.
+
+Git uses lstat(3) to detect changes in working tree. A huge number
+of lstat(3) calls may impact performance, especially on systems with
+slow lstat(3). In some cases you can reduce the number of lstat(3)
+calls by specifying what directories you are interested in, so no
+lstat(3) outside is needed.
+
+For repositories with a large number of files, you need all of them
+present in working tree, but you know in advance only a few of them
+may be modified, please consider using assume-unchanged bit (see
+linkgit:git-update-index[1]). This helps reduce the number of lstat(3)
+calls.
+
+Some Git commands need entire file content in memory to process.
+You may want to avoid using them if possible on large files. Those
+commands include:
+
+* All checkout commands (linkgit:git-checkout[1],
+  linkgit:git-checkout-index[1], linkgit:git-read-tree[1],
+  linkgit:git-clone[1]...)
+* All diff-related commands (linkgit:git-diff[1],
+  linkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
+* All commands that need file conversion processing
+
 Authors
 -------
 * git's founding father is Linus Torvalds <torvalds@osdl.org>.
-- 
1.7.0.2.445.gcbdb3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] git.txt: document limitations on non-typical repos (and hints)
  2010-10-05 13:00 [RFC PATCH] git.txt: document limitations on non-typical repos (and hints) Nguyễn Thái Ngọc Duy
                   ` (2 preceding siblings ...)
  2010-10-06 14:21 ` Nguyễn Thái Ngọc Duy
@ 2010-10-06 14:23 ` pclouds
  2010-10-06 16:32   ` Junio C Hamano
  3 siblings, 1 reply; 8+ messages in thread
From: pclouds @ 2010-10-06 14:23 UTC (permalink / raw)
  To: git
  Cc: weigelt, spearce, jrnieder, Matthieu.Moy, raa.lkml,
	Junio C Hamano, judge.packham,
	Nguyễn Thái Ngọc Duy

From: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>

---
 Revised version. I dropped shallow clone because it does not really
 relate to performance.

 Documentation/git.txt |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/Documentation/git.txt b/Documentation/git.txt
index dd57bdc..129947f 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -729,6 +729,47 @@ The index is also capable of storing multiple entries (called "stages")
 for a given pathname.  These stages are used to hold the various
 unmerged version of a file when a merge is in progress.
 
+Performance concerns
+--------------------
+
+Git is written with performance in mind and it works extremely well
+with its typical repositories (i.e. source code repositories, with
+a moderate number of small text files, possibly with long history).
+Non-typical repositories (a lot of files, or very large files...)
+may experience mild performance degradation. This section describes
+how Git behaves in such repositories and how to reduce impact.
+
+For repositories with a large number of files (~50k files or more),
+but you only need a few of them present in working tree, you can use
+sparse checkout (see linkgit:git-read-tree[1], section 'Sparse
+checkout'). If you need all of them present in working tree, but you
+know in advance only a few of them may be modified, please consider
+using assume-unchanged bit (see linkgit:git-update-index[1]). This
+helps reduce the number of lstat(2) calls.
+
+Git uses lstat(2) to detect changes in working tree, one call for each
+tracked file, in what is called "index refresh". A significant number of
+lstat(2) calls may create a small delay for many commands, especially
+on systems with slow lstat(2). In some cases you can reduce the number
+of lstat(2) calls by specifying what directories you are interested
+in, so no lstat(2) outside is needed. The following commands are
+however known to do full index refresh in some cases:
+linkgit:git-commit[1], linkgit:git-status[1], linkgit:git-diff[1],
+linkgit:git-reset[1], linkgit:git-checkout[1], linkgit:git-merge[1].
+
+Some commands need entire file content in memory to process.
+Files that have size a significant portion of physical RAM may
+affect performance. You may want to avoid using the following
+commands if possible on such large files:
+
+* All checkout commands (linkgit:git-checkout[1],
+  linkgit:git-checkout-index[1], linkgit:git-read-tree[1],
+  linkgit:git-clone[1]...)
+* All diff-related commands (linkgit:git-diff[1],
+  linkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
+* All commands that need file conversion processing (see
+  linkgit:gitattributes[5])
+
 Authors
 -------
 * git's founding father is Linus Torvalds <torvalds@osdl.org>.
-- 
1.7.0.2.445.gcbdb3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] git.txt: document limitations on non-typical repos (and hints)
  2010-10-06 14:23 ` [PATCH] " pclouds
@ 2010-10-06 16:32   ` Junio C Hamano
  2010-10-07  2:25     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2010-10-06 16:32 UTC (permalink / raw)
  To: pclouds
  Cc: git, weigelt, spearce, jrnieder, Matthieu.Moy, raa.lkml,
	judge.packham

pclouds@gmail.com writes:

> From: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>
> ---
>  Revised version. I dropped shallow clone because it does not really
>  relate to performance.
>
>  Documentation/git.txt |   41 +++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 41 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/git.txt b/Documentation/git.txt
> index dd57bdc..129947f 100644
> --- a/Documentation/git.txt
> +++ b/Documentation/git.txt
> @@ -729,6 +729,47 @@ The index is also capable of storing multiple entries (called "stages")
>  for a given pathname.  These stages are used to hold the various
>  unmerged version of a file when a merge is in progress.
>  
> +Performance concerns
> +--------------------
> +
> +Git is written with performance in mind and it works extremely well
> +with its typical repositories (i.e. source code repositories, with
> +a moderate number of small text files, possibly with long history).
> +Non-typical repositories (a lot of files, or very large files...)
> +may experience mild performance degradation. This section describes
> +how Git behaves in such repositories and how to reduce impact.
> +

I have seen this "mild" suggested in the discussion, but do we want any
adjective here?  The runtime for, say, "git log" from the tip to the root
obviously would grow proportionally to the length of the history, i.e. the
number of records you would want to see, and it may not be "mild" if your
history is very deep.  Same for the runtime for "git diff" in a wide
project with many changed paths.

More importantly, what is "degradation"?  It is not a degradation if "git
log" took 100x as long for a project with 100k commits compared to a
similar project with 1k commits.

If you do not have enough core to hold the part of the ancestry graph that
is involved to compute "git log A..B" to show a gazillion commits, it will
eat into the swap, take a lot more time than it takes "git log B" to show
the same number of commits.  That _is_ degradation, and I suspect it won't
be mild at all.

> +For repositories with a large number of files (~50k files or more),

How did you come up with this 50k number?

> +but you only need a few of them present in working tree, you can use
> +sparse checkout (see linkgit:git-read-tree[1], section 'Sparse
> +checkout').

Is "sparse checkout" a real feature that has been made usable by mere
mortals, battle tested, and shown to be reliable?

It feels funny that we have to refer to the documentation of plumbing
read-tree when the key verb in this paragraph is "checkout".  With the
current documentation set, you can follow read-tree page that mentions
some magic called skip-worktree-bit, get tempted to jump to update-index
page and get lost in the implementation details of the feature, which is
irrelevant to the end user.  If you resisted the temptation and keep
reading read-tree page, you see the description of info/sparse-checkout to
learn how to control the feature, but it does not come with an
easy-to-follow example.  A few concrete suggestions to "Sparse checkout"
section in read-tree:

    - Move the section to a separate file and include it in read-tree
      page, so that we can include it later in checkout page;

    - Drop the first paragraph;

    - Move the second and third paragraph, that still describe the
      machinery more than the usage, much later in the section;

    - Start the section with the description of info/sparse-checkout; the
      first sentence ("while ... is usually used") need to be rewritten,
      because (1) it is not a complete sentence and grammatically
      incorrect, and (2) it reads as if you will say an alternative file
      can be used instead of info/sparse-checkout, which is not what you
      wanted to do; perhaps "$GIT_DIR/info/sparse-checkout is used to
      specify which paths are to be (and not to be) checked out. It lists
      glob patterns to match paths to be checked out. Prefix the pattern
      with a '!' to specify a pattern to match paths not to be checked
      out.  Note: a bug in the implementation requires you to end a
      pattern with a trailing slash to match a directory".

    - Show examples; not just the samples of how contents of that control
      file looks like, but also with a concrete command sequence (e.g. (1)
      run "git clone -n", (2) edit info/sparse-checkout to contain this,
      (3) run "git checkout", (4) here is how to widen/narrow the sparse
      checkout--first edit info/sparse-checkout to look like this and then
      run "git xxx" to match the updated definition, etc.).

    - Drop BUGS section from read-tree documentation (but see Note: above
      in my example); the bug mentioned there is not a bug of read-tree,
      but is a bug in the sparse-checkout feature.

I think the suggestion to use Sparse checkout in git(1)---i.e. your patch
we are discussing here, is a bit premature before the above happens.

> +... If you need all of them present in working tree, but you
> +know in advance only a few of them may be modified, please consider
> +using assume-unchanged bit (see linkgit:git-update-index[1]).
> +... The following commands are
> +however known to do full index refresh in some cases:

It is "need to", not "are known to", isn't it?

> +Some commands need entire file content in memory to process.
> +Files that have size a significant portion of physical RAM may
> +affect performance. You may want to avoid using the following
> +commands if possible on such large files:

"If possible" is not a good excuse.  How would one _avoid_ checkout of a
file if one wants to use it?  You can't.  Similarly to "diff".  This
advice is pretty much useless, isn't it?  It's not much better than saying
"if your machine has too little RAM, things will get slow---deal with it".

> +* All checkout commands (linkgit:git-checkout[1],
> +  linkgit:git-checkout-index[1], linkgit:git-read-tree[1],
> +  linkgit:git-clone[1]...)
> +* All diff-related commands (linkgit:git-diff[1],
> +  linkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
> +* All commands that need file conversion processing (see
> +  linkgit:gitattributes[5])
> +
>  Authors
>  -------
>  * git's founding father is Linus Torvalds <torvalds@osdl.org>.
> -- 
> 1.7.0.2.445.gcbdb3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git.txt: document limitations on non-typical repos (and hints)
  2010-10-06 16:32   ` Junio C Hamano
@ 2010-10-07  2:25     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 8+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-10-07  2:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, weigelt, spearce, jrnieder, Matthieu.Moy, raa.lkml,
	judge.packham

On Wed, Oct 6, 2010 at 11:32 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> +Performance concerns
>> +--------------------
>> +
>> +Git is written with performance in mind and it works extremely well
>> +with its typical repositories (i.e. source code repositories, with
>> +a moderate number of small text files, possibly with long history).
>> +Non-typical repositories (a lot of files, or very large files...)
>> +may experience mild performance degradation. This section describes
>> +how Git behaves in such repositories and how to reduce impact.
>> +
>
> I have seen this "mild" suggested in the discussion, but do we want any
> adjective here?  The runtime for, say, "git log" from the tip to the root
> obviously would grow proportionally to the length of the history, i.e. the
> number of records you would want to see, and it may not be "mild" if your
> history is very deep.  Same for the runtime for "git diff" in a wide
> project with many changed paths.

I don't want to give an impression that the sky will fall when someone
puts a 200MB file in his repo.

> More importantly, what is "degradation"?  It is not a degradation if "git
> log" took 100x as long for a project with 100k commits compared to a
> similar project with 1k commits.

From my perspective, git commands that are instant in typical repos
should still be instant in non-typical ones. Yes "git add hugefile"
will take longer than "git add git.c", but it should not take, say, 1
hour for that command. It's hard to draw a clear line here.

> If you do not have enough core to hold the part of the ancestry graph that
> is involved to compute "git log A..B" to show a gazillion commits, it will
> eat into the swap, take a lot more time than it takes "git log B" to show
> the same number of commits.  That _is_ degradation, and I suspect it won't
> be mild at all.
>
>> +For repositories with a large number of files (~50k files or more),
>
> How did you come up with this 50k number?

Quite unscientific, I started with gentoo-x86 (~130k files) which I
know git performs less than satisfactory. I also looked how big other
repos are, wine.git, linux-2.6.git... then choose a number in the
middle.

>> +but you only need a few of them present in working tree, you can use
>> +sparse checkout (see linkgit:git-read-tree[1], section 'Sparse
>> +checkout').
>
> Is "sparse checkout" a real feature that has been made usable by mere
> mortals, battle tested, and shown to be reliable?

Hopefully. In 2010 survey, there are 331 answers they use "partial
(sparse) checkout". I hope that they used this feature, not something
else.

> It feels funny that we have to refer to the documentation of plumbing
> read-tree when the key verb in this paragraph is "checkout".  With the
> current documentation set, you can follow read-tree page that mentions
> some magic called skip-worktree-bit, get tempted to jump to update-index
> page and get lost in the implementation details of the feature, which is
> irrelevant to the end user.  If you resisted the temptation and keep
> reading read-tree page, you see the description of info/sparse-checkout to
> learn how to control the feature, but it does not come with an
> easy-to-follow example.  A few concrete suggestions to "Sparse checkout"
> section in read-tree:
>
> ...
>

Hmm.. yeah. Will do something.

> I think the suggestion to use Sparse checkout in git(1)---i.e. your patch
> we are discussing here, is a bit premature before the above happens.
>
>> +... If you need all of them present in working tree, but you
>> +know in advance only a few of them may be modified, please consider
>> +using assume-unchanged bit (see linkgit:git-update-index[1]).
>> +... The following commands are
>> +however known to do full index refresh in some cases:
>
> It is "need to", not "are known to", isn't it?

In case of "git commit", as you said in another mail, index refresh is
needed because of post-commit hook. If there are no hooks, I think
index refresh can be skipped. But yes, probably "need to".

>> +Some commands need entire file content in memory to process.
>> +Files that have size a significant portion of physical RAM may
>> +affect performance. You may want to avoid using the following
>> +commands if possible on such large files:
>
> "If possible" is not a good excuse.  How would one _avoid_ checkout of a
> file if one wants to use it?  You can't.  Similarly to "diff".  This
> advice is pretty much useless, isn't it?  It's not much better than saying
> "if your machine has too little RAM, things will get slow---deal with it".

That's more of bug acknowledgement, or to-be-improved TODOs. I didn't
want to say that out loud. Should I?
-- 
Duy

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-10-07  2:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-05 13:00 [RFC PATCH] git.txt: document limitations on non-typical repos (and hints) Nguyễn Thái Ngọc Duy
2010-10-05 16:12 ` Alex Riesen
2010-10-05 16:18 ` Chris Packham
2010-10-05 23:52   ` Nguyen Thai Ngoc Duy
2010-10-06 14:21 ` Nguyễn Thái Ngọc Duy
2010-10-06 14:23 ` [PATCH] " pclouds
2010-10-06 16:32   ` Junio C Hamano
2010-10-07  2:25     ` Nguyen Thai Ngoc Duy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).