Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
@ 2024-05-01  5:26 Dhruva Krishnamurthy
  2024-05-01 22:00 ` using tree as attribute source is slow, was " Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Dhruva Krishnamurthy @ 2024-05-01  5:26 UTC (permalink / raw)
  To: git

Hello,
Cloning by specifying depth exhibits performance regression in
pack-objects (~20x). The repository I am cloning is on NFS (mounted
with NFSv3 & positive lookup cache enabled).

Ran under 'perf' command to capture profiling information to see if
something really stands out. There is a significant overhead in calls
to file open/open64, fstat64 & mmap/munmap in git 2.44 compared to git
2.42. Not sure if there is an increase in the number of calls or
something more is done.

Could someone please guide me on how to troubleshoot this better?

--- Details of the test environment and the clone commands with output ---
# There are 10 loose objects under objects/08
$ git count-objects -vH
count: 3627
size: 25.84 MiB
in-pack: 1108374
packs: 2
size-pack: 303.38 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

# Simple driver script to enable performance tracking for upload-pack only
$ cat trace-git-upload-pack
#!/usr/bin/env bash
export GIT_TRACE_PERFORMANCE=true
exec git-upload-pack "$@"

# git clone with 2.42: pack objects take 17s
$ /opt/git/bin/git clone --no-checkout --no-local --depth=500
--upload-pack=$(pwd)/trace-git-upload-pack .. prod
Cloning into 'prod'...
remote: Enumerating objects: 669941, done.
remote: Counting objects: 100% (669941/669941), done.
remote: Compressing objects: 100% (154988/154988), done.
Receiving objects: 100% (669941/669941), 144.54 MiB | 25.78 MiB/s, done.
remote: Total 669941 (delta 533745), reused 645666 (delta 512193), pack-reused 0
remote: 05:35:40.654828 trace.c:414             performance:
17.098198597 s: git command: git --shallow-file '' pack-objects --revs
--thin --stdout --shallow --progress --delta-base-offset --include-tag
        05:35:40.708162 trace.c:414             performance:
24.764812264 s: git command: git-upload-pack /large_repo/perf/..
Resolving deltas: 100% (533745/533745), done.
Checking connectivity: 669940, done.

# git clone with 2.44: pack objects take 325s
$ /opt/gitn/bin/git clone --no-checkout --no-local --depth=500
--upload-pack=$(pwd)/trace-git-upload-pack .. dev
Cloning into 'dev'...
remote: Enumerating objects: 669941, done.
remote: Counting objects: 100% (669941/669941), done.
remote: Compressing objects: 100% (154988/154988), done.
Receiving objects: 100% (669941/669941), 144.66 MiB | 29.08 MiB/s, done.
remote: Total 669941 (delta 533742), reused 645666 (delta 512193),
pack-reused 0 (from 0)
remote: 05:42:01.017156 trace.c:414             performance:
325.552424902 s: git command: git --shallow-file '' pack-objects
--revs --thin --stdout --shallow --progress --delta-base-offset
--include-tag
        05:42:01.063013 trace.c:414             performance:
330.965731114 s: git command: git-upload-pack /large_repo/perf/..
Resolving deltas: 100% (533742/533742), done.
Checking connectivity: 669940, done.

Best regards,
Dhruva

^ permalink raw reply	[flat|nested] 20+ messages in thread

* using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-01  5:26 Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42 Dhruva Krishnamurthy
@ 2024-05-01 22:00 ` Jeff King
  2024-05-01 22:37   ` rsbecker
                     ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Jeff King @ 2024-05-01 22:00 UTC (permalink / raw)
  To: Dhruva Krishnamurthy; +Cc: John Cai, Karthik Nayak, git

On Tue, Apr 30, 2024 at 10:26:32PM -0700, Dhruva Krishnamurthy wrote:

> Cloning by specifying depth exhibits performance regression in
> pack-objects (~20x). The repository I am cloning is on NFS (mounted
> with NFSv3 & positive lookup cache enabled).
> 
> Ran under 'perf' command to capture profiling information to see if
> something really stands out. There is a significant overhead in calls
> to file open/open64, fstat64 & mmap/munmap in git 2.44 compared to git
> 2.42. Not sure if there is an increase in the number of calls or
> something more is done.
> 
> Could someone please guide me on how to troubleshoot this better?

Oof. I was able to reproduce this regression, and the impact can be
pretty severe. You can reproduce it without NFS and without shallow
clones. Get a bare clone of something big like the kernel:

  git clone --bare https://github.com/torvalds/linux
  cd linux.git

and then do something similar to the server side of a clone:

  time git pack-objects --all --stdout </dev/null >/dev/null

Most of the time will be spent in the "Enumerating objects" phase,
walking the graph (since the repo is fully packed, delta searching and
writing are fairly quick).

With v2.42.0, I get:

  real	1m12.409s
  user	1m13.021s
  sys	0m6.091s

With v2.44.0, I get:

  real	4m12.075s
  user	4m12.763s
  sys	0m5.546s

Bisecting show the culprit is 2386535511 (attr: read attributes from
HEAD when bare repo, 2023-10-13), which is in v2.43.0. Before that, a
bare repository would only look for attributes in the info/attributes
file. But after, we look at the HEAD tree-ish, too. And pack-objects
will check the "delta" attribute for every path of every object we are
packing. And remember that in-tree lookups for foo/bar/baz require
looking not just for .gitattributes, but also foo/.gitattributes,
foo/bar/.gitattributes, and foo/bar/baz/.gitattributes.

In a non-bare repo, this isn't _too_ bad. We'll try an open() for each
path, but the common case is that we don't have such a file, so the
total cost is the one syscall per object. But when we're pulling
attributes from HEAD, each lookup requires walking the whole chain of
trees (so for the final one, tree foo points to tree bar points to tree
baz, where we see there is no .gitattributes entry). And so all of that
extra time goes to reading in trees over and over.

You can repeat the same test with git.git. It gets slower, but it's
barely perceptible. This is because we have a relatively shallow tree,
so most lookups are hitting root-level .gitattributes or maybe one or
two levels deep. The kernel has a much deeper tree, so those lookups are
more expensive (and of course there are a lot more of them).

Getting back to shallow clones and NFS:

  - the effect is more pronounced on NFS because the cost to access
    objects is higher. With git.git I was even able to see some
    slowdown, probably just from the cost/latency of opening up all of
    those objects.

  - the problem doesn't show up if the repo has reachability bitmaps.
    This is because the bitmap result doesn't have the pathnames of each
    object (we do have the "name hash", but it's not enough for us to do
    an attr lookup), and so objects we get from a bitmap do not
    respect the delta attribute at all.

    But when doing a shallow clone, we have to disable bitmaps and do a
    regular traversal. So even if you have bitmaps, you still run into
    the problem.

    The example above should not have bitmaps (we do build them by
    default when repacking bare repos these days, but I don't think
    we'll do so right after cloning). If you have a local repo that
    already has bitmaps, you should be able to see the difference by
    using "git -c pack.usebitmaps=false pack-objects".

    So even if you are a server which generally enables bitmaps, you can
    still get bit by this for shallow clones, but also for other
    non-bitmap invocations, like say "git repack -ad". There I see the
    same 3-minute slowdown in the enumeration phase.

So what to do? It seems like some kind of caching would help here. We're
looking up the same paths over and over, for two reasons:

  1. We'll have many objects with the same paths, one for each time the
     path was modified through history.

  2. Adjacent objects share the higher-level lookups. Both "dir/a" and
     "dir/b" will need to look up "dir/.gitattributes" (and all the way
     up to ".gitattributes").

So even something simple and stupid like this:

diff --git a/attr.c b/attr.c
index 679e42258c..b32af9a78b 100644
--- a/attr.c
+++ b/attr.c
@@ -24,6 +24,7 @@
 #include "thread-utils.h"
 #include "tree-walk.h"
 #include "object-name.h"
+#include "strmap.h"
 
 const char *git_attr_tree;
 
@@ -795,25 +796,31 @@ static struct attr_stack *read_attr_from_blob(struct index_state *istate,
 					      const struct object_id *tree_oid,
 					      const char *path, unsigned flags)
 {
-	struct object_id oid;
-	unsigned long sz;
-	enum object_type type;
+	static struct strmap cache = STRMAP_INIT;
+	void *CACHE_MISSING = (void *)1;
 	void *buf;
-	unsigned short mode;
 
 	if (!tree_oid)
 		return NULL;
 
-	if (get_tree_entry(istate->repo, tree_oid, path, &oid, &mode))
-		return NULL;
-
-	buf = repo_read_object_file(istate->repo, &oid, &type, &sz);
-	if (!buf || type != OBJ_BLOB) {
-		free(buf);
-		return NULL;
+	buf = strmap_get(&cache, path);
+	if (!buf) {
+		struct object_id oid;
+		unsigned short mode;
+		if (!get_tree_entry(istate->repo, tree_oid, path, &oid, &mode)) {
+			unsigned long sz;
+			enum object_type type;
+			buf = repo_read_object_file(istate->repo, &oid, &type, &sz);
+			if (!buf || type != OBJ_BLOB) {
+				free(buf);
+				buf = NULL;
+			}
+		}
+		strmap_put(&cache, path, buf ? buf : CACHE_MISSING);
 	}
 
-	return read_attr_from_buf(buf, path, flags);
+	return buf == CACHE_MISSING ? NULL :
+		read_attr_from_buf(buf, path, flags);
 }
 
 static struct attr_stack *read_attr_from_index(struct index_state *istate,

restores the v2.42 performance. But there are probably better options:

  - this is caching whole .gitattributes buffers. In pack-objects we
    only care about a single bit for try_delta. For linux.git it doesn't
    really matter, as 99% of our entries are just CACHE_MISSING and the
    real value is avoiding the negative lookups. But the same problem
    exists to a lesser degree for "git log -p" in a bare repo. So I
    think it makes sense to try to solve it in the attr layer.

  - the string keys have a lot of duplication. You'll have
    "foo/.gitattributes", "foo/bar/.gitattributes", and so on. A trie
    structure split by path component would let you store each component
    just once. And perhaps have even faster lookups. I think this is
    roughly the same issue faced by the kernel VFS for doing path
    lookups, so something dentry/dcache-like would help.

    I don't know how much it matters in practice, though. The sum of all
    of the paths in HEAD for linux.git is ~3.5MB, which is a rounding
    error on the needs of the rest of the packing process.

  - Something dcache-like could also be pushed down lower, to the
    get_tree_entry() API (imagine it quietly caching uncompressed trees
    behind the scenes and then using them to traverse). That depends on
    high locality of requests, though. I don't know if we have that
    here, because we do our lookups in traversal order. So you'd look at
    entries in HEAD^{tree}, then HEAD~1^{tree}, then HEAD~2^{tree}, and
    so on.

  - Speaking of locality, the attr code tries to make use of request
    locality in its stack (so if we ask for attributes for "foo/bar",
    then "foo/baz", we should be able to keep the parsed data for
    "foo/.gitattributes" available between them). But our pattern here
    violates that. Again, not that big an issue if you don't have that
    many .gitattributes files in the first place, so it might not be
    worth worrying about. But if we did want to rearrange the lookups to
    exploit locality, it might change the overall caching strategy.

  - As noted above, most entries are just CACHE_MISSING. So rather than
    lazily looking up and caching entries, we could just prepopulate the
    cache. And then you know that if an entry isn't present in the
    cache, it does not exist in the tree. The downside is that you pay
    to walk the all of HEAD^{tree}, even if you only have a few lookups
    to do. That's a good tradeoff for pack-objects (which usually ends
    up looking for every path anyway), but not for "git diff" (where you
    only care about a few changed paths.

  - the cache here is static-local in the function. It should probably
    at least be predicated on the tree_oid, and maybe attached to the
    repository object? I think having one per repository at a time would
    be fine (generally the tree_oid is set once per process, so it's not
    like you're switching between multiple options).

I've cc'd John as the author of 2386535511. But really, that was just
enabling by default the attr-tree code added by 47cfc9bd7d (attr: add
flag `--source` to work with tree-ish, 2023-01-14). Although in that
original context (git check-attr) the lack of caching would be much less
important.

-Peff

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* RE: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-01 22:00 ` using tree as attribute source is slow, was " Jeff King
@ 2024-05-01 22:37   ` rsbecker
  2024-05-01 22:40   ` Junio C Hamano
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: rsbecker @ 2024-05-01 22:37 UTC (permalink / raw)
  To: 'Jeff King', 'Dhruva Krishnamurthy'
  Cc: 'John Cai', 'Karthik Nayak', git

On Wednesday, May 1, 2024 6:01 PM, Jeff King wrote:
>On Tue, Apr 30, 2024 at 10:26:32PM -0700, Dhruva Krishnamurthy wrote:
>
>> Cloning by specifying depth exhibits performance regression in
>> pack-objects (~20x). The repository I am cloning is on NFS (mounted
>> with NFSv3 & positive lookup cache enabled).
>>
>> Ran under 'perf' command to capture profiling information to see if
>> something really stands out. There is a significant overhead in calls
>> to file open/open64, fstat64 & mmap/munmap in git 2.44 compared to git
>> 2.42. Not sure if there is an increase in the number of calls or
>> something more is done.
>>
>> Could someone please guide me on how to troubleshoot this better?
>
>Oof. I was able to reproduce this regression, and the impact can be pretty severe.
>You can reproduce it without NFS and without shallow clones. Get a bare clone of
>something big like the kernel:
>
>  git clone --bare https://github.com/torvalds/linux
>  cd linux.git
>
>and then do something similar to the server side of a clone:
>
>  time git pack-objects --all --stdout </dev/null >/dev/null
>
>Most of the time will be spent in the "Enumerating objects" phase, walking the
>graph (since the repo is fully packed, delta searching and writing are fairly quick).
>
>With v2.42.0, I get:
>
>  real	1m12.409s
>  user	1m13.021s
>  sys	0m6.091s
>
>With v2.44.0, I get:
>
>  real	4m12.075s
>  user	4m12.763s
>  sys	0m5.546s
>
>Bisecting show the culprit is 2386535511 (attr: read attributes from HEAD when
>bare repo, 2023-10-13), which is in v2.43.0. Before that, a bare repository would
>only look for attributes in the info/attributes file. But after, we look at the HEAD
>tree-ish, too. And pack-objects will check the "delta" attribute for every path of
>every object we are packing. And remember that in-tree lookups for foo/bar/baz
>require looking not just for .gitattributes, but also foo/.gitattributes,
>foo/bar/.gitattributes, and foo/bar/baz/.gitattributes.
>
>In a non-bare repo, this isn't _too_ bad. We'll try an open() for each path, but the
>common case is that we don't have such a file, so the total cost is the one syscall per
>object. But when we're pulling attributes from HEAD, each lookup requires walking
>the whole chain of trees (so for the final one, tree foo points to tree bar points to
>tree baz, where we see there is no .gitattributes entry). And so all of that extra time
>goes to reading in trees over and over.
>
>You can repeat the same test with git.git. It gets slower, but it's barely perceptible.
>This is because we have a relatively shallow tree, so most lookups are hitting root-
>level .gitattributes or maybe one or two levels deep. The kernel has a much deeper
>tree, so those lookups are more expensive (and of course there are a lot more of
>them).
>
>Getting back to shallow clones and NFS:
>
>  - the effect is more pronounced on NFS because the cost to access
>    objects is higher. With git.git I was even able to see some
>    slowdown, probably just from the cost/latency of opening up all of
>    those objects.
>
>  - the problem doesn't show up if the repo has reachability bitmaps.
>    This is because the bitmap result doesn't have the pathnames of each
>    object (we do have the "name hash", but it's not enough for us to do
>    an attr lookup), and so objects we get from a bitmap do not
>    respect the delta attribute at all.
>
>    But when doing a shallow clone, we have to disable bitmaps and do a
>    regular traversal. So even if you have bitmaps, you still run into
>    the problem.
>
>    The example above should not have bitmaps (we do build them by
>    default when repacking bare repos these days, but I don't think
>    we'll do so right after cloning). If you have a local repo that
>    already has bitmaps, you should be able to see the difference by
>    using "git -c pack.usebitmaps=false pack-objects".
>
>    So even if you are a server which generally enables bitmaps, you can
>    still get bit by this for shallow clones, but also for other
>    non-bitmap invocations, like say "git repack -ad". There I see the
>    same 3-minute slowdown in the enumeration phase.
>
>So what to do? It seems like some kind of caching would help here. We're looking
>up the same paths over and over, for two reasons:
>
>  1. We'll have many objects with the same paths, one for each time the
>     path was modified through history.
>
>  2. Adjacent objects share the higher-level lookups. Both "dir/a" and
>     "dir/b" will need to look up "dir/.gitattributes" (and all the way
>     up to ".gitattributes").
>
>So even something simple and stupid like this:
>
>diff --git a/attr.c b/attr.c
>index 679e42258c..b32af9a78b 100644
>--- a/attr.c
>+++ b/attr.c
>@@ -24,6 +24,7 @@
> #include "thread-utils.h"
> #include "tree-walk.h"
> #include "object-name.h"
>+#include "strmap.h"
>
> const char *git_attr_tree;
>
>@@ -795,25 +796,31 @@ static struct attr_stack *read_attr_from_blob(struct
>index_state *istate,
> 					      const struct object_id *tree_oid,
> 					      const char *path, unsigned flags)  {
>-	struct object_id oid;
>-	unsigned long sz;
>-	enum object_type type;
>+	static struct strmap cache = STRMAP_INIT;
>+	void *CACHE_MISSING = (void *)1;
> 	void *buf;
>-	unsigned short mode;
>
> 	if (!tree_oid)
> 		return NULL;
>
>-	if (get_tree_entry(istate->repo, tree_oid, path, &oid, &mode))
>-		return NULL;
>-
>-	buf = repo_read_object_file(istate->repo, &oid, &type, &sz);
>-	if (!buf || type != OBJ_BLOB) {
>-		free(buf);
>-		return NULL;
>+	buf = strmap_get(&cache, path);
>+	if (!buf) {
>+		struct object_id oid;
>+		unsigned short mode;
>+		if (!get_tree_entry(istate->repo, tree_oid, path, &oid, &mode)) {
>+			unsigned long sz;
>+			enum object_type type;
>+			buf = repo_read_object_file(istate->repo, &oid, &type,
>&sz);
>+			if (!buf || type != OBJ_BLOB) {
>+				free(buf);
>+				buf = NULL;
>+			}
>+		}
>+		strmap_put(&cache, path, buf ? buf : CACHE_MISSING);
> 	}
>
>-	return read_attr_from_buf(buf, path, flags);
>+	return buf == CACHE_MISSING ? NULL :
>+		read_attr_from_buf(buf, path, flags);
> }
>
> static struct attr_stack *read_attr_from_index(struct index_state *istate,
>
>restores the v2.42 performance. But there are probably better options:
>
>  - this is caching whole .gitattributes buffers. In pack-objects we
>    only care about a single bit for try_delta. For linux.git it doesn't
>    really matter, as 99% of our entries are just CACHE_MISSING and the
>    real value is avoiding the negative lookups. But the same problem
>    exists to a lesser degree for "git log -p" in a bare repo. So I
>    think it makes sense to try to solve it in the attr layer.
>
>  - the string keys have a lot of duplication. You'll have
>    "foo/.gitattributes", "foo/bar/.gitattributes", and so on. A trie
>    structure split by path component would let you store each component
>    just once. And perhaps have even faster lookups. I think this is
>    roughly the same issue faced by the kernel VFS for doing path
>    lookups, so something dentry/dcache-like would help.
>
>    I don't know how much it matters in practice, though. The sum of all
>    of the paths in HEAD for linux.git is ~3.5MB, which is a rounding
>    error on the needs of the rest of the packing process.
>
>  - Something dcache-like could also be pushed down lower, to the
>    get_tree_entry() API (imagine it quietly caching uncompressed trees
>    behind the scenes and then using them to traverse). That depends on
>    high locality of requests, though. I don't know if we have that
>    here, because we do our lookups in traversal order. So you'd look at
>    entries in HEAD^{tree}, then HEAD~1^{tree}, then HEAD~2^{tree}, and
>    so on.
>
>  - Speaking of locality, the attr code tries to make use of request
>    locality in its stack (so if we ask for attributes for "foo/bar",
>    then "foo/baz", we should be able to keep the parsed data for
>    "foo/.gitattributes" available between them). But our pattern here
>    violates that. Again, not that big an issue if you don't have that
>    many .gitattributes files in the first place, so it might not be
>    worth worrying about. But if we did want to rearrange the lookups to
>    exploit locality, it might change the overall caching strategy.
>
>  - As noted above, most entries are just CACHE_MISSING. So rather than
>    lazily looking up and caching entries, we could just prepopulate the
>    cache. And then you know that if an entry isn't present in the
>    cache, it does not exist in the tree. The downside is that you pay
>    to walk the all of HEAD^{tree}, even if you only have a few lookups
>    to do. That's a good tradeoff for pack-objects (which usually ends
>    up looking for every path anyway), but not for "git diff" (where you
>    only care about a few changed paths.
>
>  - the cache here is static-local in the function. It should probably
>    at least be predicated on the tree_oid, and maybe attached to the
>    repository object? I think having one per repository at a time would
>    be fine (generally the tree_oid is set once per process, so it's not
>    like you're switching between multiple options).
>
>I've cc'd John as the author of 2386535511. But really, that was just enabling by
>default the attr-tree code added by 47cfc9bd7d (attr: add flag `--source` to work
>with tree-ish, 2023-01-14). Although in that original context (git check-attr) the
>lack of caching would be much less important.

Although this is an unbaked thought... 

I am not sure this will help, but I have been considering this since the trie data structure was introduced. Should we consider moving to a sparse trie with a lazy load approach? There is no current indication in the trie structure in path.c that a node is incomplete (not as far as I can tell anyway), so the assumption is the trie must be fully populated. Loading tries are generally the same performance as searching - Order(1) - given there are limited numbers of splits compared to a balanced or red-black tree; so a lazy load would not significantly change the trie load time. Each lookup might expand the node population but could cut down times where parts of the trie are ignored. Although this would be a non-trivial change, knowing that the trie is incomplete in a sparse situation might help here.

--Randall


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-01 22:00 ` using tree as attribute source is slow, was " Jeff King
  2024-05-01 22:37   ` rsbecker
@ 2024-05-01 22:40   ` Junio C Hamano
  2024-05-02  0:33   ` Taylor Blau
  2024-05-02  0:45   ` Dhruva Krishnamurthy
  3 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-05-01 22:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Dhruva Krishnamurthy, John Cai, Karthik Nayak, git

Jeff King <peff@peff.net> writes:

>   - the cache here is static-local in the function. It should probably
>     at least be predicated on the tree_oid, and maybe attached to the
>     repository object? I think having one per repository at a time would
>     be fine (generally the tree_oid is set once per process, so it's not
>     like you're switching between multiple options).

It should be per tree_oid or you will get a stale and incorrect
result when you read paths from a different tree.  But thanks for
that "something simple and stupid" code to clearly demonstrate
that repeated reading of the attributes data is the problem.

Given a tree with a name, the result of reading a path from that
tree does not depend on the repository the tree appears in, so the
cache does not have a reason to be tied to a particular repository.
Generally we work only inside a single repository, so attaching the
cache to that single repository would be a good way to make it
available globally without adding another global variable, as
the_repository can serve as the starting point for the global state,
but other than that there is no reason.

I agree that the attribute layer may be a better place to cache this
data.  As you pointed out, it already has a caching behaviour in its
attr_stack data structure that is optimized for local walk that
visits every path in a tree in depth first order, but it is likely
that a different caching scheme that is more suitable for random
access may need to be introduced.  The cache eviction strategy may
need some thought (the attr_stack based caching has an obviously
optimal eviction strategy---to evict the attribute data read from a
directory when the traversal leaves that directory) in order to
avoid unbounded bloat of the cached data.

> I've cc'd John as the author of 2386535511. But really, that was just
> enabling by default the attr-tree code added by 47cfc9bd7d (attr: add
> flag `--source` to work with tree-ish, 2023-01-14). Although in that
> original context (git check-attr) the lack of caching would be much less
> important.
>
> -Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-01 22:00 ` using tree as attribute source is slow, was " Jeff King
  2024-05-01 22:37   ` rsbecker
  2024-05-01 22:40   ` Junio C Hamano
@ 2024-05-02  0:33   ` Taylor Blau
  2024-05-02 17:33     ` Taylor Blau
  2024-05-02  0:45   ` Dhruva Krishnamurthy
  3 siblings, 1 reply; 20+ messages in thread
From: Taylor Blau @ 2024-05-02  0:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Dhruva Krishnamurthy, John Cai, Karthik Nayak, git

On Wed, May 01, 2024 at 06:00:30PM -0400, Jeff King wrote:
> Bisecting show the culprit is 2386535511 (attr: read attributes from
> HEAD when bare repo, 2023-10-13), which is in v2.43.0. Before that, a
> bare repository would only look for attributes in the info/attributes
> file. But after, we look at the HEAD tree-ish, too. And pack-objects
> will check the "delta" attribute for every path of every object we are
> packing. And remember that in-tree lookups for foo/bar/baz require
> looking not just for .gitattributes, but also foo/.gitattributes,
> foo/bar/.gitattributes, and foo/bar/baz/.gitattributes.

Thanks for the explanation and bisection. I agree that 2386535511 makes
sense as a likely culprit given what you wrote here.

>   - the problem doesn't show up if the repo has reachability bitmaps.
>     This is because the bitmap result doesn't have the pathnames of each
>     object (we do have the "name hash", but it's not enough for us to do
>     an attr lookup), and so objects we get from a bitmap do not
>     respect the delta attribute at all.
>
>     But when doing a shallow clone, we have to disable bitmaps and do a
>     regular traversal. So even if you have bitmaps, you still run into
>     the problem.
>
>     The example above should not have bitmaps (we do build them by
>     default when repacking bare repos these days, but I don't think
>     we'll do so right after cloning). If you have a local repo that
>     already has bitmaps, you should be able to see the difference by
>     using "git -c pack.usebitmaps=false pack-objects".

Yikes. I was hoping that bitmaps would be a saving grace here for setups
that have bitmap generation enabled, but it makes sense that it doesn't
help if you are doing a shallow clone where you have to disable bitmaps.

>     So even if you are a server which generally enables bitmaps, you can
>     still get bit by this for shallow clones, but also for other
>     non-bitmap invocations, like say "git repack -ad". There I see the
>     same 3-minute slowdown in the enumeration phase.

That's also pretty scary, and a worthwhile callout.

> So what to do? It seems like some kind of caching would help here. We're
> looking up the same paths over and over, for two reasons:
>
>   1. We'll have many objects with the same paths, one for each time the
>      path was modified through history.
>
>   2. Adjacent objects share the higher-level lookups. Both "dir/a" and
>      "dir/b" will need to look up "dir/.gitattributes" (and all the way
>      up to ".gitattributes").

Right. I guess you need to cache something like on the order of the set
of dirnames of all modified paths in the repository (and recursively the
dirnames of those dirnames up until you get to the root).

> So even something simple and stupid like this:

...makes sense.

> restores the v2.42 performance. But there are probably better options:
>
>   - this is caching whole .gitattributes buffers. In pack-objects we
>     only care about a single bit for try_delta. For linux.git it doesn't
>     really matter, as 99% of our entries are just CACHE_MISSING and the
>     real value is avoiding the negative lookups. But the same problem
>     exists to a lesser degree for "git log -p" in a bare repo. So I
>     think it makes sense to try to solve it in the attr layer.
>
>   - the string keys have a lot of duplication. You'll have
>     "foo/.gitattributes", "foo/bar/.gitattributes", and so on. A trie
>     structure split by path component would let you store each component
>     just once. And perhaps have even faster lookups. I think this is
>     roughly the same issue faced by the kernel VFS for doing path
>     lookups, so something dentry/dcache-like would help.
>
>     I don't know how much it matters in practice, though. The sum of all
>     of the paths in HEAD for linux.git is ~3.5MB, which is a rounding
>     error on the needs of the rest of the packing process.

This was my gut reaction when I started reading this bullet point, too.
I have a hard time imagining a repository that would be so large that it
would have a lot of unique paths, but not so large that it would be
otherwise cheap to run pack-objects.

>   - As noted above, most entries are just CACHE_MISSING. So rather than
>     lazily looking up and caching entries, we could just prepopulate the
>     cache. And then you know that if an entry isn't present in the
>     cache, it does not exist in the tree. The downside is that you pay
>     to walk the all of HEAD^{tree}, even if you only have a few lookups
>     to do. That's a good tradeoff for pack-objects (which usually ends
>     up looking for every path anyway), but not for "git diff" (where you
>     only care about a few changed paths.

That seems reasonable to do.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-02  0:33   ` Taylor Blau
@ 2024-05-02 17:33     ` Taylor Blau
  2024-05-02 17:44       ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Taylor Blau @ 2024-05-02 17:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Dhruva Krishnamurthy, John Cai, Karthik Nayak, git

On Wed, May 01, 2024 at 08:33:52PM -0400, Taylor Blau wrote:
> On Wed, May 01, 2024 at 06:00:30PM -0400, Jeff King wrote:
> > Bisecting show the culprit is 2386535511 (attr: read attributes from
> > HEAD when bare repo, 2023-10-13), which is in v2.43.0. Before that, a
> > bare repository would only look for attributes in the info/attributes
> > file. But after, we look at the HEAD tree-ish, too. And pack-objects
> > will check the "delta" attribute for every path of every object we are
> > packing. And remember that in-tree lookups for foo/bar/baz require
> > looking not just for .gitattributes, but also foo/.gitattributes,
> > foo/bar/.gitattributes, and foo/bar/baz/.gitattributes.
>
> Thanks for the explanation and bisection. I agree that 2386535511 makes
> sense as a likely culprit given what you wrote here.

Here is one possible approach, which is a partial revert of 2386535511.
I thought about suggesting that we revert 2386535511 entirely, but I
think that may be too strong of an approach especially if there are
plans to otherwise improve the performance of attr lookups with some
caching layer.

Instead, this patch changes the behavior to only fallback to "HEAD" in
bare repositories from check-attr, but leaves pack-objects, archive, and
all other builtins alone.

I should note, this is a pretty hacky approach to use the extern'd
git_attr_tree variable from within the check-attr builtin, but I think
that this does do the trick.

Alternatively, if this is too hacky or magical that check-attr does one
thing but every other command does something else, I would personally be
fine with a full revert of 2386535511.

Anyway, here is the patch:

--- 8< ---

Subject: [PATCH] attr.c: only read attributes from HEAD via check-attr

This patch is a partial revert of commit 2386535511d (attr: read
attributes from HEAD when bare repo, 2023-10-13), which caused Git to
start reading from .gitattributes files from HEAD^{tree} when invoked in
a bare repository.

This patch has an unfortunate side-effect of significantly slowing down
pack-objects, for example, when invoked in a bare repository without
using reachability bitmaps.

Prior to 2386535511d, pack-objects would only look at the
info/attributes file when working in a bare repository. But after,
pack-objects ends up looking at every "delta" attribute not just in the
info/attributes file, but for every .gitattributes file in each tree
recursively from the root down to the dirname of whatever path we're
inspecting. In other words, gathering attributes for path foo/bar/baz
requires reading .gitattributes, foo/.gitattributes,
foo/bar/.gitattributes, and foo/bar/baz/.gitattributes.

Restore the pre-2386535511d behavior for commands other than check-attr
(which was the intended target of the change described in 2386535511d).

If we want to cause pack-objects to use HEAD^{tree} as an attributes
source in bare repositories by default again, it would likely come after
some caching layer to avoid the performance penalty.

Reported-by: Dhruva Krishnamurthy <dhruvakm@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 attr.c                  | 7 -------
 builtin/check-attr.c    | 2 ++
 t/t5001-archive-attr.sh | 2 +-
 3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/attr.c b/attr.c
index 7c380c17317..33473bdce01 100644
--- a/attr.c
+++ b/attr.c
@@ -1220,17 +1220,10 @@ static void compute_default_attr_source(struct object_id *attr_source)
 	if (!default_attr_source_tree_object_name && git_attr_tree) {
 		default_attr_source_tree_object_name = git_attr_tree;
 		ignore_bad_attr_tree = 1;
 	}

-	if (!default_attr_source_tree_object_name &&
-	    startup_info->have_repository &&
-	    is_bare_repository()) {
-		default_attr_source_tree_object_name = "HEAD";
-		ignore_bad_attr_tree = 1;
-	}
-
 	if (!default_attr_source_tree_object_name || !is_null_oid(attr_source))
 		return;

 	if (repo_get_oid_treeish(the_repository,
 				 default_attr_source_tree_object_name,
diff --git a/builtin/check-attr.c b/builtin/check-attr.c
index c1da1d184e9..9b445fe33c6 100644
--- a/builtin/check-attr.c
+++ b/builtin/check-attr.c
@@ -188,10 +188,12 @@ int cmd_check_attr(int argc, const char **argv, const char *prefix)

 	if (source) {
 		if (repo_get_oid_tree(the_repository, source, &initialized_oid))
 			die("%s: not a valid tree-ish source", source);
 		set_git_attr_source(source);
+	} else if (startup_info->have_repository && is_bare_repository()) {
+		git_attr_tree = "HEAD";
 	}

 	if (stdin_paths)
 		check_attr_stdin_paths(prefix, check, all_attrs);
 	else {
diff --git a/t/t5001-archive-attr.sh b/t/t5001-archive-attr.sh
index eaf959d8f63..0ff47a239db 100755
--- a/t/t5001-archive-attr.sh
+++ b/t/t5001-archive-attr.sh
@@ -136,11 +136,11 @@ test_expect_success 'git archive with worktree attributes, bare' '
 	(cd bare && git archive --worktree-attributes HEAD) >bare-worktree.tar &&
 	(mkdir bare-worktree && cd bare-worktree && "$TAR" xf -) <bare-worktree.tar
 '

 test_expect_missing	bare-worktree/ignored
-test_expect_missing	bare-worktree/ignored-by-tree
+test_expect_exists	bare-worktree/ignored-by-tree
 test_expect_exists	bare-worktree/ignored-by-worktree

 test_expect_success 'export-subst' '
 	git log "--pretty=format:A${SUBSTFORMAT}O" HEAD >substfile1.expected &&
 	test_cmp nosubstfile archive/nosubstfile &&
--
2.45.0.1.g3e84e921a0a.dirty

--- >8 ---

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-02 17:33     ` Taylor Blau
@ 2024-05-02 17:44       ` Junio C Hamano
  2024-05-02 17:55         ` Taylor Blau
  2024-05-02 18:34         ` using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42 Dhruva Krishnamurthy
  0 siblings, 2 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-05-02 17:44 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, Dhruva Krishnamurthy, John Cai, Karthik Nayak, git

Taylor Blau <me@ttaylorr.com> writes:

> Instead, this patch changes the behavior to only fallback to "HEAD" in
> bare repositories from check-attr, but leaves pack-objects, archive, and
> all other builtins alone.

I thought the whole point of the exercise was to allow server-side
(which typically is bare and cannot use anything from the working
tree) to pay attention to the attributes.  This patch rips that out
and piles even more new and unproven code on top?  I am not sure.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-02 17:44       ` Junio C Hamano
@ 2024-05-02 17:55         ` Taylor Blau
  2024-05-02 19:01           ` Karthik Nayak
  2024-05-02 18:34         ` using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42 Dhruva Krishnamurthy
  1 sibling, 1 reply; 20+ messages in thread
From: Taylor Blau @ 2024-05-02 17:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Dhruva Krishnamurthy, John Cai, Karthik Nayak, git

On Thu, May 02, 2024 at 10:44:20AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Instead, this patch changes the behavior to only fallback to "HEAD" in
> > bare repositories from check-attr, but leaves pack-objects, archive, and
> > all other builtins alone.
>
> I thought the whole point of the exercise was to allow server-side
> (which typically is bare and cannot use anything from the working
> tree) to pay attention to the attributes.  This patch rips that out
> and piles even more new and unproven code on top?  I am not sure.

I thought the point of John's patch was to allow just check-attr to read
from HEAD^{tree} in bare repositories, and not to touch other commands.

I could be misunderstanding the original intent of John's patch (the
commit message there isn't clear whether the change was intended to
target just check-attr or all of Git). But my hope is that it was the
former, which this patch preserves.

I do not know whether servers should in general be trusting
user-provided attributes for things like "delta".

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-02 17:55         ` Taylor Blau
@ 2024-05-02 19:01           ` Karthik Nayak
  2024-05-02 21:08             ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Karthik Nayak @ 2024-05-02 19:01 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano
  Cc: Jeff King, Dhruva Krishnamurthy, John Cai, git

[-- Attachment #1: Type: text/plain, Size: 1338 bytes --]

Taylor Blau <me@ttaylorr.com> writes:

> On Thu, May 02, 2024 at 10:44:20AM -0700, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>> > Instead, this patch changes the behavior to only fallback to "HEAD" in
>> > bare repositories from check-attr, but leaves pack-objects, archive, and
>> > all other builtins alone.
>>
>> I thought the whole point of the exercise was to allow server-side
>> (which typically is bare and cannot use anything from the working
>> tree) to pay attention to the attributes.  This patch rips that out
>> and piles even more new and unproven code on top?  I am not sure.
>
> I thought the point of John's patch was to allow just check-attr to read
> from HEAD^{tree} in bare repositories, and not to touch other commands.
>
> I could be misunderstanding the original intent of John's patch (the
> commit message there isn't clear whether the change was intended to
> target just check-attr or all of Git). But my hope is that it was the
> former, which this patch preserves.
>

From the series [1] it becomes more clear that the intention was to
target all commands.

[1]: https://lore.kernel.org/git/pull.1577.v5.git.git.1697218770.gitgitgadget@gmail.com/

> I do not know whether servers should in general be trusting
> user-provided attributes for things like "delta".
>
> Thanks,
> Taylor

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-02 19:01           ` Karthik Nayak
@ 2024-05-02 21:08             ` Junio C Hamano
  2024-05-03  5:37               ` Dhruva Krishnamurthy
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2024-05-02 21:08 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: Taylor Blau, Jeff King, Dhruva Krishnamurthy, John Cai, git

Karthik Nayak <karthik.188@gmail.com> writes:

> Taylor Blau <me@ttaylorr.com> writes:
>
>> I could be misunderstanding the original intent of John's patch (the
>> commit message there isn't clear whether the change was intended to
>> target just check-attr or all of Git). But my hope is that it was the
>> former, which this patch preserves.
>>
>
> From the series [1] it becomes more clear that the intention was to
> target all commands.
>
> [1]: https://lore.kernel.org/git/pull.1577.v5.git.git.1697218770.gitgitgadget@gmail.com/

True.  

We could drop [1/2] from the series in the meantime to make it a
GitLab installation specific issue where they explicitly use
attr.tree to point at HEAD ;-) That is not solving anything for
those who set attr.tree (in a sense, they are buying the feature
with overhead of reading attributes from the named tree), but at
least for most people who are used to seeing the bare repository
ignoring the attributes, it would be an improvement to drop the
"bare repositories the tree of the HEAD commit is used to look up
attributes files by default" half from the series.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-02 21:08             ` Junio C Hamano
@ 2024-05-03  5:37               ` Dhruva Krishnamurthy
  2024-05-03 15:34                 ` Re* " Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Dhruva Krishnamurthy @ 2024-05-03  5:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Karthik Nayak, Taylor Blau, Jeff King, John Cai, git

On Thu, May 2, 2024 at 2:08 PM Junio C Hamano <gitster@pobox.com> wrote:
> We could drop [1/2] from the series in the meantime to make it a
> GitLab installation specific issue where they explicitly use
> attr.tree to point at HEAD ;-) That is not solving anything for
> those who set attr.tree (in a sense, they are buying the feature
> with overhead of reading attributes from the named tree), but at
> least for most people who are used to seeing the bare repository
> ignoring the attributes, it would be an improvement to drop the
> "bare repositories the tree of the HEAD commit is used to look up
> attributes files by default" half from the series.
>

A hack (without knowing side effects if any) is to use an empty tree
for attr source:
$ git config --add attr.tree $(git hash-object -t tree /dev/null)

This gives me performance comparable to git 2.42

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re* using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-03  5:37               ` Dhruva Krishnamurthy
@ 2024-05-03 15:34                 ` Junio C Hamano
  2024-05-03 17:46                   ` Jeff King
                                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-05-03 15:34 UTC (permalink / raw)
  To: git; +Cc: Karthik Nayak, Taylor Blau, Jeff King, John Cai,
	Dhruva Krishnamurthy

Dhruva Krishnamurthy <dhruvakm@gmail.com> writes:

> On Thu, May 2, 2024 at 2:08 PM Junio C Hamano <gitster@pobox.com> wrote:
>> We could drop [1/2] from the series in the meantime to make it a
>> GitLab installation specific issue where they explicitly use
>> attr.tree to point at HEAD ;-) That is not solving anything for
>> those who set attr.tree (in a sense, they are buying the feature
>> with overhead of reading attributes from the named tree), but at
>> least for most people who are used to seeing the bare repository
>> ignoring the attributes, it would be an improvement to drop the
>> "bare repositories the tree of the HEAD commit is used to look up
>> attributes files by default" half from the series.
>>
>
> A hack (without knowing side effects if any) is to use an empty tree
> for attr source:
> $ git config --add attr.tree $(git hash-object -t tree /dev/null)
>
> This gives me performance comparable to git 2.42

That is clever.  Instead of crawling a potentially large tree that
is at the HEAD of the main project payload to find ".gitattributes"
files that may be relevant (and often not), folks can set an empty
tree to attr.tree to the configuration until this gets corrected.

And for folks who had been happy with the pre 2.42 behaviour,
we could do something like the attached as the first step to a real fix.

----- >8 --------- >8 --------- >8 --------- >8 -----
Subject: [PATCH] stop using HEAD for attributes in bare repository by default

With 23865355 (attr: read attributes from HEAD when bare repo,
2023-10-13), we started to use the HEAD tree as the default
attribute source in a bare repository.  One argument for such a
behaviour is that it would make things like "git archive" run in
bare and non-bare repositories for the same commit consistent.
This changes was merged to Git 2.43 but without an explicit mention
in its release notes.

It turns out that this change destroys performance of shallowly
cloning from a bare repository.  As the "server" installations are
expected to be mostly bare, and "git pack-objects", which is the
core of driving the other side of "git clone" and "git fetch" wants
to see if a path is set not to delta with blobs from other paths via
the attribute system, the change forces the server side to traverse
the tree of the HEAD commit needlessly to find if each and every
paths the objects it sends out has the attribute that controls the
deltification.  Given that (1) most projects do not configure such
an attribute, and (2) it is dubious for the server side to honor
such an end-user supplied attribute anyway, this was a poor choice
of the default.

To mitigate the current situation, let's revert the change that uses
the tree of HEAD in a bare repository by default as the attribute
source.  This will help most people who have been happy with the
behaviour of Git 2.42 and before.

Two things to note:

 * If you are stuck with versions of Git 2.43 or newer, that is
   older than the release this fix appears in, you can explicitly
   set the attr.tree configuration variable to point at an empty
   tree object, i.e.

	$ git config attr.tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904

 * If you like the behaviour we are reverting, you can explicitly
   set the attr.tree configuration variable to HEAD, i.e.

	$ git config attr.tree HEAD

The right fix for this is to optimize the code paths that allow
accesses to attributes in tree objects, but that is a much more
involved change and is left as a longer-term project, outside the
scope of this "first step" fix.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 attr.c                  |  7 -------
 t/t0003-attributes.sh   | 10 ++++++++--
 t/t5001-archive-attr.sh |  3 ++-
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git c/attr.c w/attr.c
index 679e42258c..6af7151088 100644
--- c/attr.c
+++ w/attr.c
@@ -1223,13 +1223,6 @@ static void compute_default_attr_source(struct object_id *attr_source)
 		ignore_bad_attr_tree = 1;
 	}

-	if (!default_attr_source_tree_object_name &&
-	    startup_info->have_repository &&
-	    is_bare_repository()) {
-		default_attr_source_tree_object_name = "HEAD";
-		ignore_bad_attr_tree = 1;
-	}
-
 	if (!default_attr_source_tree_object_name || !is_null_oid(attr_source))
 		return;

diff --git c/t/t0003-attributes.sh w/t/t0003-attributes.sh
index 774b52c298..d755cc3c29 100755
--- c/t/t0003-attributes.sh
+++ w/t/t0003-attributes.sh
@@ -398,13 +398,19 @@ test_expect_success 'bad attr source defaults to reading .gitattributes file' '
 	)
 '

-test_expect_success 'bare repo defaults to reading .gitattributes from HEAD' '
+test_expect_success 'bare repo no longer defaults to reading .gitattributes from HEAD' '
 	test_when_finished rm -rf test bare_with_gitattribute &&
 	git init test &&
 	test_commit -C test gitattributes .gitattributes "f/path test=val" &&
 	git clone --bare test bare_with_gitattribute &&
-	echo "f/path: test: val" >expect &&
+
+	echo "f/path: test: unspecified" >expect &&
 	git -C bare_with_gitattribute check-attr test -- f/path >actual &&
+	test_cmp expect actual &&
+
+	echo "f/path: test: val" >expect &&
+	git -C bare_with_gitattribute -c attr.tree=HEAD \
+		check-attr test -- f/path >actual &&
 	test_cmp expect actual
 '

diff --git c/t/t5001-archive-attr.sh w/t/t5001-archive-attr.sh
index eaf959d8f6..7310774af5 100755
--- c/t/t5001-archive-attr.sh
+++ w/t/t5001-archive-attr.sh
@@ -133,7 +133,8 @@ test_expect_success 'git archive vs. bare' '
 '

 test_expect_success 'git archive with worktree attributes, bare' '
-	(cd bare && git archive --worktree-attributes HEAD) >bare-worktree.tar &&
+	(cd bare &&
+	git -c attr.tree=HEAD archive --worktree-attributes HEAD) >bare-worktree.tar &&
 	(mkdir bare-worktree && cd bare-worktree && "$TAR" xf -) <bare-worktree.tar
 '

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Re* using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-03 15:34                 ` Re* " Junio C Hamano
@ 2024-05-03 17:46                   ` Jeff King
  2024-05-06 20:28                     ` Taylor Blau
  2024-05-13 20:16                   ` John Cai
  2024-06-05 21:43                   ` [PATCH] attr.tree: HEAD:.gitattributes is no longer the default in a bare repo Junio C Hamano
  2 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2024-05-03 17:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Karthik Nayak, Taylor Blau, John Cai, Dhruva Krishnamurthy

On Fri, May 03, 2024 at 08:34:27AM -0700, Junio C Hamano wrote:

> And for folks who had been happy with the pre 2.42 behaviour,
> we could do something like the attached as the first step to a real fix.

It looks like lots of discussion happened with out me, and everybody
already posted all of the responses I was going to. Good. :)

In particular...

> ----- >8 --------- >8 --------- >8 --------- >8 -----
> Subject: [PATCH] stop using HEAD for attributes in bare repository by default
> [...]
> The right fix for this is to optimize the code paths that allow
> accesses to attributes in tree objects, but that is a much more
> involved change and is left as a longer-term project, outside the
> scope of this "first step" fix.

...this was the exact first step I was going to suggest. And your patch
looks correct to me. I assume you'd target this for 'maint'. The
regression goes back to v2.43.0, so it's not exactly new, but given the
severity in some cases it seems like it's worth getting it into a
release sooner rather than later.

I am mildly surprised nobody noticed the issue until now. I wonder if
t/perf would notice it and nobody is running it, or if this is a gap in
our coverage there. If the latter, it might be worth adding such a
script, which should be able to show off that your change here takes us
back to the v2.42 state.

Running the perf suite against linux.git between 2.42 and 2.43 would
answer the "is this a gap" question, but I haven't had a chance to do
so, and it takes a while.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re* using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-03 17:46                   ` Jeff King
@ 2024-05-06 20:28                     ` Taylor Blau
  0 siblings, 0 replies; 20+ messages in thread
From: Taylor Blau @ 2024-05-06 20:28 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Karthik Nayak, John Cai,
	Dhruva Krishnamurthy

On Fri, May 03, 2024 at 01:46:53PM -0400, Jeff King wrote:
> On Fri, May 03, 2024 at 08:34:27AM -0700, Junio C Hamano wrote:
>
> > And for folks who had been happy with the pre 2.42 behaviour,
> > we could do something like the attached as the first step to a real fix.
>
> It looks like lots of discussion happened with out me, and everybody
> already posted all of the responses I was going to. Good. :)
>
> In particular...
>
> > ----- >8 --------- >8 --------- >8 --------- >8 -----
> > Subject: [PATCH] stop using HEAD for attributes in bare repository by default
> > [...]
> > The right fix for this is to optimize the code paths that allow
> > accesses to attributes in tree objects, but that is a much more
> > involved change and is left as a longer-term project, outside the
> > scope of this "first step" fix.
>
> ...this was the exact first step I was going to suggest. And your patch
> looks correct to me. I assume you'd target this for 'maint'. The
> regression goes back to v2.43.0, so it's not exactly new, but given the
> severity in some cases it seems like it's worth getting it into a
> release sooner rather than later.

For what it's worth, I am in favor of the patch that Junio proposed
here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re* using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-03 15:34                 ` Re* " Junio C Hamano
  2024-05-03 17:46                   ` Jeff King
@ 2024-05-13 20:16                   ` John Cai
  2024-06-05 21:43                   ` [PATCH] attr.tree: HEAD:.gitattributes is no longer the default in a bare repo Junio C Hamano
  2 siblings, 0 replies; 20+ messages in thread
From: John Cai @ 2024-05-13 20:16 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Karthik Nayak, Taylor Blau, Jeff King, Dhruva Krishnamurthy


On 3 May 2024, at 11:34, Junio C Hamano wrote:

> Dhruva Krishnamurthy <dhruvakm@gmail.com> writes:
>
>> On Thu, May 2, 2024 at 2:08 PM Junio C Hamano <gitster@pobox.com> wrote:
>>> We could drop [1/2] from the series in the meantime to make it a
>>> GitLab installation specific issue where they explicitly use
>>> attr.tree to point at HEAD ;-) That is not solving anything for
>>> those who set attr.tree (in a sense, they are buying the feature
>>> with overhead of reading attributes from the named tree), but at
>>> least for most people who are used to seeing the bare repository
>>> ignoring the attributes, it would be an improvement to drop the
>>> "bare repositories the tree of the HEAD commit is used to look up
>>> attributes files by default" half from the series.
>>>
>>
>> A hack (without knowing side effects if any) is to use an empty tree
>> for attr source:
>> $ git config --add attr.tree $(git hash-object -t tree /dev/null)
>>
>> This gives me performance comparable to git 2.42
>
> That is clever.  Instead of crawling a potentially large tree that
> is at the HEAD of the main project payload to find ".gitattributes"
> files that may be relevant (and often not), folks can set an empty
> tree to attr.tree to the configuration until this gets corrected.
>
> And for folks who had been happy with the pre 2.42 behaviour,
> we could do something like the attached as the first step to a real fix.
>
> ----- >8 --------- >8 --------- >8 --------- >8 -----
> Subject: [PATCH] stop using HEAD for attributes in bare repository by default
>
> With 23865355 (attr: read attributes from HEAD when bare repo,
> 2023-10-13), we started to use the HEAD tree as the default
> attribute source in a bare repository.  One argument for such a
> behaviour is that it would make things like "git archive" run in
> bare and non-bare repositories for the same commit consistent.
> This changes was merged to Git 2.43 but without an explicit mention
> in its release notes.
>
> It turns out that this change destroys performance of shallowly
> cloning from a bare repository.  As the "server" installations are
> expected to be mostly bare, and "git pack-objects", which is the
> core of driving the other side of "git clone" and "git fetch" wants
> to see if a path is set not to delta with blobs from other paths via
> the attribute system, the change forces the server side to traverse
> the tree of the HEAD commit needlessly to find if each and every
> paths the objects it sends out has the attribute that controls the
> deltification.  Given that (1) most projects do not configure such
> an attribute, and (2) it is dubious for the server side to honor
> such an end-user supplied attribute anyway, this was a poor choice
> of the default.
>
> To mitigate the current situation, let's revert the change that uses
> the tree of HEAD in a bare repository by default as the attribute
> source.  This will help most people who have been happy with the
> behaviour of Git 2.42 and before.

This change makes sense to me, and glad it got uncovered. Thanks to all who
chimed in with this root cause analysis and the proposed patches. Sorry I
haven't replied until now-I was traveling the past two weeks.

thanks
John

>
> Two things to note:
>
>  * If you are stuck with versions of Git 2.43 or newer, that is
>    older than the release this fix appears in, you can explicitly
>    set the attr.tree configuration variable to point at an empty
>    tree object, i.e.
>
> 	$ git config attr.tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
>
>  * If you like the behaviour we are reverting, you can explicitly
>    set the attr.tree configuration variable to HEAD, i.e.
>
> 	$ git config attr.tree HEAD
>
> The right fix for this is to optimize the code paths that allow
> accesses to attributes in tree objects, but that is a much more
> involved change and is left as a longer-term project, outside the
> scope of this "first step" fix.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  attr.c                  |  7 -------
>  t/t0003-attributes.sh   | 10 ++++++++--
>  t/t5001-archive-attr.sh |  3 ++-
>  3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git c/attr.c w/attr.c
> index 679e42258c..6af7151088 100644
> --- c/attr.c
> +++ w/attr.c
> @@ -1223,13 +1223,6 @@ static void compute_default_attr_source(struct object_id *attr_source)
>  		ignore_bad_attr_tree = 1;
>  	}
>
> -	if (!default_attr_source_tree_object_name &&
> -	    startup_info->have_repository &&
> -	    is_bare_repository()) {
> -		default_attr_source_tree_object_name = "HEAD";
> -		ignore_bad_attr_tree = 1;
> -	}
> -
>  	if (!default_attr_source_tree_object_name || !is_null_oid(attr_source))
>  		return;
>
> diff --git c/t/t0003-attributes.sh w/t/t0003-attributes.sh
> index 774b52c298..d755cc3c29 100755
> --- c/t/t0003-attributes.sh
> +++ w/t/t0003-attributes.sh
> @@ -398,13 +398,19 @@ test_expect_success 'bad attr source defaults to reading .gitattributes file' '
>  	)
>  '
>
> -test_expect_success 'bare repo defaults to reading .gitattributes from HEAD' '
> +test_expect_success 'bare repo no longer defaults to reading .gitattributes from HEAD' '
>  	test_when_finished rm -rf test bare_with_gitattribute &&
>  	git init test &&
>  	test_commit -C test gitattributes .gitattributes "f/path test=val" &&
>  	git clone --bare test bare_with_gitattribute &&
> -	echo "f/path: test: val" >expect &&
> +
> +	echo "f/path: test: unspecified" >expect &&
>  	git -C bare_with_gitattribute check-attr test -- f/path >actual &&
> +	test_cmp expect actual &&
> +
> +	echo "f/path: test: val" >expect &&
> +	git -C bare_with_gitattribute -c attr.tree=HEAD \
> +		check-attr test -- f/path >actual &&
>  	test_cmp expect actual
>  '
>
> diff --git c/t/t5001-archive-attr.sh w/t/t5001-archive-attr.sh
> index eaf959d8f6..7310774af5 100755
> --- c/t/t5001-archive-attr.sh
> +++ w/t/t5001-archive-attr.sh
> @@ -133,7 +133,8 @@ test_expect_success 'git archive vs. bare' '
>  '
>
>  test_expect_success 'git archive with worktree attributes, bare' '
> -	(cd bare && git archive --worktree-attributes HEAD) >bare-worktree.tar &&
> +	(cd bare &&
> +	git -c attr.tree=HEAD archive --worktree-attributes HEAD) >bare-worktree.tar &&
>  	(mkdir bare-worktree && cd bare-worktree && "$TAR" xf -) <bare-worktree.tar
>  '

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] attr.tree: HEAD:.gitattributes is no longer the default in a bare repo
  2024-05-03 15:34                 ` Re* " Junio C Hamano
  2024-05-03 17:46                   ` Jeff King
  2024-05-13 20:16                   ` John Cai
@ 2024-06-05 21:43                   ` Junio C Hamano
  2024-06-06  8:32                     ` Jeff King
  2 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2024-06-05 21:43 UTC (permalink / raw)
  To: git; +Cc: Karthik Nayak, Taylor Blau, Jeff King, John Cai,
	Dhruva Krishnamurthy

51441e64 (stop using HEAD for attributes in bare repository by
default, 2024-05-03) has addressed a recent performance regression
by partially reverting a topic that was merged at 26dd307c (Merge
branch 'jc/attr-tree-config', 2023-10-30).  But it forgot to update
the documentation to remove the mention of a special case in bare
repositories.

Let's update the document before the update hits the next release.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/attr.txt | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git c/Documentation/config/attr.txt w/Documentation/config/attr.txt
index 1a482d6af2..c4a5857993 100644
--- c/Documentation/config/attr.txt
+++ w/Documentation/config/attr.txt
@@ -1,7 +1,6 @@
 attr.tree::
 	A reference to a tree in the repository from which to read attributes,
-	instead of the `.gitattributes` file in the working tree. In a bare
-	repository, this defaults to `HEAD:.gitattributes`. If the value does
-	not resolve to a valid tree object, an empty tree is used instead.
+	instead of the `.gitattributes` file in the working tree. If the value
+	does not resolve to a valid tree object, an empty tree is used instead.
 	When the `GIT_ATTR_SOURCE` environment variable or `--attr-source`
 	command line option are used, this configuration variable has no effect.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] attr.tree: HEAD:.gitattributes is no longer the default in a bare repo
  2024-06-05 21:43                   ` [PATCH] attr.tree: HEAD:.gitattributes is no longer the default in a bare repo Junio C Hamano
@ 2024-06-06  8:32                     ` Jeff King
  2024-06-06 16:02                       ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2024-06-06  8:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Karthik Nayak, Taylor Blau, John Cai, Dhruva Krishnamurthy

On Wed, Jun 05, 2024 at 02:43:03PM -0700, Junio C Hamano wrote:

> 51441e64 (stop using HEAD for attributes in bare repository by
> default, 2024-05-03) has addressed a recent performance regression
> by partially reverting a topic that was merged at 26dd307c (Merge
> branch 'jc/attr-tree-config', 2023-10-30).  But it forgot to update
> the documentation to remove the mention of a special case in bare
> repositories.
> 
> Let's update the document before the update hits the next release.

Good catch, and the patch looks good.

I think 51441e64 is essentially a revert of 2386535511 (attr: read
attributes from HEAD when bare repo, 2023-10-13). I don't know how you
prepared it, but I'd probably have started with "cherry-pick -n". But
that wouldn't help, because the documentation didn't come until after
that in 9f9c40cf34 (attr: add attr.tree for setting the treeish to read
attributes from, 2023-10-13).

Not that it really matters much now, but always just curious about how
we can avoid missing stuff like this next time.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] attr.tree: HEAD:.gitattributes is no longer the default in a bare repo
  2024-06-06  8:32                     ` Jeff King
@ 2024-06-06 16:02                       ` Junio C Hamano
  0 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-06-06 16:02 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Karthik Nayak, Taylor Blau, John Cai, Dhruva Krishnamurthy

Jeff King <peff@peff.net> writes:

> I think 51441e64 is essentially a revert of 2386535511 (attr: read
> attributes from HEAD when bare repo, 2023-10-13). I don't know how you
> prepared it, but I'd probably have started with "cherry-pick -n". But
> that wouldn't help, because the documentation didn't come until after
> that in 9f9c40cf34 (attr: add attr.tree for setting the treeish to read
> attributes from, 2023-10-13).

"revert -m 1" followed by "commit --amend" might have worked well in
this case to get rid of the code that came from one and doc update
that came from the other in a two patch series, but in general, that
would be too much noise to wade through in general.

> Not that it really matters much now, but always just curious about how
> we can avoid missing stuff like this next time.

The series first did "HEAD tree is used in bare" without doc, and
followed up with "configuration can be used to name any tree" with
doc that mentions the behaviour of the first step as a special case
of default value for the configuration variable.  The only way it
could have been made easier to spot is to introduce the variable
with documentation first, and then do the "bare repo uses HEAD as
the default" thing on top.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-02 17:44       ` Junio C Hamano
  2024-05-02 17:55         ` Taylor Blau
@ 2024-05-02 18:34         ` Dhruva Krishnamurthy
  1 sibling, 0 replies; 20+ messages in thread
From: Dhruva Krishnamurthy @ 2024-05-02 18:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, Jeff King, John Cai, Karthik Nayak, git

On Thu, May 2, 2024 at 10:44 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Instead, this patch changes the behavior to only fallback to "HEAD" in
> > bare repositories from check-attr, but leaves pack-objects, archive, and
> > all other builtins alone.
>
> I thought the whole point of the exercise was to allow server-side
> (which typically is bare and cannot use anything from the working
> tree) to pay attention to the attributes.  This patch rips that out
> and piles even more new and unproven code on top?  I am not sure.

Yes, I am particularly interested in bare repositories (server/hosting side).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42
  2024-05-01 22:00 ` using tree as attribute source is slow, was " Jeff King
                     ` (2 preceding siblings ...)
  2024-05-02  0:33   ` Taylor Blau
@ 2024-05-02  0:45   ` Dhruva Krishnamurthy
  3 siblings, 0 replies; 20+ messages in thread
From: Dhruva Krishnamurthy @ 2024-05-02  0:45 UTC (permalink / raw)
  To: Jeff King; +Cc: John Cai, Karthik Nayak, git

On Wed, May 1, 2024 at 3:00 PM Jeff King <peff@peff.net> wrote:
> Oof. I was able to reproduce this regression, and the impact can be
> pretty severe. You can reproduce it without NFS and without shallow
> clones. Get a bare clone of something big like the kernel:

Wow, that was really fast! I spent quite some time narrowing down and
still was far off. I hope to develop such deep insights and
understanding of git code some day. Thank you very much for sharing
all the details and giving me leads to further explore.

-dhruva

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-06-06 16:02 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-01  5:26 Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42 Dhruva Krishnamurthy
2024-05-01 22:00 ` using tree as attribute source is slow, was " Jeff King
2024-05-01 22:37   ` rsbecker
2024-05-01 22:40   ` Junio C Hamano
2024-05-02  0:33   ` Taylor Blau
2024-05-02 17:33     ` Taylor Blau
2024-05-02 17:44       ` Junio C Hamano
2024-05-02 17:55         ` Taylor Blau
2024-05-02 19:01           ` Karthik Nayak
2024-05-02 21:08             ` Junio C Hamano
2024-05-03  5:37               ` Dhruva Krishnamurthy
2024-05-03 15:34                 ` Re* " Junio C Hamano
2024-05-03 17:46                   ` Jeff King
2024-05-06 20:28                     ` Taylor Blau
2024-05-13 20:16                   ` John Cai
2024-06-05 21:43                   ` [PATCH] attr.tree: HEAD:.gitattributes is no longer the default in a bare repo Junio C Hamano
2024-06-06  8:32                     ` Jeff King
2024-06-06 16:02                       ` Junio C Hamano
2024-05-02 18:34         ` using tree as attribute source is slow, was Re: Help troubleshoot performance regression cloning with depth: git 2.44 vs git 2.42 Dhruva Krishnamurthy
2024-05-02  0:45   ` Dhruva Krishnamurthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).