[PATCH] attr: avoid recursion when expanding attribute macros

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] attr: avoid recursion when expanding attribute macros
@ 2025-11-11 22:36 Jeff King
  2025-11-12  1:30 ` Ben Knoble
  2025-11-12  6:57 ` Patrick Steinhardt
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff King @ 2025-11-11 22:36 UTC (permalink / raw)
  To: git; +Cc: Ben Stav

Given a set of attribute macros like:

   [attr]a1 a2
   [attr]a2 a3
   ...
   [attr]a300000 -text
   file a1

expanding the attributes for "file" requires expanding "a1" to "a2",
"a2" to "a3", and so on until hitting a non-macro expansion ("-text", in
this case). We implement this via recursion: fill_one() calls
macroexpand_one(), which then recurses back to fill_one(). As a result,
very deep macro chains like the one above can run out of stack space and
cause us to segfault.

The required stack space is fairly small; I needed on the order of
200,000 entries to get a segfault on Linux. So it's unlikely anybody
would hit this accidentally, leaving only malicious inputs. There you
can easily construct a repo which will segfault on clone (we look at
attributes during the checkout step, but you'd see the same trying to do
other operations, like diff in a bare repo). It's mostly harmless, since
anybody constructing such a repo is only preventing victims from cloning
their evil garbage, but it could be a nuisance for hosting sites.

One option to prevent this is to limit the depth of recursion we'll
allow. This is conceptually easy to implement, but it raises other
questions: what should the limit be, and do we need a configuration knob
for it?

The recursion here is simple enough that we can avoid those questions by
just converting it to iteration instead. Rather than iterate over the
states of a match_attr in fill_one(), we'll put them all in a queue, and
the expansion of each can add to the queue rather than recursing. Note
that this is a LIFO queue in order to keep the same depth-first order we
did with the recursive implementation. I've avoided using the word
"stack" in the code because the term is already heavily used to refer to
the stack of .gitattribute files that matches the tree structure of the
repository.

The test uses a limited stack size so we can trigger the problem with a
much smaller input than the one shown above. The value here (3000) is
enough to trigger the issue on my x86_64 Linux machine.

Reported-by: Ben Stav <benstav@miggo.io>
Signed-off-by: Jeff King <peff@peff.net>
---
 attr.c                | 50 +++++++++++++++++++++++++++++--------------
 t/t0003-attributes.sh | 20 +++++++++++++++++
 2 files changed, 54 insertions(+), 16 deletions(-)

diff --git a/attr.c b/attr.c
index d1daeb0b4d..4999b7e09d 100644
--- a/attr.c
+++ b/attr.c
@@ -1064,24 +1064,52 @@ static int path_matches(const char *pathname, int pathlen,
 			      pattern, prefix, pat->patternlen);
 }

-static int macroexpand_one(struct all_attrs_item *all_attrs, int nr, int rem);
+struct attr_state_queue {
+	const struct attr_state **items;
+	size_t alloc, nr;
+};
+
+static void attr_state_queue_push(struct attr_state_queue *t,
+				 const struct match_attr *a)
+{
+	for (size_t i = 0; i < a->num_attr; i++) {
+		ALLOC_GROW(t->items, t->nr + 1, t->alloc);
+		t->items[t->nr++] = &a->state[i];
+	}
+}
+
+static const struct attr_state *attr_state_queue_pop(struct attr_state_queue *t)
+{
+	return t->nr ? t->items[--t->nr] : NULL;
+}
+
+static void attr_state_queue_release(struct attr_state_queue *t)
+{
+	free(t->items);
+}

 static int fill_one(struct all_attrs_item *all_attrs,
 		    const struct match_attr *a, int rem)
 {
-	size_t i;
+	struct attr_state_queue todo = { 0 };
+	const struct attr_state *state;

-	for (i = a->num_attr; rem > 0 && i > 0; i--) {
-		const struct git_attr *attr = a->state[i - 1].attr;
+	attr_state_queue_push(&todo, a);
+	while (rem > 0 && (state = attr_state_queue_pop(&todo))) {
+		const struct git_attr *attr = state->attr;
 		const char **n = &(all_attrs[attr->attr_nr].value);
-		const char *v = a->state[i - 1].setto;
+		const char *v = state->setto;

 		if (*n == ATTR__UNKNOWN) {
+			const struct all_attrs_item *item =
+				&all_attrs[attr->attr_nr];
 			*n = v;
 			rem--;
-			rem = macroexpand_one(all_attrs, attr->attr_nr, rem);
+			if (item->macro && item->value == ATTR__TRUE)
+				attr_state_queue_push(&todo, item->macro);
 		}
 	}
+	attr_state_queue_release(&todo);
 	return rem;
 }

@@ -1106,16 +1134,6 @@ static int fill(const char *path, int pathlen, int basename_offset,
 	return rem;
 }

-static int macroexpand_one(struct all_attrs_item *all_attrs, int nr, int rem)
-{
-	const struct all_attrs_item *item = &all_attrs[nr];
-
-	if (item->macro && item->value == ATTR__TRUE)
-		return fill_one(all_attrs, item->macro, rem);
-	else
-		return rem;
-}
-
 /*
  * Marks the attributes which are macros based on the attribute stack.
  * This prevents having to search through the attribute stack each time
diff --git a/t/t0003-attributes.sh b/t/t0003-attributes.sh
index 3c98b622f2..582e207aa1 100755
--- a/t/t0003-attributes.sh
+++ b/t/t0003-attributes.sh
@@ -664,4 +664,24 @@ test_expect_success 'user defined builtin_objectmode values are ignored' '
 	test_cmp expect err
 '

+test_expect_success ULIMIT_STACK_SIZE 'deep macro recursion' '
+	n=3000 &&
+	{
+		i=0 &&
+		while test $i -lt $n; do
+			echo "[attr]a$i a$((i+1))" &&
+			i=$((i+1)) ||
+			return 1
+		done &&
+		echo "[attr]a$n -text" &&
+		echo "file a0"
+	} >.gitattributes &&
+	{
+		echo "file: text: unset" &&
+		test_seq -f "file: a%d: set" 0 $n
+	} >expect &&
+	run_with_limited_stack git check-attr -a file >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.52.0.rc1.258.g0c10df6ae5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] attr: avoid recursion when expanding attribute macros
  2025-11-11 22:36 [PATCH] attr: avoid recursion when expanding attribute macros Jeff King
@ 2025-11-12  1:30 ` Ben Knoble
  2025-11-12  7:09   ` Jeff King
  2025-11-12  6:57 ` Patrick Steinhardt
  1 sibling, 1 reply; 8+ messages in thread
From: Ben Knoble @ 2025-11-12  1:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ben Stav

> Le 11 nov. 2025 à 17:37, Jeff King <peff@peff.net> a écrit :
> 
> Given a set of attribute macros like:
> 
>   [attr]a1 a2
>   [attr]a2 a3
>   ...
>   [attr]a300000 -text
>   file a1
> 
> expanding the attributes for "file" requires expanding "a1" to "a2",
> "a2" to "a3", and so on until hitting a non-macro expansion ("-text", in
> this case). We implement this via recursion: fill_one() calls
> macroexpand_one(), which then recurses back to fill_one(). As a result,
> very deep macro chains like the one above can run out of stack space and
> cause us to segfault.
> 
> The required stack space is fairly small; I needed on the order of
> 200,000 entries to get a segfault on Linux. So it's unlikely anybody
> would hit this accidentally, leaving only malicious inputs. There you
> can easily construct a repo which will segfault on clone (we look at
> attributes during the checkout step, but you'd see the same trying to do
> other operations, like diff in a bare repo). It's mostly harmless, since
> anybody constructing such a repo is only preventing victims from cloning
> their evil garbage, but it could be a nuisance for hosting sites.
> 
> One option to prevent this is to limit the depth of recursion we'll
> allow. This is conceptually easy to implement, but it raises other
> questions: what should the limit be, and do we need a configuration knob
> for it?
> 
> The recursion here is simple enough that we can avoid those questions by
> just converting it to iteration instead. Rather than iterate over the
> states of a match_attr in fill_one(), we'll put them all in a queue, and
> the expansion of each can add to the queue rather than recursing. Note
> that this is a LIFO queue in order to keep the same depth-first order we
> did with the recursive implementation. I've avoided using the word
> "stack" in the code because the term is already heavily used to refer to
> the stack of .gitattribute files that matches the tree structure of the
> repository.

Worth catching, and I agree with your choice of in-memory iteration over tunable depth.

My knowledge on memory models is a bit weak and I didn’t check directly, but are we implicitly assuming that we are less likely to run out of heap memory in such an evil case? In effect I suppose we’re turning a stack overflow segfault into an OOM error?

That seems like a fine assumption to me (I’m used to languages where the call stack lives more or less efficiently on the heap), just wanted to check my understanding. 

The memory use has to go somewhere ;) presuming there’s no good way to only keep the relevant entries in memory, since I can of course find a large example that also uses each intermediate macro, so the code would need to get a lot smarter to collapse equivalence classes, prune unused paths, etc., which seems like a poor investment for what AFAICT is a little-used feature*.

*I love it, and use it for custom diff drivers and the baked in hunk headers. But the diff definitions aren’t easily shareable, and I have to be aware of when to turn them off, so I end up not sharing the corresponding attributes either.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] attr: avoid recursion when expanding attribute macros
  2025-11-12  1:30 ` Ben Knoble
@ 2025-11-12  7:09   ` Jeff King
  2025-11-12  7:17     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2025-11-12  7:09 UTC (permalink / raw)
  To: Ben Knoble; +Cc: git, Ben Stav

On Tue, Nov 11, 2025 at 08:30:58PM -0500, Ben Knoble wrote:

> My knowledge on memory models is a bit weak and I didn’t check
> directly, but are we implicitly assuming that we are less likely to
> run out of heap memory in such an evil case? In effect I suppose we’re
> turning a stack overflow segfault into an OOM error?

Yes, I think you could think of it that way. But there are two reasons
to prefer heap:

  1. The heap limits are _way_ bigger. The stack size on Linux is
     usually 8MB, and that is considered large. It's much smaller on
     other platforms (and especially if you have multiple threads).

  2. In C, you don't have many options for detecting the case of running
     out of stack, let alone recovering from it. Whereas you can check
     for heap allocation failures. We don't tend to do anything besides
     die() in git, but it's still nicer to have a controlled die than a
     segfault.

So switching out stack recursion to spending heap memory essentially
makes the problem go away, or at least turns it into one of the zillion
other ways that you can convince Git to allocate a bunch of heap memory. ;)

> The memory use has to go somewhere ;) presuming there’s no good way to
> only keep the relevant entries in memory, since I can of course find a
> large example that also uses each intermediate macro, so the code
> would need to get a lot smarter to collapse equivalence classes, prune
> unused paths, etc., which seems like a poor investment for what AFAICT
> is a little-used feature*.

We have a hard limit of 100MB on attributes files, which is mostly a
made-up number (it was the size that GitHub had been limiting for all
blobs for years, so we knew nobody would complain about instituting it).
From the research in 3c50032ff5 (attr: ignore overly large gitattributes
files, 2022-12-01), it would probably be fine to drop it by a factor of
10 or more.

Though I think you might be able to chain macros across files (so
".gitattributes" introduces macro "foo", and the "sub/.gitattributes"
introduces "bar" which resolves to "foo", and so on). In which case your
total size is larger, and only eventually limited by how deep a tree
we'll accept (another place where we recurse, but there is a
configurable depth limit).

So for the most part Git's protection against these sort of resource
consumption attacks is: die if the process wants too many resources, and
people who try to tickle those limits are only hurting their own repos.
It does put people who host arbitrary Git repos on the hook for managing
resources at the OS level (so greedy and malicious processes are killed
rather than bringing down the rest of the system).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] attr: avoid recursion when expanding attribute macros
  2025-11-12  7:09   ` Jeff King
@ 2025-11-12  7:17     ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2025-11-12  7:17 UTC (permalink / raw)
  To: Ben Knoble; +Cc: git, Ben Stav

On Wed, Nov 12, 2025 at 02:09:07AM -0500, Jeff King wrote:

> Though I think you might be able to chain macros across files (so
> ".gitattributes" introduces macro "foo", and the "sub/.gitattributes"
> introduces "bar" which resolves to "foo", and so on). In which case your
> total size is larger, and only eventually limited by how deep a tree
> we'll accept (another place where we recurse, but there is a
> configurable depth limit).

I did poke at this briefly, and the answer is: no, you can't do that. We
allow macro definitions only at the top-level. Which makes sense, as
otherwise you get into confusing dependencies between files.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] attr: avoid recursion when expanding attribute macros
  2025-11-11 22:36 [PATCH] attr: avoid recursion when expanding attribute macros Jeff King
  2025-11-12  1:30 ` Ben Knoble
@ 2025-11-12  6:57 ` Patrick Steinhardt
  2025-11-12  7:16   ` Jeff King
  2025-11-12 17:40   ` Junio C Hamano
  1 sibling, 2 replies; 8+ messages in thread
From: Patrick Steinhardt @ 2025-11-12  6:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ben Stav

On Tue, Nov 11, 2025 at 05:36:47PM -0500, Jeff King wrote:
> Given a set of attribute macros like:
> 
>    [attr]a1 a2
>    [attr]a2 a3
>    ...
>    [attr]a300000 -text
>    file a1
> 
> expanding the attributes for "file" requires expanding "a1" to "a2",
> "a2" to "a3", and so on until hitting a non-macro expansion ("-text", in
> this case). We implement this via recursion: fill_one() calls
> macroexpand_one(), which then recurses back to fill_one(). As a result,
> very deep macro chains like the one above can run out of stack space and
> cause us to segfault.
> 
> The required stack space is fairly small; I needed on the order of
> 200,000 entries to get a segfault on Linux. So it's unlikely anybody
> would hit this accidentally, leaving only malicious inputs. There you
> can easily construct a repo which will segfault on clone (we look at
> attributes during the checkout step, but you'd see the same trying to do
> other operations, like diff in a bare repo). It's mostly harmless, since
> anybody constructing such a repo is only preventing victims from cloning
> their evil garbage, but it could be a nuisance for hosting sites.
> 
> One option to prevent this is to limit the depth of recursion we'll
> allow. This is conceptually easy to implement, but it raises other
> questions: what should the limit be, and do we need a configuration knob
> for it?

That's fair, and as you demonstrate it's easy enough to turn recursion
into iteration. But it doesn't really solve the main problem: given
malicious input we'd now still crash eventually, even though we
ourselves control how exactly we crash. The main difference is that with
iteration it'll both:

  - take longer for us to crash

  - require way more memory along the way

So the evil garbage would continue to be a nuisance for users who want
to clone such a repository, but now it's going to be more of a nuisance
for hosting sites given that it could lead to out-of-memory situations.

I guess the reasoning here is that for this to become a real problem the
".gitattributes" file would need to be excessively huge. We're probably
talking about many millions or even billions of attributes before this
could cause an OOM situation. And such a file would be large enough to
bust the typical limits that the likes of GitHub and GitLab have in
place, so Git hosters already protect themselves against this crafted
input, even if only indirectly so.

The other angle is of course the wasted compute that an adversary can
cause. But I don't really mind that too much: there's enough benign
operations that require a bunch of compute, so I don't really see a
reason why one would need to craft a "compute waster" with malicious
input.

So personally I would've probably leaned into the direction of enforcing
a hard limit. I don't see a reason why anybody would need more than a
couple of recursions, it culls both compute and memory growth, and it
allows us to have a proper error message in case the limit is busted.
Furthermore, we can demonstrate right now that it wasn't possible to
have unlimited recursion anyway, which makes it easier to put a new
limit into place.

But following my above reasoning I think it's okay to turn this into
iteration, as well, though, but I'd like to hear whether my train of
thought matches yours.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] attr: avoid recursion when expanding attribute macros
  2025-11-12  6:57 ` Patrick Steinhardt
@ 2025-11-12  7:16   ` Jeff King
  2025-11-12 10:21     ` Patrick Steinhardt
  2025-11-12 17:40   ` Junio C Hamano
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff King @ 2025-11-12  7:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Ben Stav

On Wed, Nov 12, 2025 at 07:57:14AM +0100, Patrick Steinhardt wrote:

> So personally I would've probably leaned into the direction of enforcing
> a hard limit. I don't see a reason why anybody would need more than a
> couple of recursions, it culls both compute and memory growth, and it
> allows us to have a proper error message in case the limit is busted.
> Furthermore, we can demonstrate right now that it wasn't possible to
> have unlimited recursion anyway, which makes it easier to put a new
> limit into place.
> 
> But following my above reasoning I think it's okay to turn this into
> iteration, as well, though, but I'd like to hear whether my train of
> thought matches yours.

Yeah, it does match mine. If I wanted to waste a bunch of CPU and memory
on a hosting site, there are a lot easier ways to do that than with
really long gitattributes.

I'm not at all opposed to putting in a hard limit on top. My general
feeling is that it never hurts to convert recursion to iteration; it
only gives us more options. I'm not planning to work on a hard limit
myself, but if you want to, be my guest. :)

I think if we do (or even if we don't), it may also be reasonable to
shrink the max attribute file size to 10MB or even smaller.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] attr: avoid recursion when expanding attribute macros
  2025-11-12  7:16   ` Jeff King
@ 2025-11-12 10:21     ` Patrick Steinhardt
  0 siblings, 0 replies; 8+ messages in thread
From: Patrick Steinhardt @ 2025-11-12 10:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ben Stav

On Wed, Nov 12, 2025 at 02:16:51AM -0500, Jeff King wrote:
> On Wed, Nov 12, 2025 at 07:57:14AM +0100, Patrick Steinhardt wrote:
> 
> > So personally I would've probably leaned into the direction of enforcing
> > a hard limit. I don't see a reason why anybody would need more than a
> > couple of recursions, it culls both compute and memory growth, and it
> > allows us to have a proper error message in case the limit is busted.
> > Furthermore, we can demonstrate right now that it wasn't possible to
> > have unlimited recursion anyway, which makes it easier to put a new
> > limit into place.
> > 
> > But following my above reasoning I think it's okay to turn this into
> > iteration, as well, though, but I'd like to hear whether my train of
> > thought matches yours.
> 
> Yeah, it does match mine. If I wanted to waste a bunch of CPU and memory
> on a hosting site, there are a lot easier ways to do that than with
> really long gitattributes.
> 
> I'm not at all opposed to putting in a hard limit on top. My general
> feeling is that it never hurts to convert recursion to iteration; it
> only gives us more options. I'm not planning to work on a hard limit
> myself, but if you want to, be my guest. :)
> 
> I think if we do (or even if we don't), it may also be reasonable to
> shrink the max attribute file size to 10MB or even smaller.

I think for now it's okay to convert this into iteration and not
introduce a limit, at least as long as we keep an open mind about
introducing such a limit in the future. I don't really expect that
anyone will ever abuse this, but if I'm wrong and this happens at one
point in time we may have to introduce the limit retroactively.

So: I'm happy with your patch, but it might make sense to summarize the
discussion in the commit message.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] attr: avoid recursion when expanding attribute macros
  2025-11-12  6:57 ` Patrick Steinhardt
  2025-11-12  7:16   ` Jeff King
@ 2025-11-12 17:40   ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2025-11-12 17:40 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git, Ben Stav

Patrick Steinhardt <ps@pks.im> writes:

> That's fair, and as you demonstrate it's easy enough to turn recursion
> into iteration. But it doesn't really solve the main problem: given
> malicious input we'd now still crash eventually, even though we
> ...
> So the evil garbage would continue to be a nuisance for users who want
> to clone such a repository, but now it's going to be more of a nuisance
> for hosting sites given that it could lead to out-of-memory situations.

That assumes there are users who want to clone such a repository
with evil garbage in it, doesn't it?  I am not sure how likely there
exist such people, and even less sure if we want to actively support
such users or discourage them.

I like the conversion from recursion to iteraiton as a general
principle, but somehow I do not think this particular one is an
issue that warrants more than minimum effort on it.

I also wonder how common the use of attribute macros (other than the
built-in ones) are.  Are folks working at hosting sites have easy
access to public data (i.e., super "git grep" that lets them sample
some random subset among many public repositories and work on them)?

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-11-12 17:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11 22:36 [PATCH] attr: avoid recursion when expanding attribute macros Jeff King
2025-11-12  1:30 ` Ben Knoble
2025-11-12  7:09   ` Jeff King
2025-11-12  7:17     ` Jeff King
2025-11-12  6:57 ` Patrick Steinhardt
2025-11-12  7:16   ` Jeff King
2025-11-12 10:21     ` Patrick Steinhardt
2025-11-12 17:40   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).