git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] range-diff: add configurable memory limit for cost matrix
@ 2025-08-26 17:18 Paulo Casaretto via GitGitGadget
  2025-08-26 19:18 ` Junio C Hamano
  2025-08-28  8:38 ` [PATCH v2 0/2] " Paulo Casaretto via GitGitGadget
  0 siblings, 2 replies; 18+ messages in thread
From: Paulo Casaretto via GitGitGadget @ 2025-08-26 17:18 UTC (permalink / raw)
  To: git; +Cc: Paulo Casaretto, pcasaretto

From: pcasaretto <paulo.casaretto@shopify.com>

When comparing large commit ranges (e.g., 250,000+ commits), range-diff
attempts to allocate an n×n cost matrix that can exhaust available
memory. For example, with 256,784 commits (n = 513,568), the matrix
would require approximately 256GB of memory (513,568² × 4 bytes),
causing either immediate segmentation faults due to integer overflow or
system hangs.

Add a memory limit check in get_correspondences() before allocating the
cost matrix. This check uses the total size in bytes (n² × sizeof(int))
and compares it against a configurable maximum, preventing both
excessive memory usage and integer overflow issues.

The limit is configurable via a new --max-memory option that accepts
human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
systems and 2GB for 32 bit systems. This allows comparing ranges of
approximately 32,000 (16,000) commits - generous for real-world use cases
while preventing impractical operations.

When the limit is exceeded, range-diff now displays a clear error
message showing both the requested memory size and the maximum allowed,
formatted in human-readable units for better user experience.

Example usage:
  git range-diff --max-memory=1G branch1...branch2
  git range-diff --max-memory=500M base..topic1 base..topic2

This approach was chosen over alternatives:
- Pre-counting commits: Would require spawning additional git processes
  and reading all commits twice
- Limiting by commit count: Less precise than actual memory usage
- Streaming approach: Would require significant refactoring of the
  current algorithm

This issue was previously discussed in:
https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/

Acked-by: Johannes Schindelin johannes.schindelin@gmx.de
Signed-off-by: pcasaretto <paulo.casaretto@shopify.com>
---
    range-diff: add configurable memory limit for cost matrix
    
    
    Problem Description
    ===================
    
    When git range-diff is given extremely large ranges, it can result in
    either:
    
     1. Segmentation fault due to integer overflow in array index
        calculations
     2. Excessive memory consumption leading to system hangs or OOM kills
     3. Poor user experience with the command appearing to hang for minutes
    
    
    Reproduction Case
    =================
    
    In a Shopify's large monorepo a range-diff command like this crashes
    after several minutes with a SIGBUS error
    
    $ git range-diff 4430f36511..316c1276c6 cb5240b6a8..2bbd292091
    
    
    Range statistics:
    
     * First range: 256,783 commits
     * Second range: 1 commit
     * Total: 256,784 commits
     * Memory required for cost matrix: n² × 4 bytes = ~260GB
    
    
    Stack Trace (Segmentation Fault)
    ================================
    
    (lldb) bt
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e000ae3b8)
      * frame #0: 0x000000010029a284 git`get_correspondences(a=0x000000016fde6188, b=0x000000016fde6160, creation_factor=60) at range-diff.c:356:20
        frame #1: 0x0000000100299310 git`show_range_diff(range1="4430f36511cbacf5c517c6851e2b8508a72dfd30..316c1276c63f55ad9413fa18bf3b6483564a9cf4", range2="cb5240b6a8ba59b4a1f282559ee0742721b0cafc..2bbd292091e376d177ce62e264dae8872ca6be5a", range_diff_opts=0x000000016fde6308) at range-diff.c:593:3
        frame #2: 0x00000001000c719c git`cmd_range_diff(argc=2, argv=0x0000600000bcd8c0, prefix="areas/core/shopify/", repo=0x0000000100468b20) at range-diff.c:167:8
        frame #3: 0x000000010000277c git`run_builtin(p=0x00000001004408d8, argc=3, argv=0x0000600000bcd8c0, repo=0x0000000100468b20) at git.c:480:11
        frame #4: 0x0000000100001020 git`handle_builtin(args=0x000000016fde6c70) at git.c:746:9
        frame #5: 0x0000000100002074 git`run_argv(args=0x000000016fde6c70) at git.c:813:4
        frame #6: 0x0000000100000d3c git`cmd_main(argc=3, argv=0x000000016fde7350) at git.c:953:19
        frame #7: 0x000000010012750c git`main(argc=4, argv=0x000000016fde7348) at common-main.c:9:11
        frame #8: 0x000000018573ab98 dyld`start + 6076
    
    
    
    Root Cause Analysis
    ===================
    
    The crash occurs in get_correspondences() at line 356:
    
    static void get_correspondences(struct string_list *a, struct string_list *b, ...)
    {
        int n = a->nr + b->nr;  // Integer overflow: 256,784 fits in int
        ...
        ALLOC_ARRAY(cost, st_mult(n, n));  // Would allocate ~260GB
        ...
        cost[i + n * j] = c;  // Line 356: Invalid memory access
    }
    
    
    Problems:
    
     1. Integer overflow: While n=256,784 fits in an int, n*n overflows
     2. Memory allocation: Even with proper types, allocating 260GB is
        impractical
    
    
    Solution
    ========
    
    Add a memory limit check in get_correspondences() before allocating the
    cost matrix. This check uses the total size in bytes (n² × sizeof(int))
    and compares it against a configurable maximum, preventing both
    excessive memory usage and integer overflow issues.
    
    The limit is configurable via a new --max-memory option that accepts
    human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
    systems and 2GB for 32 bit systems. This allows comparing ranges of
    approximately 32,000 (16,000) commits - generous for real-world use
    cases while preventing impractical operations.
    
    When the limit is exceeded, range-diff now displays a clear error
    message showing both the requested memory size and the maximum allowed,
    formatted in human-readable units for better user experience.
    
    Example usage: git range-diff --max-memory=1G branch1...branch2 git
    range-diff --max-memory=500M base..topic1 base..topic2
    
    This approach was chosen over alternatives:
    
     * Pre-counting commits: Would require spawning additional git processes
       and reading all commits twice
     * Limiting by commit count: Less precise than actual memory usage
     * Streaming approach: Would require significant refactoring of the
       current algorithm
    
    This issue was previously discussed in:
    https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/
    
    Acked-by: Johannes Schindelin johannes.schindelin@gmx.de
    [https://github.com/gitgitgadget/git/pull/1958#issuecomment-3224133138]

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1958%2Fpcasaretto%2Frange-diff-size-limit-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1958/pcasaretto/range-diff-size-limit-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1958

 builtin/log.c        |  1 +
 builtin/range-diff.c | 30 ++++++++++++++++++++++++++----
 log-tree.c           |  1 +
 range-diff.c         | 35 +++++++++++++++++++++++++----------
 range-diff.h         |  5 +++++
 5 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index c2f8bbf8630..5f552d14c0f 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1404,6 +1404,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file,
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = rev->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts,
 			.other_arg = &other_arg
 		};
diff --git a/builtin/range-diff.c b/builtin/range-diff.c
index a563abff5fe..10fa82fd285 100644
--- a/builtin/range-diff.c
+++ b/builtin/range-diff.c
@@ -6,6 +6,7 @@
 #include "parse-options.h"
 #include "range-diff.h"
 #include "config.h"
+#include "parse.h"
 
 
 static const char * const builtin_range_diff_usage[] = {
@@ -15,6 +16,22 @@ N_("git range-diff [<options>] <base> <old-tip> <new-tip>"),
 NULL
 };
 
+static int parse_max_memory(const struct option *opt, const char *arg, int unset)
+{
+	size_t *max_memory = opt->value;
+	uintmax_t val;
+
+	if (unset) {
+		return 0;
+	}
+
+	if (!git_parse_unsigned(arg, &val, SIZE_MAX))
+		return error(_("invalid max-memory value: %s"), arg);
+
+	*max_memory = (size_t)val;
+	return 0;
+}
+
 int cmd_range_diff(int argc,
 		   const char **argv,
 		   const char *prefix,
@@ -25,6 +42,7 @@ int cmd_range_diff(int argc,
 	struct strvec diff_merges_arg = STRVEC_INIT;
 	struct range_diff_options range_diff_opts = {
 		.creation_factor = RANGE_DIFF_CREATION_FACTOR_DEFAULT,
+		.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 		.diffopt = &diffopt,
 		.other_arg = &other_arg
 	};
@@ -33,17 +51,21 @@ int cmd_range_diff(int argc,
 		OPT_INTEGER(0, "creation-factor",
 			    &range_diff_opts.creation_factor,
 			    N_("percentage by which creation is weighted")),
+		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
+				  N_("style"), N_("passed to 'git log'"), 0),
+		OPT_BOOL(0, "left-only", &left_only,
+			 N_("only emit output related to the first range")),
+		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
+			     N_("size"),
+			     N_("maximum memory for cost matrix (default 4G)"),
+			     parse_max_memory),
 		OPT_BOOL(0, "no-dual-color", &simple_color,
 			    N_("use simple diff colors")),
 		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
 				  N_("notes"), N_("passed to 'git log'"),
 				  PARSE_OPT_OPTARG),
-		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
-				  N_("style"), N_("passed to 'git log'"), 0),
 		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
 				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
-		OPT_BOOL(0, "left-only", &left_only,
-			 N_("only emit output related to the first range")),
 		OPT_BOOL(0, "right-only", &right_only,
 			 N_("only emit output related to the second range")),
 		OPT_END()
diff --git a/log-tree.c b/log-tree.c
index 233bf9f227c..73d21f71764 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -717,6 +717,7 @@ static void show_diff_of_diff(struct rev_info *opt)
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = opt->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts
 		};
 
diff --git a/range-diff.c b/range-diff.c
index 8a2dcbee322..6e9b6b115e5 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -21,6 +21,7 @@
 #include "apply.h"
 #include "revision.h"
 
+
 struct patch_util {
 	/* For the search for an exact match */
 	struct hashmap_entry e;
@@ -287,8 +288,8 @@ static void find_exact_matches(struct string_list *a, struct string_list *b)
 }
 
 static int diffsize_consume(void *data,
-			     char *line UNUSED,
-			     unsigned long len UNUSED)
+			    char *line UNUSED,
+			    unsigned long len UNUSED)
 {
 	(*(int *)data)++;
 	return 0;
@@ -325,13 +326,24 @@ static int diffsize(const char *a, const char *b)
 }
 
 static void get_correspondences(struct string_list *a, struct string_list *b,
-				int creation_factor)
+				int creation_factor, size_t max_memory)
 {
 	int n = a->nr + b->nr;
 	int *cost, c, *a2b, *b2a;
 	int i, j;
-
-	ALLOC_ARRAY(cost, st_mult(n, n));
+	size_t cost_size = st_mult(n, n);
+	size_t cost_bytes = st_mult(sizeof(int), cost_size);
+	if (cost_bytes >= max_memory) {
+		struct strbuf cost_str = STRBUF_INIT;
+		struct strbuf max_str = STRBUF_INIT;
+		strbuf_humanise_bytes(&cost_str, cost_bytes);
+		strbuf_humanise_bytes(&max_str, max_memory);
+		die(_("range-diff: unable to compute the range-diff, since it "
+		      "exceeds the maximum memory for the cost matrix: %s "
+		      "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
+		    cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
+	}
+	ALLOC_ARRAY(cost, cost_size);
 	ALLOC_ARRAY(a2b, n);
 	ALLOC_ARRAY(b2a, n);
 
@@ -351,7 +363,8 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
 		}
 
 		c = a_util->matching < 0 ?
-			a_util->diffsize * creation_factor / 100 : COST_MAX;
+			    a_util->diffsize * creation_factor / 100 :
+			    COST_MAX;
 		for (j = b->nr; j < n; j++)
 			cost[i + n * j] = c;
 	}
@@ -360,7 +373,8 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
 		struct patch_util *util = b->items[j].util;
 
 		c = util->matching < 0 ?
-			util->diffsize * creation_factor / 100 : COST_MAX;
+			    util->diffsize * creation_factor / 100 :
+			    COST_MAX;
 		for (i = a->nr; i < n; i++)
 			cost[i + n * j] = c;
 	}
@@ -539,7 +553,7 @@ static void output(struct string_list *a, struct string_list *b,
 		if (i < a->nr && a_util->matching < 0) {
 			if (!range_diff_opts->right_only)
 				output_pair_header(&opts, patch_no_width,
-					   &buf, &dashes, a_util, NULL);
+						   &buf, &dashes, a_util, NULL);
 			i++;
 			continue;
 		}
@@ -548,7 +562,7 @@ static void output(struct string_list *a, struct string_list *b,
 		while (j < b->nr && b_util->matching < 0) {
 			if (!range_diff_opts->left_only)
 				output_pair_header(&opts, patch_no_width,
-					   &buf, &dashes, NULL, b_util);
+						   &buf, &dashes, NULL, b_util);
 			b_util = ++j < b->nr ? b->items[j].util : NULL;
 		}
 
@@ -591,7 +605,8 @@ int show_range_diff(const char *range1, const char *range2,
 	if (!res) {
 		find_exact_matches(&branch1, &branch2);
 		get_correspondences(&branch1, &branch2,
-				    range_diff_opts->creation_factor);
+				    range_diff_opts->creation_factor,
+				    range_diff_opts->max_memory);
 		output(&branch1, &branch2, range_diff_opts);
 	}
 
diff --git a/range-diff.h b/range-diff.h
index cd85000b5a0..9d39818e349 100644
--- a/range-diff.h
+++ b/range-diff.h
@@ -5,6 +5,10 @@
 #include "strvec.h"
 
 #define RANGE_DIFF_CREATION_FACTOR_DEFAULT 60
+#define RANGE_DIFF_MAX_MEMORY_DEFAULT \
+	(sizeof(void*) >= 8 ? \
+		((size_t)(1024L * 1024L) * (size_t)(4L * 1024L)) : /* 4GB on 64-bit */ \
+		((size_t)(1024L * 1024L) * (size_t)(2L * 1024L)))   /* 2GB on 32-bit */
 
 /*
  * A much higher value than the default, when we KNOW we are comparing
@@ -17,6 +21,7 @@ struct range_diff_options {
 	unsigned dual_color:1;
 	unsigned left_only:1, right_only:1;
 	unsigned include_merges:1;
+	size_t max_memory;
 	const struct diff_options *diffopt; /* may be NULL */
 	const struct strvec *other_arg; /* may be NULL */
 };

base-commit: 954d33a9757fcfab723a824116902f1eb16e05f7
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] range-diff: add configurable memory limit for cost matrix
  2025-08-26 17:18 [PATCH] range-diff: add configurable memory limit for cost matrix Paulo Casaretto via GitGitGadget
@ 2025-08-26 19:18 ` Junio C Hamano
  2025-08-28  8:38 ` [PATCH v2 0/2] " Paulo Casaretto via GitGitGadget
  1 sibling, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2025-08-26 19:18 UTC (permalink / raw)
  To: Paulo Casaretto via GitGitGadget; +Cc: git, Paulo Casaretto, pcasaretto

"Paulo Casaretto via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: pcasaretto <paulo.casaretto@shopify.com>

<administrivia>

It is usual to see a less human readable name embedded in the commit
object than the mail header when a mail comes from GGG.  

Just in case you want to be known to this community as "Paulo
Casaretto", not "pcasaretto", I thought I'd point it out that you
may want to redo the commit.  I do not mind what name you like to
use, as long as it is identifiable, and From: identity matches the
identity you add your Signed-off-by: with.

</administrivia>

> Acked-by: Johannes Schindelin johannes.schindelin@gmx.de

It is unusual to lack <> around e-mail address here.

> Signed-off-by: pcasaretto <paulo.casaretto@shopify.com>
> ---
>     range-diff: add configurable memory limit for cost matrix

> +static int parse_max_memory(const struct option *opt, const char *arg, int unset)
> +{
> +	size_t *max_memory = opt->value;
> +	uintmax_t val;
> +
> +	if (unset) {
> +		return 0;
> +	}

No unnecessary {braces} around a single statement, please.

> +	if (!git_parse_unsigned(arg, &val, SIZE_MAX))
> +		return error(_("invalid max-memory value: %s"), arg);
> +
> +	*max_memory = (size_t)val;
> +	return 0;
> +}

> @@ -33,17 +51,21 @@ int cmd_range_diff(int argc,
>  		OPT_INTEGER(0, "creation-factor",
>  			    &range_diff_opts.creation_factor,
>  			    N_("percentage by which creation is weighted")),
> +		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> +				  N_("style"), N_("passed to 'git log'"), 0),
> +		OPT_BOOL(0, "left-only", &left_only,
> +			 N_("only emit output related to the first range")),
> +		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
> +			     N_("size"),
> +			     N_("maximum memory for cost matrix (default 4G)"),
> +			     parse_max_memory),
>  		OPT_BOOL(0, "no-dual-color", &simple_color,
>  			    N_("use simple diff colors")),
>  		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
>  				  N_("notes"), N_("passed to 'git log'"),
>  				  PARSE_OPT_OPTARG),
> -		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> -				  N_("style"), N_("passed to 'git log'"), 0),
>  		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
>  				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
> -		OPT_BOOL(0, "left-only", &left_only,
> -			 N_("only emit output related to the first range")),
>  		OPT_BOOL(0, "right-only", &right_only,
>  			 N_("only emit output related to the second range")),
>  		OPT_END()

This seems to mix unrelated changes.  Please don't.

Or if the reordering of options do have a reason to exist in _this_
commit, please justify it in your proposed log message.  Even if
there were a good reason for reordering existing options, I strongly
suspect that the change would want to be done in a separate,
preparatory-clean-up commit (i.e., making this topic a two-patch
series), because it has nothing to do with preventing inefficient
cost matrix computation from consuming too much memory, which _is_
the theme of this commit.

> diff --git a/range-diff.c b/range-diff.c
> index 8a2dcbee322..6e9b6b115e5 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -21,6 +21,7 @@
>  #include "apply.h"
>  #include "revision.h"
>  
> +

Unrelated, unexplained, and unnecessary change snuck in?  Please
proof-read the patch yourself before sending.

> @@ -287,8 +288,8 @@ static void find_exact_matches(struct string_list *a, struct string_list *b)
>  }
>  
>  static int diffsize_consume(void *data,
> -			     char *line UNUSED,
> -			     unsigned long len UNUSED)
> +			    char *line UNUSED,
> +			    unsigned long len UNUSED)

What is this change about???

>  static void get_correspondences(struct string_list *a, struct string_list *b,
> -				int creation_factor)
> +				int creation_factor, size_t max_memory)
>  {
>  	int n = a->nr + b->nr;
>  	int *cost, c, *a2b, *b2a;
>  	int i, j;
> -
> -	ALLOC_ARRAY(cost, st_mult(n, n));
> +	size_t cost_size = st_mult(n, n);
> +	size_t cost_bytes = st_mult(sizeof(int), cost_size);
> +	if (cost_bytes >= max_memory) {
> +		struct strbuf cost_str = STRBUF_INIT;
> +		struct strbuf max_str = STRBUF_INIT;
> +		strbuf_humanise_bytes(&cost_str, cost_bytes);
> +		strbuf_humanise_bytes(&max_str, max_memory);
> +		die(_("range-diff: unable to compute the range-diff, since it "
> +		      "exceeds the maximum memory for the cost matrix: %s "
> +		      "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
> +		    cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
> +	}
> +	ALLOC_ARRAY(cost, cost_size);

Nicely done.

> @@ -351,7 +363,8 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
>  		}
>  
>  		c = a_util->matching < 0 ?
> -			a_util->diffsize * creation_factor / 100 : COST_MAX;
> +			    a_util->diffsize * creation_factor / 100 :
> +			    COST_MAX;
>  		for (j = b->nr; j < n; j++)
>  			cost[i + n * j] = c;
>  	}

There seem to be other unrelated changes indentation-only changes
mixed in to the changes to this file, not just this one.

As a style fix, 

		c = a_util->matching < 0
		  ? a_util->diffsize * creation_factor / 100
		  : COST_MAX;

would be easier to follow and read, but please do not do such a
cosmetic clean-up in the same patch.  Do them in a separate
preliminary clean-up patch before the "real work".

> @@ -591,7 +605,8 @@ int show_range_diff(const char *range1, const char *range2,
>  	if (!res) {
>  		find_exact_matches(&branch1, &branch2);
>  		get_correspondences(&branch1, &branch2,
> -				    range_diff_opts->creation_factor);
> +				    range_diff_opts->creation_factor,
> +				    range_diff_opts->max_memory);
>  		output(&branch1, &branch2, range_diff_opts);
>  	}

OK.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 0/2] range-diff: add configurable memory limit for cost matrix
  2025-08-26 17:18 [PATCH] range-diff: add configurable memory limit for cost matrix Paulo Casaretto via GitGitGadget
  2025-08-26 19:18 ` Junio C Hamano
@ 2025-08-28  8:38 ` Paulo Casaretto via GitGitGadget
  2025-08-28  8:38   ` [PATCH v2 1/2] range-diff: reorder options lexicographically pcasaretto via GitGitGadget
                     ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Paulo Casaretto via GitGitGadget @ 2025-08-28  8:38 UTC (permalink / raw)
  To: git; +Cc: Paulo Casaretto


Problem Description
===================

When git range-diff is given extremely large ranges, it can result in
either:

 1. Segmentation fault due to integer overflow in array index calculations
 2. Excessive memory consumption leading to system hangs or OOM kills
 3. Poor user experience with the command appearing to hang for minutes


Reproduction Case
=================

In a Shopify's large monorepo a range-diff command like this crashes after
several minutes with a SIGBUS error

$ git range-diff 4430f36511..316c1276c6 cb5240b6a8..2bbd292091


Range statistics:

 * First range: 256,783 commits
 * Second range: 1 commit
 * Total: 256,784 commits
 * Memory required for cost matrix: n² × 4 bytes = ~260GB


Stack Trace (Segmentation Fault)
================================

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e000ae3b8)
  * frame #0: 0x000000010029a284 git`get_correspondences(a=0x000000016fde6188, b=0x000000016fde6160, creation_factor=60) at range-diff.c:356:20
    frame #1: 0x0000000100299310 git`show_range_diff(range1="4430f36511cbacf5c517c6851e2b8508a72dfd30..316c1276c63f55ad9413fa18bf3b6483564a9cf4", range2="cb5240b6a8ba59b4a1f282559ee0742721b0cafc..2bbd292091e376d177ce62e264dae8872ca6be5a", range_diff_opts=0x000000016fde6308) at range-diff.c:593:3
    frame #2: 0x00000001000c719c git`cmd_range_diff(argc=2, argv=0x0000600000bcd8c0, prefix="areas/core/shopify/", repo=0x0000000100468b20) at range-diff.c:167:8
    frame #3: 0x000000010000277c git`run_builtin(p=0x00000001004408d8, argc=3, argv=0x0000600000bcd8c0, repo=0x0000000100468b20) at git.c:480:11
    frame #4: 0x0000000100001020 git`handle_builtin(args=0x000000016fde6c70) at git.c:746:9
    frame #5: 0x0000000100002074 git`run_argv(args=0x000000016fde6c70) at git.c:813:4
    frame #6: 0x0000000100000d3c git`cmd_main(argc=3, argv=0x000000016fde7350) at git.c:953:19
    frame #7: 0x000000010012750c git`main(argc=4, argv=0x000000016fde7348) at common-main.c:9:11
    frame #8: 0x000000018573ab98 dyld`start + 6076



Root Cause Analysis
===================

The crash occurs in get_correspondences() at line 356:

static void get_correspondences(struct string_list *a, struct string_list *b, ...)
{
    int n = a->nr + b->nr;  // Integer overflow: 256,784 fits in int
    ...
    ALLOC_ARRAY(cost, st_mult(n, n));  // Would allocate ~260GB
    ...
    cost[i + n * j] = c;  // Line 356: Invalid memory access
}


Problems:

 1. Integer overflow: While n=256,784 fits in an int, n*n overflows
 2. Memory allocation: Even with proper types, allocating 260GB is
    impractical


Solution
========

Add a memory limit check in get_correspondences() before allocating the cost
matrix. This check uses the total size in bytes (n² × sizeof(int)) and
compares it against a configurable maximum, preventing both excessive memory
usage and integer overflow issues.

The limit is configurable via a new --max-memory option that accepts
human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
systems and 2GB for 32 bit systems. This allows comparing ranges of
approximately 32,000 (16,000) commits - generous for real-world use cases
while preventing impractical operations.

When the limit is exceeded, range-diff now displays a clear error message
showing both the requested memory size and the maximum allowed, formatted in
human-readable units for better user experience.

Example usage: git range-diff --max-memory=1G branch1...branch2 git
range-diff --max-memory=500M base..topic1 base..topic2

This approach was chosen over alternatives:

 * Pre-counting commits: Would require spawning additional git processes and
   reading all commits twice
 * Limiting by commit count: Less precise than actual memory usage
 * Streaming approach: Would require significant refactoring of the current
   algorithm

This issue was previously discussed in:
https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/

Acked-by: Johannes Schindelin johannes.schindelin@gmx.de
[https://github.com/gitgitgadget/git/pull/1958#issuecomment-3224133138]

pcasaretto (2):
  range-diff: reorder options lexicographically
  range-diff: add configurable memory limit for cost matrix

 builtin/log.c        |  1 +
 builtin/range-diff.c | 29 +++++++++++++++++++++++++----
 log-tree.c           |  1 +
 range-diff.c         | 20 ++++++++++++++++----
 range-diff.h         |  5 +++++
 5 files changed, 48 insertions(+), 8 deletions(-)


base-commit: 954d33a9757fcfab723a824116902f1eb16e05f7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1958%2Fpcasaretto%2Frange-diff-size-limit-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1958/pcasaretto/range-diff-size-limit-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1958

Range-diff vs v1:

 -:  ----------- > 1:  ec5dcdf9d00 range-diff: reorder options lexicographically
 1:  5cf3e8921a7 ! 2:  c81f920fee0 range-diff: add configurable memory limit for cost matrix
     @@ Commit message
          This issue was previously discussed in:
          https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/
      
     -    Acked-by: Johannes Schindelin johannes.schindelin@gmx.de
     -    Signed-off-by: pcasaretto <paulo.casaretto@shopify.com>
     +    Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
     +    Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>
      
       ## builtin/log.c ##
      @@ builtin/log.c: static void make_cover_letter(struct rev_info *rev, int use_separate_file,
     @@ builtin/range-diff.c: N_("git range-diff [<options>] <base> <old-tip> <new-tip>"
      +	size_t *max_memory = opt->value;
      +	uintmax_t val;
      +
     -+	if (unset) {
     ++	if (unset)
      +		return 0;
     -+	}
      +
      +	if (!git_parse_unsigned(arg, &val, SIZE_MAX))
      +		return error(_("invalid max-memory value: %s"), arg);
     @@ builtin/range-diff.c: int cmd_range_diff(int argc,
       		.other_arg = &other_arg
       	};
      @@ builtin/range-diff.c: int cmd_range_diff(int argc,
     - 		OPT_INTEGER(0, "creation-factor",
     - 			    &range_diff_opts.creation_factor,
     - 			    N_("percentage by which creation is weighted")),
     -+		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
     -+				  N_("style"), N_("passed to 'git log'"), 0),
     -+		OPT_BOOL(0, "left-only", &left_only,
     -+			 N_("only emit output related to the first range")),
     + 				  N_("style"), N_("passed to 'git log'"), 0),
     + 		OPT_BOOL(0, "left-only", &left_only,
     + 			 N_("only emit output related to the first range")),
      +		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
      +			     N_("size"),
      +			     N_("maximum memory for cost matrix (default 4G)"),
     @@ builtin/range-diff.c: int cmd_range_diff(int argc,
       		OPT_BOOL(0, "no-dual-color", &simple_color,
       			    N_("use simple diff colors")),
       		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
     - 				  N_("notes"), N_("passed to 'git log'"),
     - 				  PARSE_OPT_OPTARG),
     --		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
     --				  N_("style"), N_("passed to 'git log'"), 0),
     - 		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
     - 				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
     --		OPT_BOOL(0, "left-only", &left_only,
     --			 N_("only emit output related to the first range")),
     - 		OPT_BOOL(0, "right-only", &right_only,
     - 			 N_("only emit output related to the second range")),
     - 		OPT_END()
      
       ## log-tree.c ##
      @@ log-tree.c: static void show_diff_of_diff(struct rev_info *opt)
     @@ log-tree.c: static void show_diff_of_diff(struct rev_info *opt)
       
      
       ## range-diff.c ##
     -@@
     - #include "apply.h"
     - #include "revision.h"
     - 
     -+
     - struct patch_util {
     - 	/* For the search for an exact match */
     - 	struct hashmap_entry e;
     -@@ range-diff.c: static void find_exact_matches(struct string_list *a, struct string_list *b)
     - }
     - 
     - static int diffsize_consume(void *data,
     --			     char *line UNUSED,
     --			     unsigned long len UNUSED)
     -+			    char *line UNUSED,
     -+			    unsigned long len UNUSED)
     - {
     - 	(*(int *)data)++;
     - 	return 0;
      @@ range-diff.c: static int diffsize(const char *a, const char *b)
       }
       
     @@ range-diff.c: static int diffsize(const char *a, const char *b)
       	ALLOC_ARRAY(a2b, n);
       	ALLOC_ARRAY(b2a, n);
       
     -@@ range-diff.c: static void get_correspondences(struct string_list *a, struct string_list *b,
     - 		}
     - 
     - 		c = a_util->matching < 0 ?
     --			a_util->diffsize * creation_factor / 100 : COST_MAX;
     -+			    a_util->diffsize * creation_factor / 100 :
     -+			    COST_MAX;
     - 		for (j = b->nr; j < n; j++)
     - 			cost[i + n * j] = c;
     - 	}
     -@@ range-diff.c: static void get_correspondences(struct string_list *a, struct string_list *b,
     - 		struct patch_util *util = b->items[j].util;
     - 
     - 		c = util->matching < 0 ?
     --			util->diffsize * creation_factor / 100 : COST_MAX;
     -+			    util->diffsize * creation_factor / 100 :
     -+			    COST_MAX;
     - 		for (i = a->nr; i < n; i++)
     - 			cost[i + n * j] = c;
     - 	}
     -@@ range-diff.c: static void output(struct string_list *a, struct string_list *b,
     - 		if (i < a->nr && a_util->matching < 0) {
     - 			if (!range_diff_opts->right_only)
     - 				output_pair_header(&opts, patch_no_width,
     --					   &buf, &dashes, a_util, NULL);
     -+						   &buf, &dashes, a_util, NULL);
     - 			i++;
     - 			continue;
     - 		}
     -@@ range-diff.c: static void output(struct string_list *a, struct string_list *b,
     - 		while (j < b->nr && b_util->matching < 0) {
     - 			if (!range_diff_opts->left_only)
     - 				output_pair_header(&opts, patch_no_width,
     --					   &buf, &dashes, NULL, b_util);
     -+						   &buf, &dashes, NULL, b_util);
     - 			b_util = ++j < b->nr ? b->items[j].util : NULL;
     - 		}
     - 
      @@ range-diff.c: int show_range_diff(const char *range1, const char *range2,
       	if (!res) {
       		find_exact_matches(&branch1, &branch2);

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/2] range-diff: reorder options lexicographically
  2025-08-28  8:38 ` [PATCH v2 0/2] " Paulo Casaretto via GitGitGadget
@ 2025-08-28  8:38   ` pcasaretto via GitGitGadget
  2025-08-28 15:21     ` Junio C Hamano
  2025-08-28  8:38   ` [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix pcasaretto via GitGitGadget
  2025-08-29 11:00   ` [PATCH v3] " Paulo Casaretto via GitGitGadget
  2 siblings, 1 reply; 18+ messages in thread
From: pcasaretto via GitGitGadget @ 2025-08-28  8:38 UTC (permalink / raw)
  To: git; +Cc: Paulo Casaretto, pcasaretto

From: pcasaretto <paulo.casaretto@shopify.com>

Reorder the command-line options in builtin/range-diff.c to be in
lexicographic order for better organization and readability. This is
a preparatory cleanup with no functional changes.

Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>
---
 builtin/range-diff.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/range-diff.c b/builtin/range-diff.c
index a563abff5fee..283583a80d0b 100644
--- a/builtin/range-diff.c
+++ b/builtin/range-diff.c
@@ -33,17 +33,17 @@ int cmd_range_diff(int argc,
 		OPT_INTEGER(0, "creation-factor",
 			    &range_diff_opts.creation_factor,
 			    N_("percentage by which creation is weighted")),
+		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
+				  N_("style"), N_("passed to 'git log'"), 0),
+		OPT_BOOL(0, "left-only", &left_only,
+			 N_("only emit output related to the first range")),
 		OPT_BOOL(0, "no-dual-color", &simple_color,
 			    N_("use simple diff colors")),
 		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
 				  N_("notes"), N_("passed to 'git log'"),
 				  PARSE_OPT_OPTARG),
-		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
-				  N_("style"), N_("passed to 'git log'"), 0),
 		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
 				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
-		OPT_BOOL(0, "left-only", &left_only,
-			 N_("only emit output related to the first range")),
 		OPT_BOOL(0, "right-only", &right_only,
 			 N_("only emit output related to the second range")),
 		OPT_END()
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix
  2025-08-28  8:38 ` [PATCH v2 0/2] " Paulo Casaretto via GitGitGadget
  2025-08-28  8:38   ` [PATCH v2 1/2] range-diff: reorder options lexicographically pcasaretto via GitGitGadget
@ 2025-08-28  8:38   ` pcasaretto via GitGitGadget
  2025-08-28 17:04     ` Elijah Newren
  2025-08-29 11:00   ` [PATCH v3] " Paulo Casaretto via GitGitGadget
  2 siblings, 1 reply; 18+ messages in thread
From: pcasaretto via GitGitGadget @ 2025-08-28  8:38 UTC (permalink / raw)
  To: git; +Cc: Paulo Casaretto, pcasaretto

From: pcasaretto <paulo.casaretto@shopify.com>

When comparing large commit ranges (e.g., 250,000+ commits), range-diff
attempts to allocate an n×n cost matrix that can exhaust available
memory. For example, with 256,784 commits (n = 513,568), the matrix
would require approximately 256GB of memory (513,568² × 4 bytes),
causing either immediate segmentation faults due to integer overflow or
system hangs.

Add a memory limit check in get_correspondences() before allocating the
cost matrix. This check uses the total size in bytes (n² × sizeof(int))
and compares it against a configurable maximum, preventing both
excessive memory usage and integer overflow issues.

The limit is configurable via a new --max-memory option that accepts
human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
systems and 2GB for 32 bit systems. This allows comparing ranges of
approximately 32,000 (16,000) commits - generous for real-world use cases
while preventing impractical operations.

When the limit is exceeded, range-diff now displays a clear error
message showing both the requested memory size and the maximum allowed,
formatted in human-readable units for better user experience.

Example usage:
  git range-diff --max-memory=1G branch1...branch2
  git range-diff --max-memory=500M base..topic1 base..topic2

This approach was chosen over alternatives:
- Pre-counting commits: Would require spawning additional git processes
  and reading all commits twice
- Limiting by commit count: Less precise than actual memory usage
- Streaming approach: Would require significant refactoring of the
  current algorithm

This issue was previously discussed in:
https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/

Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>
---
 builtin/log.c        |  1 +
 builtin/range-diff.c | 21 +++++++++++++++++++++
 log-tree.c           |  1 +
 range-diff.c         | 20 ++++++++++++++++----
 range-diff.h         |  5 +++++
 5 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index c2f8bbf86301..5f552d14c0fe 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1404,6 +1404,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file,
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = rev->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts,
 			.other_arg = &other_arg
 		};
diff --git a/builtin/range-diff.c b/builtin/range-diff.c
index 283583a80d0b..79956d83e400 100644
--- a/builtin/range-diff.c
+++ b/builtin/range-diff.c
@@ -6,6 +6,7 @@
 #include "parse-options.h"
 #include "range-diff.h"
 #include "config.h"
+#include "parse.h"
 
 
 static const char * const builtin_range_diff_usage[] = {
@@ -15,6 +16,21 @@ N_("git range-diff [<options>] <base> <old-tip> <new-tip>"),
 NULL
 };
 
+static int parse_max_memory(const struct option *opt, const char *arg, int unset)
+{
+	size_t *max_memory = opt->value;
+	uintmax_t val;
+
+	if (unset)
+		return 0;
+
+	if (!git_parse_unsigned(arg, &val, SIZE_MAX))
+		return error(_("invalid max-memory value: %s"), arg);
+
+	*max_memory = (size_t)val;
+	return 0;
+}
+
 int cmd_range_diff(int argc,
 		   const char **argv,
 		   const char *prefix,
@@ -25,6 +41,7 @@ int cmd_range_diff(int argc,
 	struct strvec diff_merges_arg = STRVEC_INIT;
 	struct range_diff_options range_diff_opts = {
 		.creation_factor = RANGE_DIFF_CREATION_FACTOR_DEFAULT,
+		.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 		.diffopt = &diffopt,
 		.other_arg = &other_arg
 	};
@@ -37,6 +54,10 @@ int cmd_range_diff(int argc,
 				  N_("style"), N_("passed to 'git log'"), 0),
 		OPT_BOOL(0, "left-only", &left_only,
 			 N_("only emit output related to the first range")),
+		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
+			     N_("size"),
+			     N_("maximum memory for cost matrix (default 4G)"),
+			     parse_max_memory),
 		OPT_BOOL(0, "no-dual-color", &simple_color,
 			    N_("use simple diff colors")),
 		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
diff --git a/log-tree.c b/log-tree.c
index 233bf9f227c6..73d21f71764e 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -717,6 +717,7 @@ static void show_diff_of_diff(struct rev_info *opt)
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = opt->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts
 		};
 
diff --git a/range-diff.c b/range-diff.c
index 8a2dcbee322e..e31f71c73d20 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -325,13 +325,24 @@ static int diffsize(const char *a, const char *b)
 }
 
 static void get_correspondences(struct string_list *a, struct string_list *b,
-				int creation_factor)
+				int creation_factor, size_t max_memory)
 {
 	int n = a->nr + b->nr;
 	int *cost, c, *a2b, *b2a;
 	int i, j;
-
-	ALLOC_ARRAY(cost, st_mult(n, n));
+	size_t cost_size = st_mult(n, n);
+	size_t cost_bytes = st_mult(sizeof(int), cost_size);
+	if (cost_bytes >= max_memory) {
+		struct strbuf cost_str = STRBUF_INIT;
+		struct strbuf max_str = STRBUF_INIT;
+		strbuf_humanise_bytes(&cost_str, cost_bytes);
+		strbuf_humanise_bytes(&max_str, max_memory);
+		die(_("range-diff: unable to compute the range-diff, since it "
+		      "exceeds the maximum memory for the cost matrix: %s "
+		      "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
+		    cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
+	}
+	ALLOC_ARRAY(cost, cost_size);
 	ALLOC_ARRAY(a2b, n);
 	ALLOC_ARRAY(b2a, n);
 
@@ -591,7 +602,8 @@ int show_range_diff(const char *range1, const char *range2,
 	if (!res) {
 		find_exact_matches(&branch1, &branch2);
 		get_correspondences(&branch1, &branch2,
-				    range_diff_opts->creation_factor);
+				    range_diff_opts->creation_factor,
+				    range_diff_opts->max_memory);
 		output(&branch1, &branch2, range_diff_opts);
 	}
 
diff --git a/range-diff.h b/range-diff.h
index cd85000b5a0d..9d39818e349c 100644
--- a/range-diff.h
+++ b/range-diff.h
@@ -5,6 +5,10 @@
 #include "strvec.h"
 
 #define RANGE_DIFF_CREATION_FACTOR_DEFAULT 60
+#define RANGE_DIFF_MAX_MEMORY_DEFAULT \
+	(sizeof(void*) >= 8 ? \
+		((size_t)(1024L * 1024L) * (size_t)(4L * 1024L)) : /* 4GB on 64-bit */ \
+		((size_t)(1024L * 1024L) * (size_t)(2L * 1024L)))   /* 2GB on 32-bit */
 
 /*
  * A much higher value than the default, when we KNOW we are comparing
@@ -17,6 +21,7 @@ struct range_diff_options {
 	unsigned dual_color:1;
 	unsigned left_only:1, right_only:1;
 	unsigned include_merges:1;
+	size_t max_memory;
 	const struct diff_options *diffopt; /* may be NULL */
 	const struct strvec *other_arg; /* may be NULL */
 };
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] range-diff: reorder options lexicographically
  2025-08-28  8:38   ` [PATCH v2 1/2] range-diff: reorder options lexicographically pcasaretto via GitGitGadget
@ 2025-08-28 15:21     ` Junio C Hamano
  2025-08-28 17:12       ` Elijah Newren
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2025-08-28 15:21 UTC (permalink / raw)
  To: pcasaretto via GitGitGadget; +Cc: git, Paulo Casaretto, pcasaretto

"pcasaretto via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: pcasaretto <paulo.casaretto@shopify.com>
>
> Reorder the command-line options in builtin/range-diff.c to be in
> lexicographic order for better organization and readability. This is
> a preparatory cleanup with no functional changes.
>
> Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>
> ---
>  builtin/range-diff.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)

Thanks for splitting this out into its own commit.

I am not sure if "lexicographic order" fits well in the context of
"git cmd -h" that spews out many many options, shown with related
options together in groups.  I find it aggressively annoying to show
left/right-only far apart.  A user unfamiliar with the command would
look at the list, find "left-only" sitting in the list alone, and
waste time and break concentration wondering what in the first range
is so special to deserve such an option, until they see "right-only"
further down to realize that they are symmetric.

I'd rather not to see this "lexicographic" change done, but others
may have better justification (note: "for better organization and
readability" I just disagreed is a good justification) that may make
me change my mind.

What I would change, if there is something suboptimal in the current
output from "git range-diff -h" that deserves improvement, is the
lack of the grouping header before the options for range-diff
operation (i.e. creation-factor to left/right-only, before the next
"diff output" group begins).

Thanks.

> diff --git a/builtin/range-diff.c b/builtin/range-diff.c
> index a563abff5fee..283583a80d0b 100644
> --- a/builtin/range-diff.c
> +++ b/builtin/range-diff.c
> @@ -33,17 +33,17 @@ int cmd_range_diff(int argc,
>  		OPT_INTEGER(0, "creation-factor",
>  			    &range_diff_opts.creation_factor,
>  			    N_("percentage by which creation is weighted")),
> +		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> +				  N_("style"), N_("passed to 'git log'"), 0),
> +		OPT_BOOL(0, "left-only", &left_only,
> +			 N_("only emit output related to the first range")),
>  		OPT_BOOL(0, "no-dual-color", &simple_color,
>  			    N_("use simple diff colors")),
>  		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
>  				  N_("notes"), N_("passed to 'git log'"),
>  				  PARSE_OPT_OPTARG),
> -		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> -				  N_("style"), N_("passed to 'git log'"), 0),
>  		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
>  				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
> -		OPT_BOOL(0, "left-only", &left_only,
> -			 N_("only emit output related to the first range")),
>  		OPT_BOOL(0, "right-only", &right_only,
>  			 N_("only emit output related to the second range")),
>  		OPT_END()

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix
  2025-08-28  8:38   ` [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix pcasaretto via GitGitGadget
@ 2025-08-28 17:04     ` Elijah Newren
  2025-08-28 21:22       ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Elijah Newren @ 2025-08-28 17:04 UTC (permalink / raw)
  To: pcasaretto via GitGitGadget; +Cc: git, Paulo Casaretto, pcasaretto

On Thu, Aug 28, 2025 at 2:00 AM pcasaretto via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: pcasaretto <paulo.casaretto@shopify.com>
> Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>

The names (and emails) in these should match; I believe the name in
the From field is set by Gitgitgadget based on your profile settings;
see https://github.com/settings/profile and set your name there.

>  static void get_correspondences(struct string_list *a, struct string_list *b,
> -                               int creation_factor)
> +                               int creation_factor, size_t max_memory)
>  {
>         int n = a->nr + b->nr;
>         int *cost, c, *a2b, *b2a;
>         int i, j;
> -
> -       ALLOC_ARRAY(cost, st_mult(n, n));
> +       size_t cost_size = st_mult(n, n);
> +       size_t cost_bytes = st_mult(sizeof(int), cost_size);
> +       if (cost_bytes >= max_memory) {
> +               struct strbuf cost_str = STRBUF_INIT;
> +               struct strbuf max_str = STRBUF_INIT;
> +               strbuf_humanise_bytes(&cost_str, cost_bytes);
> +               strbuf_humanise_bytes(&max_str, max_memory);
> +               die(_("range-diff: unable to compute the range-diff, since it "
> +                     "exceeds the maximum memory for the cost matrix: %s "
> +                     "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),

available?  I'm worried the error message will report in users
checking system memory, claiming they have 14GB available on their
system, and then reporting a "bug".

Perhaps something like:

+                     "(%"PRIuMAX" bytes) needed, limited to %s
(%"PRIuMAX" bytes)"),

?


The rest of the patch looks good to me.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] range-diff: reorder options lexicographically
  2025-08-28 15:21     ` Junio C Hamano
@ 2025-08-28 17:12       ` Elijah Newren
  2025-08-29 10:56         ` Paulo L F Casaretto
  0 siblings, 1 reply; 18+ messages in thread
From: Elijah Newren @ 2025-08-28 17:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: pcasaretto via GitGitGadget, git, Paulo Casaretto, pcasaretto

On Thu, Aug 28, 2025 at 8:24 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "pcasaretto via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: pcasaretto <paulo.casaretto@shopify.com>
> > Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>

Same issue with name here.

> I am not sure if "lexicographic order" fits well in the context of
> "git cmd -h" that spews out many many options, shown with related
> options together in groups.  I find it aggressively annoying to show
> left/right-only far apart.  A user unfamiliar with the command would
> look at the list, find "left-only" sitting in the list alone, and
> waste time and break concentration wondering what in the first range
> is so special to deserve such an option, until they see "right-only"
> further down to realize that they are symmetric.
>
> I'd rather not to see this "lexicographic" change done, but others
> may have better justification (note: "for better organization and
> readability" I just disagreed is a good justification) that may make
> me change my mind.
>
> What I would change, if there is something suboptimal in the current
> output from "git range-diff -h" that deserves improvement, is the
> lack of the grouping header before the options for range-diff
> operation (i.e. creation-factor to left/right-only, before the next
> "diff output" group begins).
>
> Thanks.

I do like lexicographic ordering for unrelated options, but I prefer
options to be grouped by intent/use first, then by lexicographic
ordering.  And here, not only are--left-only & --right-only related as
Junio points out, to me --diff-merges and --remerge-diff are a similar
grouping that belong together.  So, my $0.02 is that I'd lean towards
calling both changes in the patch a reduction in organization rather
than an improvement.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix
  2025-08-28 17:04     ` Elijah Newren
@ 2025-08-28 21:22       ` Junio C Hamano
  2025-08-28 21:34         ` Elijah Newren
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2025-08-28 21:22 UTC (permalink / raw)
  To: Elijah Newren
  Cc: pcasaretto via GitGitGadget, git, Paulo Casaretto, pcasaretto

Elijah Newren <newren@gmail.com> writes:

> <gitgitgadget@gmail.com> wrote:
>>
>> From: pcasaretto <paulo.casaretto@shopify.com>
>> Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>
>
> The names (and emails) in these should match; I believe the name in
> the From field is set by Gitgitgadget based on your profile settings;
> see https://github.com/settings/profile and set your name there.
>
>>  static void get_correspondences(struct string_list *a, struct string_list *b,
>> -                               int creation_factor)
>> +                               int creation_factor, size_t max_memory)
>>  {
>>         int n = a->nr + b->nr;
>>         int *cost, c, *a2b, *b2a;
>>         int i, j;
>> -
>> -       ALLOC_ARRAY(cost, st_mult(n, n));
>> +       size_t cost_size = st_mult(n, n);
>> +       size_t cost_bytes = st_mult(sizeof(int), cost_size);
>> +       if (cost_bytes >= max_memory) {
>> +               struct strbuf cost_str = STRBUF_INIT;
>> +               struct strbuf max_str = STRBUF_INIT;
>> +               strbuf_humanise_bytes(&cost_str, cost_bytes);
>> +               strbuf_humanise_bytes(&max_str, max_memory);
>> +               die(_("range-diff: unable to compute the range-diff, since it "
>> +                     "exceeds the maximum memory for the cost matrix: %s "
>> +                     "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
>
> available?  I'm worried the error message will report in users
> checking system memory, claiming they have 14GB available on their
> system, and then reporting a "bug".
>
> Perhaps something like:
>
> +                     "(%"PRIuMAX" bytes) needed, limited to %s
> (%"PRIuMAX" bytes)"),

Sounds like a good idea.

I am not a huge fan of configuration variables that do not have a
command line option.  Assuming that it is not like you'd be doing
overly huge range-diff that would not fit your memory every day,
shouldn't we start this with a command line option without a
configuration variable to gauge how useful it would be for users
with such a need, and then after it proves useful and we identify a
workflow where a user would be passing this option all the time, add
a configuration to allow it always be in effect (with command line
override still available)?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix
  2025-08-28 21:22       ` Junio C Hamano
@ 2025-08-28 21:34         ` Elijah Newren
  2025-08-28 21:45           ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Elijah Newren @ 2025-08-28 21:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: pcasaretto via GitGitGadget, git, Paulo Casaretto, pcasaretto

On Thu, Aug 28, 2025 at 2:22 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> I am not a huge fan of configuration variables that do not have a
> command line option.  Assuming that it is not like you'd be doing
> overly huge range-diff that would not fit your memory every day,
> shouldn't we start this with a command line option without a
> configuration variable to gauge how useful it would be for users
> with such a need, and then after it proves useful and we identify a
> workflow where a user would be passing this option all the time, add
> a configuration to allow it always be in effect (with command line
> override still available)?

Isn't that what Paulo's patch does?  Maybe I'm just blind, but I've
looked over the patch a couple times and don't see where he's reading
from a configuration variable; am I just missing it?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix
  2025-08-28 21:34         ` Elijah Newren
@ 2025-08-28 21:45           ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2025-08-28 21:45 UTC (permalink / raw)
  To: Elijah Newren
  Cc: pcasaretto via GitGitGadget, git, Paulo Casaretto, pcasaretto

Elijah Newren <newren@gmail.com> writes:

> On Thu, Aug 28, 2025 at 2:22 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> I am not a huge fan of configuration variables that do not have a
>> command line option.  Assuming that it is not like you'd be doing
>> overly huge range-diff that would not fit your memory every day,
>> shouldn't we start this with a command line option without a
>> configuration variable to gauge how useful it would be for users
>> with such a need, and then after it proves useful and we identify a
>> workflow where a user would be passing this option all the time, add
>> a configuration to allow it always be in effect (with command line
>> override still available)?
>
> Isn't that what Paulo's patch does?  Maybe I'm just blind, but I've
> looked over the patch a couple times and don't see where he's reading
> from a configuration variable; am I just missing it?

Ah, I just blindly trusted that the "configurable memory limit" on
the subject line is talking about configuring memory limit with some
mechanism.  Thanks for correcting me.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] range-diff: reorder options lexicographically
  2025-08-28 17:12       ` Elijah Newren
@ 2025-08-29 10:56         ` Paulo L F Casaretto
  2025-08-29 15:15           ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Paulo L F Casaretto @ 2025-08-29 10:56 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, pcasaretto via GitGitGadget, git, pcasaretto

Yes, I concur. I noticed these were "out of order" when I added the
new flag but now it's obvious that there was order. I'll remove this
commit.
Regarding the name problem, I've checked and I do have "Paulo
Casaretto" set as my name in my Github public profile.
I fixed my local git config and apparently that fixed it.


On Thu, Aug 28, 2025 at 7:12 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, Aug 28, 2025 at 8:24 AM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > "pcasaretto via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> > > From: pcasaretto <paulo.casaretto@shopify.com>
> > > Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>
>
> Same issue with name here.
>
> > I am not sure if "lexicographic order" fits well in the context of
> > "git cmd -h" that spews out many many options, shown with related
> > options together in groups.  I find it aggressively annoying to show
> > left/right-only far apart.  A user unfamiliar with the command would
> > look at the list, find "left-only" sitting in the list alone, and
> > waste time and break concentration wondering what in the first range
> > is so special to deserve such an option, until they see "right-only"
> > further down to realize that they are symmetric.
> >
> > I'd rather not to see this "lexicographic" change done, but others
> > may have better justification (note: "for better organization and
> > readability" I just disagreed is a good justification) that may make
> > me change my mind.
> >
> > What I would change, if there is something suboptimal in the current
> > output from "git range-diff -h" that deserves improvement, is the
> > lack of the grouping header before the options for range-diff
> > operation (i.e. creation-factor to left/right-only, before the next
> > "diff output" group begins).
> >
> > Thanks.
>
> I do like lexicographic ordering for unrelated options, but I prefer
> options to be grouped by intent/use first, then by lexicographic
> ordering.  And here, not only are--left-only & --right-only related as
> Junio points out, to me --diff-merges and --remerge-diff are a similar
> grouping that belong together.  So, my $0.02 is that I'd lean towards
> calling both changes in the patch a reduction in organization rather
> than an improvement.



-- 
Paulo L F Casaretto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] range-diff: add configurable memory limit for cost matrix
  2025-08-28  8:38 ` [PATCH v2 0/2] " Paulo Casaretto via GitGitGadget
  2025-08-28  8:38   ` [PATCH v2 1/2] range-diff: reorder options lexicographically pcasaretto via GitGitGadget
  2025-08-28  8:38   ` [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix pcasaretto via GitGitGadget
@ 2025-08-29 11:00   ` Paulo Casaretto via GitGitGadget
  2025-08-29 15:21     ` Elijah Newren
                       ` (2 more replies)
  2 siblings, 3 replies; 18+ messages in thread
From: Paulo Casaretto via GitGitGadget @ 2025-08-29 11:00 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Paulo Casaretto, Paulo Casaretto

From: Paulo Casaretto <pcasaretto@gmail.com>

When comparing large commit ranges (e.g., 250,000+ commits), range-diff
attempts to allocate an n×n cost matrix that can exhaust available
memory. For example, with 256,784 commits (n = 513,568), the matrix
would require approximately 256GB of memory (513,568² × 4 bytes),
causing either immediate segmentation faults due to integer overflow or
system hangs.

Add a memory limit check in get_correspondences() before allocating the
cost matrix. This check uses the total size in bytes (n² × sizeof(int))
and compares it against a configurable maximum, preventing both
excessive memory usage and integer overflow issues.

The limit is configurable via a new --max-memory option that accepts
human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
systems and 2GB for 32 bit systems. This allows comparing ranges of
approximately 32,000 (16,000) commits - generous for real-world use cases
while preventing impractical operations.

When the limit is exceeded, range-diff now displays a clear error
message showing both the requested memory size and the maximum allowed,
formatted in human-readable units for better user experience.

Example usage:
  git range-diff --max-memory=1G branch1...branch2
  git range-diff --max-memory=500M base..topic1 base..topic2

This approach was chosen over alternatives:
- Pre-counting commits: Would require spawning additional git processes
  and reading all commits twice
- Limiting by commit count: Less precise than actual memory usage
- Streaming approach: Would require significant refactoring of the
  current algorithm

This issue was previously discussed in:
https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/

Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Paulo Casaretto <pcasaretto@gmail.com>
---
    range-diff: add configurable memory limit for cost matrix
    
    
    Problem Description
    ===================
    
    When git range-diff is given extremely large ranges, it can result in
    either:
    
     1. Segmentation fault due to integer overflow in array index
        calculations
     2. Excessive memory consumption leading to system hangs or OOM kills
     3. Poor user experience with the command appearing to hang for minutes
    
    
    Reproduction Case
    =================
    
    In a Shopify's large monorepo a range-diff command like this crashes
    after several minutes with a SIGBUS error
    
    $ git range-diff 4430f36511..316c1276c6 cb5240b6a8..2bbd292091
    
    
    Range statistics:
    
     * First range: 256,783 commits
     * Second range: 1 commit
     * Total: 256,784 commits
     * Memory required for cost matrix: n² × 4 bytes = ~260GB
    
    
    Stack Trace (Segmentation Fault)
    ================================
    
    (lldb) bt
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e000ae3b8)
      * frame #0: 0x000000010029a284 git`get_correspondences(a=0x000000016fde6188, b=0x000000016fde6160, creation_factor=60) at range-diff.c:356:20
        frame #1: 0x0000000100299310 git`show_range_diff(range1="4430f36511cbacf5c517c6851e2b8508a72dfd30..316c1276c63f55ad9413fa18bf3b6483564a9cf4", range2="cb5240b6a8ba59b4a1f282559ee0742721b0cafc..2bbd292091e376d177ce62e264dae8872ca6be5a", range_diff_opts=0x000000016fde6308) at range-diff.c:593:3
        frame #2: 0x00000001000c719c git`cmd_range_diff(argc=2, argv=0x0000600000bcd8c0, prefix="areas/core/shopify/", repo=0x0000000100468b20) at range-diff.c:167:8
        frame #3: 0x000000010000277c git`run_builtin(p=0x00000001004408d8, argc=3, argv=0x0000600000bcd8c0, repo=0x0000000100468b20) at git.c:480:11
        frame #4: 0x0000000100001020 git`handle_builtin(args=0x000000016fde6c70) at git.c:746:9
        frame #5: 0x0000000100002074 git`run_argv(args=0x000000016fde6c70) at git.c:813:4
        frame #6: 0x0000000100000d3c git`cmd_main(argc=3, argv=0x000000016fde7350) at git.c:953:19
        frame #7: 0x000000010012750c git`main(argc=4, argv=0x000000016fde7348) at common-main.c:9:11
        frame #8: 0x000000018573ab98 dyld`start + 6076
    
    
    
    Root Cause Analysis
    ===================
    
    The crash occurs in get_correspondences() at line 356:
    
    static void get_correspondences(struct string_list *a, struct string_list *b, ...)
    {
        int n = a->nr + b->nr;  // Integer overflow: 256,784 fits in int
        ...
        ALLOC_ARRAY(cost, st_mult(n, n));  // Would allocate ~260GB
        ...
        cost[i + n * j] = c;  // Line 356: Invalid memory access
    }
    
    
    Problems:
    
     1. Integer overflow: While n=256,784 fits in an int, n*n overflows
     2. Memory allocation: Even with proper types, allocating 260GB is
        impractical
    
    
    Solution
    ========
    
    Add a memory limit check in get_correspondences() before allocating the
    cost matrix. This check uses the total size in bytes (n² × sizeof(int))
    and compares it against a configurable maximum, preventing both
    excessive memory usage and integer overflow issues.
    
    The limit is configurable via a new --max-memory option that accepts
    human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
    systems and 2GB for 32 bit systems. This allows comparing ranges of
    approximately 32,000 (16,000) commits - generous for real-world use
    cases while preventing impractical operations.
    
    When the limit is exceeded, range-diff now displays a clear error
    message showing both the requested memory size and the maximum allowed,
    formatted in human-readable units for better user experience.
    
    Example usage: git range-diff --max-memory=1G branch1...branch2 git
    range-diff --max-memory=500M base..topic1 base..topic2
    
    This approach was chosen over alternatives:
    
     * Pre-counting commits: Would require spawning additional git processes
       and reading all commits twice
     * Limiting by commit count: Less precise than actual memory usage
     * Streaming approach: Would require significant refactoring of the
       current algorithm
    
    This issue was previously discussed in:
    https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/
    
    Acked-by: Johannes Schindelin johannes.schindelin@gmx.de
    [https://github.com/gitgitgadget/git/pull/1958#issuecomment-3224133138]

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1958%2Fpcasaretto%2Frange-diff-size-limit-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1958/pcasaretto/range-diff-size-limit-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1958

Range-diff vs v2:

 1:  ec5dcdf9d0 < -:  ---------- range-diff: reorder options lexicographically
 2:  c81f920fee ! 1:  6d7ff43e56 range-diff: add configurable memory limit for cost matrix
     @@
       ## Metadata ##
     -Author: pcasaretto <paulo.casaretto@shopify.com>
     +Author: Paulo Casaretto <pcasaretto@gmail.com>
      
       ## Commit message ##
          range-diff: add configurable memory limit for cost matrix
     @@ Commit message
          https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/
      
          Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
     -    Signed-off-by: Paulo Casaretto <paulo.casaretto@shopify.com>
     +    Signed-off-by: Paulo Casaretto <pcasaretto@gmail.com>
      
       ## builtin/log.c ##
      @@ builtin/log.c: static void make_cover_letter(struct rev_info *rev, int use_separate_file,
     @@ builtin/range-diff.c: int cmd_range_diff(int argc,
       		.other_arg = &other_arg
       	};
      @@ builtin/range-diff.c: int cmd_range_diff(int argc,
     + 				  PARSE_OPT_OPTARG),
     + 		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
       				  N_("style"), N_("passed to 'git log'"), 0),
     - 		OPT_BOOL(0, "left-only", &left_only,
     - 			 N_("only emit output related to the first range")),
      +		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
      +			     N_("size"),
      +			     N_("maximum memory for cost matrix (default 4G)"),
      +			     parse_max_memory),
     - 		OPT_BOOL(0, "no-dual-color", &simple_color,
     - 			    N_("use simple diff colors")),
     - 		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
     + 		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
     + 				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
     + 		OPT_BOOL(0, "left-only", &left_only,
      
       ## log-tree.c ##
      @@ log-tree.c: static void show_diff_of_diff(struct rev_info *opt)


 builtin/log.c        |  1 +
 builtin/range-diff.c | 21 +++++++++++++++++++++
 log-tree.c           |  1 +
 range-diff.c         | 20 ++++++++++++++++----
 range-diff.h         |  5 +++++
 5 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index c2f8bbf863..5f552d14c0 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1404,6 +1404,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file,
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = rev->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts,
 			.other_arg = &other_arg
 		};
diff --git a/builtin/range-diff.c b/builtin/range-diff.c
index a563abff5f..aafcc99b96 100644
--- a/builtin/range-diff.c
+++ b/builtin/range-diff.c
@@ -6,6 +6,7 @@
 #include "parse-options.h"
 #include "range-diff.h"
 #include "config.h"
+#include "parse.h"
 
 
 static const char * const builtin_range_diff_usage[] = {
@@ -15,6 +16,21 @@ N_("git range-diff [<options>] <base> <old-tip> <new-tip>"),
 NULL
 };
 
+static int parse_max_memory(const struct option *opt, const char *arg, int unset)
+{
+	size_t *max_memory = opt->value;
+	uintmax_t val;
+
+	if (unset)
+		return 0;
+
+	if (!git_parse_unsigned(arg, &val, SIZE_MAX))
+		return error(_("invalid max-memory value: %s"), arg);
+
+	*max_memory = (size_t)val;
+	return 0;
+}
+
 int cmd_range_diff(int argc,
 		   const char **argv,
 		   const char *prefix,
@@ -25,6 +41,7 @@ int cmd_range_diff(int argc,
 	struct strvec diff_merges_arg = STRVEC_INIT;
 	struct range_diff_options range_diff_opts = {
 		.creation_factor = RANGE_DIFF_CREATION_FACTOR_DEFAULT,
+		.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 		.diffopt = &diffopt,
 		.other_arg = &other_arg
 	};
@@ -40,6 +57,10 @@ int cmd_range_diff(int argc,
 				  PARSE_OPT_OPTARG),
 		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
 				  N_("style"), N_("passed to 'git log'"), 0),
+		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
+			     N_("size"),
+			     N_("maximum memory for cost matrix (default 4G)"),
+			     parse_max_memory),
 		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
 				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
 		OPT_BOOL(0, "left-only", &left_only,
diff --git a/log-tree.c b/log-tree.c
index 233bf9f227..73d21f7176 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -717,6 +717,7 @@ static void show_diff_of_diff(struct rev_info *opt)
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = opt->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts
 		};
 
diff --git a/range-diff.c b/range-diff.c
index 8a2dcbee32..e31f71c73d 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -325,13 +325,24 @@ static int diffsize(const char *a, const char *b)
 }
 
 static void get_correspondences(struct string_list *a, struct string_list *b,
-				int creation_factor)
+				int creation_factor, size_t max_memory)
 {
 	int n = a->nr + b->nr;
 	int *cost, c, *a2b, *b2a;
 	int i, j;
-
-	ALLOC_ARRAY(cost, st_mult(n, n));
+	size_t cost_size = st_mult(n, n);
+	size_t cost_bytes = st_mult(sizeof(int), cost_size);
+	if (cost_bytes >= max_memory) {
+		struct strbuf cost_str = STRBUF_INIT;
+		struct strbuf max_str = STRBUF_INIT;
+		strbuf_humanise_bytes(&cost_str, cost_bytes);
+		strbuf_humanise_bytes(&max_str, max_memory);
+		die(_("range-diff: unable to compute the range-diff, since it "
+		      "exceeds the maximum memory for the cost matrix: %s "
+		      "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
+		    cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
+	}
+	ALLOC_ARRAY(cost, cost_size);
 	ALLOC_ARRAY(a2b, n);
 	ALLOC_ARRAY(b2a, n);
 
@@ -591,7 +602,8 @@ int show_range_diff(const char *range1, const char *range2,
 	if (!res) {
 		find_exact_matches(&branch1, &branch2);
 		get_correspondences(&branch1, &branch2,
-				    range_diff_opts->creation_factor);
+				    range_diff_opts->creation_factor,
+				    range_diff_opts->max_memory);
 		output(&branch1, &branch2, range_diff_opts);
 	}
 
diff --git a/range-diff.h b/range-diff.h
index cd85000b5a..9d39818e34 100644
--- a/range-diff.h
+++ b/range-diff.h
@@ -5,6 +5,10 @@
 #include "strvec.h"
 
 #define RANGE_DIFF_CREATION_FACTOR_DEFAULT 60
+#define RANGE_DIFF_MAX_MEMORY_DEFAULT \
+	(sizeof(void*) >= 8 ? \
+		((size_t)(1024L * 1024L) * (size_t)(4L * 1024L)) : /* 4GB on 64-bit */ \
+		((size_t)(1024L * 1024L) * (size_t)(2L * 1024L)))   /* 2GB on 32-bit */
 
 /*
  * A much higher value than the default, when we KNOW we are comparing
@@ -17,6 +21,7 @@ struct range_diff_options {
 	unsigned dual_color:1;
 	unsigned left_only:1, right_only:1;
 	unsigned include_merges:1;
+	size_t max_memory;
 	const struct diff_options *diffopt; /* may be NULL */
 	const struct strvec *other_arg; /* may be NULL */
 };

base-commit: 954d33a9757fcfab723a824116902f1eb16e05f7
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] range-diff: reorder options lexicographically
  2025-08-29 10:56         ` Paulo L F Casaretto
@ 2025-08-29 15:15           ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2025-08-29 15:15 UTC (permalink / raw)
  To: Paulo L F Casaretto
  Cc: Elijah Newren, pcasaretto via GitGitGadget, git, pcasaretto

Paulo L F Casaretto <pcasaretto@gmail.com> writes:

> Yes, I concur. I noticed these were "out of order" when I added the
> new flag but now it's obvious that there was order. I'll remove this
> commit.
> Regarding the name problem, I've checked and I do have "Paulo
> Casaretto" set as my name in my Github public profile.
> I fixed my local git config and apparently that fixed it.

Yeah, these in-body From: lines GigGitGadget adds come from the
authorship of the commits you are sending (in other words, what you
see in "git cat-file commit <commit>" for these commits), and your
GitHub profile would not affect it (and you do not want your GitHub
profile name be used---otherwise you cannot send a series that
contains a change written by somebody else without overtaking the
authorship of their commits).

I see v3 posted there; thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] range-diff: add configurable memory limit for cost matrix
  2025-08-29 11:00   ` [PATCH v3] " Paulo Casaretto via GitGitGadget
@ 2025-08-29 15:21     ` Elijah Newren
  2025-08-29 16:33       ` Junio C Hamano
  2025-08-29 15:40     ` Junio C Hamano
  2025-08-29 16:02     ` [PATCH v4] " Paulo Casaretto via GitGitGadget
  2 siblings, 1 reply; 18+ messages in thread
From: Elijah Newren @ 2025-08-29 15:21 UTC (permalink / raw)
  To: Paulo Casaretto via GitGitGadget; +Cc: git, Paulo Casaretto

On Fri, Aug 29, 2025 at 4:00 AM Paulo Casaretto via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> -
> -       ALLOC_ARRAY(cost, st_mult(n, n));
> +       size_t cost_size = st_mult(n, n);
> +       size_t cost_bytes = st_mult(sizeof(int), cost_size);
> +       if (cost_bytes >= max_memory) {
> +               struct strbuf cost_str = STRBUF_INIT;
> +               struct strbuf max_str = STRBUF_INIT;
> +               strbuf_humanise_bytes(&cost_str, cost_bytes);
> +               strbuf_humanise_bytes(&max_str, max_memory);
> +               die(_("range-diff: unable to compute the range-diff, since it "
> +                     "exceeds the maximum memory for the cost matrix: %s "
> +                     "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
> +                   cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
> +       }
> +       ALLOC_ARRAY(cost, cost_size);
>         ALLOC_ARRAY(a2b, n);
>         ALLOC_ARRAY(b2a, n);
>

This still has the same wording issue that I commented on in v2:
https://lore.kernel.org/git/CABPp-BEDje5dYZHEyYMN6j_LdR5CqRN1cxc0riRK06qK-OxiTA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] range-diff: add configurable memory limit for cost matrix
  2025-08-29 11:00   ` [PATCH v3] " Paulo Casaretto via GitGitGadget
  2025-08-29 15:21     ` Elijah Newren
@ 2025-08-29 15:40     ` Junio C Hamano
  2025-08-29 16:02     ` [PATCH v4] " Paulo Casaretto via GitGitGadget
  2 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2025-08-29 15:40 UTC (permalink / raw)
  To: Paulo Casaretto via GitGitGadget; +Cc: git, Elijah Newren, Paulo Casaretto

"Paulo Casaretto via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Paulo Casaretto <pcasaretto@gmail.com>
>
> When comparing large commit ranges (e.g., 250,000+ commits), range-diff
> attempts to allocate an n×n cost matrix that can exhaust available
> memory. For example, with 256,784 commits (n = 513,568), the matrix
> would require approximately 256GB of memory (513,568² × 4 bytes),
> causing either immediate segmentation faults due to integer overflow or
> system hangs.
>
> Add a memory limit check in get_correspondences() before allocating the
> cost matrix. This check uses the total size in bytes (n² × sizeof(int))
> and compares it against a configurable maximum, preventing both
> excessive memory usage and integer overflow issues.
>
> The limit is configurable via a new --max-memory option that accepts
> human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
> systems and 2GB for 32 bit systems. This allows comparing ranges of
> approximately 32,000 (16,000) commits - generous for real-world use cases
> while preventing impractical operations.
>
> When the limit is exceeded, range-diff now displays a clear error
> message showing both the requested memory size and the maximum allowed,
> formatted in human-readable units for better user experience.
>
> Example usage:
>   git range-diff --max-memory=1G branch1...branch2
>   git range-diff --max-memory=500M base..topic1 base..topic2
>
> This approach was chosen over alternatives:
> - Pre-counting commits: Would require spawning additional git processes
>   and reading all commits twice
> - Limiting by commit count: Less precise than actual memory usage
> - Streaming approach: Would require significant refactoring of the
>   current algorithm
>
> This issue was previously discussed in:
> https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/
>
> Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> Signed-off-by: Paulo Casaretto <pcasaretto@gmail.com>
> ---

Looks good, especially without the reordering existing entries in
the options list.  The authorship information above looks much
better, too.

> @@ -40,6 +57,10 @@ int cmd_range_diff(int argc,
>  				  PARSE_OPT_OPTARG),
>  		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
>  				  N_("style"), N_("passed to 'git log'"), 0),
> +		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
> +			     N_("size"),
> +			     N_("maximum memory for cost matrix (default 4G)"),
> +			     parse_max_memory),
>  		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
>  				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
>  		OPT_BOOL(0, "left-only", &left_only,

Among existing options (an excerpt from "git range-diff h")

    --[no-]creation-factor <n>
                          percentage by which creation is weighted

    This controls how correspondence between commits on old and new
    branches are computed.

    --no-dual-color       use simple diff colors
    --dual-color          opposite of --no-dual-color

    These control how the findings are shown, by painting the lines
    in distinct colors. 

    --[no-]notes[=<notes>]
                          passed to 'git log'
    --[no-]diff-merges <style>
                          passed to 'git log'
    --[no-]remerge-diff   passed to 'git log'

    These control what text are used to represent each commit and
    participate in comparison and display.

    --[no-]left-only      only emit output related to the first range
    --[no-]right-only     only emit output related to the second range

    These again control how the findings are shown, by omitting some
    commits from the output.

So there is no perfectly logical place to place the new option, but
between diff-merges and remerge-diff somewhat feels a bit odder
choice than other possible places.

Will queue as is.  If some users find the location in the "-h"
output too odd and disturbing, they can later send in a reordering
patch on top, but I would think the chosen location is good enough.

As #leftoverbits we might want to

 * Group range-diff specific options with OPT_GROUP()

 * Instead of having to match the full NxN matrix, perhaps reduce
   the matrix by keeping the most promising M (which is much smaller
   than N) for each N, or something?

but that (especially the latter) is totally outside the scope of
this patch.

Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4] range-diff: add configurable memory limit for cost matrix
  2025-08-29 11:00   ` [PATCH v3] " Paulo Casaretto via GitGitGadget
  2025-08-29 15:21     ` Elijah Newren
  2025-08-29 15:40     ` Junio C Hamano
@ 2025-08-29 16:02     ` Paulo Casaretto via GitGitGadget
  2 siblings, 0 replies; 18+ messages in thread
From: Paulo Casaretto via GitGitGadget @ 2025-08-29 16:02 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Paulo Casaretto, Paulo Casaretto

From: Paulo Casaretto <pcasaretto@gmail.com>

When comparing large commit ranges (e.g., 250,000+ commits), range-diff
attempts to allocate an n×n cost matrix that can exhaust available
memory. For example, with 256,784 commits (n = 513,568), the matrix
would require approximately 256GB of memory (513,568² × 4 bytes),
causing either immediate segmentation faults due to integer overflow or
system hangs.

Add a memory limit check in get_correspondences() before allocating the
cost matrix. This check uses the total size in bytes (n² × sizeof(int))
and compares it against a configurable maximum, preventing both
excessive memory usage and integer overflow issues.

The limit is configurable via a new --max-memory option that accepts
human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
systems and 2GB for 32 bit systems. This allows comparing ranges of
approximately 32,000 (16,000) commits - generous for real-world use cases
while preventing impractical operations.

When the limit is exceeded, range-diff now displays a clear error
message showing both the requested memory size and the maximum allowed,
formatted in human-readable units for better user experience.

Example usage:
  git range-diff --max-memory=1G branch1...branch2
  git range-diff --max-memory=500M base..topic1 base..topic2

This approach was chosen over alternatives:
- Pre-counting commits: Would require spawning additional git processes
  and reading all commits twice
- Limiting by commit count: Less precise than actual memory usage
- Streaming approach: Would require significant refactoring of the
  current algorithm

This issue was previously discussed in:
https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/

Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Paulo Casaretto <pcasaretto@gmail.com>
---
    range-diff: add configurable memory limit for cost matrix
    
    
    Problem Description
    ===================
    
    When git range-diff is given extremely large ranges, it can result in
    either:
    
     1. Segmentation fault due to integer overflow in array index
        calculations
     2. Excessive memory consumption leading to system hangs or OOM kills
     3. Poor user experience with the command appearing to hang for minutes
    
    
    Reproduction Case
    =================
    
    In a Shopify's large monorepo a range-diff command like this crashes
    after several minutes with a SIGBUS error
    
    $ git range-diff 4430f36511..316c1276c6 cb5240b6a8..2bbd292091
    
    
    Range statistics:
    
     * First range: 256,783 commits
     * Second range: 1 commit
     * Total: 256,784 commits
     * Memory required for cost matrix: n² × 4 bytes = ~260GB
    
    
    Stack Trace (Segmentation Fault)
    ================================
    
    (lldb) bt
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e000ae3b8)
      * frame #0: 0x000000010029a284 git`get_correspondences(a=0x000000016fde6188, b=0x000000016fde6160, creation_factor=60) at range-diff.c:356:20
        frame #1: 0x0000000100299310 git`show_range_diff(range1="4430f36511cbacf5c517c6851e2b8508a72dfd30..316c1276c63f55ad9413fa18bf3b6483564a9cf4", range2="cb5240b6a8ba59b4a1f282559ee0742721b0cafc..2bbd292091e376d177ce62e264dae8872ca6be5a", range_diff_opts=0x000000016fde6308) at range-diff.c:593:3
        frame #2: 0x00000001000c719c git`cmd_range_diff(argc=2, argv=0x0000600000bcd8c0, prefix="areas/core/shopify/", repo=0x0000000100468b20) at range-diff.c:167:8
        frame #3: 0x000000010000277c git`run_builtin(p=0x00000001004408d8, argc=3, argv=0x0000600000bcd8c0, repo=0x0000000100468b20) at git.c:480:11
        frame #4: 0x0000000100001020 git`handle_builtin(args=0x000000016fde6c70) at git.c:746:9
        frame #5: 0x0000000100002074 git`run_argv(args=0x000000016fde6c70) at git.c:813:4
        frame #6: 0x0000000100000d3c git`cmd_main(argc=3, argv=0x000000016fde7350) at git.c:953:19
        frame #7: 0x000000010012750c git`main(argc=4, argv=0x000000016fde7348) at common-main.c:9:11
        frame #8: 0x000000018573ab98 dyld`start + 6076
    
    
    
    Root Cause Analysis
    ===================
    
    The crash occurs in get_correspondences() at line 356:
    
    static void get_correspondences(struct string_list *a, struct string_list *b, ...)
    {
        int n = a->nr + b->nr;  // Integer overflow: 256,784 fits in int
        ...
        ALLOC_ARRAY(cost, st_mult(n, n));  // Would allocate ~260GB
        ...
        cost[i + n * j] = c;  // Line 356: Invalid memory access
    }
    
    
    Problems:
    
     1. Integer overflow: While n=256,784 fits in an int, n*n overflows
     2. Memory allocation: Even with proper types, allocating 260GB is
        impractical
    
    
    Solution
    ========
    
    Add a memory limit check in get_correspondences() before allocating the
    cost matrix. This check uses the total size in bytes (n² × sizeof(int))
    and compares it against a configurable maximum, preventing both
    excessive memory usage and integer overflow issues.
    
    The limit is configurable via a new --max-memory option that accepts
    human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
    systems and 2GB for 32 bit systems. This allows comparing ranges of
    approximately 32,000 (16,000) commits - generous for real-world use
    cases while preventing impractical operations.
    
    When the limit is exceeded, range-diff now displays a clear error
    message showing both the requested memory size and the maximum allowed,
    formatted in human-readable units for better user experience.
    
    Example usage: git range-diff --max-memory=1G branch1...branch2 git
    range-diff --max-memory=500M base..topic1 base..topic2
    
    This approach was chosen over alternatives:
    
     * Pre-counting commits: Would require spawning additional git processes
       and reading all commits twice
     * Limiting by commit count: Less precise than actual memory usage
     * Streaming approach: Would require significant refactoring of the
       current algorithm
    
    This issue was previously discussed in:
    https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com/
    
    Acked-by: Johannes Schindelin johannes.schindelin@gmx.de
    [https://github.com/gitgitgadget/git/pull/1958#issuecomment-3224133138]

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1958%2Fpcasaretto%2Frange-diff-size-limit-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1958/pcasaretto/range-diff-size-limit-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1958

Range-diff vs v3:

 1:  6d7ff43e56 ! 1:  203113ea2a range-diff: add configurable memory limit for cost matrix
     @@ range-diff.c: static int diffsize(const char *a, const char *b)
      +		strbuf_humanise_bytes(&max_str, max_memory);
      +		die(_("range-diff: unable to compute the range-diff, since it "
      +		      "exceeds the maximum memory for the cost matrix: %s "
     -+		      "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
     ++		      "(%"PRIuMAX" bytes) needed, limited to %s (%"PRIuMAX" bytes)"),
      +		    cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
      +	}
      +	ALLOC_ARRAY(cost, cost_size);


 builtin/log.c        |  1 +
 builtin/range-diff.c | 21 +++++++++++++++++++++
 log-tree.c           |  1 +
 range-diff.c         | 20 ++++++++++++++++----
 range-diff.h         |  5 +++++
 5 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index c2f8bbf863..5f552d14c0 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1404,6 +1404,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file,
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = rev->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts,
 			.other_arg = &other_arg
 		};
diff --git a/builtin/range-diff.c b/builtin/range-diff.c
index a563abff5f..aafcc99b96 100644
--- a/builtin/range-diff.c
+++ b/builtin/range-diff.c
@@ -6,6 +6,7 @@
 #include "parse-options.h"
 #include "range-diff.h"
 #include "config.h"
+#include "parse.h"
 
 
 static const char * const builtin_range_diff_usage[] = {
@@ -15,6 +16,21 @@ N_("git range-diff [<options>] <base> <old-tip> <new-tip>"),
 NULL
 };
 
+static int parse_max_memory(const struct option *opt, const char *arg, int unset)
+{
+	size_t *max_memory = opt->value;
+	uintmax_t val;
+
+	if (unset)
+		return 0;
+
+	if (!git_parse_unsigned(arg, &val, SIZE_MAX))
+		return error(_("invalid max-memory value: %s"), arg);
+
+	*max_memory = (size_t)val;
+	return 0;
+}
+
 int cmd_range_diff(int argc,
 		   const char **argv,
 		   const char *prefix,
@@ -25,6 +41,7 @@ int cmd_range_diff(int argc,
 	struct strvec diff_merges_arg = STRVEC_INIT;
 	struct range_diff_options range_diff_opts = {
 		.creation_factor = RANGE_DIFF_CREATION_FACTOR_DEFAULT,
+		.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 		.diffopt = &diffopt,
 		.other_arg = &other_arg
 	};
@@ -40,6 +57,10 @@ int cmd_range_diff(int argc,
 				  PARSE_OPT_OPTARG),
 		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
 				  N_("style"), N_("passed to 'git log'"), 0),
+		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
+			     N_("size"),
+			     N_("maximum memory for cost matrix (default 4G)"),
+			     parse_max_memory),
 		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
 				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
 		OPT_BOOL(0, "left-only", &left_only,
diff --git a/log-tree.c b/log-tree.c
index 233bf9f227..73d21f7176 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -717,6 +717,7 @@ static void show_diff_of_diff(struct rev_info *opt)
 		struct range_diff_options range_diff_opts = {
 			.creation_factor = opt->creation_factor,
 			.dual_color = 1,
+			.max_memory = RANGE_DIFF_MAX_MEMORY_DEFAULT,
 			.diffopt = &opts
 		};
 
diff --git a/range-diff.c b/range-diff.c
index 8a2dcbee32..ca449a0769 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -325,13 +325,24 @@ static int diffsize(const char *a, const char *b)
 }
 
 static void get_correspondences(struct string_list *a, struct string_list *b,
-				int creation_factor)
+				int creation_factor, size_t max_memory)
 {
 	int n = a->nr + b->nr;
 	int *cost, c, *a2b, *b2a;
 	int i, j;
-
-	ALLOC_ARRAY(cost, st_mult(n, n));
+	size_t cost_size = st_mult(n, n);
+	size_t cost_bytes = st_mult(sizeof(int), cost_size);
+	if (cost_bytes >= max_memory) {
+		struct strbuf cost_str = STRBUF_INIT;
+		struct strbuf max_str = STRBUF_INIT;
+		strbuf_humanise_bytes(&cost_str, cost_bytes);
+		strbuf_humanise_bytes(&max_str, max_memory);
+		die(_("range-diff: unable to compute the range-diff, since it "
+		      "exceeds the maximum memory for the cost matrix: %s "
+		      "(%"PRIuMAX" bytes) needed, limited to %s (%"PRIuMAX" bytes)"),
+		    cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
+	}
+	ALLOC_ARRAY(cost, cost_size);
 	ALLOC_ARRAY(a2b, n);
 	ALLOC_ARRAY(b2a, n);
 
@@ -591,7 +602,8 @@ int show_range_diff(const char *range1, const char *range2,
 	if (!res) {
 		find_exact_matches(&branch1, &branch2);
 		get_correspondences(&branch1, &branch2,
-				    range_diff_opts->creation_factor);
+				    range_diff_opts->creation_factor,
+				    range_diff_opts->max_memory);
 		output(&branch1, &branch2, range_diff_opts);
 	}
 
diff --git a/range-diff.h b/range-diff.h
index cd85000b5a..9d39818e34 100644
--- a/range-diff.h
+++ b/range-diff.h
@@ -5,6 +5,10 @@
 #include "strvec.h"
 
 #define RANGE_DIFF_CREATION_FACTOR_DEFAULT 60
+#define RANGE_DIFF_MAX_MEMORY_DEFAULT \
+	(sizeof(void*) >= 8 ? \
+		((size_t)(1024L * 1024L) * (size_t)(4L * 1024L)) : /* 4GB on 64-bit */ \
+		((size_t)(1024L * 1024L) * (size_t)(2L * 1024L)))   /* 2GB on 32-bit */
 
 /*
  * A much higher value than the default, when we KNOW we are comparing
@@ -17,6 +21,7 @@ struct range_diff_options {
 	unsigned dual_color:1;
 	unsigned left_only:1, right_only:1;
 	unsigned include_merges:1;
+	size_t max_memory;
 	const struct diff_options *diffopt; /* may be NULL */
 	const struct strvec *other_arg; /* may be NULL */
 };

base-commit: 954d33a9757fcfab723a824116902f1eb16e05f7
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] range-diff: add configurable memory limit for cost matrix
  2025-08-29 15:21     ` Elijah Newren
@ 2025-08-29 16:33       ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2025-08-29 16:33 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Paulo Casaretto via GitGitGadget, git, Paulo Casaretto

Elijah Newren <newren@gmail.com> writes:

> On Fri, Aug 29, 2025 at 4:00 AM Paulo Casaretto via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> -
>> -       ALLOC_ARRAY(cost, st_mult(n, n));
>> +       size_t cost_size = st_mult(n, n);
>> +       size_t cost_bytes = st_mult(sizeof(int), cost_size);
>> +       if (cost_bytes >= max_memory) {
>> +               struct strbuf cost_str = STRBUF_INIT;
>> +               struct strbuf max_str = STRBUF_INIT;
>> +               strbuf_humanise_bytes(&cost_str, cost_bytes);
>> +               strbuf_humanise_bytes(&max_str, max_memory);
>> +               die(_("range-diff: unable to compute the range-diff, since it "
>> +                     "exceeds the maximum memory for the cost matrix: %s "
>> +                     "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
>> +                   cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
>> +       }
>> +       ALLOC_ARRAY(cost, cost_size);
>>         ALLOC_ARRAY(a2b, n);
>>         ALLOC_ARRAY(b2a, n);
>>
>
> This still has the same wording issue that I commented on in v2:
> https://lore.kernel.org/git/CABPp-BEDje5dYZHEyYMN6j_LdR5CqRN1cxc0riRK06qK-OxiTA@mail.gmail.com/

Right.  I overlooked it, sorry.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-08-29 16:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 17:18 [PATCH] range-diff: add configurable memory limit for cost matrix Paulo Casaretto via GitGitGadget
2025-08-26 19:18 ` Junio C Hamano
2025-08-28  8:38 ` [PATCH v2 0/2] " Paulo Casaretto via GitGitGadget
2025-08-28  8:38   ` [PATCH v2 1/2] range-diff: reorder options lexicographically pcasaretto via GitGitGadget
2025-08-28 15:21     ` Junio C Hamano
2025-08-28 17:12       ` Elijah Newren
2025-08-29 10:56         ` Paulo L F Casaretto
2025-08-29 15:15           ` Junio C Hamano
2025-08-28  8:38   ` [PATCH v2 2/2] range-diff: add configurable memory limit for cost matrix pcasaretto via GitGitGadget
2025-08-28 17:04     ` Elijah Newren
2025-08-28 21:22       ` Junio C Hamano
2025-08-28 21:34         ` Elijah Newren
2025-08-28 21:45           ` Junio C Hamano
2025-08-29 11:00   ` [PATCH v3] " Paulo Casaretto via GitGitGadget
2025-08-29 15:21     ` Elijah Newren
2025-08-29 16:33       ` Junio C Hamano
2025-08-29 15:40     ` Junio C Hamano
2025-08-29 16:02     ` [PATCH v4] " Paulo Casaretto via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).