Re: [PATCH 1/4] xdiff: reduce size of action arrays

Git development
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Phillip Wood <phillip.wood123@gmail.com>
Cc: git@vger.kernel.org,  Ezekiel Newren <ezekielnewren@gmail.com>
Subject: Re: [PATCH 1/4] xdiff: reduce size of action arrays
Date: Thu, 02 Apr 2026 12:19:24 -0700	[thread overview]
Message-ID: <xmqqy0j5p3ur.fsf@gitster.g> (raw)
In-Reply-To: <447b8c0af1746d61bfa26e7908a784583ab5dc2e.1775141855.git.phillip.wood@dunelm.org.uk> (Phillip Wood's message of "Thu, 2 Apr 2026 15:57:41 +0100")

Phillip Wood <phillip.wood123@gmail.com> writes:

> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>
> When the myers algorithm is selected the input files are pre-processed
> to remove any common prefix and suffix. Then any lines that appear
> only in one side of the diff are marked as changed and frequently
> occurring lines are marked as changed if they are adjacent to a
> changed line. This step requires a couple of temporary arrays. As as
> the common prefix and suffix have already been removed, the arrays
> only need to be big enough to hold the lines between them, not the
> whole file. Reduce the size of the arrays and adjust the loops that
> use them accordingly while taking care to keep indexing the arrays
> in xdfile_t with absolute line numbers.

"As as"???

> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
>  xdiff/xprepare.c | 31 +++++++++++++++++--------------
>  1 file changed, 17 insertions(+), 14 deletions(-)
>
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 1f2e8c6b4b9..4bb3a8ef41c 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -273,16 +273,19 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>  	uint8_t *action1 = NULL, *action2 = NULL;
>  	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
>  	int ret = 0;
> +	ptrdiff_t off = xdf1->dstart;
> +	ptrdiff_t len1 = xdf1->dend - off + 1;
> +	ptrdiff_t len2 = xdf2->dend - off + 1;
>  
>  	/*
>  	 * Create temporary arrays that will help us decide if
>  	 * changed[i] should remain false, or become true.
>  	 */
> -	if (!XDL_CALLOC_ARRAY(action1, xdf1->nrec + 1)) {
> +	if (!XDL_CALLOC_ARRAY(action1, len1)) {
>  		ret = -1;
>  		goto cleanup;
>  	}
> -	if (!XDL_CALLOC_ARRAY(action2, xdf2->nrec + 1)) {
> +	if (!XDL_CALLOC_ARRAY(action2, len2)) {
>  		ret = -1;
>  		goto cleanup;
>  	}

OK, so we used to allocate for the whole thing, but now we only
allocate for lines starting at dstart.  "off" is the difference
between [i], the index into these action arrays, and the true line
numbers.

> @@ -299,8 +302,8 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>  	/*
>  	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>  	 */
> -	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
> -		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
> +	for (i = 0; i < len1; i++) {
> +		size_t mph1 = xdf1->recs[i + off].minimal_perfect_hash;

And we iterate as many times as we have entries in the action array,
but we need to offset the [i] with off when looking at the record.

>  		rcrec = cf->rcrecs[mph1];
>  		nm = rcrec ? rcrec->len2 : 0;
>  		if (nm == 0)
> @@ -311,8 +314,8 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>  			action1[i] = INVESTIGATE;
>  	}
>  
> -	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
> -		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
> +	for (i = 0; i < len2; i++) {
> +		size_t mph2 = xdf2->recs[i + off].minimal_perfect_hash;

Likewise.

> @@ -328,37 +331,37 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>  	 * false, or become true.
>  	 */
>  	xdf1->nreff = 0;
> -	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
> +	for (i = 0; i < len1; i++) {
>  		if (action1[i] == INVESTIGATE) {
> -			if (!xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))
> +			if (!xdl_clean_mmatch(action1, i, 0, len1 - 1))

Let me think aloud to see if I can follow the logic here.  Looking
at the implementation of the xdl_clean_mmatch() function, it takes
an action array, an offset 'i' into it (starting from dstart), the
beginning 'dstart' and the end 'dend' offsets.  The idea is that an
index derived from 'i' is used to index the action array and the
beginning and the end offsets are used to limit how much far the
access can deviate from 'i'.

Now we stripped the first xdf1->dstart elements from action1[]
array, 'i' in this loop runs from 0 (i.e. one beyond the initial
common section) and len1 (i.e. one before the tail end of the common
section), i.e., everything is consistently shifted down by xdf1->dstart
in this call.  So xdl_clean_mmatch() does not even need to know that
it is fed a shortened action[] array.


>  				action1[i] = KEEP;
>  			else
>  				action1[i] = DISCARD;
>  		}
>  
>  		if (action1[i] == KEEP) {
> -			xdf1->reference_index[xdf1->nreff++] = i;
> +			xdf1->reference_index[xdf1->nreff++] = i + off;
>  			/* changed[i] remains false */
>  		} else if (action1[i] == DISCARD)
> -			xdf1->changed[i] = true;
> +			xdf1->changed[i + off] = true;

But these two arrays are not shrunk, so we need to compensate by the 'off'
offset.

And the remainder is similar but for xdf2 instead of xdf1 above.

>  		else
>  			BUG("Illegal state for action1[i]");
>  	}
>  
>  	xdf2->nreff = 0;
> -	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
> +	for (i = 0; i < len2; i++) {
>  		if (action2[i] == INVESTIGATE) {
> -			if (!xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))
> +			if (!xdl_clean_mmatch(action2, i, 0, len2 - 1))
>  				action2[i] = KEEP;
>  			else
>  				action2[i] = DISCARD;
>  		}
>  
>  		if (action2[i] == KEEP) {
> -			xdf2->reference_index[xdf2->nreff++] = i;
> +			xdf2->reference_index[xdf2->nreff++] = i + off;
>  			/* changed[i] remains false */
>  		} else if (action2[i] == DISCARD)
> -			xdf2->changed[i] = true;
> +			xdf2->changed[i + off] = true;
>  		else
>  			BUG("Illegal state for action2[i]");
>  	}

next prev parent reply	other threads:[~2026-04-02 19:19 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02 14:57 [PATCH 0/4] xdiff: reduce the size of a couple of arrays Phillip Wood
2026-04-02 14:57 ` [PATCH 1/4] xdiff: reduce size of action arrays Phillip Wood
2026-04-02 19:19   ` Junio C Hamano [this message]
2026-04-02 14:57 ` [PATCH 2/4] xdiff: cleanup xdl_clean_mmatch() Phillip Wood
2026-04-02 19:20   ` Junio C Hamano
2026-04-02 14:57 ` [PATCH 3/4] xprepare: simplify error handling Phillip Wood
2026-04-02 19:24   ` Junio C Hamano
2026-04-02 14:57 ` [PATCH 4/4] xdiff: reduce the size of array Phillip Wood
2026-04-02 19:44   ` Junio C Hamano
2026-05-04 14:06 ` [PATCH v2 0/4] xdiff: reduce the size of a couple of arrays Phillip Wood
2026-05-04 14:06   ` [PATCH v2 1/4] xdiff: reduce size of action arrays Phillip Wood
2026-05-04 14:06   ` [PATCH v2 2/4] xdiff: cleanup xdl_clean_mmatch() Phillip Wood
2026-05-04 14:06   ` [PATCH v2 3/4] xprepare: simplify error handling Phillip Wood
2026-05-04 14:06   ` [PATCH v2 4/4] xdiff: reduce the size of array Phillip Wood

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqy0j5p3ur.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=ezekielnewren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=phillip.wood123@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox