git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Tighten patch header parsing in patch-id
@ 2024-06-21 23:18 Junio C Hamano
  2024-06-21 23:18 ` [PATCH 1/5] t4204: patch-id supports various input format Junio C Hamano
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-06-21 23:18 UTC (permalink / raw)
  To: git; +Cc: Rob Linden

The patch-id command loops over a series of patches, picking up the
origin commit object name (which is on the "patch header" line) and
then computing the patch identifier out of the "patch" (series of
"diff") that follows the "patch header".

The parser is structured in a bit "strange" way.  It repeatedly
calls a single helper function get_one_patchid() that returns when a
patch header is recognised, or skips until the "patch" part begins
and then computes the "patch id" over the "patch" part, until it
sees a patch header.  The caller knows that it gets just the "patch
header" for the first patch with its first call, and the second call
is about computing the patch id for the first patch, whose
originating commit was obtained from the first call, etc.

During the second and subsequent call (i.e. after finding a patch
header which caused the get_one_patchid() to return, calling the
helper again, expecting it to skip the commit log and find the patch
for which we are asked to compute the patch id), we shouldn't look
for the patch header at all.  Otherwise, a line that looks like a
patch header in the log message can easily be mistaken to be the
beginning of a new patch header, as if the current message did not
have any patch text.

This 5-patch series is organized as follows:

 - patch 1 is about setting the baseline.  We need to recognise the
   patch header produced by format-patch, log, and diff-tree --stdin.

 - patch 2 to patch 4 are bit of code restructuring without changing
   the behaviour.

 - patch 5 stops looking for a patch header when we shouldn't, and
   adds tests.

Junio C Hamano (5):
  t4204: patch-id supports various input format
  patch-id: call flush_current_id() only when needed
  patch-id: make get_one_patchid() more extensible
  patch-id: rewrite code that detects the beginning of a patch
  patch-id: tighten code to detect the patch header

 builtin/patch-id.c  | 82 +++++++++++++++++++++++++++++++++------------
 t/t4204-patch-id.sh | 40 ++++++++++++++++++++++
 2 files changed, 101 insertions(+), 21 deletions(-)

-- 
2.45.2-786-g49444cbe9a


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] t4204: patch-id supports various input format
  2024-06-21 23:18 [PATCH 0/5] Tighten patch header parsing in patch-id Junio C Hamano
@ 2024-06-21 23:18 ` Junio C Hamano
  2024-06-21 23:18 ` [PATCH 2/5] patch-id: call flush_current_id() only when needed Junio C Hamano
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-06-21 23:18 UTC (permalink / raw)
  To: git

"git patch-id" was first developed to read from "git diff-tree
--stdin -p" output.  Later it was enhanced to read from "git
diff-tree --stdin -p -v", which was the downstream of an early
imitation of "git log" ("git rev-list" run in the upstream of a pipe
to feed the "diff-tree").  These days, we also read from "git
format-patch".

Their output begins slightly differently, but the patch-id computed
over them for the same commit should be the same.  Ensure that we
won't accidentally break this expectation.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4204-patch-id.sh | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/t/t4204-patch-id.sh b/t/t4204-patch-id.sh
index a7fa94ce0a..1627fdda1b 100755
--- a/t/t4204-patch-id.sh
+++ b/t/t4204-patch-id.sh
@@ -114,6 +114,29 @@ test_expect_success 'patch-id supports git-format-patch output' '
 	test "$2" = $(git rev-parse HEAD)
 '
 
+test_expect_success 'patch-id computes the same for various formats' '
+	# This test happens to consider "git log -p -1" output
+	# the canonical input format, so use it as the norm.
+	git log -1 -p same >log-p.output &&
+	git patch-id <log-p.output >expect &&
+
+	# format-patch begins with "From <commit object name>"
+	git format-patch -1 --stdout same >format-patch.output &&
+	git patch-id <format-patch.output >actual &&
+	test_cmp actual expect &&
+
+	# "diff-tree --stdin -p" begins with "<commit object name>"
+	same=$(git rev-parse same) &&
+	echo $same | git diff-tree --stdin -p >diff-tree.output &&
+	git patch-id <diff-tree.output >actual &&
+	test_cmp actual expect &&
+
+	# "diff-tree --stdin -v -p" begins with "commit <commit object name>"
+	echo $same | git diff-tree --stdin -p -v >diff-tree-v.output &&
+	git patch-id <diff-tree-v.output >actual &&
+	test_cmp actual expect
+'
+
 test_expect_success 'whitespace is irrelevant in footer' '
 	get_patch_id main &&
 	git checkout same &&
-- 
2.45.2-786-g49444cbe9a


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] patch-id: call flush_current_id() only when needed
  2024-06-21 23:18 [PATCH 0/5] Tighten patch header parsing in patch-id Junio C Hamano
  2024-06-21 23:18 ` [PATCH 1/5] t4204: patch-id supports various input format Junio C Hamano
@ 2024-06-21 23:18 ` Junio C Hamano
  2024-06-21 23:18 ` [PATCH 3/5] patch-id: make get_one_patchid() more extensible Junio C Hamano
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-06-21 23:18 UTC (permalink / raw)
  To: git

The caller passes a flag that is used to become no-op when calling
flush_current_id().  Instead of calling something that becomes a
no-op, teach the caller not to call it in the first place.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 3894d2b970..0f262e7a03 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -6,10 +6,9 @@
 #include "hex.h"
 #include "parse-options.h"
 
-static void flush_current_id(int patchlen, struct object_id *id, struct object_id *result)
+static void flush_current_id(struct object_id *id, struct object_id *result)
 {
-	if (patchlen)
-		printf("%s %s\n", oid_to_hex(result), oid_to_hex(id));
+	printf("%s %s\n", oid_to_hex(result), oid_to_hex(id));
 }
 
 static int remove_space(char *line)
@@ -181,7 +180,8 @@ static void generate_id_list(int stable, int verbatim)
 	oidclr(&oid);
 	while (!feof(stdin)) {
 		patchlen = get_one_patchid(&n, &result, &line_buf, stable, verbatim);
-		flush_current_id(patchlen, &oid, &result);
+		if (patchlen)
+			flush_current_id(&oid, &result);
 		oidcpy(&oid, &n);
 	}
 	strbuf_release(&line_buf);
-- 
2.45.2-786-g49444cbe9a


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] patch-id: make get_one_patchid() more extensible
  2024-06-21 23:18 [PATCH 0/5] Tighten patch header parsing in patch-id Junio C Hamano
  2024-06-21 23:18 ` [PATCH 1/5] t4204: patch-id supports various input format Junio C Hamano
  2024-06-21 23:18 ` [PATCH 2/5] patch-id: call flush_current_id() only when needed Junio C Hamano
@ 2024-06-21 23:18 ` Junio C Hamano
  2024-07-29 12:02   ` Patrick Steinhardt
  2024-06-21 23:18 ` [PATCH 4/5] patch-id: rewrite code that detects the beginning of a patch Junio C Hamano
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2024-06-21 23:18 UTC (permalink / raw)
  To: git

We pass two independent Boolean flags (i.e. do we want the stable
variant of patch-id?  do we want to hash the stuff verbatim?) into
the function as two separate parameters.  Before adding the third
one and make the interface even wider, let's consolidate them into
a single flag word.

No changes in behaviour.  Just a trivial interface change.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 0f262e7a03..128e0997d8 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -58,9 +58,14 @@ static int scan_hunk_header(const char *p, int *p_before, int *p_after)
 	return 1;
 }
 
+#define GOPID_STABLE   01
+#define GOPID_VERBATIM 02
+
 static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
-			   struct strbuf *line_buf, int stable, int verbatim)
+			   struct strbuf *line_buf, unsigned flags)
 {
+	int stable = flags & GOPID_STABLE;
+	int verbatim = flags & GOPID_VERBATIM;
 	int patchlen = 0, found_next = 0;
 	int before = -1, after = -1;
 	int diff_is_binary = 0;
@@ -171,7 +176,7 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 	return patchlen;
 }
 
-static void generate_id_list(int stable, int verbatim)
+static void generate_id_list(unsigned flags)
 {
 	struct object_id oid, n, result;
 	int patchlen;
@@ -179,7 +184,7 @@ static void generate_id_list(int stable, int verbatim)
 
 	oidclr(&oid);
 	while (!feof(stdin)) {
-		patchlen = get_one_patchid(&n, &result, &line_buf, stable, verbatim);
+		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
 		if (patchlen)
 			flush_current_id(&oid, &result);
 		oidcpy(&oid, &n);
@@ -218,6 +223,7 @@ int cmd_patch_id(int argc, const char **argv, const char *prefix)
 	/* if nothing is set, default to unstable */
 	struct patch_id_opts config = {0, 0};
 	int opts = 0;
+	unsigned flags = 0;
 	struct option builtin_patch_id_options[] = {
 		OPT_CMDMODE(0, "unstable", &opts,
 		    N_("use the unstable patch-id algorithm"), 1),
@@ -237,7 +243,11 @@ int cmd_patch_id(int argc, const char **argv, const char *prefix)
 	argc = parse_options(argc, argv, prefix, builtin_patch_id_options,
 			     patch_id_usage, 0);
 
-	generate_id_list(opts ? opts > 1 : config.stable,
-			 opts ? opts == 3 : config.verbatim);
+	if (opts ? opts > 1 : config.stable)
+		flags |= GOPID_STABLE;
+	if (opts ? opts == 3 : config.verbatim)
+		flags |= GOPID_VERBATIM;
+	generate_id_list(flags);
+
 	return 0;
 }
-- 
2.45.2-786-g49444cbe9a


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] patch-id: rewrite code that detects the beginning of a patch
  2024-06-21 23:18 [PATCH 0/5] Tighten patch header parsing in patch-id Junio C Hamano
                   ` (2 preceding siblings ...)
  2024-06-21 23:18 ` [PATCH 3/5] patch-id: make get_one_patchid() more extensible Junio C Hamano
@ 2024-06-21 23:18 ` Junio C Hamano
  2024-07-29 12:03   ` Patrick Steinhardt
  2024-06-21 23:18 ` [PATCH 5/5] patch-id: tighten code to detect the patch header Junio C Hamano
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
  5 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2024-06-21 23:18 UTC (permalink / raw)
  To: git

The get_one_patchid() function reads input lines until it finds a
patch header (the line that begins a patch), whose beginning is one
of:

 (1) an "<object name>", which is "git diff-tree --stdin" shows;
 (2) "commit <object name>", which is "git log" shows; or
 (3) "From <object name>",  which is "git log --format=email" gives.

When it finds such a line, it returns to the caller, reporting the
<object name> it found, and the size of the "patch" it processed.

The caller then calls the function again, which then ignores the
commit log message, and then processes the lines in the patch part
until it hits another "beginning of a patch".

The above logic was fairly easy to see until 2bb73ae8 (patch-id: use
starts_with() and skip_prefix(), 2016-05-28) reorganized the code,
which made another logic that has nothing to do with the "where does
the next patch begin?" logic, which came from 2485eab5
(git-patch-id: do not trip over "no newline" markers, 2011-02-17)
that ignores the "\ No newline at the end", rolled into the same
single if() statement.

Let's split it out.  The "\ No newline at the end" marker is part of
the patch, should not appear before we start reading the patch part,
and does not belong to the detection of patch header.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 128e0997d8..a649966f31 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -80,16 +80,19 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 		const char *p = line;
 		int len;
 
-		/* Possibly skip over the prefix added by "log" or "format-patch" */
-		if (!skip_prefix(line, "commit ", &p) &&
-		    !skip_prefix(line, "From ", &p) &&
-		    starts_with(line, "\\ ") && 12 < strlen(line)) {
+		/*
+		 * If we see a line that begins with "<object name>",
+		 * "commit <object name>" or "From <object name>", it is
+		 * the beginning of a patch.  Return to the caller, as
+		 * we are done with the one we have been processing.
+		 */
+		if (skip_prefix(line, "commit ", &p))
+			;
+		else if (skip_prefix(line, "From ", &p))
+			;
+		if (!get_oid_hex(p, next_oid)) {
 			if (verbatim)
 				the_hash_algo->update_fn(&ctx, line, strlen(line));
-			continue;
-		}
-
-		if (!get_oid_hex(p, next_oid)) {
 			found_next = 1;
 			break;
 		}
@@ -130,6 +133,16 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 				break;
 		}
 
+		/*
+		 * A hunk about an incomplete line may have this
+		 * marker at the end, which should just be ignored.
+		 */
+		if (starts_with(line, "\\ ") && 12 < strlen(line)) {
+			if (verbatim)
+				the_hash_algo->update_fn(&ctx, line, strlen(line));
+			continue;
+		}
+
 		if (diff_is_binary) {
 			if (starts_with(line, "diff ")) {
 				diff_is_binary = 0;
-- 
2.45.2-786-g49444cbe9a


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] patch-id: tighten code to detect the patch header
  2024-06-21 23:18 [PATCH 0/5] Tighten patch header parsing in patch-id Junio C Hamano
                   ` (3 preceding siblings ...)
  2024-06-21 23:18 ` [PATCH 4/5] patch-id: rewrite code that detects the beginning of a patch Junio C Hamano
@ 2024-06-21 23:18 ` Junio C Hamano
  2024-07-29 12:07   ` Patrick Steinhardt
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
  5 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2024-06-21 23:18 UTC (permalink / raw)
  To: git

The get_one_patchid() function unconditionally takes a line that
matches the patch header (namely, a line that begins with a full
object name, possibly prefixed by "commit" or "From" plus a space)
as the beginning of a patch.  Even when it is *not* looking for one
(namely, when the previous call found the patch header and returned,
and then we are called again to skip the log message and process the
patch whose header was found by the previous invocation).

As a consequence, a line in the commit log message that begins with
one of these patterns can be mistaken to start another patch, with
current message entirely skipped (because we haven't even reached
the patch at all).

Allow the caller to tell us if it called us already and saw the
patch header (in which case we shouldn't be looking for another one,
until we see the "diff" part of the patch; instead we simply should
be skipping these lines as part of the commit log message), and skip
the header processing logic when that is the case.  In the helper
function, it also needs to flip this "are we looking for a header?"
bit, once it finished skipping the commit log message and started
processing the patches, as the patch header of the _next_ message is
the only clue in the input that the current patch is done.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c  | 43 ++++++++++++++++++++++++++++++-------------
 t/t4204-patch-id.sh | 17 +++++++++++++++++
 2 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index a649966f31..0e6aab1ca2 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -60,12 +60,14 @@ static int scan_hunk_header(const char *p, int *p_before, int *p_after)
 
 #define GOPID_STABLE   01
 #define GOPID_VERBATIM 02
+#define GOPID_FIND_HEADER 04
 
 static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 			   struct strbuf *line_buf, unsigned flags)
 {
 	int stable = flags & GOPID_STABLE;
 	int verbatim = flags & GOPID_VERBATIM;
+	int find_header = flags & GOPID_FIND_HEADER;
 	int patchlen = 0, found_next = 0;
 	int before = -1, after = -1;
 	int diff_is_binary = 0;
@@ -81,26 +83,39 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 		int len;
 
 		/*
-		 * If we see a line that begins with "<object name>",
-		 * "commit <object name>" or "From <object name>", it is
-		 * the beginning of a patch.  Return to the caller, as
-		 * we are done with the one we have been processing.
+		 * The caller hasn't seen us find a patch header and
+		 * return to it, or we have started processing patch
+		 * and may encounter the beginning of the next patch.
 		 */
-		if (skip_prefix(line, "commit ", &p))
-			;
-		else if (skip_prefix(line, "From ", &p))
-			;
-		if (!get_oid_hex(p, next_oid)) {
-			if (verbatim)
-				the_hash_algo->update_fn(&ctx, line, strlen(line));
-			found_next = 1;
-			break;
+		if (find_header) {
+			/*
+			 * If we see a line that begins with "<object name>",
+			 * "commit <object name>" or "From <object name>", it is
+			 * the beginning of a patch.  Return to the caller, as
+			 * we are done with the one we have been processing.
+			 */
+			if (skip_prefix(line, "commit ", &p))
+				;
+			else if (skip_prefix(line, "From ", &p))
+				;
+			if (!get_oid_hex(p, next_oid)) {
+				if (verbatim)
+					the_hash_algo->update_fn(&ctx, line, strlen(line));
+				found_next = 1;
+				break;
+			}
 		}
 
 		/* Ignore commit comments */
 		if (!patchlen && !starts_with(line, "diff "))
 			continue;
 
+		/*
+		 * We are past the commit log message.  Prepare to
+		 * stop at the beginning of the next patch header.
+		 */
+		find_header = 1;
+
 		/* Parsing diff header?  */
 		if (before == -1) {
 			if (starts_with(line, "GIT binary patch") ||
@@ -196,11 +211,13 @@ static void generate_id_list(unsigned flags)
 	struct strbuf line_buf = STRBUF_INIT;
 
 	oidclr(&oid);
+	flags |= GOPID_FIND_HEADER;
 	while (!feof(stdin)) {
 		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
 		if (patchlen)
 			flush_current_id(&oid, &result);
 		oidcpy(&oid, &n);
+		flags &= ~GOPID_FIND_HEADER;
 	}
 	strbuf_release(&line_buf);
 }
diff --git a/t/t4204-patch-id.sh b/t/t4204-patch-id.sh
index 1627fdda1b..b1d98d4110 100755
--- a/t/t4204-patch-id.sh
+++ b/t/t4204-patch-id.sh
@@ -137,6 +137,23 @@ test_expect_success 'patch-id computes the same for various formats' '
 	test_cmp actual expect
 '
 
+hash=$(git rev-parse same:)
+for cruft in "$hash" "commit $hash is bad" "From $hash status"
+do
+	test_expect_success "patch-id with <$cruft> in log message" '
+		git format-patch -1 --stdout same >patch-0 &&
+		git patch-id <patch-0 >expect &&
+
+		{
+			sed -e "/^$/q" patch-0 &&
+			printf "random message\n%s\n\n" "$cruft" &&
+			sed -e "1,/^$/d" patch-0
+		} >patch-cruft &&
+		git patch-id <patch-cruft >actual &&
+		test_cmp actual expect
+	'
+done
+
 test_expect_success 'whitespace is irrelevant in footer' '
 	get_patch_id main &&
 	git checkout same &&
-- 
2.45.2-786-g49444cbe9a


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] patch-id: make get_one_patchid() more extensible
  2024-06-21 23:18 ` [PATCH 3/5] patch-id: make get_one_patchid() more extensible Junio C Hamano
@ 2024-07-29 12:02   ` Patrick Steinhardt
  2024-07-29 20:03     ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Patrick Steinhardt @ 2024-07-29 12:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2351 bytes --]

On Fri, Jun 21, 2024 at 04:18:24PM -0700, Junio C Hamano wrote:
> We pass two independent Boolean flags (i.e. do we want the stable
> variant of patch-id?  do we want to hash the stuff verbatim?) into
> the function as two separate parameters.  Before adding the third
> one and make the interface even wider, let's consolidate them into
> a single flag word.
> 
> No changes in behaviour.  Just a trivial interface change.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  builtin/patch-id.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/builtin/patch-id.c b/builtin/patch-id.c
> index 0f262e7a03..128e0997d8 100644
> --- a/builtin/patch-id.c
> +++ b/builtin/patch-id.c
> @@ -58,9 +58,14 @@ static int scan_hunk_header(const char *p, int *p_before, int *p_after)
>  	return 1;
>  }
>  
> +#define GOPID_STABLE   01
> +#define GOPID_VERBATIM 02
> +

This certainly is a worthwhile change. I have to wonder about code style
though:

  - Using 01 and 02 as constants feels somewhat weird to me. Don't we
    typically use `(1 << 0)` and `(1 << 1)` for such binary flags?

  - What is our preferred style nowadays? Do we prefer defines over
    enums? I rather had the feeling that enums are the go-to style for
    things like this nowadays.

It would also be nice to have documentation for the flags.

In any case, all of these are really just smallish nits and I think that
this is a strict improvement regardless of whether we massage the style
or not.

> @@ -237,7 +243,11 @@ int cmd_patch_id(int argc, const char **argv, const char *prefix)
>  	argc = parse_options(argc, argv, prefix, builtin_patch_id_options,
>  			     patch_id_usage, 0);
>  
> -	generate_id_list(opts ? opts > 1 : config.stable,
> -			 opts ? opts == 3 : config.verbatim);
> +	if (opts ? opts > 1 : config.stable)
> +		flags |= GOPID_STABLE;
> +	if (opts ? opts == 3 : config.verbatim)
> +		flags |= GOPID_VERBATIM;

I was wondering whether we could use `OPT_BIT()` here to set those as
flags directly. I guess that would require a bit more refactoring, but
if we also converted `struct patch_id_opts` to have a `flags` field then
this might overall be easier to read than the weird massaging of opts
that we did before and after your change.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] patch-id: rewrite code that detects the beginning of a patch
  2024-06-21 23:18 ` [PATCH 4/5] patch-id: rewrite code that detects the beginning of a patch Junio C Hamano
@ 2024-07-29 12:03   ` Patrick Steinhardt
  0 siblings, 0 replies; 20+ messages in thread
From: Patrick Steinhardt @ 2024-07-29 12:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 525 bytes --]

On Fri, Jun 21, 2024 at 04:18:25PM -0700, Junio C Hamano wrote:
> The get_one_patchid() function reads input lines until it finds a
> patch header (the line that begins a patch), whose beginning is one
> of:
> 
>  (1) an "<object name>", which is "git diff-tree --stdin" shows;
>  (2) "commit <object name>", which is "git log" shows; or
>  (3) "From <object name>",  which is "git log --format=email" gives.

All of these items should probably say "which is what ...", where "what"
is what is missing.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] patch-id: tighten code to detect the patch header
  2024-06-21 23:18 ` [PATCH 5/5] patch-id: tighten code to detect the patch header Junio C Hamano
@ 2024-07-29 12:07   ` Patrick Steinhardt
  2024-07-29 20:12     ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Patrick Steinhardt @ 2024-07-29 12:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 827 bytes --]

On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
> @@ -196,11 +211,13 @@ static void generate_id_list(unsigned flags)
>  	struct strbuf line_buf = STRBUF_INIT;
>  
>  	oidclr(&oid);
> +	flags |= GOPID_FIND_HEADER;
>  	while (!feof(stdin)) {
>  		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
>  		if (patchlen)
>  			flush_current_id(&oid, &result);
>  		oidcpy(&oid, &n);
> +		flags &= ~GOPID_FIND_HEADER;
>  	}

I think I'm missing the obvious. But why don't we have to set
`GOPID_FIND_HEADER` when we have flushed the current patch ID? Is this
because we know that `get_one_patchid()` stops once it finds the next
line starting with a commit? Makes me wonder what happens when there is
non-diff garbage between patches for which we are about to generate
patch IDs.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] patch-id: make get_one_patchid() more extensible
  2024-07-29 12:02   ` Patrick Steinhardt
@ 2024-07-29 20:03     ` Junio C Hamano
  0 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-07-29 20:03 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

>> No changes in behaviour.  Just a trivial interface change.
>> 
>> Signed-off-by: Junio C Hamano <gitster@pobox.com>
>> ...
>> +#define GOPID_STABLE   01
>> +#define GOPID_VERBATIM 02
>> +
>
> This certainly is a worthwhile change. I have to wonder about code style
> though:
>
>   - Using 01 and 02 as constants feels somewhat weird to me. Don't we
>     typically use `(1 << 0)` and `(1 << 1)` for such binary flags?
>
>   - What is our preferred style nowadays? Do we prefer defines over
>     enums? I rather had the feeling that enums are the go-to style for
>     things like this nowadays.
>
> It would also be nice to have documentation for the flags.

For an internal implementation detail that does not even cross file
boundaries with descriptive _STABLE and _VERBATIM that corresponds
to the member names of config structure?  I doubt it.

> In any case, all of these are really just smallish nits and I think that
> this is a strict improvement regardless of whether we massage the style
> or not.
> ...
> I was wondering whether we could use `OPT_BIT()` here to set those as
> flags directly. I guess that would require a bit more refactoring, but
> if we also converted `struct patch_id_opts` to have a `flags` field then
> this might overall be easier to read than the weird massaging of opts
> that we did before and after your change.

As a longer direction, I envision that most of the implementation we
see in this file and what diff.c:diff_get_patch_id() does should be
refactored and one of them should just go.  So until that happens, I
am inclined to keep the changes to this file to an absolute minimum.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] patch-id: tighten code to detect the patch header
  2024-07-29 12:07   ` Patrick Steinhardt
@ 2024-07-29 20:12     ` Junio C Hamano
  2024-07-30  4:55       ` Patrick Steinhardt
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2024-07-29 20:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
>> @@ -196,11 +211,13 @@ static void generate_id_list(unsigned flags)
>>  	struct strbuf line_buf = STRBUF_INIT;
>>  
>>  	oidclr(&oid);
>> +	flags |= GOPID_FIND_HEADER;
>>  	while (!feof(stdin)) {
>>  		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
>>  		if (patchlen)
>>  			flush_current_id(&oid, &result);
>>  		oidcpy(&oid, &n);
>> +		flags &= ~GOPID_FIND_HEADER;
>>  	}
>
> I think I'm missing the obvious. But why don't we have to set
> `GOPID_FIND_HEADER` when we have flushed the current patch ID? Is this
> because we know that `get_one_patchid()` stops once it finds the next
> line starting with a commit?

Yup the original control flow is rather convoluted.  The first call
stops when it finds the header that begins the log message part and
returns, but the subsequent calls are to (1) skip the log message
and then (2) parse and hash the diff part, until it finds another
header that begins the log message part of the _next_ patch and
return.  GOPID_FIND_HEADER bit is used to tell the callee when we
haven't found the header (hence we can stop at a line whose
beginning looks like a hash) or the previous round already found the
header and we positively know we are now in the "skip the log
message" phase (hence a line whose beginning looks like a hash is
not a new header).

> Makes me wonder what happens when there is
> non-diff garbage between patches for which we are about to generate
> patch IDs.

"Skip non-diff garbage until we see a patch" is the mechanism used
to skip the log message, so it would be a reasonable thing to skip
such no-diff garbage between patches, no?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 0/5] Tighten patch header parsing in patch-id
  2024-06-21 23:18 [PATCH 0/5] Tighten patch header parsing in patch-id Junio C Hamano
                   ` (4 preceding siblings ...)
  2024-06-21 23:18 ` [PATCH 5/5] patch-id: tighten code to detect the patch header Junio C Hamano
@ 2024-07-30  1:17 ` Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 1/5] t4204: patch-id supports various input format Junio C Hamano
                     ` (5 more replies)
  5 siblings, 6 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-07-30  1:17 UTC (permalink / raw)
  To: git

    Updates from v1

     - changed flag bits used internally from CPP macro to enum and
       added a bit of comments.

The patch-id command loops over a series of patches, picking up the
origin commit object name (which is on the "patch header" line) and
then computing the patch identifier out of the "patch" (series of
"diff") that follows the "patch header".

The parser is structured in a bit "strange" way.  It repeatedly
calls a single helper function get_one_patchid() that returns when a
patch header is recognised, or skips until the "patch" part begins
and then computes the "patch id" over the "patch" part, until it
sees a patch header.  The caller knows that it gets just the "patch
header" for the first patch with its first call, and the second call
is about computing the patch id for the first patch, whose
originating commit was obtained from the first call, etc.

During the second and subsequent call (i.e. after finding a patch
header which caused the get_one_patchid() to return, calling the
helper again, expecting it to skip the commit log and find the patch
for which we are asked to compute the patch id), we shouldn't look
for the patch header at all.  Otherwise, a line that looks like a
patch header in the log message can easily be mistaken to be the
beginning of a new patch header, as if the current message did not
have any patch text.

This 5-patch series is organized as follows:

 - patch 1 is about setting the baseline.  We need to recognise the
   patch header produced by format-patch, log, and diff-tree --stdin.

 - patch 2 to patch 4 are bit of code restructuring without changing
   the behaviour.

 - patch 5 stops looking for a patch header when we shouldn't, and
   adds tests.


Junio C Hamano (5):
  t4204: patch-id supports various input format
  patch-id: call flush_current_id() only when needed
  patch-id: make get_one_patchid() more extensible
  patch-id: rewrite code that detects the beginning of a patch
  patch-id: tighten code to detect the patch header

 builtin/patch-id.c  | 93 +++++++++++++++++++++++++++++++++++----------
 t/t4204-patch-id.sh | 40 +++++++++++++++++++
 2 files changed, 112 insertions(+), 21 deletions(-)

Range-diff against v1:
1:  c1ff38c0b8 = 1:  e68a30f6c9 t4204: patch-id supports various input format
2:  a201f344a6 = 2:  3afc63b210 patch-id: call flush_current_id() only when needed
3:  237f8910ca ! 3:  22dd5e7a5b patch-id: make get_one_patchid() more extensible
    @@ builtin/patch-id.c: static int scan_hunk_header(const char *p, int *p_before, in
      	return 1;
      }
      
    -+#define GOPID_STABLE   01
    -+#define GOPID_VERBATIM 02
    ++/*
    ++ * flag bits to control get_one_patchid()'s behaviour.
    ++ */
    ++enum {
    ++	GOPID_STABLE = (1<<0),		/* --stable */
    ++	GOPID_VERBATIM = (1<<1),	/* --verbatim */
    ++};
     +
      static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
     -			   struct strbuf *line_buf, int stable, int verbatim)
4:  d6d068c9dc = 4:  0cca1ed513 patch-id: rewrite code that detects the beginning of a patch
5:  51af73722c ! 5:  ef422df7c1 patch-id: tighten code to detect the patch header
    @@ Commit message
      ## builtin/patch-id.c ##
     @@ builtin/patch-id.c: static int scan_hunk_header(const char *p, int *p_before, int *p_after)
      
    - #define GOPID_STABLE   01
    - #define GOPID_VERBATIM 02
    -+#define GOPID_FIND_HEADER 04
    + /*
    +  * flag bits to control get_one_patchid()'s behaviour.
    ++ *
    ++ * STABLE/VERBATIM are given from the command line option as
    ++ * --stable/--verbatim.  FIND_HEADER conveys the internal state
    ++ * maintained by the caller to allow the function to avoid mistaking
    ++ * lines of log message before seeing the "diff" part as the beginning
    ++ * of the next patch.
    +  */
    + enum {
    + 	GOPID_STABLE = (1<<0),		/* --stable */
    + 	GOPID_VERBATIM = (1<<1),	/* --verbatim */
    ++	GOPID_FIND_HEADER = (1<<2),	/* stop at the beginning of patch message */
    + };
      
      static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
    - 			   struct strbuf *line_buf, unsigned flags)
    +@@ builtin/patch-id.c: static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
      {
      	int stable = flags & GOPID_STABLE;
      	int verbatim = flags & GOPID_VERBATIM;
-- 
2.46.0-69-gd0749fd195


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/5] t4204: patch-id supports various input format
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
@ 2024-07-30  1:17   ` Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 2/5] patch-id: call flush_current_id() only when needed Junio C Hamano
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-07-30  1:17 UTC (permalink / raw)
  To: git

"git patch-id" was first developed to read from "git diff-tree
--stdin -p" output.  Later it was enhanced to read from "git
diff-tree --stdin -p -v", which was the downstream of an early
imitation of "git log" ("git rev-list" run in the upstream of a pipe
to feed the "diff-tree").  These days, we also read from "git
format-patch".

Their output begins slightly differently, but the patch-id computed
over them for the same commit should be the same.  Ensure that we
won't accidentally break this expectation.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4204-patch-id.sh | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/t/t4204-patch-id.sh b/t/t4204-patch-id.sh
index a7fa94ce0a..1627fdda1b 100755
--- a/t/t4204-patch-id.sh
+++ b/t/t4204-patch-id.sh
@@ -114,6 +114,29 @@ test_expect_success 'patch-id supports git-format-patch output' '
 	test "$2" = $(git rev-parse HEAD)
 '
 
+test_expect_success 'patch-id computes the same for various formats' '
+	# This test happens to consider "git log -p -1" output
+	# the canonical input format, so use it as the norm.
+	git log -1 -p same >log-p.output &&
+	git patch-id <log-p.output >expect &&
+
+	# format-patch begins with "From <commit object name>"
+	git format-patch -1 --stdout same >format-patch.output &&
+	git patch-id <format-patch.output >actual &&
+	test_cmp actual expect &&
+
+	# "diff-tree --stdin -p" begins with "<commit object name>"
+	same=$(git rev-parse same) &&
+	echo $same | git diff-tree --stdin -p >diff-tree.output &&
+	git patch-id <diff-tree.output >actual &&
+	test_cmp actual expect &&
+
+	# "diff-tree --stdin -v -p" begins with "commit <commit object name>"
+	echo $same | git diff-tree --stdin -p -v >diff-tree-v.output &&
+	git patch-id <diff-tree-v.output >actual &&
+	test_cmp actual expect
+'
+
 test_expect_success 'whitespace is irrelevant in footer' '
 	get_patch_id main &&
 	git checkout same &&
-- 
2.46.0-69-gd0749fd195


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/5] patch-id: call flush_current_id() only when needed
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 1/5] t4204: patch-id supports various input format Junio C Hamano
@ 2024-07-30  1:17   ` Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 3/5] patch-id: make get_one_patchid() more extensible Junio C Hamano
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-07-30  1:17 UTC (permalink / raw)
  To: git

The caller passes a flag that is used to become no-op when calling
flush_current_id().  Instead of calling something that becomes a
no-op, teach the caller not to call it in the first place.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 3894d2b970..0f262e7a03 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -6,10 +6,9 @@
 #include "hex.h"
 #include "parse-options.h"
 
-static void flush_current_id(int patchlen, struct object_id *id, struct object_id *result)
+static void flush_current_id(struct object_id *id, struct object_id *result)
 {
-	if (patchlen)
-		printf("%s %s\n", oid_to_hex(result), oid_to_hex(id));
+	printf("%s %s\n", oid_to_hex(result), oid_to_hex(id));
 }
 
 static int remove_space(char *line)
@@ -181,7 +180,8 @@ static void generate_id_list(int stable, int verbatim)
 	oidclr(&oid);
 	while (!feof(stdin)) {
 		patchlen = get_one_patchid(&n, &result, &line_buf, stable, verbatim);
-		flush_current_id(patchlen, &oid, &result);
+		if (patchlen)
+			flush_current_id(&oid, &result);
 		oidcpy(&oid, &n);
 	}
 	strbuf_release(&line_buf);
-- 
2.46.0-69-gd0749fd195


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/5] patch-id: make get_one_patchid() more extensible
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 1/5] t4204: patch-id supports various input format Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 2/5] patch-id: call flush_current_id() only when needed Junio C Hamano
@ 2024-07-30  1:17   ` Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 4/5] patch-id: rewrite code that detects the beginning of a patch Junio C Hamano
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-07-30  1:17 UTC (permalink / raw)
  To: git

We pass two independent Boolean flags (i.e. do we want the stable
variant of patch-id?  do we want to hash the stuff verbatim?) into
the function as two separate parameters.  Before adding the third
one and make the interface even wider, let's consolidate them into
a single flag word.

No changes in behaviour.  Just a trivial interface change.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 0f262e7a03..1d3b7ff12b 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -58,9 +58,19 @@ static int scan_hunk_header(const char *p, int *p_before, int *p_after)
 	return 1;
 }
 
+/*
+ * flag bits to control get_one_patchid()'s behaviour.
+ */
+enum {
+	GOPID_STABLE = (1<<0),		/* --stable */
+	GOPID_VERBATIM = (1<<1),	/* --verbatim */
+};
+
 static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
-			   struct strbuf *line_buf, int stable, int verbatim)
+			   struct strbuf *line_buf, unsigned flags)
 {
+	int stable = flags & GOPID_STABLE;
+	int verbatim = flags & GOPID_VERBATIM;
 	int patchlen = 0, found_next = 0;
 	int before = -1, after = -1;
 	int diff_is_binary = 0;
@@ -171,7 +181,7 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 	return patchlen;
 }
 
-static void generate_id_list(int stable, int verbatim)
+static void generate_id_list(unsigned flags)
 {
 	struct object_id oid, n, result;
 	int patchlen;
@@ -179,7 +189,7 @@ static void generate_id_list(int stable, int verbatim)
 
 	oidclr(&oid);
 	while (!feof(stdin)) {
-		patchlen = get_one_patchid(&n, &result, &line_buf, stable, verbatim);
+		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
 		if (patchlen)
 			flush_current_id(&oid, &result);
 		oidcpy(&oid, &n);
@@ -218,6 +228,7 @@ int cmd_patch_id(int argc, const char **argv, const char *prefix)
 	/* if nothing is set, default to unstable */
 	struct patch_id_opts config = {0, 0};
 	int opts = 0;
+	unsigned flags = 0;
 	struct option builtin_patch_id_options[] = {
 		OPT_CMDMODE(0, "unstable", &opts,
 		    N_("use the unstable patch-id algorithm"), 1),
@@ -237,7 +248,11 @@ int cmd_patch_id(int argc, const char **argv, const char *prefix)
 	argc = parse_options(argc, argv, prefix, builtin_patch_id_options,
 			     patch_id_usage, 0);
 
-	generate_id_list(opts ? opts > 1 : config.stable,
-			 opts ? opts == 3 : config.verbatim);
+	if (opts ? opts > 1 : config.stable)
+		flags |= GOPID_STABLE;
+	if (opts ? opts == 3 : config.verbatim)
+		flags |= GOPID_VERBATIM;
+	generate_id_list(flags);
+
 	return 0;
 }
-- 
2.46.0-69-gd0749fd195


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/5] patch-id: rewrite code that detects the beginning of a patch
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
                     ` (2 preceding siblings ...)
  2024-07-30  1:17   ` [PATCH v2 3/5] patch-id: make get_one_patchid() more extensible Junio C Hamano
@ 2024-07-30  1:17   ` Junio C Hamano
  2024-07-30  1:17   ` [PATCH v2 5/5] patch-id: tighten code to detect the patch header Junio C Hamano
  2024-07-30  5:12   ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Patrick Steinhardt
  5 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-07-30  1:17 UTC (permalink / raw)
  To: git

The get_one_patchid() function reads input lines until it finds a
patch header (the line that begins a patch), whose beginning is one
of:

 (1) an "<object name>", which is what "git diff-tree --stdin" shows;
 (2) "commit <object name>", which is what "git log" shows; or
 (3) "From <object name>",  which is what "git log --format=email" shows.

When it finds such a line, it returns to the caller, reporting the
<object name> it found, and the size of the "patch" it processed.

The caller then calls the function again, which then ignores the
commit log message, and then processes the lines in the patch part
until it hits another "beginning of a patch".

The above logic was fairly easy to see until 2bb73ae8 (patch-id: use
starts_with() and skip_prefix(), 2016-05-28) reorganized the code,
which made another logic that has nothing to do with the "where does
the next patch begin?" logic, which came from 2485eab5
(git-patch-id: do not trip over "no newline" markers, 2011-02-17)
that ignores the "\ No newline at the end", rolled into the same
single if() statement.

Let's split it out.  The "\ No newline at the end" marker is part of
the patch, should not appear before we start reading the patch part,
and does not belong to the detection of patch header.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 1d3b7ff12b..a2fdc0505d 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -85,16 +85,19 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 		const char *p = line;
 		int len;
 
-		/* Possibly skip over the prefix added by "log" or "format-patch" */
-		if (!skip_prefix(line, "commit ", &p) &&
-		    !skip_prefix(line, "From ", &p) &&
-		    starts_with(line, "\\ ") && 12 < strlen(line)) {
+		/*
+		 * If we see a line that begins with "<object name>",
+		 * "commit <object name>" or "From <object name>", it is
+		 * the beginning of a patch.  Return to the caller, as
+		 * we are done with the one we have been processing.
+		 */
+		if (skip_prefix(line, "commit ", &p))
+			;
+		else if (skip_prefix(line, "From ", &p))
+			;
+		if (!get_oid_hex(p, next_oid)) {
 			if (verbatim)
 				the_hash_algo->update_fn(&ctx, line, strlen(line));
-			continue;
-		}
-
-		if (!get_oid_hex(p, next_oid)) {
 			found_next = 1;
 			break;
 		}
@@ -135,6 +138,16 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 				break;
 		}
 
+		/*
+		 * A hunk about an incomplete line may have this
+		 * marker at the end, which should just be ignored.
+		 */
+		if (starts_with(line, "\\ ") && 12 < strlen(line)) {
+			if (verbatim)
+				the_hash_algo->update_fn(&ctx, line, strlen(line));
+			continue;
+		}
+
 		if (diff_is_binary) {
 			if (starts_with(line, "diff ")) {
 				diff_is_binary = 0;
-- 
2.46.0-69-gd0749fd195


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/5] patch-id: tighten code to detect the patch header
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
                     ` (3 preceding siblings ...)
  2024-07-30  1:17   ` [PATCH v2 4/5] patch-id: rewrite code that detects the beginning of a patch Junio C Hamano
@ 2024-07-30  1:17   ` Junio C Hamano
  2024-07-30  5:12   ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Patrick Steinhardt
  5 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2024-07-30  1:17 UTC (permalink / raw)
  To: git

The get_one_patchid() function unconditionally takes a line that
matches the patch header (namely, a line that begins with a full
object name, possibly prefixed by "commit" or "From" plus a space)
as the beginning of a patch.  Even when it is *not* looking for one
(namely, when the previous call found the patch header and returned,
and then we are called again to skip the log message and process the
patch whose header was found by the previous invocation).

As a consequence, a line in the commit log message that begins with
one of these patterns can be mistaken to start another patch, with
current message entirely skipped (because we haven't even reached
the patch at all).

Allow the caller to tell us if it called us already and saw the
patch header (in which case we shouldn't be looking for another one,
until we see the "diff" part of the patch; instead we simply should
be skipping these lines as part of the commit log message), and skip
the header processing logic when that is the case.  In the helper
function, it also needs to flip this "are we looking for a header?"
bit, once it finished skipping the commit log message and started
processing the patches, as the patch header of the _next_ message is
the only clue in the input that the current patch is done.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c  | 49 +++++++++++++++++++++++++++++++++------------
 t/t4204-patch-id.sh | 17 ++++++++++++++++
 2 files changed, 53 insertions(+), 13 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index a2fdc0505d..f2e8feb6d3 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -60,10 +60,17 @@ static int scan_hunk_header(const char *p, int *p_before, int *p_after)
 
 /*
  * flag bits to control get_one_patchid()'s behaviour.
+ *
+ * STABLE/VERBATIM are given from the command line option as
+ * --stable/--verbatim.  FIND_HEADER conveys the internal state
+ * maintained by the caller to allow the function to avoid mistaking
+ * lines of log message before seeing the "diff" part as the beginning
+ * of the next patch.
  */
 enum {
 	GOPID_STABLE = (1<<0),		/* --stable */
 	GOPID_VERBATIM = (1<<1),	/* --verbatim */
+	GOPID_FIND_HEADER = (1<<2),	/* stop at the beginning of patch message */
 };
 
 static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
@@ -71,6 +78,7 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 {
 	int stable = flags & GOPID_STABLE;
 	int verbatim = flags & GOPID_VERBATIM;
+	int find_header = flags & GOPID_FIND_HEADER;
 	int patchlen = 0, found_next = 0;
 	int before = -1, after = -1;
 	int diff_is_binary = 0;
@@ -86,26 +94,39 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 		int len;
 
 		/*
-		 * If we see a line that begins with "<object name>",
-		 * "commit <object name>" or "From <object name>", it is
-		 * the beginning of a patch.  Return to the caller, as
-		 * we are done with the one we have been processing.
+		 * The caller hasn't seen us find a patch header and
+		 * return to it, or we have started processing patch
+		 * and may encounter the beginning of the next patch.
 		 */
-		if (skip_prefix(line, "commit ", &p))
-			;
-		else if (skip_prefix(line, "From ", &p))
-			;
-		if (!get_oid_hex(p, next_oid)) {
-			if (verbatim)
-				the_hash_algo->update_fn(&ctx, line, strlen(line));
-			found_next = 1;
-			break;
+		if (find_header) {
+			/*
+			 * If we see a line that begins with "<object name>",
+			 * "commit <object name>" or "From <object name>", it is
+			 * the beginning of a patch.  Return to the caller, as
+			 * we are done with the one we have been processing.
+			 */
+			if (skip_prefix(line, "commit ", &p))
+				;
+			else if (skip_prefix(line, "From ", &p))
+				;
+			if (!get_oid_hex(p, next_oid)) {
+				if (verbatim)
+					the_hash_algo->update_fn(&ctx, line, strlen(line));
+				found_next = 1;
+				break;
+			}
 		}
 
 		/* Ignore commit comments */
 		if (!patchlen && !starts_with(line, "diff "))
 			continue;
 
+		/*
+		 * We are past the commit log message.  Prepare to
+		 * stop at the beginning of the next patch header.
+		 */
+		find_header = 1;
+
 		/* Parsing diff header?  */
 		if (before == -1) {
 			if (starts_with(line, "GIT binary patch") ||
@@ -201,11 +222,13 @@ static void generate_id_list(unsigned flags)
 	struct strbuf line_buf = STRBUF_INIT;
 
 	oidclr(&oid);
+	flags |= GOPID_FIND_HEADER;
 	while (!feof(stdin)) {
 		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
 		if (patchlen)
 			flush_current_id(&oid, &result);
 		oidcpy(&oid, &n);
+		flags &= ~GOPID_FIND_HEADER;
 	}
 	strbuf_release(&line_buf);
 }
diff --git a/t/t4204-patch-id.sh b/t/t4204-patch-id.sh
index 1627fdda1b..b1d98d4110 100755
--- a/t/t4204-patch-id.sh
+++ b/t/t4204-patch-id.sh
@@ -137,6 +137,23 @@ test_expect_success 'patch-id computes the same for various formats' '
 	test_cmp actual expect
 '
 
+hash=$(git rev-parse same:)
+for cruft in "$hash" "commit $hash is bad" "From $hash status"
+do
+	test_expect_success "patch-id with <$cruft> in log message" '
+		git format-patch -1 --stdout same >patch-0 &&
+		git patch-id <patch-0 >expect &&
+
+		{
+			sed -e "/^$/q" patch-0 &&
+			printf "random message\n%s\n\n" "$cruft" &&
+			sed -e "1,/^$/d" patch-0
+		} >patch-cruft &&
+		git patch-id <patch-cruft >actual &&
+		test_cmp actual expect
+	'
+done
+
 test_expect_success 'whitespace is irrelevant in footer' '
 	get_patch_id main &&
 	git checkout same &&
-- 
2.46.0-69-gd0749fd195


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] patch-id: tighten code to detect the patch header
  2024-07-29 20:12     ` Junio C Hamano
@ 2024-07-30  4:55       ` Patrick Steinhardt
  2024-07-30  5:12         ` Patrick Steinhardt
  0 siblings, 1 reply; 20+ messages in thread
From: Patrick Steinhardt @ 2024-07-30  4:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2273 bytes --]

On Mon, Jul 29, 2024 at 01:12:42PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
> >> @@ -196,11 +211,13 @@ static void generate_id_list(unsigned flags)
> >>  	struct strbuf line_buf = STRBUF_INIT;
> >>  
> >>  	oidclr(&oid);
> >> +	flags |= GOPID_FIND_HEADER;
> >>  	while (!feof(stdin)) {
> >>  		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
> >>  		if (patchlen)
> >>  			flush_current_id(&oid, &result);
> >>  		oidcpy(&oid, &n);
> >> +		flags &= ~GOPID_FIND_HEADER;
> >>  	}
> >
> > I think I'm missing the obvious. But why don't we have to set
> > `GOPID_FIND_HEADER` when we have flushed the current patch ID? Is this
> > because we know that `get_one_patchid()` stops once it finds the next
> > line starting with a commit?
> 
> Yup the original control flow is rather convoluted.  The first call
> stops when it finds the header that begins the log message part and
> returns, but the subsequent calls are to (1) skip the log message
> and then (2) parse and hash the diff part, until it finds another
> header that begins the log message part of the _next_ patch and
> return.  GOPID_FIND_HEADER bit is used to tell the callee when we
> haven't found the header (hence we can stop at a line whose
> beginning looks like a hash) or the previous round already found the
> header and we positively know we are now in the "skip the log
> message" phase (hence a line whose beginning looks like a hash is
> not a new header).

Okay.

> > Makes me wonder what happens when there is
> > non-diff garbage between patches for which we are about to generate
> > patch IDs.
> 
> "Skip non-diff garbage until we see a patch" is the mechanism used
> to skip the log message, so it would be a reasonable thing to skip
> such no-diff garbage between patches, no?

Oh, yes, it is reasonable. I just didn't quite figure out the flow of
the above loop when reading through the code. As you say, it is somewhat
convoluted and not all that straight forward.

In any case though, your changes improve readability, so the fact that
things are not quite straight forward is not the fault of this patch
series here.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] patch-id: tighten code to detect the patch header
  2024-07-30  4:55       ` Patrick Steinhardt
@ 2024-07-30  5:12         ` Patrick Steinhardt
  0 siblings, 0 replies; 20+ messages in thread
From: Patrick Steinhardt @ 2024-07-30  5:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]

On Tue, Jul 30, 2024 at 06:55:14AM +0200, Patrick Steinhardt wrote:
> On Mon, Jul 29, 2024 at 01:12:42PM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > > On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
> > "Skip non-diff garbage until we see a patch" is the mechanism used
> > to skip the log message, so it would be a reasonable thing to skip
> > such no-diff garbage between patches, no?
> 
> Oh, yes, it is reasonable. I just didn't quite figure out the flow of
> the above loop when reading through the code. As you say, it is somewhat
> convoluted and not all that straight forward.

As far as I can see we didn't have a test for this yet, so I did have a
quick go at it to reassure myself that things work as expected before
and after your change. Feel free to pick it up if you feel like it, or
to just ignore it :)

Patrick

test_expect_success 'patch-id handles diffs with garbage in between' '
	cat >diff-with-garbage <<-\EOF &&
	$(git rev-parse HEAD)
	diff --git a/bar b/bar
	index bdaf90f..31051f6 100644
	--- a/bar
	+++ b/bar
	@@ -2 +2,2 @@
	 b
	+c
	some
	garbage
	lines
	$(git rev-parse HEAD)
	diff --git a/car b/car
	index 00750ed..2ae5e34 100644
	--- a/car
	+++ b/car
	@@ -1 +1,2 @@
	 3
	+d
	EOF

	cat >diff-without-garbage <<-\EOF &&
	$(git rev-parse HEAD)
	diff --git a/bar b/bar
	index bdaf90f..31051f6 100644
	--- a/bar
	+++ b/bar
	@@ -2 +2,2 @@
	 b
	+c
	$(git rev-parse HEAD)
	diff --git a/car b/car
	index 00750ed..2ae5e34 100644
	--- a/car
	+++ b/car
	@@ -1 +1,2 @@
	 3
	+d
	EOF

	for stable in true false
	do
		test_config patchid.stable $stable &&
		git patch-id <diff-with-garbage >id-with-garbage &&
		git patch-id <diff-without-garbage >id-without-garbage &&
		test_line_count -eq 2 id-with-garbage &&
		test_cmp id-with-garbage id-without-garbage ||
		return 1
	done
'

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/5] Tighten patch header parsing in patch-id
  2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
                     ` (4 preceding siblings ...)
  2024-07-30  1:17   ` [PATCH v2 5/5] patch-id: tighten code to detect the patch header Junio C Hamano
@ 2024-07-30  5:12   ` Patrick Steinhardt
  5 siblings, 0 replies; 20+ messages in thread
From: Patrick Steinhardt @ 2024-07-30  5:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 301 bytes --]

On Mon, Jul 29, 2024 at 06:17:33PM -0700, Junio C Hamano wrote:
>     Updates from v1
> 
>      - changed flag bits used internally from CPP macro to enum and
>        added a bit of comments.

The changes look like what I expected, so this series looks good to me
overall. Thanks!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-07-30  5:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-21 23:18 [PATCH 0/5] Tighten patch header parsing in patch-id Junio C Hamano
2024-06-21 23:18 ` [PATCH 1/5] t4204: patch-id supports various input format Junio C Hamano
2024-06-21 23:18 ` [PATCH 2/5] patch-id: call flush_current_id() only when needed Junio C Hamano
2024-06-21 23:18 ` [PATCH 3/5] patch-id: make get_one_patchid() more extensible Junio C Hamano
2024-07-29 12:02   ` Patrick Steinhardt
2024-07-29 20:03     ` Junio C Hamano
2024-06-21 23:18 ` [PATCH 4/5] patch-id: rewrite code that detects the beginning of a patch Junio C Hamano
2024-07-29 12:03   ` Patrick Steinhardt
2024-06-21 23:18 ` [PATCH 5/5] patch-id: tighten code to detect the patch header Junio C Hamano
2024-07-29 12:07   ` Patrick Steinhardt
2024-07-29 20:12     ` Junio C Hamano
2024-07-30  4:55       ` Patrick Steinhardt
2024-07-30  5:12         ` Patrick Steinhardt
2024-07-30  1:17 ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Junio C Hamano
2024-07-30  1:17   ` [PATCH v2 1/5] t4204: patch-id supports various input format Junio C Hamano
2024-07-30  1:17   ` [PATCH v2 2/5] patch-id: call flush_current_id() only when needed Junio C Hamano
2024-07-30  1:17   ` [PATCH v2 3/5] patch-id: make get_one_patchid() more extensible Junio C Hamano
2024-07-30  1:17   ` [PATCH v2 4/5] patch-id: rewrite code that detects the beginning of a patch Junio C Hamano
2024-07-30  1:17   ` [PATCH v2 5/5] patch-id: tighten code to detect the patch header Junio C Hamano
2024-07-30  5:12   ` [PATCH v2 0/5] Tighten patch header parsing in patch-id Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).